How to speed up the NLP text annotation process

Text annotation techniques like sequence labeling are vital in NLP, but are tedious, time-consuming, and expensive. Here’s a shortcut to save time.

How to speed up the NLP text annotation process
Picture by Bram Naus, Unsplash.
19 minutes to read

When you’re building a Natural Language Processing model, it’s the text annotation process which is the most laborious and the most expensive for your business. While you can use tools like Doccano for text annotation to speed the process up, you can save hours or days of time by computationally annotating some data yourself, to reduce the amount of work your human annotators need to undertake. Here’s how it’s done.

Examine the Doccano JSON import format

If you export data from Doccano you’ll see that it’s in the JSONL or JSON lines dialect of JavaScript Object Notation. Any sequence labeling data you want to import back into Doccano needs to follow this format. You need an object per item with its own id, text, meta, annotation_approver, and a Python list of labels representing the position of the labels found within the text.

{"id": 3863, 
 "text": "amd ryzen 7 eight core 1700x 3.80ghz socket am4 processor retail", 
 "meta": {}, 
 "annotation_approver": null, 
 "labels": [[0, 3, "brand"], 
            [29, 33, "speed"], 
            [4, 11, "range"], 
            [23, 28, "model"], 
            [12, 22, "cores"], 
            [37, 47, "socket"], 
            [58, 64, "packaging"]]
}

Load your data

For this example, I’m conducting sequence labeling on the PriceRunner CPUs dataset. This contains thousands of CPU names in various formats and is a great test dataset for product attribution extraction models. These require you to sequence label the data so you can train an NLP model and allow it to generalise and spot other similar attributes without explicit training. We actually only need the product_title column.

import pandas as pd

df = pd.read_csv('data/pricerunner_cpus.csv', names=['product_id', 'product_title', 'vendor_id', 'cluster_id', 
                                               'cluster_label', 'category_id', 'category_label'])
df = df[['product_title']]
df.head()
product_title
0 amd ryzen 7 eight core 1700x 3.80ghz socket am...
1 amd ryzen 7 1700x 8 core am4 cpu/processor
2 amd ryzen 7 1700x 3.4ghz 16mb l3 processor
3 amd ryzen 7 1700x 95 w 8 core/16 threads 3.8gh...
4 open box amd ryzen 7 1700x 3.8 ghz 8 core 95w ...

Create attribute lists

Next we’ll identify some of the attributes in the dataset and create some lists of attributes to identify in the product names. I’ve selected the brand (i.e. AMD or Intel); the range (i.e. Core i3, or Ryzen 7); the number of cores; the number of threads; the cache size in megabytes; the packaging (i.e. box or tray), and the clock speed. There’s lots of variation in the ways that all of these are stored so you’ll need to be creative.

brands = ['amd','intel']
ranges = ['core i3','core i5', 'core i7', 'core i9', 'xeon', 'celeron', 'pentium',
          'ryzen 3', 'ryzen 5', 'ryzen 7', 'ryzen 9', 'threadripper', 'epyc',
         'phenom', 'a8','a10','x4','opteron','a4','sempron']
cores = ['16 core','8 core','6 core','hexa core','4 core', '2 core', 'dual core']
threads = ['32 thread','16 thread','8 thread','4 thread','2 thread',]
cache = ['8mb','9mb','12mb','16mb',
        '8 mb','9 mb','12 mb','16 mb'
        ]
socket = ['am4','1366','1150','1151']
packaging = ['box','tray']
clock_speed = ['2.0ghz','2.1ghz','2.2ghz','2.3ghz','2.4ghz','2.5ghz','2.6ghz','2.7ghz','2.8ghz','2.9ghz','3.0ghz',
               '3.1ghz','3.2ghz','3.3ghz','3.4ghz','3.5ghz','3.6ghz','3.7ghz','3.8ghz','3.9ghz','4.0ghz',

               '2.00ghz','2.10ghz','2.20ghz','2.30ghz','2.40ghz','2.50ghz','2.60ghz','2.70ghz','2.80ghz','2.90ghz','3.00ghz',
               '3.10ghz','3.20ghz','3.30ghz','3.40ghz','3.50ghz','3.60ghz','3.70ghz','3.80ghz','3.90ghz','4.00ghz',

               '2.0 ghz','2.1 ghz','2.2 ghz','2.3 ghz','2.4 ghz','2.5 ghz','2.6 ghz','2.7 ghz','2.8 ghz','2.9 ghz','3.0ghz',
               '3.1 ghz','3.2 ghz','3.3 ghz','3.4 ghz','3.5 ghz','3.6 ghz','3.7 ghz','3.8 ghz','3.9 ghz','4.0 ghz',
              ]

Examine coverage

Next we’ll create a function called find_attribute() which will search each product_title in the dataframe and return the first value it finds from our list of attributes. This is fine on product titles, where values attributes tend not to get repeated, but you’ll need to modify the function for larger bodies of text. Run the function on each of the lists of attributes, so it assigns any matching values to the dataframe. You’ll probably want to stick these in a for loop to avoid the repetition I have included below.

def find_attribute(attributes, text):
    return next((x for x in attributes if x in text), "")

df['brand'] = df.apply(lambda x: find_attribute(brands, x['product_title']), axis=1)
df['range'] = df.apply(lambda x: find_attribute(ranges, x['product_title']), axis=1)
df['cores'] = df.apply(lambda x: find_attribute(cores, x['product_title']), axis=1)
df['threads'] = df.apply(lambda x: find_attribute(threads, x['product_title']), axis=1)
df['cache'] = df.apply(lambda x: find_attribute(cache, x['product_title']), axis=1)
df['socket'] = df.apply(lambda x: find_attribute(socket, x['product_title']), axis=1)
df['packaging'] = df.apply(lambda x: find_attribute(packaging, x['product_title']), axis=1)
df['clock_speed'] = df.apply(lambda x: find_attribute(clock_speed, x['product_title']), axis=1)

Now examine the dataframe and see what sort of coverage you get. How many of the products have all of their attributes filled in? Go back to your lists and add any other common terms you can see so more of the attributes can be annotated by the sequence labeler. Keep going until you’ve filled in most of the gaps.

df[['product_title','brand','range','cores','threads','cache','socket','packaging','clock_speed']].head(10)
product_title brand range cores threads cache socket packaging clock_speed
0 amd ryzen 7 eight core 1700x 3.80ghz socket am... amd ryzen 7 am4 3.80ghz
1 amd ryzen 7 1700x 8 core am4 cpu/processor amd ryzen 7 8 core am4
2 amd ryzen 7 1700x 3.4ghz 16mb l3 processor amd ryzen 7 16mb 3.4ghz
3 amd ryzen 7 1700x 95 w 8 core/16 threads 3.8gh... amd ryzen 7 8 core 16 thread 3.8ghz
4 open box amd ryzen 7 1700x 3.8 ghz 8 core 95w ... amd ryzen 7 8 core box 3.8 ghz
5 amd ryzen 7 1700x 8 core 16 thread am4 cpu/pro... amd ryzen 7 8 core 16 thread am4
6 wof processor amd ryzen 7 1700x 8 x 3.4 ghz oc... amd ryzen 7 3.4 ghz
7 amd ryzen 7 1700x cpu am4 3.4ghz 3.8 turbo 8 c... amd ryzen 7 8 core am4 3.4ghz
8 amd prozessor cpu ryzen 7 sockel am4 1700x 8 x... amd ryzen 7 16 mb am4 box
9 intel core intel core i7 7700 processor 8m cac... intel core i7

Add labels

When you’re happy with the coverage you’ve got and most of the attributes in the product names are being identified and added to the dataframe attribute columns, we can create the data we need for labeling.

The labels required in NLP models need to define the start and end positions of the string and the label to which it should be assigned. For example, [0, 3, "brand"] says that we’ve identified the brand attribute at position 0-3 (i.e. it’s AMD). This is achieved with the attribute_strpos() function I wrote below.

def attribute_strpos(attributes, text, label):

    attribute = next((x for x in attributes if x in text), "")
    start = text.find(attribute)
    end = start + len(attribute)

    if attribute:
        return [start, end, label]
    else:
        return

Next, we need to create a function to take each of the lists of attributes, and run the attribute_strpos() function on each product title field in the dataframe to return a Python list containing the positions of all of the attributes we identify in the text. That’s done with the annotate() function below. This loops over each of the attributes in the below dictionary and appends any labels returned to the tags list.

def annotate(text, attributes):
    tags = []

    for label, values in attributes.items():

        attribute = attribute_strpos(values, text, label)
        if attribute:
            tags.append(attribute)

    return tags
attributes = {
    'brand': brands,
    'range': ranges,
    'cores': cores,
    'threads': threads,
    'cache': cache,
    'clock_speed': clock_speed,
    'socket': socket,
    'packaging': packaging,
}

Annotate data

Finally, we can create a lambda function to run the annotate() function on each product. This runs very quickly and returns a column of labels containing the attributes found and their locations within each product name.

df['labels'] = df.apply(lambda x: annotate(x['product_title'], attributes), axis=1)
df.head()
product_title brand range cores threads cache socket packaging clock_speed labels
0 amd ryzen 7 eight core 1700x 3.80ghz socket am... amd ryzen 7 am4 3.80ghz [[0, 3, brand], [4, 11, range], [29, 36, clock...
1 amd ryzen 7 1700x 8 core am4 cpu/processor amd ryzen 7 8 core am4 [[0, 3, brand], [4, 11, range], [18, 24, cores...
2 amd ryzen 7 1700x 3.4ghz 16mb l3 processor amd ryzen 7 16mb 3.4ghz [[0, 3, brand], [4, 11, range], [25, 29, cache...
3 amd ryzen 7 1700x 95 w 8 core/16 threads 3.8gh... amd ryzen 7 8 core 16 thread 3.8ghz [[0, 3, brand], [4, 11, range], [23, 29, cores...
4 open box amd ryzen 7 1700x 3.8 ghz 8 core 95w ... amd ryzen 7 8 core box 3.8 ghz [[9, 12, brand], [13, 20, range], [35, 41, cor...

Export as JSON

Now we’ve got our data, all we need to do is export it. Doccano requires the id, text, meta, annotation_approver, and the labels, so we’ll first reformat the dataframe so it’s got the right fields. The Pandas rename function is perfect for this.

df.rename(columns={'product_title': 'text'}, inplace=True)
df['meta'] = ''
df['annotation_approver'] = ''
df[['text','meta','annotation_approver','labels']].head()
text meta annotation_approver labels
0 amd ryzen 7 eight core 1700x 3.80ghz socket am... [[0, 3, brand], [4, 11, range], [29, 36, clock...
1 amd ryzen 7 1700x 8 core am4 cpu/processor [[0, 3, brand], [4, 11, range], [18, 24, cores...
2 amd ryzen 7 1700x 3.4ghz 16mb l3 processor [[0, 3, brand], [4, 11, range], [25, 29, cache...
3 amd ryzen 7 1700x 95 w 8 core/16 threads 3.8gh... [[0, 3, brand], [4, 11, range], [23, 29, cores...
4 open box amd ryzen 7 1700x 3.8 ghz 8 core 95w ... [[9, 12, brand], [13, 20, range], [35, 41, cor...

The trickiest bit is getting the JSON format right. You can export the contents of the dataframe as JSON in Pandas using the to_json() function but the output isn’t compatible with Doccano. To make it work you need to pass in the records argument to orient and set lines=True. This will ensure the JSON gets created in JSONL format. Output the JSON to a file so you can upload it.

df.to_json('output.json', orient='records', lines=True)

Import into Doccano

Finally, all you need to do is login to Doccano, create a new project and use the import tool to upload your JSON. Doccano will automatically extract any labels from your JSON and create them. It even creates keyboard shortcuts for you.

Doccano from JSON

If you check the statistics tab, you should see that most of your records have now been annotated with the common terms you identified. All your human annotators need to do now is go through and add anything missing, correct any issues, and approve the annotations. For my dataset, this saved days of work.

Doccano stats

Matt Clarke, Thursday, March 04, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.