How to perform tokenization in NLP with NLTK and Python

Picture by Pixabay, Pexels.

8 minutes to read

Machine Learning Natural Language Processing

Tokenization is a data science technique that breaks up the words in a sentence into a comma separated list of distinct words or values. It’s a crucial first step in preprocessing text data during Natural Language Processing or NLP.

Before you can run most NLP machine learning techniques, you’ll usually need to use tokenize your data. In this quick project, I’ll show you how you can use Python’s Natural Language Toolkit (NLTK) to take text data from a Pandas dataframe and return a tokenized list of words using the punkt tokenizer.

Import the packages

To get started, open a Jupyter notebook and import the pandas and nltk packages. We’ll be using Pandas to load and manipulate our data, and the Natural Language Toolkit (NLTK) to perform the tokenization. If you don’t have nltk installed, you can install it by entering pip3 install nltk in your terminal.

import pandas as pd
import nltk

Import the data

Next, import your data into a Pandas dataframe. For demonstration purposes, I’m importing a dataset of titles and descriptions from the Practical Data Science website. Two columns - title and description - contain text that we can tokenize using NLTK and Python.

df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/titles_and_descriptions.csv')
df.head()

	url	title	description
0	https://practicaldatascience.co.uk/data-scienc...	How to create a Python virtual environment for...	Learn how to create a virtual environment for ...
1	https://practicaldatascience.co.uk/data-scienc...	How to engineer date features using Pandas	In time series datasets dates often hold the k...
2	https://practicaldatascience.co.uk/machine-lea...	How to impute missing numeric values in your d...	Cleverly filling in the gaps when numeric data...
3	https://practicaldatascience.co.uk/machine-lea...	How to interpret the confusion matrix	The confusion matrix can tell you more about y...
4	https://practicaldatascience.co.uk/machine-lea...	How to use mean encoding in your machine learn...	Learn how to use the mean encoding technique t...

Concatenate the text into a single column

While not essential, when using NLP you’ll usually want to analyse all the available text rather than the text in a single column. We can merge the text in the two columns together using concatenation via the + operator. Adding + ' ' + means that when the end of the title is concatenated to the beginning of the description a space will be added, so we don’t accidentally invent new words.

df['text'] = df['title'] + ' ' + df['description']

Remove NaN values and cast to string

Next, we need to ensure that any NaN values have been dropped from the column and that we’re dealing only with string values. If you miss this step, NLTK will throw an error saying TypeError: expected string or bytes-like object.

df['text'].dropna(inplace=True)
df['text'] = df['text'].astype(str)
df.head()

	url	title	description	text	tokenized
0	https://practicaldatascience.co.uk/data-scienc...	How to create a Python virtual environment for...	Learn how to create a virtual environment for ...	How to create a Python virtual environment for...	[How, to, create, a, Python, virtual, environm...
1	https://practicaldatascience.co.uk/data-scienc...	How to engineer date features using Pandas	In time series datasets dates often hold the k...	How to engineer date features using Pandas In ...	[How, to, engineer, date, features, using, Pan...
2	https://practicaldatascience.co.uk/machine-lea...	How to impute missing numeric values in your d...	Cleverly filling in the gaps when numeric data...	How to impute missing numeric values in your d...	[How, to, impute, missing, numeric, values, in...
3	https://practicaldatascience.co.uk/machine-lea...	How to interpret the confusion matrix	The confusion matrix can tell you more about y...	How to interpret the confusion matrix The conf...	[How, to, interpret, the, confusion, matrix, T...
4	https://practicaldatascience.co.uk/machine-lea...	How to use mean encoding in your machine learn...	Learn how to use the mean encoding technique t...	How to use mean encoding in your machine learn...	[How, to, use, mean, encoding, in, your, machi...

Create a tokenizer using NLTK

Finally, we can use NLTK to create our tokenizer function. The command nltk.download('punkt'); will fire up the NLTK downloader and tell it to install the punkt data, which is a sentence tokenizer that takes a sentence of words and breaks them up into individual values.

nltk.download('punkt');

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

To make it easier to apply tokenization to our Pandas dataframe column, and to allow us to re-use the function in any other NLP projects we might tackle later, we’ll make a little function. This takes a Pandas column name and returns a list of tokens from word_tokenize. The for loop bit uses isalpha() to return values instead of booleans.

def tokenize(column):
    """Tokenizes a Pandas dataframe column and returns a list of tokens.

    Args:
        column: Pandas dataframe column (i.e. df['text']).

    Returns:
        tokens (list): Tokenized list, i.e. [Donald, Trump, tweets]
    """

    tokens = nltk.word_tokenize(column)
    return [w for w in tokens if w.isalpha()]    

Tokenize your text data using NLTK

The last thing we need to do is run our function on the Pandas text column, which we can do using a lambda function on the vertical axis (via axis=1). This passes in the whole text column, uses NLTK to tokenize the values, and returns a new Pandas column called tokenized that contains Python lists containing comma separated tokens.

df['tokenized'] = df.apply(lambda x: tokenize(x['text']), axis=1)
df[['tokenized']].head()

	tokenized
0	[How, to, create, a, Python, virtual, environm...
1	[How, to, engineer, date, features, using, Pan...
2	[How, to, impute, missing, numeric, values, in...
3	[How, to, interpret, the, confusion, matrix, T...
4	[How, to, use, mean, encoding, in, your, machi...

Matt Clarke, Monday, May 09, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.