Tokenization is a data science technique that breaks up the words in a sentence into a comma separated list of distinct words or values. It’s a crucial first step in preprocessing text data during Natural Language Processing or NLP.
Before you can run most NLP machine learning techniques, you’ll usually need to use tokenize your data. In this quick project, I’ll show you how you can use Python’s Natural Language Toolkit (NLTK) to take text data from a Pandas dataframe and return a tokenized list of words using the punkt
tokenizer.
To get started, open a Jupyter notebook and import the pandas
and nltk
packages. We’ll be using Pandas to load and manipulate our data, and the Natural Language Toolkit (NLTK) to perform the tokenization. If you don’t have nltk
installed, you can install it by entering pip3 install nltk
in your terminal.
import pandas as pd
import nltk
Next, import your data into a Pandas dataframe. For demonstration purposes, I’m importing a dataset of titles and descriptions from the Practical Data Science website. Two columns - title
and description
- contain text that we can tokenize using NLTK and Python.
df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/titles_and_descriptions.csv')
df.head()
url | title | description | |
---|---|---|---|
0 | https://practicaldatascience.co.uk/data-scienc... | How to create a Python virtual environment for... | Learn how to create a virtual environment for ... |
1 | https://practicaldatascience.co.uk/data-scienc... | How to engineer date features using Pandas | In time series datasets dates often hold the k... |
2 | https://practicaldatascience.co.uk/machine-lea... | How to impute missing numeric values in your d... | Cleverly filling in the gaps when numeric data... |
3 | https://practicaldatascience.co.uk/machine-lea... | How to interpret the confusion matrix | The confusion matrix can tell you more about y... |
4 | https://practicaldatascience.co.uk/machine-lea... | How to use mean encoding in your machine learn... | Learn how to use the mean encoding technique t... |
While not essential, when using NLP you’ll usually want to analyse all the available text rather than the text in a single column. We can merge the text in the two columns together using concatenation via the +
operator. Adding + ' ' +
means that when the end of the title
is concatenated to the beginning of the description
a space will be added, so we don’t accidentally invent new words.
df['text'] = df['title'] + ' ' + df['description']
Next, we need to ensure that any NaN
values have been dropped from the column and that we’re dealing only with string values. If you miss this step, NLTK will throw an error saying TypeError: expected string or bytes-like object
.
df['text'].dropna(inplace=True)
df['text'] = df['text'].astype(str)
df.head()
url | title | description | text | tokenized | |
---|---|---|---|---|---|
0 | https://practicaldatascience.co.uk/data-scienc... | How to create a Python virtual environment for... | Learn how to create a virtual environment for ... | How to create a Python virtual environment for... | [How, to, create, a, Python, virtual, environm... |
1 | https://practicaldatascience.co.uk/data-scienc... | How to engineer date features using Pandas | In time series datasets dates often hold the k... | How to engineer date features using Pandas In ... | [How, to, engineer, date, features, using, Pan... |
2 | https://practicaldatascience.co.uk/machine-lea... | How to impute missing numeric values in your d... | Cleverly filling in the gaps when numeric data... | How to impute missing numeric values in your d... | [How, to, impute, missing, numeric, values, in... |
3 | https://practicaldatascience.co.uk/machine-lea... | How to interpret the confusion matrix | The confusion matrix can tell you more about y... | How to interpret the confusion matrix The conf... | [How, to, interpret, the, confusion, matrix, T... |
4 | https://practicaldatascience.co.uk/machine-lea... | How to use mean encoding in your machine learn... | Learn how to use the mean encoding technique t... | How to use mean encoding in your machine learn... | [How, to, use, mean, encoding, in, your, machi... |
Finally, we can use NLTK to create our tokenizer function. The command nltk.download('punkt');
will fire up the NLTK downloader and tell it to install the punkt
data, which is a sentence tokenizer that takes a sentence of words and breaks them up into individual values.
nltk.download('punkt');
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
To make it easier to apply tokenization to our Pandas dataframe column, and to allow us to re-use the function in any other NLP projects we might tackle later, we’ll make a little function. This takes a Pandas column name and returns a list of tokens from word_tokenize
. The for
loop bit uses isalpha()
to return values instead of booleans.
def tokenize(column):
"""Tokenizes a Pandas dataframe column and returns a list of tokens.
Args:
column: Pandas dataframe column (i.e. df['text']).
Returns:
tokens (list): Tokenized list, i.e. [Donald, Trump, tweets]
"""
tokens = nltk.word_tokenize(column)
return [w for w in tokens if w.isalpha()]
The last thing we need to do is run our function on the Pandas text
column, which we can do using a lambda
function on the vertical axis (via axis=1
). This passes in the whole text
column, uses NLTK to tokenize the values, and returns a new Pandas column called tokenized
that contains Python lists containing comma separated tokens.
df['tokenized'] = df.apply(lambda x: tokenize(x['text']), axis=1)
df[['tokenized']].head()
tokenized | |
---|---|
0 | [How, to, create, a, Python, virtual, environm... |
1 | [How, to, engineer, date, features, using, Pan... |
2 | [How, to, impute, missing, numeric, values, in... |
3 | [How, to, interpret, the, confusion, matrix, T... |
4 | [How, to, use, mean, encoding, in, your, machi... |
Matt Clarke, Monday, May 09, 2022