How to use NLTK for POS tagging in Pandas

Learn how to use NLTK for Part of Speech tagging in Pandas to analyse the text in a dataframe column and extract the POS tags, and extract nouns, verbs, and numbers. It's great for Python SEO analysis and generating model features.

How to use NLTK for POS tagging in Pandas
Picture by Pixabay, Pexels.
16 minutes to read

The Natural Language Toolkit (NLTK) is a powerful Python package for performing a wide range of common NLP tasks, including Part of Speech tagging or POS tagging for short.

In this example, we’ll load some text data in a Pandas dataframe and then use NLTK’s POS tagging feature to identify the word classes or lexical categories for each word in the string, so we can extract them, analyse them, or make model features from them. It’s a particularly useful tool in Python SEO projects.

How does Part of Speech tagging work?

Part of Speech tagging is an NLP process that takes a string of text and then returns a structured response that identifies the word class, or lexical or grammatical category for each word in the string. Several NLP packages are now capable of POS tagging, so the process is now quite simple and robust.

In NLTK, POS tagging is powered by the Averaged Perceptron Tagger model which is a port of a module from the Textblob package. This model was pre-trained on words in the Wall Street Journal and is able to identify whether a particular element is a verb, noun, adjective, adverb, pronoun, preposition, conjunction, numeral, interjunction, determiner, or article. Such models are language-specific, so you’ll need to load one that works for text in the language you intend to analyse.

As with other NLP techniques, some text preprocessing is required for Part of Speech tagging to work correctly. The first step is to tokenize the data and convert a string such as “Noel is the most talented Gallagher brother” and return a Python list of individual elements or “tokens” to provide ['noel', 'is', 'the', 'best', 'gallagher', 'brother'].

string = "Noel is the most talented Gallagher brother"
tokenized = nltk.word_tokenize(string.lower())
tokenized
['noel', 'is', 'the', 'most', 'talented', 'gallagher', 'brother']

Once that’s done, it’s simply a case of passing the tokenized data to the pos_tag method that uses the Averaged Perceptron Tagger and it will return a structured object containing the POS tags for each token in the list. Here’s a really simple example. For each token, we get back a tuple containing the word and its POS tag code.

In our simple example NN means a singular noun (such as “noel”, “gallagher”, or “brother”); VBZ means a verb in the present tense (such as “is”); DT means determiner (such as “the”); RBS means adverb or superlative (such as “most”), and JJ means adjective (such as talented). There are, of course, many other Part of Speech tags.

tagged = nltk.pos_tag(tokenized)
tagged
[('noel', 'NN'),
 ('is', 'VBZ'),
 ('the', 'DT'),
 ('most', 'RBS'),
 ('talented', 'JJ'),
 ('gallagher', 'NN'),
 ('brother', 'NN')]

NLTK Part of Speech tag list

NLTK has a wide range of Part of Speech tags. Here’s a full list of NLTK POS tags and their meanings so you can better interpret the part of speech for each token in your text data.

POS Tag Meaning Example
CC Coordinating conjunction and
CD Cardinal number one, two
DT Determiner the, a
EX Existential there there
FW Foreign word dolce
IN Preposition or subordinating conjunction of, in, by
JJ Adjective big
JJR Adjective, comparative bigger
JJS Adjective or superlative biggest
LS List item marker 1)
MD Modal could, will
NN Noun, singular or mass turnip, badger
NNS Noun, plural turnips, badgers
NNP Proper noun, singular Edinburgh
NNPS Proper noun, plural Smiths
PDT Predeterminer all, both
POS Possessive ending 's
PRP Personal pronoun I, you, he
PRP$ Possessive pronoun my, your, his
RB Adverb quickly
RBR Adverb, comparative more quickly
RBS Adverb, superlative most quickly
RP Particle up, off
TO Infinite marker to
UH Interjection oh, oops
VB Verb, base form take
VBD Verb, past tense took
VBG Verb, gerund or present participle taking
VBN Verb, past participle taken
VBP Verb, non-3rd person singular present take
VBZ Verb, 3rd person singular present takes
WDT Wh-determiner which, that, what
WP Wh-pronoun what, who
WP$ Possessive wh-pronoun whose
WRB Wh-adverb how, where, when

Import the packages

In this project we’ll be loading a Pandas dataframe and applying Part of Speech tagging using NLTK to tag the elements in a column of text, and then extract specific POS tags based on their type, so we can better understand the dataset. To get started, open a Jupyter notebook and import the Pandas and NLTK packages. You may need to install NLTK via Pip first.

import pandas as pd
import nltk
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

NLTK uses a range of underlying packages, models, and datasets that you will need to load in order to complete the next steps. We’ll be using punkt and the averaged_perceptron_tagger model, so you’ll first need to download these by executing the commands below.

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.





True

Load the data

Next, we need to load a dataset into Pandas. I’m using a dummy dataset I created that is based on meta titles and meta descriptions from this website. You can load it remotely via my GitHub account. For POS tagging to work, you’ll need to fill any NaN values with an empty string, otherwise you’ll get an error.

df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/titles_and_descriptions.csv')
df = df[['title']].fillna('')
df.head()
title
0 How to create a Python virtual environment for Jupyter
1 How to engineer date features using Pandas
2 How to impute missing numeric values in your dataset
3 How to interpret the confusion matrix
4 How to use mean encoding in your machine learning models

Tokenize the data

Before you can apply any NLP techniques to the data you will need to tokenize the data, or convert the original text string stored in the Pandas column to a Python list stored in a Pandas column. We’ll create a column called tokenized to store the data in, then we’ll convert the string to lowercase using str.lower() and then use the apply() function to run the nltk.word_tokenize() function on each row.

df['tokenized'] = df['title'].str.lower().apply(nltk.word_tokenize)
df.head()
title tokenized
0 How to create a Python virtual environment for Jupyter [how, to, create, a, python, virtual, environment, for, jupyter]
1 How to engineer date features using Pandas [how, to, engineer, date, features, using, pandas]
2 How to impute missing numeric values in your dataset [how, to, impute, missing, numeric, values, in, your, dataset]
3 How to interpret the confusion matrix [how, to, interpret, the, confusion, matrix]
4 How to use mean encoding in your machine learning models [how, to, use, mean, encoding, in, your, machine, learning, models]

Apply POS tagging to the tokenized data

To apply POS tagging to our tokenized data we can follow a similar process. We’ll create a new column called tagged, and we’ll then use the apply() function to run the nltk.pos_tag method on each column of tokenized data. It will return a list of tuples that contain the word and its POS tag.

df['tagged'] = df['tokenized'].apply(nltk.pos_tag)
df[['tagged']].head()
tagged
0 [(how, WRB), (to, TO), (create, VB), (a, DT), (python, JJ), (virtual, JJ), (environment, NN), (for, IN), (jupyter, NN)]
1 [(how, WRB), (to, TO), (engineer, VB), (date, NN), (features, NNS), (using, VBG), (pandas, NNS)]
2 [(how, WRB), (to, TO), (impute, VB), (missing, VBG), (numeric, JJ), (values, NNS), (in, IN), (your, PRP$), (dataset, NN)]
3 [(how, WRB), (to, TO), (interpret, VB), (the, DT), (confusion, NN), (matrix, NN)]
4 [(how, WRB), (to, TO), (use, VB), (mean, JJ), (encoding, VBG), (in, IN), (your, PRP$), (machine, NN), (learning, NN), (models, NNS)]

Extract the nouns from the text

The useful part of POS tagging comes when you extract words with specific POS tag types. To show how this works, let’s create a new column called nouns and run it on the tagged column that contains the POS tagged data we just created. We’ll use a lambda function to loop over the words and tags to return only those that are nouns.

In NLTK nouns can be tagged with any of four different values: NN for singular nouns, NNS for plural nouns, NNP for singular proper nouns, and NNPS for plural proper nouns. When you run this, NLTK will extract the nouns and put them in the new column.

df['nouns'] = df['tagged'].apply(lambda x: [word for word, tag in x if tag in ['NN', 'NNS', 'NNP', 'NNPS']])
df[['title', 'nouns']].head()
title nouns
0 How to create a Python virtual environment for Jupyter [environment, jupyter]
1 How to engineer date features using Pandas [date, features, pandas]
2 How to impute missing numeric values in your dataset [values, dataset]
3 How to interpret the confusion matrix [confusion, matrix]
4 How to use mean encoding in your machine learning models [machine, learning, models]

Extract the verbs from the text

We can use the same lambda function approach on the POS tagging data to identify the verbs in each piece of text. In NLTK verbs can be tagged as: VB for verb, VBD for past tense verbs, VBG for verb gerunds, VBN for past participle verbs, VBP for present tense verbs without third person singular, and VBZ for present tense verbs with third person singular.

df['verbs'] = df['tagged'].apply(lambda x: [word for word, tag in x if tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']])
df[['title', 'verbs']].head()
title verbs
0 How to create a Python virtual environment for Jupyter [create]
1 How to engineer date features using Pandas [engineer, using]
2 How to impute missing numeric values in your dataset [impute, missing]
3 How to interpret the confusion matrix [interpret]
4 How to use mean encoding in your machine learning models [use, encoding]

Extract the numbers or cardinal digits from the text

Another extremely useful feature of POS tagging is the ability to extract numbers or “cardinal digits” from text. In machine learning these often make powerful features so are often worth extracting and storing. The neat thing about using NLTK for this is that it even works on text representations of numbers, so “three” and “four” are both identified as cardinal digits, even though they’re non-numeric.

df['cardinal_digits'] = df['tagged'].apply(lambda x: [word for word, tag in x if tag in ['CD']])
df[['title', 'cardinal_digits']].sort_values(by='cardinal_digits', ascending=False).head()
title cardinal_digits
175 How to scrape Google results in three lines of Python code [three]
93 How to write better code using DRY and Do One Thing [one]
114 How to preprocess text for NLP in four easy steps [four]
81 The four Python data science libraries you need to learn [four]
27 Dell Precision 7750 mobile data science workstation review [7750]

Matt Clarke, Tuesday, September 27, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.