How to use NLTK for POS tagging in Pandas

Picture by Pixabay, Pexels.

16 minutes to read

Data Science Natural Language Processing

The Natural Language Toolkit (NLTK) is a powerful Python package for performing a wide range of common NLP tasks, including Part of Speech tagging or POS tagging for short.

In this example, we’ll load some text data in a Pandas dataframe and then use NLTK’s POS tagging feature to identify the word classes or lexical categories for each word in the string, so we can extract them, analyse them, or make model features from them. It’s a particularly useful tool in Python SEO projects.

How does Part of Speech tagging work?

Part of Speech tagging is an NLP process that takes a string of text and then returns a structured response that identifies the word class, or lexical or grammatical category for each word in the string. Several NLP packages are now capable of POS tagging, so the process is now quite simple and robust.

In NLTK, POS tagging is powered by the Averaged Perceptron Tagger model which is a port of a module from the Textblob package. This model was pre-trained on words in the Wall Street Journal and is able to identify whether a particular element is a verb, noun, adjective, adverb, pronoun, preposition, conjunction, numeral, interjunction, determiner, or article. Such models are language-specific, so you’ll need to load one that works for text in the language you intend to analyse.

As with other NLP techniques, some text preprocessing is required for Part of Speech tagging to work correctly. The first step is to tokenize the data and convert a string such as “Noel is the most talented Gallagher brother” and return a Python list of individual elements or “tokens” to provide ['noel', 'is', 'the', 'best', 'gallagher', 'brother'].

string = "Noel is the most talented Gallagher brother"
tokenized = nltk.word_tokenize(string.lower())

tokenized

['noel', 'is', 'the', 'most', 'talented', 'gallagher', 'brother']

Once that’s done, it’s simply a case of passing the tokenized data to the pos_tag method that uses the Averaged Perceptron Tagger and it will return a structured object containing the POS tags for each token in the list. Here’s a really simple example. For each token, we get back a tuple containing the word and its POS tag code.

In our simple example NN means a singular noun (such as “noel”, “gallagher”, or “brother”); VBZ means a verb in the present tense (such as “is”); DT means determiner (such as “the”); RBS means adverb or superlative (such as “most”), and JJ means adjective (such as talented). There are, of course, many other Part of Speech tags.

tagged = nltk.pos_tag(tokenized)

tagged

[('noel', 'NN'),
 ('is', 'VBZ'),
 ('the', 'DT'),
 ('most', 'RBS'),
 ('talented', 'JJ'),
 ('gallagher', 'NN'),
 ('brother', 'NN')]

NLTK has a wide range of Part of Speech tags. Here’s a full list of NLTK POS tags and their meanings so you can better interpret the part of speech for each token in your text data.

POS Tag	Meaning	Example
CC	Coordinating conjunction	and
CD	Cardinal number	one, two
DT	Determiner	the, a
EX	Existential there	there
FW	Foreign word	dolce
IN	Preposition or subordinating conjunction	of, in, by
JJ	Adjective	big
JJR	Adjective, comparative	bigger
JJS	Adjective or superlative	biggest
LS	List item marker	1)
MD	Modal	could, will
NN	Noun, singular or mass	turnip, badger
NNS	Noun, plural	turnips, badgers
NNP	Proper noun, singular	Edinburgh
NNPS	Proper noun, plural	Smiths
PDT	Predeterminer	all, both
POS	Possessive ending	's
PRP	Personal pronoun	I, you, he
PRP$	Possessive pronoun	my, your, his
RB	Adverb	quickly
RBR	Adverb, comparative	more quickly
RBS	Adverb, superlative	most quickly
RP	Particle	up, off
TO	Infinite marker	to
UH	Interjection	oh, oops
VB	Verb, base form	take
VBD	Verb, past tense	took
VBG	Verb, gerund or present participle	taking
VBN	Verb, past participle	taken
VBP	Verb, non-3rd person singular present	take
VBZ	Verb, 3rd person singular present	takes
WDT	Wh-determiner	which, that, what
WP	Wh-pronoun	what, who
WP$	Possessive wh-pronoun	whose
WRB	Wh-adverb	how, where, when

Import the packages

In this project we’ll be loading a Pandas dataframe and applying Part of Speech tagging using NLTK to tag the elements in a column of text, and then extract specific POS tags based on their type, so we can better understand the dataset. To get started, open a Jupyter notebook and import the Pandas and NLTK packages. You may need to install NLTK via Pip first.

import pandas as pd
import nltk

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

NLTK uses a range of underlying packages, models, and datasets that you will need to load in order to complete the next steps. We’ll be using punkt and the averaged_perceptron_tagger model, so you’ll first need to download these by executing the commands below.

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.

True

Load the data

Next, we need to load a dataset into Pandas. I’m using a dummy dataset I created that is based on meta titles and meta descriptions from this website. You can load it remotely via my GitHub account. For POS tagging to work, you’ll need to fill any NaN values with an empty string, otherwise you’ll get an error.

df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/titles_and_descriptions.csv')
df = df[['title']].fillna('')

df.head()

	title
0	How to create a Python virtual environment for Jupyter
1	How to engineer date features using Pandas
2	How to impute missing numeric values in your dataset
3	How to interpret the confusion matrix
4	How to use mean encoding in your machine learning models

Tokenize the data

Before you can apply any NLP techniques to the data you will need to tokenize the data, or convert the original text string stored in the Pandas column to a Python list stored in a Pandas column. We’ll create a column called tokenized to store the data in, then we’ll convert the string to lowercase using str.lower() and then use the apply() function to run the nltk.word_tokenize() function on each row.

df['tokenized'] = df['title'].str.lower().apply(nltk.word_tokenize)

df.head()

	title	tokenized
0	How to create a Python virtual environment for Jupyter	[how, to, create, a, python, virtual, environment, for, jupyter]
1	How to engineer date features using Pandas	[how, to, engineer, date, features, using, pandas]
2	How to impute missing numeric values in your dataset	[how, to, impute, missing, numeric, values, in, your, dataset]
3	How to interpret the confusion matrix	[how, to, interpret, the, confusion, matrix]
4	How to use mean encoding in your machine learning models	[how, to, use, mean, encoding, in, your, machine, learning, models]

Apply POS tagging to the tokenized data

To apply POS tagging to our tokenized data we can follow a similar process. We’ll create a new column called tagged, and we’ll then use the apply() function to run the nltk.pos_tag method on each column of tokenized data. It will return a list of tuples that contain the word and its POS tag.

df['tagged'] = df['tokenized'].apply(nltk.pos_tag)

df[['tagged']].head()

	tagged
0	[(how, WRB), (to, TO), (create, VB), (a, DT), (python, JJ), (virtual, JJ), (environment, NN), (for, IN), (jupyter, NN)]
1	[(how, WRB), (to, TO), (engineer, VB), (date, NN), (features, NNS), (using, VBG), (pandas, NNS)]
2	[(how, WRB), (to, TO), (impute, VB), (missing, VBG), (numeric, JJ), (values, NNS), (in, IN), (your, PRP$), (dataset, NN)]
3	[(how, WRB), (to, TO), (interpret, VB), (the, DT), (confusion, NN), (matrix, NN)]
4	[(how, WRB), (to, TO), (use, VB), (mean, JJ), (encoding, VBG), (in, IN), (your, PRP$), (machine, NN), (learning, NN), (models, NNS)]

Extract the nouns from the text

The useful part of POS tagging comes when you extract words with specific POS tag types. To show how this works, let’s create a new column called nouns and run it on the tagged column that contains the POS tagged data we just created. We’ll use a lambda function to loop over the words and tags to return only those that are nouns.

In NLTK nouns can be tagged with any of four different values: NN for singular nouns, NNS for plural nouns, NNP for singular proper nouns, and NNPS for plural proper nouns. When you run this, NLTK will extract the nouns and put them in the new column.

df['nouns'] = df['tagged'].apply(lambda x: [word for word, tag in x if tag in ['NN', 'NNS', 'NNP', 'NNPS']])

df[['title', 'nouns']].head()

	title	nouns
0	How to create a Python virtual environment for Jupyter	[environment, jupyter]
1	How to engineer date features using Pandas	[date, features, pandas]
2	How to impute missing numeric values in your dataset	[values, dataset]
3	How to interpret the confusion matrix	[confusion, matrix]
4	How to use mean encoding in your machine learning models	[machine, learning, models]

Extract the verbs from the text

We can use the same lambda function approach on the POS tagging data to identify the verbs in each piece of text. In NLTK verbs can be tagged as: VB for verb, VBD for past tense verbs, VBG for verb gerunds, VBN for past participle verbs, VBP for present tense verbs without third person singular, and VBZ for present tense verbs with third person singular.

df['verbs'] = df['tagged'].apply(lambda x: [word for word, tag in x if tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']])

df[['title', 'verbs']].head()

	title	verbs
0	How to create a Python virtual environment for Jupyter	[create]
1	How to engineer date features using Pandas	[engineer, using]
2	How to impute missing numeric values in your dataset	[impute, missing]
3	How to interpret the confusion matrix	[interpret]
4	How to use mean encoding in your machine learning models	[use, encoding]

Extract the numbers or cardinal digits from the text

Another extremely useful feature of POS tagging is the ability to extract numbers or “cardinal digits” from text. In machine learning these often make powerful features so are often worth extracting and storing. The neat thing about using NLTK for this is that it even works on text representations of numbers, so “three” and “four” are both identified as cardinal digits, even though they’re non-numeric.

df['cardinal_digits'] = df['tagged'].apply(lambda x: [word for word, tag in x if tag in ['CD']])

df[['title', 'cardinal_digits']].sort_values(by='cardinal_digits', ascending=False).head()

	title	cardinal_digits
175	How to scrape Google results in three lines of Python code	[three]
93	How to write better code using DRY and Do One Thing	[one]
114	How to preprocess text for NLP in four easy steps	[four]
81	The four Python data science libraries you need to learn	[four]
27	Dell Precision 7750 mobile data science workstation review	[7750]

Matt Clarke, Tuesday, September 27, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.