The Natural Language Toolkit (NLTK) is a powerful Python package for performing a wide range of common NLP tasks, including Part of Speech tagging or POS tagging for short.
In this example, we’ll load some text data in a Pandas dataframe and then use NLTK’s POS tagging feature to identify the word classes or lexical categories for each word in the string, so we can extract them, analyse them, or make model features from them. It’s a particularly useful tool in Python SEO projects.
Part of Speech tagging is an NLP process that takes a string of text and then returns a structured response that identifies the word class, or lexical or grammatical category for each word in the string. Several NLP packages are now capable of POS tagging, so the process is now quite simple and robust.
In NLTK, POS tagging is powered by the Averaged Perceptron Tagger model which is a port of a module from the Textblob package. This model was pre-trained on words in the Wall Street Journal and is able to identify whether a particular element is a verb, noun, adjective, adverb, pronoun, preposition, conjunction, numeral, interjunction, determiner, or article. Such models are language-specific, so you’ll need to load one that works for text in the language you intend to analyse.
As with other NLP techniques, some text preprocessing is required for Part of Speech tagging to work correctly. The first step is to tokenize the data and convert a string such as “Noel is the most talented Gallagher brother” and return a Python list of individual elements or “tokens” to provide ['noel', 'is', 'the', 'best', 'gallagher', 'brother']
.
string = "Noel is the most talented Gallagher brother"
tokenized = nltk.word_tokenize(string.lower())
tokenized
['noel', 'is', 'the', 'most', 'talented', 'gallagher', 'brother']
Once that’s done, it’s simply a case of passing the tokenized data to the pos_tag
method that uses the Averaged Perceptron Tagger and it will return a structured object containing the POS tags for each token in the list. Here’s a really simple example. For each token, we get back a tuple containing the word and its POS tag code.
In our simple example NN
means a singular noun (such as “noel”, “gallagher”, or “brother”); VBZ
means a verb in the present tense (such as “is”); DT
means determiner (such as “the”); RBS
means adverb or superlative (such as “most”), and JJ
means adjective (such as talented). There are, of course, many other Part of Speech tags.
tagged = nltk.pos_tag(tokenized)
tagged
[('noel', 'NN'),
('is', 'VBZ'),
('the', 'DT'),
('most', 'RBS'),
('talented', 'JJ'),
('gallagher', 'NN'),
('brother', 'NN')]
NLTK has a wide range of Part of Speech tags. Here’s a full list of NLTK POS tags and their meanings so you can better interpret the part of speech for each token in your text data.
POS Tag | Meaning | Example |
---|---|---|
CC | Coordinating conjunction | and |
CD | Cardinal number | one, two |
DT | Determiner | the, a |
EX | Existential there | there |
FW | Foreign word | dolce |
IN | Preposition or subordinating conjunction | of, in, by |
JJ | Adjective | big |
JJR | Adjective, comparative | bigger |
JJS | Adjective or superlative | biggest |
LS | List item marker | 1) |
MD | Modal | could, will |
NN | Noun, singular or mass | turnip, badger |
NNS | Noun, plural | turnips, badgers |
NNP | Proper noun, singular | Edinburgh |
NNPS | Proper noun, plural | Smiths |
PDT | Predeterminer | all, both |
POS | Possessive ending | 's |
PRP | Personal pronoun | I, you, he |
PRP$ | Possessive pronoun | my, your, his |
RB | Adverb | quickly |
RBR | Adverb, comparative | more quickly |
RBS | Adverb, superlative | most quickly |
RP | Particle | up, off |
TO | Infinite marker | to |
UH | Interjection | oh, oops |
VB | Verb, base form | take |
VBD | Verb, past tense | took |
VBG | Verb, gerund or present participle | taking |
VBN | Verb, past participle | taken |
VBP | Verb, non-3rd person singular present | take |
VBZ | Verb, 3rd person singular present | takes |
WDT | Wh-determiner | which, that, what |
WP | Wh-pronoun | what, who |
WP$ | Possessive wh-pronoun | whose |
WRB | Wh-adverb | how, where, when |
In this project we’ll be loading a Pandas dataframe and applying Part of Speech tagging using NLTK to tag the elements in a column of text, and then extract specific POS tags based on their type, so we can better understand the dataset. To get started, open a Jupyter notebook and import the Pandas and NLTK packages. You may need to install NLTK via Pip first.
import pandas as pd
import nltk
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
NLTK uses a range of underlying packages, models, and datasets that you will need to load in order to complete the next steps. We’ll be using punkt
and the averaged_perceptron_tagger
model, so you’ll first need to download these by executing the commands below.
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.
True
Next, we need to load a dataset into Pandas. I’m using a dummy dataset I created that is based on meta titles and meta descriptions from this website. You can load it remotely via my GitHub account. For POS tagging to work, you’ll need to fill any NaN values with an empty string, otherwise you’ll get an error.
df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/titles_and_descriptions.csv')
df = df[['title']].fillna('')
df.head()
title | |
---|---|
0 | How to create a Python virtual environment for Jupyter |
1 | How to engineer date features using Pandas |
2 | How to impute missing numeric values in your dataset |
3 | How to interpret the confusion matrix |
4 | How to use mean encoding in your machine learning models |
Before you can apply any NLP techniques to the data you will need to tokenize the data, or convert the original text string stored in the Pandas column to a Python list stored in a Pandas column. We’ll create a column called tokenized
to store the data in, then we’ll convert the string to lowercase using str.lower()
and then use the apply()
function to run the nltk.word_tokenize()
function on each row.
df['tokenized'] = df['title'].str.lower().apply(nltk.word_tokenize)
df.head()
title | tokenized | |
---|---|---|
0 | How to create a Python virtual environment for Jupyter | [how, to, create, a, python, virtual, environment, for, jupyter] |
1 | How to engineer date features using Pandas | [how, to, engineer, date, features, using, pandas] |
2 | How to impute missing numeric values in your dataset | [how, to, impute, missing, numeric, values, in, your, dataset] |
3 | How to interpret the confusion matrix | [how, to, interpret, the, confusion, matrix] |
4 | How to use mean encoding in your machine learning models | [how, to, use, mean, encoding, in, your, machine, learning, models] |
To apply POS tagging to our tokenized data we can follow a similar process. We’ll create a new column called tagged
, and we’ll then use the apply()
function to run the nltk.pos_tag
method on each column of tokenized data. It will return a list of tuples that contain the word and its POS tag.
df['tagged'] = df['tokenized'].apply(nltk.pos_tag)
df[['tagged']].head()
tagged | |
---|---|
0 | [(how, WRB), (to, TO), (create, VB), (a, DT), (python, JJ), (virtual, JJ), (environment, NN), (for, IN), (jupyter, NN)] |
1 | [(how, WRB), (to, TO), (engineer, VB), (date, NN), (features, NNS), (using, VBG), (pandas, NNS)] |
2 | [(how, WRB), (to, TO), (impute, VB), (missing, VBG), (numeric, JJ), (values, NNS), (in, IN), (your, PRP$), (dataset, NN)] |
3 | [(how, WRB), (to, TO), (interpret, VB), (the, DT), (confusion, NN), (matrix, NN)] |
4 | [(how, WRB), (to, TO), (use, VB), (mean, JJ), (encoding, VBG), (in, IN), (your, PRP$), (machine, NN), (learning, NN), (models, NNS)] |
The useful part of POS tagging comes when you extract words with specific POS tag types. To show how this works, let’s create a new column called nouns
and run it on the tagged
column that contains the POS tagged data we just created. We’ll use a lambda
function to loop over the words and tags to return only those that are nouns.
In NLTK nouns can be tagged with any of four different values: NN
for singular nouns, NNS
for plural nouns, NNP
for singular proper nouns, and NNPS
for plural proper nouns. When you run this, NLTK will extract the nouns and put them in the new column.
df['nouns'] = df['tagged'].apply(lambda x: [word for word, tag in x if tag in ['NN', 'NNS', 'NNP', 'NNPS']])
df[['title', 'nouns']].head()
title | nouns | |
---|---|---|
0 | How to create a Python virtual environment for Jupyter | [environment, jupyter] |
1 | How to engineer date features using Pandas | [date, features, pandas] |
2 | How to impute missing numeric values in your dataset | [values, dataset] |
3 | How to interpret the confusion matrix | [confusion, matrix] |
4 | How to use mean encoding in your machine learning models | [machine, learning, models] |
We can use the same lambda
function approach on the POS tagging data to identify the verbs in each piece of text. In NLTK verbs can be tagged as: VB
for verb, VBD
for past tense verbs, VBG
for verb gerunds, VBN
for past participle verbs, VBP
for present tense verbs without third person singular, and VBZ
for present tense verbs with third person singular.
df['verbs'] = df['tagged'].apply(lambda x: [word for word, tag in x if tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']])
df[['title', 'verbs']].head()
title | verbs | |
---|---|---|
0 | How to create a Python virtual environment for Jupyter | [create] |
1 | How to engineer date features using Pandas | [engineer, using] |
2 | How to impute missing numeric values in your dataset | [impute, missing] |
3 | How to interpret the confusion matrix | [interpret] |
4 | How to use mean encoding in your machine learning models | [use, encoding] |
Another extremely useful feature of POS tagging is the ability to extract numbers or “cardinal digits” from text. In machine learning these often make powerful features so are often worth extracting and storing. The neat thing about using NLTK for this is that it even works on text representations of numbers, so “three” and “four” are both identified as cardinal digits, even though they’re non-numeric.
df['cardinal_digits'] = df['tagged'].apply(lambda x: [word for word, tag in x if tag in ['CD']])
df[['title', 'cardinal_digits']].sort_values(by='cardinal_digits', ascending=False).head()
title | cardinal_digits | |
---|---|---|
175 | How to scrape Google results in three lines of Python code | [three] |
93 | How to write better code using DRY and Do One Thing | [one] |
114 | How to preprocess text for NLP in four easy steps | [four] |
81 | The four Python data science libraries you need to learn | [four] |
27 | Dell Precision 7750 mobile data science workstation review | [7750] |
Matt Clarke, Tuesday, September 27, 2022