How to use Spacy for POS tagging in Pandas

Picture by Pixabay, Pexels.

10 minutes to read

Data Science Natural Language Processing

Spacy is one of the most popular Python packages for Natural Language Processing. Alongside the Natural Language Toolkit (NLTK), Spacy provides a huge range of functionality for a wide variety of NLP tasks. It supports all common tasks out of the box, and is also highly extensible.

In this simple tutorial, we’ll use Spacy for Parts of Speech tagging (or POS tagging), and NLP text preprocessing. We’ll tokenize the words in a sentence, tokenize the sentences in a paragraph, use lemmatization, detect stopwords, and extract parts of speech and their tags to a Pandas dataframe.

Install the packages

To get started, open a Jupyter notebook and install the Spacy package via the Pip Python package management system using !pip3 install spacy. Once this is installed, you’ll need to download a Spacy model.

The most commonly used one is en_core_web_sm, but other more accurate models are available. To install this you need to execute a command line command !python3 -m spacy download en_core_web_sm and wait a couple of minutes for everything to install.

!pip3 install spacy

!python3 -m spacy download en_core_web_sm

Load the packages

Next you’ll need to import the Spacy and Pandas packages and load the Spacy en_core_web_sm model using nlp = spacy.load('en_core_web_sm'). Once that’s done, you can start using Spacy.

import spacy
import pandas as pd

nlp = spacy.load('en_core_web_sm')

Create a Spacy Doc

The first step in Spacy, after loading the model is to create a Spacy document. For this simple example, we’ll pass a string containing a couple of sentences to nlp() and return the output in a variable called doc. If you print the output of doc you’ll see that it returns the text you entered.

doc = nlp('Apple is seeking 5 new data scientists with skills in Python, Pandas, and Spacy. The roles will specialise in NLP.')

doc

Apple is seeking 5 new data scientists with skills in Python, Pandas, and Spacy. The roles will specialise in NLP.

Tokenize the words and punctuation in the document

Since Spacy and other NLP systems, such as NLTK, all work with individual words and punctuation, the initial step is to convert the elements of the document text to “tokens”. If you use a for loop to print doc you’ll get these tokens back. With a few lines of extra code, you can do the same thing but output the token.text values to a Pandas dataframe.

for token in doc: 
    print(token)

Apple
is
seeking
5
new
data
scientists
with
skills
in
Python
,
Pandas
,
and
Spacy
.
The
roles
will
specialise
in
NLP
.

df = pd.DataFrame(columns=['token'])
for token in doc:
    df = df.append({'token': token.text}, ignore_index=True)
df

	token
0	Apple
1	is
2	seeking
3	5
4	new
5	data
6	scientists
7	with
8	skills
9	in
10	Python
11	,
12	Pandas
13	,
14	and
15	Spacy
16	.
17	The
18	roles
19	will
20	specialise
21	in
22	NLP
23	.

Tokenize the sentences in the document

Another technique that’s sometimes useful is sentence tokenization. Instead of breaking up the document into individual word and punctuation tokens, this uses the built in sents feature to break up the document into individual sentences.

for sent in doc.sents:
    print (sent)

Apple is seeking 5 new data scientists with skills in Python, Pandas, and Spacy.
The roles will specialise in NLP.

Extract Parts of Speech tags

We’ve already seen that the token returned by Spacy contains the text, such as the word, number, or punctuation, within the token.text element. However, there are a wide range of other token attributes you can also extract with Spacy. The code below will extract some of the most widely used Spacy token attributes and put them in a Pandas dataframe.

These include Parts of Speech or POS tags, stored in token.pos_, which contain a value such as NUM or NOUN to indicate what Spacy detected. They’re usually used in conjunction with token.tag_, which provides some deeper information.

You can also see things like the shape of the word (how many characters it has and what case was used), and whether the word is a commonly used stop word, such as “is”, “with”, or “in”. Stopwords rarely add much to models so often get stripped out to make models quicker and more effective.

df = pd.DataFrame(columns=['text', 'lemma', 'pos_', 'tag_', 'dep_', 'shape_', 'is_alpha', 'is_stop'])

for token in doc:
    df = df.append({'text': token.text,
                    'lemma': token.lemma_,
                    'pos_': token.pos_,
                    'tag_': token.tag_,
                    'dep_': token.dep_,
                    'shape_': token.shape_,
                    'is_stop': token.is_stop}, ignore_index=True)
df

	text	lemma	pos_	tag_	dep_	shape_	is_alpha	is_stop
0	Apple	Apple	PROPN	NNP	nsubj	Xxxxx	True	False
1	is	be	AUX	VBZ	aux	xx	True	True
2	seeking	seek	VERB	VBG	ROOT	xxxx	True	False
3	5	5	NUM	CD	nummod	d	False	False
4	new	new	ADJ	JJ	amod	xxx	True	False
5	data	data	NOUN	NN	compound	xxxx	True	False
6	scientists	scientist	NOUN	NNS	dobj	xxxx	True	False
7	with	with	ADP	IN	prep	xxxx	True	True
8	skills	skill	NOUN	NNS	pobj	xxxx	True	False
9	in	in	ADP	IN	prep	xx	True	True
10	Python	Python	PROPN	NNP	pobj	Xxxxx	True	False
11	,	,	PUNCT	,	punct	,	False	False
12	SQL	SQL	PROPN	NNP	conj	XXX	True	False
13	,	,	PUNCT	,	punct	,	False	False
14	and	and	CCONJ	CC	cc	xxx	True	True
15	Spacy	Spacy	PROPN	NNP	conj	Xxxxx	True	False
16	.	.	PUNCT	.	punct	.	False	False

Visualise POS tagging data with Displacy

Another neat thing you can do with Spacy is use the additional Displacy module to visualise POS tagging. The Displacy visualizer works inside a Jupyter notebook and takes the Spacy document and a style option and visualisation showing the tagged text. The ent style in Displacy labels any entities identified.

In the example below, it picks out Apple, Spacy, and NLP as ORG entities or organisations, Python as a GPE or geopolitical entity, and 5 as a CARDINAL or number. As you can see, it doesn’t always detect entities correctly when they’re a bit obscure like the ones in our text sentence. I’ll explain how you can improve and extend Spacy’s Named Entity Recognition or NER in another tutorial.

from spacy import displacy

displacy.render(doc, style="ent")

Displacy

The other useful visualizer for Spacy is the Displacy dependency or dep visualizer. This shows more of the POS tagging data that the ent visualizer above, including all the Parts of Speech tags.

displacy.render(doc, style="dep", jupyter=True)

Displacy

Matt Clarke, Friday, October 07, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.