How to use Spacy for noun phrase extraction

Picture by Bich Tran, Pexels.

4 minutes to read

Data Science Natural Language Processing

Noun phrase extraction is a Natural Language Processing technique that can be used to identify and extract noun phrases from text. Noun phrases are phrases that function grammatically as nouns in a sentence, and usually include a noun or pronoun as the headword, as well as any associated determiners, adjectives, and modifiers.

For example, given the sentence “The quick brown fox jumps over the lazy dog”, the noun phrases would be “The quick brown fox” and “the lazy dog.”

Noun phrase extraction can be very useful when analysing customer review data during review mining since it reveals more than just the nouns alone. It can be achieved during a range of NLP techniques, including dependency parsing, part of speech tagging, as well as shallow parsing and via Large Language Models and transformers.

In this project we’ll use the Spacy natural language processing library to extract some noun phrases from some text to show how easily it can be achieved.

Install the packages

To get started, open a Jupyter notebook and install the Spacy library using !pip install -U spacy, then use Spacy to download the en_core_web_sm model. There are various models you can use but the small model works fine for basic tasks like noun phrase extraction.

!pip install -U spacy
!python -m spacy download en_core_web_sm

Import the packages

Next, import Spacy then load the en_core_web_sm model using spacy.load(). Assign the model object to a variable called nlp. We’ll now be able to pass data to the nlp model and perform various Natural Language Processing tasks.

import spacy

nlp = spacy.load("en_core_web_sm")

Create a document to analyse

Now we need to create a variable or document that contains the text we want Spacy to analyse. We’ll store a sentence containing some noun phrases in a variable called text.

text = """
The data scientist hurriedly wrote some code on their Linux workstation to get everything completed before the deadline. 
"""

Pass the text to Spacy

Next we need to pass our text variable to the nlp() model and assign the output to a variable so we can parse the results returned. We can do this by entering doc = nlp(text).

doc = nlp(text)

Extract nouns and noun phrases

To extract nouns and noun phrases from the doc returned by Spacy we can use a couple of list comprehensions. The first one returns a list of values where the token.pos_ value is NOUN, which gives us a list of the nouns in our text.

print("Nouns:", [token.lemma_ for token in doc if token.pos_ == "NOUN"])

Nouns: ['data', 'scientist', 'code', 'workstation', 'deadline']

The second list comprehension extracts the noun phrases from the chunk.text using the noun_chunks feature of Spacy. This returns a list containing all the noun phrases Spacy extracted from the text.

print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])

Noun phrases: ['\nThe data scientist', 'some code', 'their Linux workstation', 'everything', 'the deadline']

Matt Clarke, Tuesday, January 17, 2023

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.