Noun phrase extraction is a Natural Language Processing technique that can be used to identify and extract noun phrases from text. Noun phrases are phrases that function grammatically as nouns in a sentence, and usually include a noun or pronoun as the headword, as well as any associated determiners, adjectives, and modifiers.
For example, given the sentence “The quick brown fox jumps over the lazy dog”, the noun phrases would be “The quick brown fox” and “the lazy dog.”
Noun phrase extraction can be very useful when analysing customer review data during review mining since it reveals more than just the nouns alone. It can be achieved during a range of NLP techniques, including dependency parsing, part of speech tagging, as well as shallow parsing and via Large Language Models and transformers.
In this project we’ll use the Spacy natural language processing library to extract some noun phrases from some text to show how easily it can be achieved.
To get started, open a Jupyter notebook and install the Spacy library using !pip install -U spacy
, then use Spacy to download the en_core_web_sm
model. There are various models you can use but the small model works fine for basic tasks like noun phrase extraction.
!pip install -U spacy
!python -m spacy download en_core_web_sm
Next, import Spacy then load the en_core_web_sm
model using spacy.load()
. Assign the model object to a variable called nlp
. We’ll now be able to pass data to the nlp
model and perform various Natural Language Processing tasks.
import spacy
nlp = spacy.load("en_core_web_sm")
Now we need to create a variable or document that contains the text we want Spacy to analyse. We’ll store a sentence containing some noun phrases in a variable called text
.
text = """
The data scientist hurriedly wrote some code on their Linux workstation to get everything completed before the deadline.
"""
Next we need to pass our text
variable to the nlp()
model and assign the output to a variable so we can parse the results returned. We can do this by entering doc = nlp(text)
.
doc = nlp(text)
To extract nouns and noun phrases from the doc
returned by Spacy we can use a couple of list comprehensions. The first one returns a list of values where the token.pos_
value is NOUN
, which gives us a list of the nouns in our text.
print("Nouns:", [token.lemma_ for token in doc if token.pos_ == "NOUN"])
Nouns: ['data', 'scientist', 'code', 'workstation', 'deadline']
The second list comprehension extracts the noun phrases from the chunk.text
using the noun_chunks
feature of Spacy. This returns a list containing all the noun phrases Spacy extracted from the text.
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
Noun phrases: ['\nThe data scientist', 'some code', 'their Linux workstation', 'everything', 'the deadline']
Matt Clarke, Tuesday, January 17, 2023