Noun phrase extraction is a Natural Language Processing technique that can be used to identify and extract noun phrases from text. Noun phrases are phrases that function grammatically as nouns in a sentence, and usually include a noun or pronoun as the headword, as well as any associated determiners, adjectives, and modifiers.
For example, given the sentence “The quick brown fox jumps over the lazy dog”, the noun phrases would be “The quick brown fox” and “the lazy dog.”
Noun phrase extraction can be very useful when analysing customer review data during review mining since it reveals more than just the nouns alone. It can be achieved during a range of NLP techniques, including dependency parsing, part of speech tagging, as well as shallow parsing and via Large Language Models and transformers.
In this project we’ll use the Spacy natural language processing library to extract some noun phrases from some text to show how easily it can be achieved.
To get started, open a Jupyter notebook and install the Spacy library using
!pip install -U spacy, then use Spacy to download the
en_core_web_sm model. There are various models you can use but the small model works fine for basic tasks like noun phrase extraction.
!pip install -U spacy
!python -m spacy download en_core_web_sm
Next, import Spacy then load the
en_core_web_sm model using
spacy.load(). Assign the model object to a variable called
nlp. We’ll now be able to pass data to the
nlp model and perform various Natural Language Processing tasks.
nlp = spacy.load("en_core_web_sm")
Now we need to create a variable or document that contains the text we want Spacy to analyse. We’ll store a sentence containing some noun phrases in a variable called
text = """
The data scientist hurriedly wrote some code on their Linux workstation to get everything completed before the deadline.
Next we need to pass our
text variable to the
nlp() model and assign the output to a variable so we can parse the results returned. We can do this by entering
doc = nlp(text).
doc = nlp(text)
To extract nouns and noun phrases from the
doc returned by Spacy we can use a couple of list comprehensions. The first one returns a list of values where the
token.pos_ value is
NOUN, which gives us a list of the nouns in our text.
print("Nouns:", [token.lemma_ for token in doc if token.pos_ == "NOUN"])
Nouns: ['data', 'scientist', 'code', 'workstation', 'deadline']
The second list comprehension extracts the noun phrases from the
chunk.text using the
noun_chunks feature of Spacy. This returns a list containing all the noun phrases Spacy extracted from the text.
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
Noun phrases: ['\nThe data scientist', 'some code', 'their Linux workstation', 'everything', 'the deadline']
Matt Clarke, Tuesday, January 17, 2023