How to use Spacy EntityRuler for custom Named Entity Recognition

Picture by This Is Engineering, Pexels.

9 minutes to read

Data Science Natural Language Processing

Spacy’s EntityRuler component is one of several rule-based matcher components that can be used to extend the core functionality of the package. It’s really useful for the creation of custom named entity recognition or NER pipelines.

EntityRuler allows you to search for two main types of pattern and then define the entity type to which they respond. Both of these entity patterns are defined by dictionaries that define a label, which holds the entity type, and the pattern, which describes what to look for.

Phrase patterns

EntityRuler phrase patterns look for exact matches or strings within a piece of text. For example, the entity rule below will look for the extract string Trump and assign it the label MORON.

{"label": "MORON", "pattern": "Trump"}

Token patterns

EntityRuler token patterns are similar but use a single dictionary to describe a phrase as a list. For example, the rule below will look for the phrase “Donald Trump” (or any case-insensitive derivative of it) and assign the label MORON.

{"label": "MORON", "pattern": [{"LOWER": "donald"}, {"LOWER": "trump"}]}

In this simple project I’ll go over the basics of how you can use the Spacy EntityRuler to create a custom named entity recognition pipeline that lets you extract the skills listed in job advertisements. Let’s get started.

Import the packages

To get started, open a Jupyter notebook and import Spacy. If you don’t have Spact installed, you can install it from the Pip package management system by entering pip3 install spacy.

import spacy

Use Spacy load() to import a model

Spacy comes with various pre-trained Natural Language Processing models you can use in your projects. For basic projects you can use the en_core_web_sm model. If you want greater accuracy, you’re better off with the much larger en_core_web_trf model.

To download the model you will first need to run a Python command to fetch the latest model and then use the Spacy load() method to load it.

python -m spacy download en_core_web_trf

nlp = spacy.load('en_core_web_sm')

Use basic Named Entity Recognition

To understand why we require a custom Named Entity Recognition model we’ll first use the standard NER model built into Spacy. To use this we’ll first create a variable containing the text we want to parse based on an excerpt from a job advertisement, with the aim of extracting the named entities for the job skills required for the role.

We’ll pass the text variable to the Spacy nlp() method and return a variable called doc. We’ll then loop over the entities Spacy recognises and return a list of tuples containing the entity text and the named entity detected.

text = """
We are looking for a data scientist with knowledge of Python and MySQL. 
The role will involve working with Pandas, scikit-learn, and Spacy.
Knowledge of Tensor Flow would be advantageous.
"""

doc = nlp(text)

entities = [(ent.text, ent.label_) for ent in doc.ents]
entities

[('Python', 'GPE'), ('Pandas', 'PERSON'), ('Spacy', 'PERSON')]

As you can see, Spacy returns [('Python', 'GPE'), ('Pandas', 'PERSON'), ('Spacy', 'PERSON')], which has matched a selection of the named entities in our document text, but it doesn’t get their context right and thinks Python is a GPE, Pandas is a PERSON, and Spacy is a PERSON. Since we want to extract the job skills from the job ad, we’re going to need a custom named entity recognition model.

Create a Custom Named Entity Recognition pipeline

There are several ways to extract custom named entities with Spacy. The simplest is to use the EntityRuler. This is a pipeline component that allows you to add custom entities to the model. You can add a list of entities to the EntityRuler, and then use the model to extract them from text.

EntityRuler is basically a pattern matching algorithm. You can add a list of patterns to the EntityRuler, and then use the model to extract them from text. It’s pretty simple stuff, but it’s still very powerful and can be used to understand massive datasets containing unstructured text data.

text = """
We are looking for a data scientist with knowledge of Python and MySQL. 
The role will involve working with Pandas, scikit-learn, and Spacy.
Knowledge of Tensor Flow would be advantageous.
"""

To use EntityRuler, we first need to load the Spacy model and then create a specially formatted list of rules to pass to the base model. We’ll call this skills, and we’ll assign each detected skill (i.e. Python, SQL, Pandas) with a label called SKILL.

When looking for a match we’ll tell the pattern component to lowercase the text first, so it’s not case-sensitive, and then we’ll additionally assign an id. You don’t always need to do this, but it can be useful for dealing with synonyms.

For example, different job ads might refer to scikit-learn as “sklearn”, “scikitlearn”, or “scikit learn”, so we can create a pattern to detect each version and assign them all a shared id so the data can be more easily grouped together.

nlp = spacy.load('en_core_web_sm')

skills = [
    {'label': 'SKILL', 'pattern': [{"LOWER": "python"}], 'id': 'python'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "sql"}], 'id': 'sql'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "mysql"}], 'id': 'mysql'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "pandas"}], 'id': 'pandas'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "spacy"}], 'id': 'spacy'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "scikit-learn"}], 'id': 'scikit-learn'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "scikit"}, {"LOWER": "learn"}], 'id': 'scikit-learn'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "sklearn"}], 'id': 'scikit-learn'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "tensor"}, {"LOWER": "flow"}], 'id': 'tensorflow'},
]

Next, we’ll use add_pipe() to add an entity_ruler using the before='ner' option to ensure we overwrite the default named entities Spacy recognises with our own custom named entities. Then we’ll use add_patterns() to add our list of patterns that recognise job skills the model.

ruler = nlp.add_pipe('entity_ruler', before='ner')
ruler.add_patterns(skills)

Finally, we can pass our text to the retrained nlp() model and use our custom named entity recognition code to detect the job skills in the job ad. We’ll extract three values from the entities: the ent.text containing the skill from the job ad (i.e. Python), the ent.label containing the named entity label (i.e. SKILL), and the ent.ent_id_ containing the ID of the skill, i.e. tensorflow.

doc = nlp(text)
entities = [(ent.text, ent.label_, ent.ent_id_) for ent in doc.ents]
entities

[('Python', 'SKILL', 'python'),
 ('MySQL', 'SKILL', 'mysql'),
 ('Pandas', 'SKILL', 'pandas'),
 ('Spacy', 'SKILL', 'spacy'),
 ('Tensor Flow', 'SKILL', 'tensorflow')]

Matt Clarke, Friday, December 02, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.