How to do custom Named Entity Recognition in Pandas using Spacy

Picture by This Is Engineering, Pexels.

60 minutes to read

As I showed in my previous tutorial on named entity recognition in Spacy, the EntityRuler allows you to customise Spacy’s default NER model to allow you to create your own named entities based on pattern matching.

In this tutorial I’ll show you how to create a custom NER model to extract the skills from data science jobs using a dataset I scraped from all the UK’s current data science roles.

We’ll import the dataset into Pandas, create a custom NER model to extract the skills from the job ads, use the named entities to clean the dataset, and then analyse the skills employers are currently looking for, and the salaries they’re currently paying.

Import the packages

To get started, open a Jupyter notebook and import Spacy. If you don’t have Spacy installed, you can install it from the Pip package management system by entering pip3 install spacy. We’ll use the Pandas set_option() method to increase the maximum number of rows shown, so we can scan more data.

import pandas as pd
import spacy

pd.set_option('display.max_rows', 200)

Load the data

Next, we’ll load my data science jobs dataset. I created this using a custom web scraper written in Requests-HTML that scraped all the current data science related jobs from the Reed jobs website. The dataset includes all the information on the job, including the job title, salary, location, and the job description itself.

df = pd.read_csv('https://raw.githubusercontent.com/flyandlure/datasets/master/data_science_jobs.csv')

df.sample(1).T

	398
reference	41642346
title	Data Engineering Tech Lead
date_posted	2020-12-27
date_ending	2021-02-07T23:55:00.0000000
advertiser	Definitive Recruitment Ltd
location	Middlesex
city	Uxbridge
country	GB
salary	75000.0
salary_min	75000.0
salary_max	80000.0
salary_frequency	YEAR
salary_currency	NaN
description	Apply now\nThe role involves technical leaders...

Use Spacy load() to import a model

Next, we need load a model into Spacy. There are various models you can use. I’m using the en_core_web_sm model. To download the model you will first need to run a Python command to fetch the latest model and then use the Spacy load() method to load it.

python -m spacy download en_core_web_sm

nlp = spacy.load('en_core_web_sm')

Create EntityRuler pattern matching rules

Now we need to create the rules to pass to Spacy’s EntityRuler. These will power our custom named entity recognition system and will look for pattern matches in the job description, assign a label to them called SKILL, and allow us to extract the named entities.

To create the list of patterns to match, I scoured the job ads to look for skills mentioned in the posts until I stopped getting much improvement on the numbers of skills extracted.

This likely doesn’t cover every skill mentioned in data science job ads, but it definitely catches the bulk of the technical ones. You can easily extend it to cover other areas, or soft skills, such as “stakeholder management”.

skills = [
    {'label': 'SKILL', 'pattern': [{"LOWER": "python"}], 'id': 'python'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "r"}], 'id': 'r'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "sas"}], 'id': 'sas'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "java"}], 'id': 'java'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "c++"}], 'id': 'c++'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "c#"}], 'id': 'c#'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "c"}], 'id': 'c'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "javascript"}], 'id': 'javascript'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "html"}], 'id': 'html'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "css"}], 'id': 'css'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "php"}], 'id': 'php'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "ruby"}], 'id': 'ruby'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "scala"}], 'id': 'scala'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "perl"}], 'id': 'perl'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "matlab"}], 'id': 'matlab'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "hadoop"}], 'id': 'hadoop'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "spark"}], 'id': 'spark'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "hive"}], 'id': 'hive'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "pig"}], 'id': 'pig'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "shark"}], 'id': 'shark'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "oozie"}], 'id': 'oozie'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "zookeeper"}], 'id': 'zookeeper'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "flume"}], 'id': 'flume'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "mahout"}], 'id': 'mahout'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "sqoop"}], 'id': 'sqoop'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "storm"}], 'id': 'storm'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "kafka"}], 'id': 'kafka'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "cassandra"}], 'id': 'cassandra'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "mongodb"}], 'id': 'mongodb'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "redis"}], 'id': 'redis'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "elasticsearch"}], 'id': 'elasticsearch'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "neo4j"}], 'id': 'neo4j'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "sql"}], 'id': 'sql'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "nosql"}], 'id': 'nosql'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "postgresql"}], 'id': 'postgresql'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "oracle"}], 'id': 'oracle'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "mysql"}], 'id': 'mysql'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "sqlite"}], 'id': 'sqlite'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "mariadb"}], 'id': 'mariadb'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "mssql"}], 'id': 'mssql'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "db2"}], 'id': 'db2'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "pandas"}], 'id': 'pandas'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "spacy"}], 'id': 'spacy'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "nltk"}], 'id': 'nltk'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "gensim"}], 'id': 'gensim'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "huggingface"}], 'id': 'huggingface'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "transformers"}], 'id': 'transformers'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "scikit-learn"}], 'id': 'scikit-learn'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "scikit"}, {"LOWER": "learn"}], 'id': 'scikit-learn'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "sklearn"}], 'id': 'scikit-learn'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "tensor"}, {"LOWER": "flow"}], 'id': 'tensorflow'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "tensorflow"}], 'id': 'tensorflow'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "keras"}], 'id': 'keras'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "pytorch"}], 'id': 'pytorch'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "numpy"}], 'id': 'numpy'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "scipy"}], 'id': 'scipy'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "matplotlib"}], 'id': 'matplotlib'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "seaborn"}], 'id': 'seaborn'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "plotly"}], 'id': 'plotly'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "bokeh"}], 'id': 'bokeh'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "d3"}], 'id': 'd3'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "airflow"}], 'id': 'airflow'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "docker"}], 'id': 'docker'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "kubernetes"}], 'id': 'kubernetes'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "aws"}], 'id': 'aws'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "amazon"}, {"LOWER": "web"}, {"LOWER": "services"}], 'id': 'aws'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "gcp"}], 'id': 'gcp'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "google"}, {"LOWER": "cloud"}, {"LOWER": "platform"}], 'id': 'gcp'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "azure"}], 'id': 'azure'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "machine learning"}], 'id': 'machine learning'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "ml"}], 'id': 'machine learning'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "deep"}, {"LOWER": "learning"}], 'id': 'deep learning'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "dl"}], 'id': 'deep learning'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "natural"}, {"LOWER": "language"}, {"LOWER": "processing"}], 'id': 'nlp'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "nlp"}], 'id': 'nlp'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "computer"}, {"LOWER": "vision"}], 'id': 'computer vision'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "cv"}], 'id': 'computer vision'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "data"}, {"LOWER": "science"}], 'id': 'data science'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "ds"}], 'id': 'data science'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "data"}, {"LOWER": "analysis"}], 'id': 'data analysis'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "da"}], 'id': 'data analysis'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "data"}, {"LOWER": "visualisation"}], 'id': 'data visualisation'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "data"}, {"LOWER": "visualization"}], 'id': 'data visualization'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "dv"}], 'id': 'data visualisation'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "data"}, {"LOWER": "mining"}], 'id': 'data mining'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "dm"}], 'id': 'data mining'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "data"}, {"LOWER": "engineering"}], 'id': 'data engineering'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "de"}], 'id': 'data engineering'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "data"}, {"LOWER": "analytics"}], 'id': 'data analytics'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "da"}], 'id': 'data analytics'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "data"}, {"LOWER": "warehouse"}], 'id': 'data warehouse'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "dw"}], 'id': 'data warehouse'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "data"}, {"LOWER": "pipelines"}], 'id': 'data pipelines'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "dp"}], 'id': 'data pipelines'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "data"}, {"LOWER": "munging"}], 'id': 'data munging'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "dm"}], 'id': 'data munging'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "data"}, {"LOWER": "preparation"}], 'id': 'data preparation'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "dp"}], 'id': 'data preparation'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "data"}, {"LOWER": "wrangling"}], 'id': 'data wrangling'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "dw"}], 'id': 'data wrangling'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "data"}, {"LOWER": "cleaning"}], 'id': 'data cleaning'},
    {'label': 'SKILL', 'pattern': [{"LOWER": "dc"}], 'id': 'data cleaning'},
]

Next you need to pass the huge list of patterns to the add_pipe() method and use before='ner' to ensure that Spacy overwrites any default named entities with the new SKILL entities. The patterns are then added to the ruler using add_patterns().

ruler = nlp.add_pipe('entity_ruler', before='ner')
ruler.add_patterns(skills)

Extract the skills to a new column

To run the custom named entity recognition model on our entire Pandas dataframe we can use the Pandas apply() method and a lambda function. We’ll create a for loop that loops over the entities and returns those where the label_ is SKILL.

The ent.text value contains the matched words, i.e. Python or MySQL, but we actually want the ent.ent_id_ value, since this will ensure all the values are equal, and we don’t have to deal with “scikit-learn”, “scikit learn” and “sklearn” when analysing the results.

Finally, we can use another lambda function to use set() to de-duplicate the values and return only the unique matches in a Python list. We get back a single column in our dataframe containing the SKILL named entities extracted.

df['skills'] = df['description'].apply(lambda x: [ent.ent_id_ for ent in nlp(x).ents if ent.label_ == 'SKILL'])
df['skills'] = df['skills'].apply(lambda x: list(set(x)))

df[['title', 'skills']].sample(5)

	title	skills
406	Data Science Consultant	[sql, hadoop, python, scala, spark, data scien...
47	Senior Data Scientist - 5 Months FTC	[python, spark, data science, computer vision,...
117	Data Scientist - Life Sciences	[data science]
298	Backend Developer - Remote	[sql, nlp, python, computer vision, data scien...
3	Data Science Consultant	[data science, r, python]

Use the named entities to clean the dataset

Now we’ve analysed each job description and extracted the named entities for job skills to a separate column we need to check the cleanliness of the data. Since recruiters often post job ads in the wrong places, some jobs filed under “data science”, might not truly be data science roles, so we can use the skills column to identify these.

df[['title', 'skills']].sort_values('skills', key=lambda x: x.str.len(), ascending=True).head(100)

	title	skills
251	Risk and Pricing Quant Lead	[]
186	Data Scientist InsurTech	[data science]
35	Data Science and Analytics Specialist - Data S...	[data science]
39	Data Scientist / Statistician (PhD level)	[data science]
41	Lead Data Scientist Manager Reading, Halifax o...	[data science]
340	UI Lead Designer / UX Lead Designer	[data science]
381	Account Manager - FinTech	[data science]
383	Graduate Recruitment Consultant	[data science]
384	Data Quality Manager	[data science]
312	Recruitment Consultant	[data science]
387	Data Scientist	[data science]
593	Principal UX Designer	[data engineering]
66	Senior Data Scientist	[data science]
268	Recruitment Consultant	[data science]
72	Lead Data Scientist Manager Reading, Halifax o...	[data science]
267	Credit Risk Manager - Credit Cards	[data science]
260	Security Architect - Active SC clearance required	[data science]
494	Sales Development Representative - SaaS Solutions	[data engineering]
535	Project Systems Engineer	[data engineering]
100	Sales Development Representative in Gaming	[data science]
196	Trainee Recruitment Consultant - Gaming & Data	[data science]
184	Recruitment Consultant - Twickenham	[data science]
177	Senior Pricing Price Analyst	[data science]
105	Technology Recruitment Consultant	[data science]
107	Insights Consultancy Director / Applied Scienc...	[data science]
113	Product Manager	[data science]
117	Data Scientist - Life Sciences	[data science]
159	Recruitment Consultant - Life Sciences	[data science]
158	Recruitment Consultant - Risk & Quantitative F...	[data science]
157	Product Manager - B2B Software	[data science]
143	Account Manager	[data science]
34	Data Science and Analytics Specialist - Data S...	[data science]
33	Senior Data Science Recruiter - Team Leader Op...	[data science]
140	Director - Customer Marketing Analytics & Stra...	[data science]
378	Senior Digital Strategist	[data science]
1	Data Science Recruiter	[data science]
590	Data Platforms Lead	[data engineering]
352	Enterprise Business Development Manager - Benelux	[data science]
592	Business Analyst	[data engineering]
376	Senior Data Strategist - CRM & Digital	[data science]
27	Team Leader - Data Science Recruitment	[data science]
364	Senior Product Manager	[data science]
31	Business Development Manager - Data Science & ...	[data science, c]
92	Data Scientist All Levels	[data science, computer vision]
218	Software Engineer - Java	[data science, java]
219	BDR Manager/Sales Manager	[data science, data analytics]
32	Business Development Manager - Data Science & ...	[data science, c]
208	Senior Data Scientist	[sql, computer vision]
375	Commercial Finance Manager / FP&A Analyst SaaS...	[data science, c]
247	Senior Data Analyst	[data science, data mining]
15	Data Science Lead (Pricing)	[data science, computer vision]
252	Agile Business Analyst	[data science, c]
16	Data Science Lead (Pricing)	[data science, computer vision]
87	Insight Analyst	[data science, sql]
206	KDB Developer	[data science, java]
374	Population Health Consultant Lead	[sql, data science]
198	Director of Data	[data science, computer vision]
11	Data Science Lead (Pricing)	[data science, computer vision]
10	Data Science Lead (Pricing)	[data science, computer vision]
591	JavaScript Developer - True Microservices/Node...	[data engineering, python]
174	Sales Account Manager - Data & Analytics	[data science, c]
7	Head of Data Science	[data science, computer vision]
110	Senior Insight Analyst	[sql, data science]
111	Senior Insight Analyst	[sql, data science]
170	Senior Business Development Manager	[data science, c]
168	Product Manager	[data science, computer vision]
118	Senior Product Manager - B2C	[data science, data analysis]
149	Senior Data Scientist	[computer vision, python]
577	Marketing Solutions Senior IT Project Manager ...	[data engineering, c]
203	Data Scientist	[spark, data science]
361	Head of Analytics	[data science, python]
536	Senior Data Analyst - Oxford	[data engineering, data analytics]
279	Senior Recruitment Consultant: Technology	[data science, computer vision]
345	Senior BI Analyst	[data science, data analysis]
29	Resourcer - Data, Analytics & Data Science	[data science, computer vision]
28	Product Manager - Data Science Routing	[data science, computer vision]
336	Agile Business Analyst - Public Sector	[data science, data visualisation]
26	Director of Data Science and Decision Science	[data science, computer vision]
330	Data Engineer	[data science, python]
19	Head of Data Science and Analytics	[data science, computer vision]
320	Senior Agile Business Analyst - Public Sector	[data science, data visualisation]
319	Agile Business Analyst - Public Sector	[data science, data visualisation]
316	Senior Data Analytics Recruitment Consultant	[data science, data analytics]
589	Mid Level JavaScript Developer -Microservices/...	[data engineering, python]
385	Head of Strategic Environment	[data science, data analysis]
22	Head of Data Science - FinTech	[data science, computer vision]
355	Platform Technical Lead	[data science, azure]
282	Infrastructure Engineer - Active DV Clearance ...	[data science, data visualisation]
359	Population Health Consultant Lead	[sql, data science]
67	Data Scientist/ Senior Data Scientist	[data science, computer vision]
474	Head of Software Development	[data engineering, azure]
294	Technical Business Analyst / Scrum Master - Te...	[data science, data analysis]
388	Corporate Information Analyst	[data science, r, sql]
367	Oracle DBA	[data science, data visualisation, oracle]
413	Lead Data Engineer - SSIS, ETL, SQL	[data engineering, sql, oracle]
453	Product Analyst	[data engineering, sql, computer vision]
455	Senior Data Engineer	[data engineering, kafka, data pipelines]
341	Junior Software Developer	[data science, c, javascript]
401	Data Engineering Lead - Azure	[data engineering, azure, sql]
457	Lead Data Engineer - SSIS, ETL, SQL	[data engineering, sql, oracle]

For a start, we can drop the very few rows present that don’t have any skills detected, since they’re unlikely to be data science roles. We can also eyeball the list above to look for other keywords that might denote non-data science roles. For example, “recruiter” or “dba”.

# Drop rows with no skills
df = df[df['skills'].str.len() > 0]

We’ll create a list of keywords that appear in the non-data science roles and we’ll filter the dataframe to exclude those rows that include one of the words in the job title. This cleans up a good chunk of the less relevant role titles that could have added noise to our data.

# Drop rows with one of the following strings in the job title
titles = ['recruit', 'resourcer', 'software engineer', 'health', 'product manager', 
'ux', 'sales', 'business development', 'account manager', 'risk manager', 'kdb', 
'business analyst', 'javascript', 'security', 'data quality', 'finance manager', 
'software development', 'project manager', 'dba', 'product analyst', 'pricing analyst', 
'crm', 'java', 'agile', 'product owner', 'it director', 'software developer']
df = df[~df['title'].str.contains('|'.join(titles), case=False)]

Re-querying the dataframe to show the rows with the lowest number of matched skills first shows that we’ve now removed most of the less relevant roles, leaving only the true data science positions behind. Of course, if you wanted to put this into production, you could fully automate this step by creating a Naive Bayes text classification model to determine how to classify each role type.

df[['title', 'skills']].sort_values('skills', key=lambda x: x.str.len(), ascending=True).head(100)

	title	skills
140	Director - Customer Marketing Analytics & Stra...	[data science]
66	Senior Data Scientist	[data science]
387	Data Scientist	[data science]
186	Data Scientist InsurTech	[data science]
378	Senior Digital Strategist	[data science]
590	Data Platforms Lead	[data engineering]
107	Insights Consultancy Director / Applied Scienc...	[data science]
39	Data Scientist / Statistician (PhD level)	[data science]
35	Data Science and Analytics Specialist - Data S...	[data science]
34	Data Science and Analytics Specialist - Data S...	[data science]
535	Project Systems Engineer	[data engineering]
117	Data Scientist - Life Sciences	[data science]
177	Senior Pricing Price Analyst	[data science]
41	Lead Data Scientist Manager Reading, Halifax o...	[data science]
72	Lead Data Scientist Manager Reading, Halifax o...	[data science]
67	Data Scientist/ Senior Data Scientist	[data science, computer vision]
247	Senior Data Analyst	[data science, data mining]
208	Senior Data Scientist	[sql, computer vision]
87	Insight Analyst	[data science, sql]
203	Data Scientist	[spark, data science]
149	Senior Data Scientist	[computer vision, python]
92	Data Scientist All Levels	[data science, computer vision]
385	Head of Strategic Environment	[data science, data analysis]
198	Director of Data	[data science, computer vision]
536	Senior Data Analyst - Oxford	[data engineering, data analytics]
15	Data Science Lead (Pricing)	[data science, computer vision]
345	Senior BI Analyst	[data science, data analysis]
330	Data Engineer	[data science, python]
7	Head of Data Science	[data science, computer vision]
282	Infrastructure Engineer - Active DV Clearance ...	[data science, data visualisation]
16	Data Science Lead (Pricing)	[data science, computer vision]
22	Head of Data Science - FinTech	[data science, computer vision]
355	Platform Technical Lead	[data science, azure]
361	Head of Analytics	[data science, python]
11	Data Science Lead (Pricing)	[data science, computer vision]
111	Senior Insight Analyst	[sql, data science]
10	Data Science Lead (Pricing)	[data science, computer vision]
26	Director of Data Science and Decision Science	[data science, computer vision]
19	Head of Data Science and Analytics	[data science, computer vision]
110	Senior Insight Analyst	[sql, data science]
560	Data and Analytics Delivery Manager	[data engineering, data warehouse, data analysis]
209	Data Engineer	[data science, java, python]
214	Senior Research Scientist	[machine learning, data science, computer vision]
450	Data Engineer Azure, SQL Nottingham Circa £60k	[data engineering, azure, sql]
172	Data Scientist	[data science, nosql, python]
79	Junior Data Scientist	[data science, python, sql]
401	Data Engineering Lead - Azure	[data engineering, azure, sql]
74	Junior Data Scientist	[data science, python, sql]
215	Senior Research Scientist	[machine learning, data science, computer vision]
123	Lead Data Scientist	[machine learning, data science, computer vision]
73	Data Scientist	[data science, python, sql]
114	Customer Analytics Manager Remote	[data science, python, sql]
413	Lead Data Engineer - SSIS, ETL, SQL	[data engineering, sql, oracle]
414	Clojure Developer Remote	[data engineering, javascript, aws]
423	Mid Level Data Engineer	[data engineering, sql, python]
457	Lead Data Engineer - SSIS, ETL, SQL	[data engineering, sql, oracle]
152	Senior Data Scientist	[data science, r, python]
104	Senior Data Scientist	[data science, python, sql]
125	Customer Analytics Manager Remote	[data science, python, sql]
572	Snowflake Data Engineer	[data engineering, sql, data science]
531	Data Architect - Remote	[data engineering, data science, aws]
88	Data Scientist Investment	[machine learning, data science, computer vision]
455	Senior Data Engineer	[data engineering, kafka, data pipelines]
343	Senior Digital Media Insight Analyst	[data science, data visualisation, sql]
58	Data & Decision Science Director	[data science, c, computer vision]
388	Corporate Information Analyst	[data science, r, sql]
486	Cloud Data Developer	[data engineering, sql, azure]
497	Digital Analytics Lead - Fully Remote Working	[data engineering, data science, computer vision]
3	Data Science Consultant	[data science, r, python]
65	Junior Python Developer	[data science, python, computer vision]
4	Data Science Manager	[data science, data analytics, computer vision]
332	Technical Pricing and Data Analyst London Market	[sql, data science, data visualisation]
500	SQL Consultant	[data engineering, sql, azure]
14	Lead Data Science Consultant	[machine learning, data science, computer vision]
180	Senior Data Scientist - Quant	[data science, r, python, sql]
307	Head of Engineering. Global Startup. Big Data	[spark, data science, hadoop, airflow]
303	Scala Data Engineer	[aws, java, data science, scala]
179	Senior Data Scientist	[data science, python, computer vision, sql]
310	Bioinformatician	[data science, r, python, aws]
311	Computational Biologist	[data science, r, python, aws]
302	Senior C++ Developer - 12 Mth FTC	[c++, c, data science, python]
176	C#/.NET Developer	[data science, c, azure, javascript]
317	BI Analyst	[sql, data science, data visualisation, comput...
228	Senior Research Data Scientist	[data science, python, computer vision, sql]
325	Data Visualisation Analyst	[data science, data visualisation, computer vi...
301	Senior C++ Developer 12 month FTC	[c++, c, data science, python]
326	Senior Media Effectiveness Analyst	[sql, data visualisation, data science, python]
434	Data Engineer	[data engineering, sql, data pipelines, comput...
438	Data Engineer	[data engineering, sql, data science, azure]
444	Data Engineer	[data engineering, sql, data pipelines, comput...
146	Senior Analyst - Strategy £40-55k + Bens	[data science, r, python, sql]
145	Market Research Data Scientist	[data science, r, python, sql]
144	Data Scientist - Gaming/ Betting	[data science, azure, python, computer vision]
142	Senior DevOps Engineer	[data science, aws, python, computer vision]
324	Lead Developer, Python, Data Scientist, AI, CN...	[computer vision, pytorch, data science, python]
353	Senior Analyst - Marketing Effectiveness / Eco...	[data science, r, python, sql]
274	Research Data Scientist	[data science, python, computer vision, sql]
377	Data Scientist - Renewable Energy / Financial ...	[sql, data science, r, python]
225	Front End Developer	[data science, css, javascript, azure]
224	Head of Data / Data Lead - Well-funded Scaleup	[data science, r, python, sql]

Remove outliers

Before we analyse the data, we need to check for potential outliers caused by job posters adding the wrong salary to their positions. We can do this easily by using the Pandas head() and tail() functions after sorting the dataframe by salary. It shows that one row needs to be corrected because it’s got a zero missing - the salary for the role should actually be £100,000 not £10,000.

df[['title', 'salary']].sort_values('salary', ascending=True).head()

	title	salary
132	Senior Customer Insight Analytics Manager	10000.0
362	Graduate Analytics Positions	20000.0
164	Data Analyst Apprentice	22000.0
79	Junior Data Scientist	22000.0
357	Junior Developer	23000.0

df[['title', 'salary']].sort_values('salary', ascending=False).head()

	title	salary
26	Director of Data Science and Decision Science	150000.0
6	Director of Data Science	150000.0
68	Director of Data	150000.0
394	Director of Data Engineering	150000.0
58	Data & Decision Science Director	150000.0

df.loc[132, 'salary'] = 100000

Analyse the distribution of named entities

Our dataset above contains a column called skills that includes a list of the unique named entities recognised by our custom named entity recognition system. To analyse these skills individually we can use the Pandas explode() function to split the list up over multiple rows. We’ll split the data into rows and then return just the skill and the salary for each one.

df_skills = df.explode('skills')
df_skills[['skills', 'salary']].sample(5)

	skills	salary
166	r	25000.0
547	aws	45000.0
506	scala	75000.0
563	azure	45000.0
293	deep learning	80000.0

To understand which roles are most sought after by UK data science employers we can use the Pandas groupby() function with an agg() function. We’ll group the data by the individual skill and then calculate the number of roles in which it appears, then the minimum, maximum, mean, and median salary for each skill.

Roles that mention “data science” range have an average salary of £58,828 and a median of £55,000, compared to those that mention “data engineering”, which carry an average salary of £60,221 and a median of £60,000.

There’s not a huge difference between the salaries of data scientists and data engineers in the UK and demand for data scientists is higher than that for data engineers. Similarly, more than three times as many roles are looking for Python compared to R, with salaries for Python developers being higher on average.

df_summary = df_skills.groupby('skills').agg(
    roles=('title', 'count'),
    min_salary=('salary', 'min'),
    max_salary=('salary', 'max'),
    avg_salary=('salary', 'mean'),
    median_salary=('salary', 'median'),
).sort_values('roles', ascending=False)
df_summary

	roles	min_salary	max_salary	avg_salary	median_salary
skills
data science	348	20000.0	150000.0	58828.077586	55000.0
python	344	20000.0	150000.0	58382.267442	55000.0
sql	270	22000.0	150000.0	56075.425926	50000.0
computer vision	212	20000.0	150000.0	64617.924528	60000.0
data engineering	173	25000.0	150000.0	60221.410405	60000.0
aws	151	23000.0	150000.0	65152.317881	65000.0
r	117	20000.0	150000.0	51152.692308	45000.0
azure	82	23000.0	125000.0	53826.268293	50000.0
spark	81	30000.0	125000.0	68774.740741	65000.0
gcp	54	25000.0	150000.0	63777.777778	65000.0
data warehouse	51	28000.0	85000.0	57254.901961	55000.0
data pipelines	45	34000.0	110000.0	58851.200000	60000.0
kafka	44	30000.0	125000.0	74035.227273	75000.0
data analytics	44	22000.0	120000.0	54500.000000	50000.0
machine learning	42	32000.0	110000.0	62309.523810	60000.0
java	40	23000.0	110000.0	59600.000000	57500.0
docker	38	25000.0	125000.0	67315.789474	70000.0
kubernetes	36	25000.0	125000.0	70416.666667	70000.0
data analysis	36	22000.0	90000.0	45803.222222	45000.0
airflow	36	50000.0	110000.0	68611.111111	70000.0
deep learning	35	32000.0	110000.0	66942.857143	65000.0
scala	33	40000.0	110000.0	65757.575758	60000.0
data visualisation	32	25000.0	100000.0	42984.375000	45000.0
nlp	28	32000.0	85000.0	55428.571429	55000.0
hadoop	26	40000.0	95000.0	64230.769231	60000.0
c	23	22000.0	150000.0	62434.782609	60000.0
sas	20	27000.0	70000.0	45200.000000	40000.0
javascript	17	23000.0	75000.0	51000.000000	50000.0
nosql	17	23000.0	85000.0	55941.176471	55000.0
tensorflow	17	40000.0	100000.0	68235.294118	70000.0
mysql	15	25000.0	75000.0	57000.000000	60000.0
postgresql	15	25000.0	75000.0	59866.666667	70000.0
c++	13	20000.0	85000.0	52384.615385	50000.0
elasticsearch	12	45000.0	75000.0	69166.666667	75000.0
matlab	10	25000.0	70000.0	43900.000000	42500.0
numpy	10	35000.0	70000.0	56000.000000	57500.0
pytorch	10	45000.0	90000.0	71500.000000	70000.0
data mining	8	45000.0	80000.0	60000.000000	60000.0
pandas	8	35000.0	70000.0	50000.000000	47500.0
keras	6	40000.0	90000.0	64166.666667	70000.0
mongodb	6	55000.0	100000.0	70000.000000	65000.0
storm	6	50000.0	75000.0	56666.666667	52500.0
hive	6	45000.0	70000.0	58333.333333	60000.0
css	5	23000.0	50000.0	35600.000000	35000.0
oracle	5	25000.0	45000.0	37600.000000	38000.0
scipy	3	40000.0	70000.0	58333.333333	65000.0
data visualization	3	22000.0	40000.0	31333.333333	32000.0
php	3	28000.0	60000.0	40666.666667	34000.0
plotly	3	35000.0	70000.0	56666.666667	65000.0
neo4j	3	70000.0	75000.0	73333.333333	75000.0
nltk	2	50000.0	50000.0	50000.000000	50000.0
mssql	2	45000.0	45000.0	45000.000000	45000.0
html	2	23000.0	50000.0	36500.000000	36500.0
data wrangling	1	50000.0	50000.0	50000.000000	50000.0
data preparation	1	65000.0	65000.0	65000.000000	65000.0
ruby	1	80000.0	80000.0	80000.000000	80000.0
db2	1	25000.0	25000.0	25000.000000	25000.0
scikit-learn	1	45000.0	45000.0	45000.000000	45000.0
seaborn	1	65000.0	65000.0	65000.000000	65000.0
cassandra	1	60000.0	60000.0	60000.000000	60000.0
matplotlib	1	45000.0	45000.0	45000.000000	45000.0
flume	1	50000.0	50000.0	50000.000000	50000.0

Create box plots for each skill

To understand the distribution in salary for each skill identified by our custom named entity recognition system we can use Seaborn and Matplotlib boxplots for data visualisation. First, we’ll examine the overall distribution in salaries across the whole dataset.

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.boxplot(y=df['salary'])
plt.title('Distribution of Salaries')
plt.show()

png

Data science roles in the UK range from a salary of £20,000 at the very, very bottom and rise to £150,000 at the top for director level roles. The overall mean salary is £58,941 and the median is £55,000. This varies according to both the nature of the role (i.e. data science versus data engineering), the skills required in the role, and the level of the role.

df['salary'].min()

20000.0

df['salary'].max()

150000.0

df['salary'].mean()

58941.05995717345

df['salary'].median()

55000.0

Visualising the salary by skill

Next, we’ll use a Seaborn boxplot to visualise the salary ranges for each individual skill. Boxplots, or box and whisker plots, show a coloured box representing the interquartile range, a middle line representing the median value, and whiskers representing the outer limits. Outliers are represented as dots.

plt.figure(figsize=(10, 40))
sns.boxplot(y='skills', x='salary', data=df_skills, orient='h')
plt.title('Distribution of Salaries by Skill')
plt.show()

png

Broad areas

Broadly speaking, the data suggests that median salaries for data engineering roles tend to be a bit higher than those for data science roles, but this changes at the top level, where data science director roles warrant salaries of up to £150K.

broad_skills = ['data science', 'data engineering', 'data visualisation', 
'data analytics', 'data warehouse', 'data pipelines', 'data munging',
 'data preparation', 'data wrangling', 'data cleaning']

df_subset = df_skills[df_skills['skills'].isin(broad_skills)]
df_summary = df_subset.groupby('skills').agg(
    roles=('title', 'count'),
    min_salary=('salary', 'min'),
    max_salary=('salary', 'max'),
    avg_salary=('salary', 'mean'),
    median_salary=('salary', 'median'),
).sort_values(['roles', 'median_salary'], ascending=False)
df_summary.head(10)

	roles	min_salary	max_salary	avg_salary	median_salary
skills
data science	348	20000.0	150000.0	58828.077586	55000.0
data engineering	173	25000.0	150000.0	60221.410405	60000.0
data warehouse	51	28000.0	85000.0	57254.901961	55000.0
data pipelines	45	34000.0	110000.0	58851.200000	60000.0
data analytics	44	22000.0	120000.0	54500.000000	50000.0
data visualisation	32	25000.0	100000.0	42984.375000	45000.0
data preparation	1	65000.0	65000.0	65000.000000	65000.0
data wrangling	1	50000.0	50000.0	50000.000000	50000.0

Data engineering skills

Looking in detail at some of the data engineering specific skills, we can see that they tend to have higher average salaries than many of the purely data science skills. Average salaries by skill for most data engineering skills is £53K to £74K.

data_engineering_skills = ['spark', 'scala', 'kafka', 'kubernetes', 
'docker', 'aws', 'gcp', 'azure', 'nosql', 'hadoop', 'hive', 
'pig', 'flume', 'airflow', 'etl', 'data lake', 
'data warehouse', 'data pipelines']

df_subset = df_skills[df_skills['skills'].isin(data_engineering_skills)]
df_summary = df_subset.groupby('skills').agg(
    roles=('title', 'count'),
    min_salary=('salary', 'min'),
    max_salary=('salary', 'max'),
    avg_salary=('salary', 'mean'),
    median_salary=('salary', 'median'),
).sort_values(['roles', 'median_salary'], ascending=False)
df_summary.head(10)

	roles	min_salary	max_salary	avg_salary	median_salary
skills
aws	151	23000.0	150000.0	65152.317881	65000.0
azure	82	23000.0	125000.0	53826.268293	50000.0
spark	81	30000.0	125000.0	68774.740741	65000.0
gcp	54	25000.0	150000.0	63777.777778	65000.0
data warehouse	51	28000.0	85000.0	57254.901961	55000.0
data pipelines	45	34000.0	110000.0	58851.200000	60000.0
kafka	44	30000.0	125000.0	74035.227273	75000.0
docker	38	25000.0	125000.0	67315.789474	70000.0
airflow	36	50000.0	110000.0	68611.111111	70000.0
kubernetes	36	25000.0	125000.0	70416.666667	70000.0

Data science skills

There are more data science roles available than there are for data engineering roles and the spread of salaries is broader. The average salaries per skill for this subset range from ~£43K for data visualisation up to £71K for Pytorch roles - other deep learning specific roles tend to carry higher salaries too.

data_science_skills = ['python', 'r', 'machine learning', 'deep learning',
    'neural networks', 'tensorflow', 'pytorch', 'scikit-learn',
    'pandas', 'numpy', 'scipy', 'matplotlib', 'seaborn', 'plotly',
    'bokeh', 'd3', 'jupyter', 'jupyterlab', 'rstudio', 'tableau',
    'power bi', 'data visualisation', 'data analytics', 'data munging',
    'data preparation', 'data wrangling', 'data cleaning']

df_subset = df_skills[df_skills['skills'].isin(data_science_skills)]
df_summary = df_subset.groupby('skills').agg(
    roles=('title', 'count'),
    min_salary=('salary', 'min'),
    max_salary=('salary', 'max'),
    avg_salary=('salary', 'mean'),
    median_salary=('salary', 'median'),
).sort_values(['roles', 'median_salary'], ascending=False)
df_summary.head(10)

	roles	min_salary	max_salary	avg_salary	median_salary
skills
python	344	20000.0	150000.0	58382.267442	55000.0
r	117	20000.0	150000.0	51152.692308	45000.0
data analytics	44	22000.0	120000.0	54500.000000	50000.0
machine learning	42	32000.0	110000.0	62309.523810	60000.0
deep learning	35	32000.0	110000.0	66942.857143	65000.0
data visualisation	32	25000.0	100000.0	42984.375000	45000.0
tensorflow	17	40000.0	100000.0	68235.294118	70000.0
pytorch	10	45000.0	90000.0	71500.000000	70000.0
numpy	10	35000.0	70000.0	56000.000000	57500.0
pandas	8	35000.0	70000.0	50000.000000	47500.0

Matt Clarke, Friday, December 02, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.