How to create a UK data science jobs dataset

Picture by Anete Lusina, Pexels.

20 minutes to read

Data Science Datasets Web scraping

According to the Harvard Business Review, the role of data scientist is said to be “the sexiest job of the 21st century”. Data science and data engineering skills are said to be in greater demand than ever before, and the relatively small number of candidates on the market with the right skills and experience can attract salaries with six figures.

To better understand whether the market, and see what employers are seeking in candidates, and identify the skills most in demand, and those associated with the best salaries, you’ll need a dataset to analyse. In this project, I’ll show you step-by-step how you can use web scraping to build a dataset including all the UK’s currently advertised data science and engineering roles.

Load the packages

First, open a Jupyter notebook and import the pandas, math, requests, requests_html and extruct packages. Any packages you don’t have can be installed by entering pip3 install package-name in your terminal.

import pandas as pd
import math
import requests
from requests_html import HTML
from requests_html import HTMLSession
import extruct

pd.set_option('max_rows', 1000)

Create the scraper

Next we’ll create a series of functions that we can use within our scraper. This modularises the code and makes it easier to explain and read, and allows us to re-use specific parts in other projects.

The first step is to fetch the source code of the target page, for which we’ll use the Requests-HTML package. This combines Requests with Beautiful Soup and makes web scraping quite quick and easy. The Python try except block will catch any errors that occur and print them to the console.

def get_source(url):
    """Return the source code for the provided URL. 

    Args: 
        url (string): URL of the page to scrape.

    Returns:
        response (object): HTTP response object from requests_html. 
    """

    try:
        session = HTMLSession()
        response = session.get(url)
        return response

    except requests.exceptions.RequestException as e:
        print(e)

Next, some Reed-specific code is required to fetch the job ad links from each page of Reed search results. I used xpath to find all the href elements with the CSS class gtmJobTitleClickResponsive and then prepended the domain to create absolute URLs.

def get_links(response):
    """Return a list containing the absolute URLs of each job ad from the Reed search results.

    Args:
        response (object): Response object from Requests-HTML containing source. 

    Returns:
        prefixed_links (list): List of absolute URLs of job ads.
    """

    links = response.html.xpath("//a[@class='gtmJobTitleClickResponsive']/@href")
    prefixed_links = [] 
    for link in links:
        prefixed_links.append('https://www.reed.co.uk' + link)
    return prefixed_links  

To paginate through the multiple pages of search results, I extracted the count of results stored in an element with the CSS class .count. I then replaced the comma and cast the value to an integer and divided it by 25, which provides the total number of pages in the result set.

def get_total_pages(response):
    """Return the total number of pages of results found. 

    Args:
        response (object): Response object from Requests-HTML containing source. 

    Returns:
        pages (int): Integer containing the number of pages of search results.
    """

    total_results = int(response.html.find('.count', first=True).text.replace(',', ''))
    return math.ceil(total_results / 25)

To create a crawl list containing the link to every job ad for our target search I created a function called get_all_links() which utilises the functions above. We pass this a search result URL and it parses the source, determines the number of pages, loops through them, and returns a list of all the job ad URLs.

def get_all_links(url):
    """Return the crawl list of URLs to scrape for a given search term. 

    Args:
        url (string): URL of job ad search on Reed, 
        i.e. https://www.reed.co.uk/jobs/data-science-jobs?perm=True&fulltime=True

    Returns: 
        all_links (list): List of all URLs of job ads from each page of results.
    """

    all_links = []
    response = get_source(url)
    total_pages = get_total_pages(response)

    page = 1
    while page <= total_pages:    
        paginated_url = url + "&pageno=" + str(page)
        response = get_source(paginated_url)
        all_links = all_links + get_links(response)
        page = page+1

    return all_links

Next I created a function called get_metadata() which uses Extruct to extract any schema.org metadata present in the page, irrespective of whether it’s in JSON-LD, microdata, or OpenGraph format. This returns a big nested dictionary of metadata for us to parse.

def get_metadata(response):
    """Return a list of dictionaries containing schema.org metadata. 

    Args:
        response (object): Response object from Requests-HTML containing source. 

    Returns:
        metadata (list): List of dictionaries of schema.org metadata. 
    """

    metadata = extruct.extract(response.text, 
                               uniform=True,
                               syntaxes=['json-ld',
                                         'microdata',
                                         'opengraph'])
    return metadata

To extract specific elements from the nested dictionary of metadata, I created a function called get_dictionary_by_key_value(). This takes the dictionary of metadata, the target schema.org ket, and the target value, and returns the specific metadata present.

def get_dictionary_by_key_value(dictionary, target_key, target_value):
    """Return a dictionary that contains a target key value pair. 

    Args:
        dictionary: Metadata dictionary containing lists of other dictionaries.
        target_key: Target key to search for within a dictionary inside a list. 
        target_value: Target value to search for within a dictionary inside a list. 

    Returns:
        target_dictionary: Target dictionary that contains target key value pair. 
    """

    for key in dictionary:
        if len(dictionary[key]) > 0:
            for item in dictionary[key]:
                if item[target_key] == target_value:
                    return item

Next, I extracted the title containing the job role name, i.e. Data Scientist, and the unique reference number holding the job advertisement’s code.

def get_title(response):
    """Get the title of the job ad. 

    Args:
        response (object): Response object from Requests-HTML containing source. 

    Returns:
        title (string): Title of job ad, i.e. Data Scientist.

    """

    return response.html.find('title', first=True).text.replace(' - reed.co.uk', '')

def get_reference(response):
    """Get the reference number of the job ad. 

    Args:
        response (object): Response object from Requests-HTML containing source. 

    Returns:
        reference (int): Unique reference number for the ad. 

    """

    return int(response.html.find('.reference', first=True).text.replace('Reference: ', ''))

I also found the date the job was added and the date it expires tucked away in some meta data, so I used xpath to extract the elements.

def get_date_posted(response):
    """Get the date the job ad was posted. 

    Args:
        response (object): Response object from Requests-HTML containing source. 

    Returns:
        date_posted (int): The date the ad was posted. 

    """ 

    date_posted =  response.html.xpath('//meta[@itemprop="datePosted"]/@content')
    return date_posted[0]

def get_date_ending(response):
    """Get the date the job ad expires. 

    Args:
        response (object): Response object from Requests-HTML containing source. 

    Returns:
        date_ending (int): The date the ad expires. 

    """ 

    date_ending =  response.html.xpath('//meta[@itemprop="validThrough"]/@content')
    return date_ending[0]

Finally, I extracted the div that contains the main text of the job advertisement.

def get_description(response):
    """Get the date the job ad description.

    Args:
        response (object): Response object from Requests-HTML containing source. 

    Returns:
        description (string): The main ad text. 

    """ 
    if response.html.find('.description', first=True):
        return response.html.find('.description', first=True).text
    else:
        return ''

To make this cleaner to run, I created a function called get_job() which takes a URL, scrapes the HTML, parses the metadata and page source, and returns a dictionary of neatly formatted data.

def get_job(url):
    """Return a dictionary containing the data scraped from the job listing on Reed.

    Args:
        url (string): URL of the job ad page to scrape. 

    Returns: 
        job (dictionary): Dictionary containing scraped data. 
    """

    response = get_source(url)
    metadata = get_metadata(response)
    description = get_description(response)
    title = get_title(response)
    reference = get_reference(response)
    date_posted = get_date_posted(response)
    date_ending = get_date_ending(response)

    categories = get_dictionary_by_key_value(metadata, "@type", "BreadcrumbList")
    salary = get_dictionary_by_key_value(metadata, "@type", "MonetaryAmount")
    advertiser = get_dictionary_by_key_value(metadata, "@type", "Organization")
    location = get_dictionary_by_key_value(metadata, "@type", "Place")

    job = {
        'reference': reference,
        'title': title,
        'date_posted': date_posted,
        'date_ending': date_ending,
        'advertiser': advertiser.get('name', ''),
        'location': location.get('address', {}).get('addressRegion', ''),
        'city': location.get('address', {}).get('addressLocality', ''),
        'country': location.get('address', {}).get('addressCountry', ''),
        'salary': float(salary.get('value', {}).get('value', '0.00')),
        'salary_min': float(salary.get('value', {}).get('minValue', '0.00')),
        'salary_max': float(salary.get('value', {}).get('maxValue', '0.00')),
        'salary_frequency': salary.get('value', {}).get('unitText', ''),
        'salary_currency': salary.get('value', {}).get('currency', ''),
        'description': description,
    }

    return job    

The final step, which brings everything above together, is the scrape_jobs() function. This takes a single URL for a set of Reed search results and then scrapes every single job ad present and returns the output in a Pandas dataframe.

def scrape_jobs(url):
    """Run a search on Reed, scrape all the jobs, and return a Pandas dataframe. 

    Args:
        url (string): URL of the job ad page to scrape, 
        i.e. https://www.reed.co.uk/jobs/data-science-jobs?perm=True&fulltime=True

    Returns: 
        df (dataframe): Pandas dataframe of job ads. 
    """

    links = get_all_links(url)

    df = pd.DataFrame(columns = ['reference','title', 'date_posted', 'date_ending',
                                 'advertiser', 'location', 'city', 'country',
                                 'salary', 'salary_min', 'salary_max', 'salary_frequency', 
                                 'salary_currency', 'description'])

    for link in links: 
        job = get_job(link)
        df = df.append(job, ignore_index=True)

    return df

Run the scraper

To run the scraper, all we need to do is head over to Reed.co.uk and run a search, select our filters, and then paste the search URL into the argument for the scrape_jobs() function.

I’ve run two searches one for “data science” and one for “data engineering” and have set the filters to “permanent” and “full-time”. I’m assuming that some roles may appear in both datasets, but we’ll de-dupe these in the next step.

url = "https://www.reed.co.uk/jobs?keywords=%22data%20science%22&perm=True&fulltime=True"
df_data_science = scrape_jobs(url)
df_data_science.to_csv('data-science-jobs.csv')
df_data_science.head()

	reference	title	date_posted	date_ending	advertiser	location	city	country	salary	salary_min	salary_max	salary_frequency	description
0	41642281	Data Science Consultant	2020-12-27	2021-02-07T23:55:00.0000000	Harnham	South East England	London	GB	40000.0	40000.0	80000.0	YEAR	Apply now\nSENIOR DATA SCIENCE CONSULTANT\nUP ...
1	41857664	Data Science Manager	2021-01-26	2021-03-09T23:55:00.0000000	Charles Simon Associates Ltd	London	Camden	GB	70000.0	70000.0	80000.0	YEAR	Apply now\nData Science Manager (Data, Python,...
2	41924233	Data Science Recruiter	2021-02-03	2021-03-03T23:55:00.0000000	Crone Corkill	South East England	London	GB	20000.0	20000.0	26000.0	YEAR	Apply now\nData Science Recruiter - £20,000-£2...
3	41658856	Data Science Consultant	2021-01-03	2021-02-14T23:55:00.0000000	Harnham	South East England	London	GB	95000.0	95000.0	100000.0	YEAR	Apply now\nData Science Consultant\nLondon\n£9...
4	41752222	Data Science Lead	2021-01-14	2021-02-25T23:55:00.0000000	Harnham	South East England	London	GB	75000.0	75000.0	80000.0	YEAR	Apply now\nDATA SCIENCE LEAD\nUP TO £80,000 + ...

url = "https://www.reed.co.uk/jobs?keywords=%22data%20engineering%22&perm=True&fulltime=True"
df_data_engineering = scrape_jobs(url)
df_data_engineering.to_csv('data-engineering-jobs.csv')
df_data_engineering.head()

	reference	title	date_posted	date_ending	advertiser	location	city	country	salary	salary_min	salary_max	salary_frequency	description
0	41889206	Data Engineering Manager	2021-01-29	2021-03-12T23:55:00.0000000	Noir	Nottinghamshire	Nottingham	GB	75000.0	75000.0	85000.0	YEAR	Apply now\nData Engineering Manager\n(Tech sta...
1	41775666	Data Engineering Manager	2021-01-18	2021-03-01T23:55:00.0000000	Harnham	South East England	London	GB	100000.0	100000.0	105000.0	YEAR	Apply now\nDATA ENGINEERING MANAGER\nCENTRAL L...
2	41894116	Data Engineering Manager	2021-01-31	2021-03-14T23:55:00.0000000	Harnham	South East England	London	GB	100000.0	100000.0	105000.0	YEAR	Apply now\nData Engineering Manager - Scala\nF...
3	41909302	Data Engineering Manager	2021-02-02	2021-03-02T23:55:00.0000000	Tesco Underwriting	Surrey	Reigate	GB	0.0	0.0	0.0		Apply now\nData Engineering Manager\nReigate, ...
4	41477266	Data Engineering Lead	2021-01-04	2021-02-08T23:55:00.0000000	Identify Solutions	South East England	London	GB	60000.0	60000.0	70000.0	YEAR	Apply now\nData Engineering Lead (Python & PHP...

Merge the datasets

Since it’s probable that the “data science” and “data engineering” datasets will include some overlap, I’ve appended the two dataframes together to create a single dataframe containing the inevitable duplicates.

df = df_data_science.append(df_data_engineering)
df.head()

	reference	title	date_posted	date_ending	advertiser	location	city	country	salary	salary_min	salary_max	salary_frequency	description
0	41642281	Data Science Consultant	2020-12-27	2021-02-07T23:55:00.0000000	Harnham	South East England	London	GB	40000.0	40000.0	80000.0	YEAR	Apply now\nSENIOR DATA SCIENCE CONSULTANT\nUP ...
1	41857664	Data Science Manager	2021-01-26	2021-03-09T23:55:00.0000000	Charles Simon Associates Ltd	London	Camden	GB	70000.0	70000.0	80000.0	YEAR	Apply now\nData Science Manager (Data, Python,...
2	41924233	Data Science Recruiter	2021-02-03	2021-03-03T23:55:00.0000000	Crone Corkill	South East England	London	GB	20000.0	20000.0	26000.0	YEAR	Apply now\nData Science Recruiter - £20,000-£2...
3	41658856	Data Science Consultant	2021-01-03	2021-02-14T23:55:00.0000000	Harnham	South East England	London	GB	95000.0	95000.0	100000.0	YEAR	Apply now\nData Science Consultant\nLondon\n£9...
4	41752222	Data Science Lead	2021-01-14	2021-02-25T23:55:00.0000000	Harnham	South East England	London	GB	75000.0	75000.0	80000.0	YEAR	Apply now\nDATA SCIENCE LEAD\nUP TO £80,000 + ...

Step 2: Data cleansing

The next step was to cleanse the scraped data. First, I removed a handful of non-UK roles that were present, so we get a clearer picture of the UK data science jobs market. Then, I removed any roles that didn’t use the annual salary (which was only a handful of positions).

df = df[df['country']=='GB']
df = df[df['salary_frequency']=='YEAR']

That gave me 781 roles from the two original “data science” and “data engineering” datasets, which held 759 and 279 roles respectively. Note that if you search on Reed without the double quotes, the search engine will return any roles containing the words “data” or “science” giving you a false impression of the size of the market.

df.shape

(781, 14)

Finally, I used drop_duplicates() to remove any roles with the same reference, which had been present in both datasets. This left 595 “unique” full time, permanent data science and engineering roles for the whole of the UK.

df = df.drop_duplicates(subset='reference', keep='last', inplace=False)
df.shape

(595, 14)

df.to_csv('deduped-jobs.csv', index=False)

Matt Clarke, Saturday, March 13, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.