How to create a dataset containing all UK companies

Picture by Dimitry Anikin, Unsplash.

9 minutes to read

In B2B ecommerce, there are two main approaches to new customer acquisition: you either rely on your website to acquire customers for you, or you target specific customers through sales calls or marketing, typically using data acquisition, or LinkedIn.

B2B sales teams generally manage the larger accounts, who spend more, place more frequent orders, and have specific requirements for account management. Sales directors want assistance in acquiring more of these, but they’re few and far between, and take time to grow.

B2B ecommerce websites generate high volumes of acquisitions, but do so in an uncontrolled manner. Since it’s rare to acquire a customer who is already large and is going to spend a lot from day one, B2B ecommerce acquisitions will be smaller, so data acquisition is used to target larger businesses that aren’t being picked up online.

Lookalike models for data acquisition

To identify the specific clients for their sales or marketing team to contact, B2B ecommerce retailers will use either intuition, or better, the results of lookalike modeling. Lookalike modeling aims to examine the targetable traits of existing high value customers and then seeks to acquire lists containing more customers with similar traits.

For example, lookalike modeling might show a correlation between the company sector or location and the likelihood for a business to become a major client.

Once you’ve identified the correlated traits, you can provide your list of requirements to a data provider, such as Experian, along with your customer database, and a hefty fee. They’ll return the list of companies you don’t have, so your sales team can make contact with them.

However, you can actually construct a dataset yourself that contains every business in the UK, potentially saving you thousands. While it doesn’t include every branch, it should include all the parent companies and head office addresses to get you started. Here’s how it’s done.

png Picture by Adrian Agawin, Pexels.

Load the packages

Open up a Python script or Jupyter notebook and import os.path, urllib.request, zipfile, glob, and pandas. Most of these modules are built into Python, so you shouldn’t need to install them, and you’ll likely already have Pandas!

from os.path import basename
import urllib.request
import zipfile
import glob
import pandas as pd

Download the raw data

Next we’re going to download the public Basic Company Data from the UK government’s Companies House website at data.gov.uk. This data is updated monthly and is available for download within five days of the previous month end.

There’s well over 1 GB of raw data here, which is currently split up into six separate zip files containing around 70 MB of compressed data and about 410MB each when uncompressed. We’ll grab the URLs to each of these zip files and place them in a list called urls.

urls = ['http://download.companieshouse.gov.uk/BasicCompanyData-part1.zip',
        'http://download.companieshouse.gov.uk/BasicCompanyData-part2.zip',
        'http://download.companieshouse.gov.uk/BasicCompanyData-part3.zip',
        'http://download.companieshouse.gov.uk/BasicCompanyData-part4.zip',
        'http://download.companieshouse.gov.uk/BasicCompanyData-part5.zip',
        'http://download.companieshouse.gov.uk/BasicCompanyData-part6.zip']

Next, we’ll create a for loop and loop through each of the URLs. For each URL, we’ll create a HTTP request using urllib.request.urlopen() and we’ll parse the response.url to obtain the filename component. Then, we’ll use urllib.request.urlretrieve() to download each file to a directory called data and we’ll preserve the original filename.

for url in urls:
    print('Downloading:', url)
    response = urllib.request.urlopen(url)
    filename = basename(response.url)
    urllib.request.urlretrieve(url, 'data/'+filename)

Downloading: http://download.companieshouse.gov.uk/BasicCompanyData-part1.zip
Downloading: http://download.companieshouse.gov.uk/BasicCompanyData-part2.zip
Downloading: http://download.companieshouse.gov.uk/BasicCompanyData-part3.zip
Downloading: http://download.companieshouse.gov.uk/BasicCompanyData-part4.zip
Downloading: http://download.companieshouse.gov.uk/BasicCompanyData-part5.zip
Downloading: http://download.companieshouse.gov.uk/BasicCompanyData-part6.zip

Unzip the files

The files are currently compressed with zip, so we need to uncompress them and extract the CSV file held within each one. First, we’ll use glob to search the data directory and return a list containing all of the file names that have a .zip suffix.

files = glob.glob('data/*.zip')
files

['data/BasicCompanyData-part1.zip',
 'data/BasicCompanyData-part2.zip',
 'data/BasicCompanyData-part4.zip',
 'data/BasicCompanyData-part3.zip',
 'data/BasicCompanyData-part5.zip',
 'data/BasicCompanyData-part6.zip']

Next, we’ll create a for loop and loop through the files, first opening the file using ZipFile() and then extracting its contents with extractall(), which we’ll save to a new directory called data/raw.

for file in files:
    print('Unzipping:',file)
    
    with zipfile.ZipFile(file, 'r') as unzip:
        unzip.extractall('data/raw')

Unzipping: data/BasicCompanyData-part1.zip
Unzipping: data/BasicCompanyData-part2.zip
Unzipping: data/BasicCompanyData-part4.zip
Unzipping: data/BasicCompanyData-part3.zip
Unzipping: data/BasicCompanyData-part5.zip
Unzipping: data/BasicCompanyData-part6.zip

Merge the files

Next we can use glob() again to obtain a list of all the CSV files in the data/raw directory and assign their file paths to a list called csv_files.

csv_files = glob.glob('data/raw/*.csv')
csv_files

['data/raw/BasicCompanyData-2021-01-01-part2_6.csv',
 'data/raw/BasicCompanyData-2021-01-01-part1_6.csv',
 'data/raw/BasicCompanyData-2021-01-01-part3_6.csv',
 'data/raw/BasicCompanyData-2021-01-01-part4_6.csv',
 'data/raw/BasicCompanyData-2021-01-01-part5_6.csv',
 'data/raw/BasicCompanyData-2021-01-01-part6_6.csv']

Finally, we’ll create a new list called all_files[], then we’ll loop through each file, read the CSV into a Pandas dataframe, and we’ll append the dataframe to all_files[].

All we need to do to put these in one huge file is to pass the all_files[] list to pd.concat() and we’ve merged everything together.

all_files = []

for filename in csv_files:
    df = pd.read_csv(filename, index_col=None, header=0, low_memory=False)
    all_files.append(df)

companies = pd.concat(all_files, axis=0, ignore_index=True)

companies.head(1).T

	0
CompanyName	CDS LABOUR (PRESTON) LTD
CompanyNumber	11995337
RegAddress.CareOf	NaN
RegAddress.POBox	NaN
RegAddress.AddressLine1	EVENTUS
RegAddress.AddressLine2	SUNDERLAND ROAD
RegAddress.PostTown	PETERBOROUGH
RegAddress.County	CAMBRIDGESHIRE
RegAddress.Country	UNITED KINGDOM
RegAddress.PostCode	PE6 8FD
CompanyCategory	Private Limited Company
CompanyStatus	Active - Proposal to Strike off
CountryOfOrigin	United Kingdom
DissolutionDate	NaN
IncorporationDate	14/05/2019
Accounts.AccountRefDay	31
Accounts.AccountRefMonth	5
Accounts.NextDueDate	14/05/2021
Accounts.LastMadeUpDate	NaN
Accounts.AccountCategory	NO ACCOUNTS FILED
Returns.NextDueDate	11/06/2020
Returns.LastMadeUpDate	NaN
Mortgages.NumMortCharges	0
Mortgages.NumMortOutstanding	0
Mortgages.NumMortPartSatisfied	0
Mortgages.NumMortSatisfied	0
SICCode.SicText_1	78200 - Temporary employment agency activities
SICCode.SicText_2	NaN
SICCode.SicText_3	NaN
SICCode.SicText_4	NaN
LimitedPartnerships.NumGenPartners	0
LimitedPartnerships.NumLimPartners	0
URI	http://business.data.gov.uk/id/company/11995337
PreviousName_1.CONDATE	01/11/2019
PreviousName_1.CompanyName	CL MS (BOSTON) LTD
PreviousName_2.CONDATE	NaN
PreviousName_2.CompanyName	NaN
PreviousName_3.CONDATE	NaN
PreviousName_3.CompanyName	NaN
PreviousName_4.CONDATE	NaN
PreviousName_4.CompanyName	NaN
PreviousName_5.CONDATE	NaN
PreviousName_5.CompanyName	NaN
PreviousName_6.CONDATE	NaN
PreviousName_6.CompanyName	NaN
PreviousName_7.CONDATE	NaN
PreviousName_7.CompanyName	NaN
PreviousName_8.CONDATE	NaN
PreviousName_8.CompanyName	NaN
PreviousName_9.CONDATE	NaN
PreviousName_9.CompanyName	NaN
PreviousName_10.CONDATE	NaN
PreviousName_10.CompanyName	NaN
ConfStmtNextDueDate	27/05/2020
ConfStmtLastMadeUpDate	NaN

This gives us a dataset containing 4.837 million companies, which is every business head office held in Companies House. You can now use your data science skills to see who you haven’t got, and who your sales and marketing teams should be contacting.

companies.shape

(4837425, 55)

companies.to_csv('companies.csv', index=False)

The data in Companies House is added by the companies themselves, so you’ll typically get the head office address (not the branches), and the SIC codes are those added by the company, which may not represent the actual SIC code if you assigned it manually.

Matt Clarke, Saturday, March 13, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.