How to create a dataset containing all UK companies

B2B ecommerce retailers spend large amounts on acquiring the addresses of potential customers to call. Here’s how you build a dataset on your own for free.

How to create a dataset containing all UK companies
Picture by Dimitry Anikin, Unsplash.
9 minutes to read

In B2B ecommerce, there are two main approaches to new customer acquisition: you either rely on your website to acquire customers for you, or you target specific customers through sales calls or marketing, typically using data acquisition, or LinkedIn.

B2B sales teams generally manage the larger accounts, who spend more, place more frequent orders, and have specific requirements for account management. Sales directors want assistance in acquiring more of these, but they’re few and far between, and take time to grow.

B2B ecommerce websites generate high volumes of acquisitions, but do so in an uncontrolled manner. Since it’s rare to acquire a customer who is already large and is going to spend a lot from day one, B2B ecommerce acquisitions will be smaller, so data acquisition is used to target larger businesses that aren’t being picked up online.

Lookalike models for data acquisition

To identify the specific clients for their sales or marketing team to contact, B2B ecommerce retailers will use either intuition, or better, the results of lookalike modeling. Lookalike modeling aims to examine the targetable traits of existing high value customers and then seeks to acquire lists containing more customers with similar traits.

For example, lookalike modeling might show a correlation between the company sector or location and the likelihood for a business to become a major client.

Once you’ve identified the correlated traits, you can provide your list of requirements to a data provider, such as Experian, along with your customer database, and a hefty fee. They’ll return the list of companies you don’t have, so your sales team can make contact with them.

However, you can actually construct a dataset yourself that contains every business in the UK, potentially saving you thousands. While it doesn’t include every branch, it should include all the parent companies and head office addresses to get you started. Here’s how it’s done.

png Picture by Adrian Agawin, Pexels.

Load the packages

Open up a Python script or Jupyter notebook and import os.path, urllib.request, zipfile, glob, and pandas. Most of these modules are built into Python, so you shouldn’t need to install them, and you’ll likely already have Pandas!

from os.path import basename
import urllib.request
import zipfile
import glob
import pandas as pd

Download the raw data

Next we’re going to download the public Basic Company Data from the UK government’s Companies House website at data.gov.uk. This data is updated monthly and is available for download within five days of the previous month end.

There’s well over 1 GB of raw data here, which is currently split up into six separate zip files containing around 70 MB of compressed data and about 410MB each when uncompressed. We’ll grab the URLs to each of these zip files and place them in a list called urls.

urls = ['http://download.companieshouse.gov.uk/BasicCompanyData-part1.zip',
        'http://download.companieshouse.gov.uk/BasicCompanyData-part2.zip',
        'http://download.companieshouse.gov.uk/BasicCompanyData-part3.zip',
        'http://download.companieshouse.gov.uk/BasicCompanyData-part4.zip',
        'http://download.companieshouse.gov.uk/BasicCompanyData-part5.zip',
        'http://download.companieshouse.gov.uk/BasicCompanyData-part6.zip']

Next, we’ll create a for loop and loop through each of the URLs. For each URL, we’ll create a HTTP request using urllib.request.urlopen() and we’ll parse the response.url to obtain the filename component. Then, we’ll use urllib.request.urlretrieve() to download each file to a directory called data and we’ll preserve the original filename.

for url in urls:
    print('Downloading:', url)
    response = urllib.request.urlopen(url)
    filename = basename(response.url)
    urllib.request.urlretrieve(url, 'data/'+filename)
Downloading: http://download.companieshouse.gov.uk/BasicCompanyData-part1.zip
Downloading: http://download.companieshouse.gov.uk/BasicCompanyData-part2.zip
Downloading: http://download.companieshouse.gov.uk/BasicCompanyData-part3.zip
Downloading: http://download.companieshouse.gov.uk/BasicCompanyData-part4.zip
Downloading: http://download.companieshouse.gov.uk/BasicCompanyData-part5.zip
Downloading: http://download.companieshouse.gov.uk/BasicCompanyData-part6.zip

Unzip the files

The files are currently compressed with zip, so we need to uncompress them and extract the CSV file held within each one. First, we’ll use glob to search the data directory and return a list containing all of the file names that have a .zip suffix.

files = glob.glob('data/*.zip')
files
['data/BasicCompanyData-part1.zip',
 'data/BasicCompanyData-part2.zip',
 'data/BasicCompanyData-part4.zip',
 'data/BasicCompanyData-part3.zip',
 'data/BasicCompanyData-part5.zip',
 'data/BasicCompanyData-part6.zip']

Next, we’ll create a for loop and loop through the files, first opening the file using ZipFile() and then extracting its contents with extractall(), which we’ll save to a new directory called data/raw.

for file in files:
    print('Unzipping:',file)
    
    with zipfile.ZipFile(file, 'r') as unzip:
        unzip.extractall('data/raw')
Unzipping: data/BasicCompanyData-part1.zip
Unzipping: data/BasicCompanyData-part2.zip
Unzipping: data/BasicCompanyData-part4.zip
Unzipping: data/BasicCompanyData-part3.zip
Unzipping: data/BasicCompanyData-part5.zip
Unzipping: data/BasicCompanyData-part6.zip

Merge the files

Next we can use glob() again to obtain a list of all the CSV files in the data/raw directory and assign their file paths to a list called csv_files.

csv_files = glob.glob('data/raw/*.csv')
csv_files
['data/raw/BasicCompanyData-2021-01-01-part2_6.csv',
 'data/raw/BasicCompanyData-2021-01-01-part1_6.csv',
 'data/raw/BasicCompanyData-2021-01-01-part3_6.csv',
 'data/raw/BasicCompanyData-2021-01-01-part4_6.csv',
 'data/raw/BasicCompanyData-2021-01-01-part5_6.csv',
 'data/raw/BasicCompanyData-2021-01-01-part6_6.csv']

Finally, we’ll create a new list called all_files[], then we’ll loop through each file, read the CSV into a Pandas dataframe, and we’ll append the dataframe to all_files[].

All we need to do to put these in one huge file is to pass the all_files[] list to pd.concat() and we’ve merged everything together.

all_files = []

for filename in csv_files:
    df = pd.read_csv(filename, index_col=None, header=0, low_memory=False)
    all_files.append(df)

companies = pd.concat(all_files, axis=0, ignore_index=True)
companies.head(1).T
0
CompanyName CDS LABOUR (PRESTON) LTD
CompanyNumber 11995337
RegAddress.CareOf NaN
RegAddress.POBox NaN
RegAddress.AddressLine1 EVENTUS
RegAddress.AddressLine2 SUNDERLAND ROAD
RegAddress.PostTown PETERBOROUGH
RegAddress.County CAMBRIDGESHIRE
RegAddress.Country UNITED KINGDOM
RegAddress.PostCode PE6 8FD
CompanyCategory Private Limited Company
CompanyStatus Active - Proposal to Strike off
CountryOfOrigin United Kingdom
DissolutionDate NaN
IncorporationDate 14/05/2019
Accounts.AccountRefDay 31
Accounts.AccountRefMonth 5
Accounts.NextDueDate 14/05/2021
Accounts.LastMadeUpDate NaN
Accounts.AccountCategory NO ACCOUNTS FILED
Returns.NextDueDate 11/06/2020
Returns.LastMadeUpDate NaN
Mortgages.NumMortCharges 0
Mortgages.NumMortOutstanding 0
Mortgages.NumMortPartSatisfied 0
Mortgages.NumMortSatisfied 0
SICCode.SicText_1 78200 - Temporary employment agency activities
SICCode.SicText_2 NaN
SICCode.SicText_3 NaN
SICCode.SicText_4 NaN
LimitedPartnerships.NumGenPartners 0
LimitedPartnerships.NumLimPartners 0
URI http://business.data.gov.uk/id/company/11995337
PreviousName_1.CONDATE 01/11/2019
PreviousName_1.CompanyName CL MS (BOSTON) LTD
PreviousName_2.CONDATE NaN
PreviousName_2.CompanyName NaN
PreviousName_3.CONDATE NaN
PreviousName_3.CompanyName NaN
PreviousName_4.CONDATE NaN
PreviousName_4.CompanyName NaN
PreviousName_5.CONDATE NaN
PreviousName_5.CompanyName NaN
PreviousName_6.CONDATE NaN
PreviousName_6.CompanyName NaN
PreviousName_7.CONDATE NaN
PreviousName_7.CompanyName NaN
PreviousName_8.CONDATE NaN
PreviousName_8.CompanyName NaN
PreviousName_9.CONDATE NaN
PreviousName_9.CompanyName NaN
PreviousName_10.CONDATE NaN
PreviousName_10.CompanyName NaN
ConfStmtNextDueDate 27/05/2020
ConfStmtLastMadeUpDate NaN

This gives us a dataset containing 4.837 million companies, which is every business head office held in Companies House. You can now use your data science skills to see who you haven’t got, and who your sales and marketing teams should be contacting.

companies.shape
(4837425, 55)
companies.to_csv('companies.csv', index=False)

The data in Companies House is added by the companies themselves, so you’ll typically get the head office address (not the branches), and the SIC codes are those added by the company, which may not represent the actual SIC code if you assigned it manually.

Matt Clarke, Saturday, March 13, 2021

Matt Clarke Matt is a Digital Director who uses data science to help in his work. He has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.

Introduction to Python

Master the basics of data analysis in Python . Expand your skillset by learning scientific computing with numpy.

Start course for FREE

Comments