In B2B ecommerce, there are two main approaches to new customer acquisition: you either rely on your website to acquire customers for you, or you target specific customers through sales calls or marketing, typically using data acquisition, or LinkedIn.
B2B sales teams generally manage the larger accounts, who spend more, place more frequent orders, and have specific requirements for account management. Sales directors want assistance in acquiring more of these, but they’re few and far between, and take time to grow.
B2B ecommerce websites generate high volumes of acquisitions, but do so in an uncontrolled manner. Since it’s rare to acquire a customer who is already large and is going to spend a lot from day one, B2B ecommerce acquisitions will be smaller, so data acquisition is used to target larger businesses that aren’t being picked up online.
To identify the specific clients for their sales or marketing team to contact, B2B ecommerce retailers will use either intuition, or better, the results of lookalike modeling. Lookalike modeling aims to examine the targetable traits of existing high value customers and then seeks to acquire lists containing more customers with similar traits.
For example, lookalike modeling might show a correlation between the company sector or location and the likelihood for a business to become a major client.
Once you’ve identified the correlated traits, you can provide your list of requirements to a data provider, such as Experian, along with your customer database, and a hefty fee. They’ll return the list of companies you don’t have, so your sales team can make contact with them.
However, you can actually construct a dataset yourself that contains every business in the UK, potentially saving you thousands. While it doesn’t include every branch, it should include all the parent companies and head office addresses to get you started. Here’s how it’s done.
Picture by Adrian Agawin, Pexels.
Open up a Python script or Jupyter notebook and import os.path
, urllib.request
, zipfile
, glob
, and pandas
. Most of these modules are built into Python, so you shouldn’t need to install them, and you’ll likely already have Pandas!
from os.path import basename
import urllib.request
import zipfile
import glob
import pandas as pd
Next we’re going to download the public Basic Company Data from the UK government’s Companies House website at data.gov.uk. This data is updated monthly and is available for download within five days of the previous month end.
There’s well over 1 GB of raw data here, which is currently split up into six separate zip files containing around 70 MB of compressed data and about 410MB each when uncompressed. We’ll grab the URLs to each of these zip files and place them in a list called urls
.
urls = ['http://download.companieshouse.gov.uk/BasicCompanyData-part1.zip',
'http://download.companieshouse.gov.uk/BasicCompanyData-part2.zip',
'http://download.companieshouse.gov.uk/BasicCompanyData-part3.zip',
'http://download.companieshouse.gov.uk/BasicCompanyData-part4.zip',
'http://download.companieshouse.gov.uk/BasicCompanyData-part5.zip',
'http://download.companieshouse.gov.uk/BasicCompanyData-part6.zip']
Next, we’ll create a for
loop and loop through each of the URLs. For each URL, we’ll create a HTTP request using urllib.request.urlopen()
and we’ll parse the response.url
to obtain the filename component. Then, we’ll use urllib.request.urlretrieve()
to download each file to a directory called data
and we’ll preserve the original filename.
for url in urls:
print('Downloading:', url)
response = urllib.request.urlopen(url)
filename = basename(response.url)
urllib.request.urlretrieve(url, 'data/'+filename)
Downloading: http://download.companieshouse.gov.uk/BasicCompanyData-part1.zip
Downloading: http://download.companieshouse.gov.uk/BasicCompanyData-part2.zip
Downloading: http://download.companieshouse.gov.uk/BasicCompanyData-part3.zip
Downloading: http://download.companieshouse.gov.uk/BasicCompanyData-part4.zip
Downloading: http://download.companieshouse.gov.uk/BasicCompanyData-part5.zip
Downloading: http://download.companieshouse.gov.uk/BasicCompanyData-part6.zip
The files are currently compressed with zip, so we need to uncompress them and extract the CSV file held within each one. First, we’ll use glob
to search the data
directory and return a list containing all of the file names that have a .zip
suffix.
files = glob.glob('data/*.zip')
files
['data/BasicCompanyData-part1.zip',
'data/BasicCompanyData-part2.zip',
'data/BasicCompanyData-part4.zip',
'data/BasicCompanyData-part3.zip',
'data/BasicCompanyData-part5.zip',
'data/BasicCompanyData-part6.zip']
Next, we’ll create a for
loop and loop through the files, first opening the file using ZipFile()
and then extracting its contents with extractall()
, which we’ll save to a new directory called data/raw
.
for file in files:
print('Unzipping:',file)
with zipfile.ZipFile(file, 'r') as unzip:
unzip.extractall('data/raw')
Unzipping: data/BasicCompanyData-part1.zip
Unzipping: data/BasicCompanyData-part2.zip
Unzipping: data/BasicCompanyData-part4.zip
Unzipping: data/BasicCompanyData-part3.zip
Unzipping: data/BasicCompanyData-part5.zip
Unzipping: data/BasicCompanyData-part6.zip
Next we can use glob()
again to obtain a list of all the CSV files in the data/raw
directory and assign their file paths to a list called csv_files
.
csv_files = glob.glob('data/raw/*.csv')
csv_files
['data/raw/BasicCompanyData-2021-01-01-part2_6.csv',
'data/raw/BasicCompanyData-2021-01-01-part1_6.csv',
'data/raw/BasicCompanyData-2021-01-01-part3_6.csv',
'data/raw/BasicCompanyData-2021-01-01-part4_6.csv',
'data/raw/BasicCompanyData-2021-01-01-part5_6.csv',
'data/raw/BasicCompanyData-2021-01-01-part6_6.csv']
Finally, we’ll create a new list called all_files[]
, then we’ll loop through each file, read the CSV into a Pandas dataframe, and we’ll append the dataframe to all_files[]
.
All we need to do to put these in one huge file is to pass the all_files[]
list to pd.concat()
and we’ve merged everything together.
all_files = []
for filename in csv_files:
df = pd.read_csv(filename, index_col=None, header=0, low_memory=False)
all_files.append(df)
companies = pd.concat(all_files, axis=0, ignore_index=True)
companies.head(1).T
0 | |
---|---|
CompanyName | CDS LABOUR (PRESTON) LTD |
CompanyNumber | 11995337 |
RegAddress.CareOf | NaN |
RegAddress.POBox | NaN |
RegAddress.AddressLine1 | EVENTUS |
RegAddress.AddressLine2 | SUNDERLAND ROAD |
RegAddress.PostTown | PETERBOROUGH |
RegAddress.County | CAMBRIDGESHIRE |
RegAddress.Country | UNITED KINGDOM |
RegAddress.PostCode | PE6 8FD |
CompanyCategory | Private Limited Company |
CompanyStatus | Active - Proposal to Strike off |
CountryOfOrigin | United Kingdom |
DissolutionDate | NaN |
IncorporationDate | 14/05/2019 |
Accounts.AccountRefDay | 31 |
Accounts.AccountRefMonth | 5 |
Accounts.NextDueDate | 14/05/2021 |
Accounts.LastMadeUpDate | NaN |
Accounts.AccountCategory | NO ACCOUNTS FILED |
Returns.NextDueDate | 11/06/2020 |
Returns.LastMadeUpDate | NaN |
Mortgages.NumMortCharges | 0 |
Mortgages.NumMortOutstanding | 0 |
Mortgages.NumMortPartSatisfied | 0 |
Mortgages.NumMortSatisfied | 0 |
SICCode.SicText_1 | 78200 - Temporary employment agency activities |
SICCode.SicText_2 | NaN |
SICCode.SicText_3 | NaN |
SICCode.SicText_4 | NaN |
LimitedPartnerships.NumGenPartners | 0 |
LimitedPartnerships.NumLimPartners | 0 |
URI | http://business.data.gov.uk/id/company/11995337 |
PreviousName_1.CONDATE | 01/11/2019 |
PreviousName_1.CompanyName | CL MS (BOSTON) LTD |
PreviousName_2.CONDATE | NaN |
PreviousName_2.CompanyName | NaN |
PreviousName_3.CONDATE | NaN |
PreviousName_3.CompanyName | NaN |
PreviousName_4.CONDATE | NaN |
PreviousName_4.CompanyName | NaN |
PreviousName_5.CONDATE | NaN |
PreviousName_5.CompanyName | NaN |
PreviousName_6.CONDATE | NaN |
PreviousName_6.CompanyName | NaN |
PreviousName_7.CONDATE | NaN |
PreviousName_7.CompanyName | NaN |
PreviousName_8.CONDATE | NaN |
PreviousName_8.CompanyName | NaN |
PreviousName_9.CONDATE | NaN |
PreviousName_9.CompanyName | NaN |
PreviousName_10.CONDATE | NaN |
PreviousName_10.CompanyName | NaN |
ConfStmtNextDueDate | 27/05/2020 |
ConfStmtLastMadeUpDate | NaN |
This gives us a dataset containing 4.837 million companies, which is every business head office held in Companies House. You can now use your data science skills to see who you haven’t got, and who your sales and marketing teams should be contacting.
companies.shape
(4837425, 55)
companies.to_csv('companies.csv', index=False)
The data in Companies House is added by the companies themselves, so you’ll typically get the head office address (not the branches), and the SIC codes are those added by the company, which may not represent the actual SIC code if you assigned it manually.
Matt Clarke, Saturday, March 13, 2021