How to create targeted B2B company sector datasets

Learn how to create targeted B2B company datasets for free using Python, Pandas, and Companies House data.

How to create targeted B2B company sector datasets
Picture by Douglas Sheppard, Unsplash.
20 minutes to read

As I explained in my previous post, many B2B ecommerce businesses spend huge amounts on procuring third-party data for companies they wish to target. However, with some data science skills and a bit of ingenuity you can also produce a dataset of your own - for free.

In this project, I’ll show you can use this data to target specific companies. We’ll see what companies are included, identify the features we want to target and exclude, and use Pandas to create a dataset of estate agents we can approach. Here’s how it’s done.

Load the packages

First, open up a Jupyter notebook and import pandas. As the dataset we’re using contains many columns, you may also want to use set_option() to increase the number of rows and columns shown. Needless to say, as this is a massive dataset, you’ll need a powerful workstation!

import pandas as pd
pd.set_option('max_columns', 100)
pd.set_option('max_rows', 100)

Load the data

I’ll assume you’ve followed my post How to create a dataset containing all UK companies explaining how you can build the required dataset. Once you’ve followed this, load the data into Pandas using the low_memory=False argument. It currently comes in at around 1.8 GB.

df = pd.read_csv('companies.csv', low_memory=False)
df.head()
CompanyName CompanyNumber RegAddress.CareOf RegAddress.POBox RegAddress.AddressLine1 RegAddress.AddressLine2 RegAddress.PostTown RegAddress.County RegAddress.Country RegAddress.PostCode CompanyCategory CompanyStatus CountryOfOrigin DissolutionDate IncorporationDate Accounts.AccountRefDay Accounts.AccountRefMonth Accounts.NextDueDate Accounts.LastMadeUpDate Accounts.AccountCategory Returns.NextDueDate Returns.LastMadeUpDate Mortgages.NumMortCharges Mortgages.NumMortOutstanding Mortgages.NumMortPartSatisfied Mortgages.NumMortSatisfied SICCode.SicText_1 SICCode.SicText_2 SICCode.SicText_3 SICCode.SicText_4 LimitedPartnerships.NumGenPartners LimitedPartnerships.NumLimPartners URI PreviousName_1.CONDATE PreviousName_1.CompanyName PreviousName_2.CONDATE PreviousName_2.CompanyName PreviousName_3.CONDATE PreviousName_3.CompanyName PreviousName_4.CONDATE PreviousName_4.CompanyName PreviousName_5.CONDATE PreviousName_5.CompanyName PreviousName_6.CONDATE PreviousName_6.CompanyName PreviousName_7.CONDATE PreviousName_7.CompanyName PreviousName_8.CONDATE PreviousName_8.CompanyName PreviousName_9.CONDATE PreviousName_9.CompanyName PreviousName_10.CONDATE PreviousName_10.CompanyName ConfStmtNextDueDate ConfStmtLastMadeUpDate
0 CDS LABOUR (PRESTON) LTD 11995337 NaN NaN EVENTUS SUNDERLAND ROAD PETERBOROUGH CAMBRIDGESHIRE UNITED KINGDOM PE6 8FD Private Limited Company Active - Proposal to Strike off United Kingdom NaN 14/05/2019 31.0 5.0 14/05/2021 NaN NO ACCOUNTS FILED 11/06/2020 NaN 0 0 0 0 78200 - Temporary employment agency activities NaN NaN NaN 0 0 http://business.data.gov.uk/id/company/11995337 01/11/2019 CL MS (BOSTON) LTD NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 27/05/2020 NaN
1 CDS LABOUR (SOFIA) LTD 11995382 NaN NaN EVENTUS SUNDERLAND ROAD PETERBOROUGH CAMBRIDGESHIRE UNITED KINGDOM PE6 8FD Private Limited Company Active - Proposal to Strike off United Kingdom NaN 14/05/2019 31.0 5.0 14/05/2021 NaN NO ACCOUNTS FILED 11/06/2020 NaN 0 0 0 0 78200 - Temporary employment agency activities NaN NaN NaN 0 0 http://business.data.gov.uk/id/company/11995382 01/11/2019 CL MC (SPALDING) LTD NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 27/05/2020 NaN
2 CDS LABOUR (WISBECH) LTD 11543682 NaN NaN EVENTUS SUNDERLAND ROAD MARKET DEEPING LINCOLNSHIRE ENGLAND PE6 8FD Private Limited Company Active United Kingdom NaN 30/08/2018 31.0 8.0 31/05/2021 31/08/2019 DORMANT 27/09/2019 NaN 0 0 0 0 78200 - Temporary employment agency activities NaN NaN NaN 0 0 http://business.data.gov.uk/id/company/11543682 06/09/2019 CL MC (WISBECH) LTD 14/05/2019 CDSL(WISBECH)FBS LTD NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 13/09/2021 30/08/2020
3 CDS LEISURE LIMITED 06910514 TOPPING PARTNERSHIP NaN INCOM HOUSE WATERSIDE TRAFFORD PARK MANCHESTER NaN NaN M17 1WD Private Limited Company Active United Kingdom NaN 19/05/2009 31.0 5.0 31/05/2021 31/05/2019 TOTAL EXEMPTION FULL 16/06/2017 19/05/2016 0 0 0 0 46420 - Wholesale of clothing and footwear NaN NaN NaN 0 0 http://business.data.gov.uk/id/company/06910514 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 02/06/2021 19/05/2020
4 CDS LOCUM LIMITED 12546137 NaN NaN OFFICE 27 8-9 RODNEY ROAD PORTSMOUTH NaN UNITED KINGDOM PO4 8BF Private Limited Company Active United Kingdom NaN 03/04/2020 30.0 4.0 03/01/2022 NaN NO ACCOUNTS FILED 01/05/2021 NaN 0 0 0 0 96090 - Other service activities n.e.c. NaN NaN NaN 0 0 http://business.data.gov.uk/id/company/12546137 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 16/04/2021 NaN

Company status

Now that the data are loaded up, we’ll take a look at the value_counts() for various key columns in the dataset, so we can identify what we need to include and exclude to give us a higher quality final dataset.

First, we’ll check out the CompanyStatus column. 4.514 million of the companies in the dataset are currently marked as “Active”, so we’ll want to include these and exclude the rest. It would be a waste of time and resources to go after the ones who are about to go tits-up.

df['CompanyStatus'].value_counts()
Active                                              4514967
Active - Proposal to Strike off                      220903
Liquidation                                           95396
In Administration                                      3113
Live but Receiver Manager on at least one charge       1315
Voluntary Arrangement                                   893
In Administration/Administrative Receiver               336
RECEIVERSHIP                                            194
ADMINISTRATION ORDER                                    130
ADMINISTRATIVE RECEIVER                                 116
In Administration/Receiver Manager                       44
RECEIVER MANAGER / ADMINISTRATIVE RECEIVER               16
VOLUNTARY ARRANGEMENT / ADMINISTRATIVE RECEIVER           1
VOLUNTARY ARRANGEMENT / RECEIVER MANAGER                  1
Name: CompanyStatus, dtype: int64

Standard Industry Classification (SIC) codes

The Standard Industry Classification or SIC code is a (usually) four digit code used to classify economic activities or industries. SIC codes are used by Companies House and all the third party data re-sellers, such as Experian, to let customers target businesses in specific sectors.

However, importantly, the SIC codes stored in Companies House are added by the company when they file their accounts. If Maggie in accounts at Tesco thinks her office falls under “70.1 Activities of head offices”, she’s free to assign it there, rather than to “47110 Retail sale in non-specialised stores with food, beverages or tobacco predominating”.

The SIC codes have various levels, allowing you to drill down your search to increasingly more specific sectors. Simply find the SIC code you think matches your target sector most closely and you can search for all possible matches.

df['SICCode.SicText_1'].value_counts()
82990 - Other business support service activities n.e.c.                     245210
70229 - Management consultancy activities other than financial management    205617
None Supplied                                                                176707
68209 - Other letting and operating of own or leased real estate             163242
96090 - Other service activities n.e.c.                                      147043
                                                                              ...  
7121 - Rent other land transport equipment                                        1
5181 - Wholesale of machine tools                                                 1
1320 - Mining non-ferrous metal ores                                              1
1422 - Mining of clays and kaolin                                                 1
3350 - Manufacture of watches and clocks                                          1
Name: SICCode.SicText_1, Length: 1177, dtype: int64
df['SICCode.SicText_2'].value_counts()
68209 - Other letting and operating of own or leased real estate             53167
70229 - Management consultancy activities other than financial management    23765
62020 - Information technology consultancy activities                        18598
41202 - Construction of domestic buildings                                   16992
82990 - Other business support service activities n.e.c.                     16266
                                                                             ...  
7413 - Market research, opinion polling                                          1
05200 - Mining of lignite                                                        1
1583 - Manufacture of sugar                                                      1
2221 - Printing of newspapers                                                    1
7482 - Packaging activities                                                      1
Name: SICCode.SicText_2, Length: 1012, dtype: int64
df['SICCode.SicText_3'].value_counts()
68320 - Management of real estate on a fee or contract basis                 13878
68209 - Other letting and operating of own or leased real estate             10902
70229 - Management consultancy activities other than financial management     8608
62090 - Other information technology service activities                       7347
82990 - Other business support service activities n.e.c.                      6874
                                                                             ...  
24460 - Processing of nuclear fuel                                               1
2416 - Manufacture of plastics in primary forms                                  1
2874 - Manufacture fasteners, screw, chains etc.                                 1
9302 - Hairdressing & other beauty treatment                                     1
13940 - Manufacture of cordage, rope, twine and netting                          1
Name: SICCode.SicText_3, Length: 871, dtype: int64
df['SICCode.SicText_4'].value_counts()
68320 - Management of real estate on a fee or contract basis                 8272
96090 - Other service activities n.e.c.                                      4314
82990 - Other business support service activities n.e.c.                     3959
70229 - Management consultancy activities other than financial management    3866
85600 - Educational support services                                         2747
                                                                             ... 
13922 - manufacture of canvas goods, sacks, etc.                                1
2225 - Ancillary printing operations                                            1
5020 - Maintenance & repair of motors                                           1
7012 - Buying & sell own real estate                                            1
9212 - Motion picture & video distribution                                      1
Name: SICCode.SicText_4, Length: 719, dtype: int64

Search the Companies House data

Next, we’ll create a helper function to search the Companies House data. The major caveat here is that the data aren’t that clean. If you restrict your search to a county, country, or other parameter, you may find it misses companies, as some records are incomplete, or inconsistent, so check first.

def get_matches(df, sic_code, county=None):
    """Return a dataframe of active SIC code matches. 

    Args:
        df (dataframe): Pandas dataframe containing Companies House data
        sic_code (string): SIC code to find, i.e. 68320
        county (string, optional): Optional county (default = None)
    """

    if county:
        return df[ (df['CompanyStatus']=='Active') \
                 & (df['SICCode.SicText_1'].str.startswith(sic_code + ' -')) \
                 & (df['RegAddress.County']==str.upper(county))
                 ]
    else:
        return df[ (df['CompanyStatus']=='Active') & (df['SICCode.SicText_1'].str.startswith(sic_code + ' -') )]
Find all active estate agents in the UK

To find all the active estate agents in the UK, all we need to do now is pass the df and the SIC code to get_matches() and we get back a Pandas dataframe containing our results. Running .shape on this shows that we get back 159,188 companies currently marked active in Companies House.

df_estate_agents = get_matches(df, sic_code='68209', )
df_estate_agents.head()
CompanyName CompanyNumber RegAddress.CareOf RegAddress.POBox RegAddress.AddressLine1 RegAddress.AddressLine2 RegAddress.PostTown RegAddress.County RegAddress.Country RegAddress.PostCode CompanyCategory CompanyStatus CountryOfOrigin DissolutionDate IncorporationDate Accounts.AccountRefDay Accounts.AccountRefMonth Accounts.NextDueDate Accounts.LastMadeUpDate Accounts.AccountCategory Returns.NextDueDate Returns.LastMadeUpDate Mortgages.NumMortCharges Mortgages.NumMortOutstanding Mortgages.NumMortPartSatisfied Mortgages.NumMortSatisfied SICCode.SicText_1 SICCode.SicText_2 SICCode.SicText_3 SICCode.SicText_4 LimitedPartnerships.NumGenPartners LimitedPartnerships.NumLimPartners URI PreviousName_1.CONDATE PreviousName_1.CompanyName PreviousName_2.CONDATE PreviousName_2.CompanyName PreviousName_3.CONDATE PreviousName_3.CompanyName PreviousName_4.CONDATE PreviousName_4.CompanyName PreviousName_5.CONDATE PreviousName_5.CompanyName PreviousName_6.CONDATE PreviousName_6.CompanyName PreviousName_7.CONDATE PreviousName_7.CompanyName PreviousName_8.CONDATE PreviousName_8.CompanyName PreviousName_9.CONDATE PreviousName_9.CompanyName PreviousName_10.CONDATE PreviousName_10.CompanyName ConfStmtNextDueDate ConfStmtLastMadeUpDate
30 CDS PROPERTIES (NORTHERN) LIMITED SC537855 NaN NaN C/O PROPERTY ACCOUNTS LIMITED 27 LAURISTON STREET EDINBURGH NaN NaN EH3 9DQ Private Limited Company Active United Kingdom NaN 14/06/2016 30.0 6.0 30/06/2021 30/06/2019 NaN 12/07/2017 NaN 11 9 0 2 68209 - Other letting and operating of own or ... NaN NaN NaN 0 0 http://business.data.gov.uk/id/company/SC537855 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 07/06/2021 24/05/2020
61 CDS TRADING UK LIMITED 04636487 NaN NaN 8 HEMMELLS NaN BASILDON ESSEX ENGLAND SS15 6ED Private Limited Company Active United Kingdom NaN 14/01/2003 31.0 3.0 31/03/2021 31/03/2019 TOTAL EXEMPTION FULL 11/02/2017 14/01/2016 3 2 0 1 68209 - Other letting and operating of own or ... NaN NaN NaN 0 0 http://business.data.gov.uk/id/company/04636487 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 25/02/2021 14/01/2020
74 CDSB LIMITED 12916865 NaN NaN 16 EURO BUSINESS PARK NEW ROAD NEWHAVEN EAST SUSSEX UNITED KINGDOM BN9 0DQ Private Limited Company Active United Kingdom NaN 30/09/2020 30.0 9.0 30/06/2022 NaN NO ACCOUNTS FILED 28/10/2021 NaN 0 0 0 0 68209 - Other letting and operating of own or ... NaN NaN NaN 0 0 http://business.data.gov.uk/id/company/12916865 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 13/10/2021 NaN
146 CDT PROPERTY INVESTMENTS LIMITED 10695269 NaN NaN SUITE 1 SECOND FLOOR EVERDENE HOUSE DEANSLEIGH ROAD BOURNEMOUTH NaN UNITED KINGDOM BH7 7DU Private Limited Company Active United Kingdom NaN 28/03/2017 31.0 3.0 31/12/2021 31/03/2020 TOTAL EXEMPTION FULL 25/04/2018 NaN 2 2 0 0 68209 - Other letting and operating of own or ... NaN NaN NaN 0 0 http://business.data.gov.uk/id/company/10695269 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 10/04/2021 27/03/2020
200 CDVRE HOLDINGS LTD 09232551 NaN NaN 8B ACCOMMODATION ROAD GOLDERS GREEN LONDON NaN NaN NW11 8ED Private Limited Company Active United Kingdom NaN 24/09/2014 30.0 9.0 30/06/2022 30/09/2020 UNAUDITED ABRIDGED 22/10/2017 24/06/2016 3 3 0 0 68209 - Other letting and operating of own or ... NaN NaN NaN 0 0 http://business.data.gov.uk/id/company/09232551 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 27/09/2021 13/09/2020
df_estate_agents.shape
(159188, 55)
Find all active estate agents in Norfolk

To find only the active estate agents in Norfolk, we can pass in optional county argument. Running this gives us 834 estate agents currently active in Norfolk. And that’s it.

You can, of course, come up with your own funky filters or regexes to find businesses that haven’t been categorised properly, but this is surprisingly effective for free! Note that it may not give you permission to market to them, but this may fall under legitimate interest GDPR wise.

df_estate_agents_norfolk = get_matches(df, sic_code='68209', county='Norfolk')
df_estate_agents_norfolk.head()
CompanyName CompanyNumber RegAddress.CareOf RegAddress.POBox RegAddress.AddressLine1 RegAddress.AddressLine2 RegAddress.PostTown RegAddress.County RegAddress.Country RegAddress.PostCode CompanyCategory CompanyStatus CountryOfOrigin DissolutionDate IncorporationDate Accounts.AccountRefDay Accounts.AccountRefMonth Accounts.NextDueDate Accounts.LastMadeUpDate Accounts.AccountCategory Returns.NextDueDate Returns.LastMadeUpDate Mortgages.NumMortCharges Mortgages.NumMortOutstanding Mortgages.NumMortPartSatisfied Mortgages.NumMortSatisfied SICCode.SicText_1 SICCode.SicText_2 SICCode.SicText_3 SICCode.SicText_4 LimitedPartnerships.NumGenPartners LimitedPartnerships.NumLimPartners URI PreviousName_1.CONDATE PreviousName_1.CompanyName PreviousName_2.CONDATE PreviousName_2.CompanyName PreviousName_3.CONDATE PreviousName_3.CompanyName PreviousName_4.CONDATE PreviousName_4.CompanyName PreviousName_5.CONDATE PreviousName_5.CompanyName PreviousName_6.CONDATE PreviousName_6.CompanyName PreviousName_7.CONDATE PreviousName_7.CompanyName PreviousName_8.CONDATE PreviousName_8.CompanyName PreviousName_9.CONDATE PreviousName_9.CompanyName PreviousName_10.CONDATE PreviousName_10.CompanyName ConfStmtNextDueDate ConfStmtLastMadeUpDate
25564 CHAND PROPERTIES LIMITED 09622267 NaN NaN THE OFFICE, 4A ALLENDALE ROAD CAISTER-ON-SEA GREAT YARMOUTH NORFOLK ENGLAND NR30 5ES Private Limited Company Active United Kingdom NaN 03/06/2015 30.0 6.0 30/06/2021 30/06/2019 NaN 01/07/2017 03/06/2016 0 0 0 0 68209 - Other letting and operating of own or ... NaN NaN NaN 0 0 http://business.data.gov.uk/id/company/09622267 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 17/06/2021 03/06/2020
29464 CHAPMANS,NORWICH.LIMITED 00294792 NaN NaN UNIT 1, BENNETT HOUSE NORWICH STREET DEREHAM NORFOLK NaN NR19 1BX Private Limited Company Active United Kingdom NaN 07/12/1934 31.0 10.0 31/07/2021 31/10/2019 TOTAL EXEMPTION FULL 27/09/2016 30/08/2015 1 0 0 1 68209 - Other letting and operating of own or ... NaN NaN NaN 0 0 http://business.data.gov.uk/id/company/00294792 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 13/09/2021 30/08/2020
35229 CHARNDATE LIMITED 05176236 NaN NaN YEW TREE FARM PULHAM MARKET DISS NORFOLK NaN IP21 4XN Private Limited Company Active United Kingdom NaN 12/07/2004 31.0 3.0 31/12/2021 31/03/2020 TOTAL EXEMPTION FULL 09/08/2016 12/07/2015 0 0 0 0 68209 - Other letting and operating of own or ... NaN NaN NaN 0 0 http://business.data.gov.uk/id/company/05176236 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 21/07/2021 07/07/2020
40120 CHEALL & COE PROPERTIES LTD 04956854 NaN NaN 58B SHAKESPEARE WAY TAVERHAM NORWICH NORFOLK NaN NR8 6SL Private Limited Company Active United Kingdom NaN 07/11/2003 31.0 12.0 30/09/2021 31/12/2019 TOTAL EXEMPTION FULL 05/12/2016 07/11/2015 2 2 0 0 68209 - Other letting and operating of own or ... NaN NaN NaN 0 0 http://business.data.gov.uk/id/company/04956854 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 20/12/2020 08/11/2019
43431 CHELSEA INVESTMENT PROPERTIES LTD 11646725 NaN NaN WATERLOO HOUSE 17 WATERLOO ROAD NORWICH NORFOLK UNITED KINGDOM NR3 1EH Private Limited Company Active United Kingdom NaN 29/10/2018 31.0 1.0 31/10/2021 31/01/2020 TOTAL EXEMPTION FULL 26/11/2019 NaN 11 11 0 0 68209 - Other letting and operating of own or ... NaN NaN NaN 0 0 http://business.data.gov.uk/id/company/11646725 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 11/11/2021 28/10/2020
df_estate_agents_norfolk.shape
(834, 55)

Lookalike modeling

The logical progression to this project is to use lookalike modeling. To do this, you’d first need to augment your customer data by adding in the SIC code, then examine the differences in customer performance and value by SIC code, and then use the dataset to identify more customers who look like your best ones…

While not particularly difficult, it does require some complicated fuzzy matching to allow you to connect the SIC codes of your customers (which may not be ordering via their head office address) with the addresses held in the Companies House dataset.

Matt Clarke, Saturday, March 13, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.