How to scrape competitor technology data in Python

Learn how to automate the collection of website technology data from your competitors using Builtwith and Python.

How to scrape competitor technology data in Python
Picture by Max Duzij, Unsplash.
19 minutes to read

In ecommerce, it pays to watch what your competitors are doing, so over the past decade or so in which I’ve managed ecommerce businesses, I’ve regularly undertaken competitor analyses. They’re a great way to see how your site compares to others as they allow you to identify how you can make further improvements to move (or stay) ahead of rivals.

Obviously, competitor analyses can encompass a huge range of different factors, and some of the data can be complex and time consuming to extract, which can make the process a little dull. Thankfully, some aspects of building a dataset for an ecommerce competitor analysis project can be automated. Here’s how you can quickly generate some data on your the site technologies used by your competitors using Python, so you can spend more time doing the things that code can’t easily do.

Load the libraries

Rather than writing this from scratch, we’ll be using a Python package called “Builtwith”. This is based on a script which fetches the source of a given URL and then uses a series of Python regular expressions to detect the client-side technologies the site is using. It doesn’t cover everything, but it’s pretty good and way faster than checking manually or reinventing the wheel yourself. WAD is a similar package, based on Wappalyzer’s detection rules, but it doesn’t categorise the technologies it finds as Builtwith does.

To get started, load up Pandas and Builtwith, which can both be installed via PyPi with pip3 install pandas and pip3 install builtwith. Additionally, to allow us to see all of the rows Pandas returns, I’ve used pd.set_option('max_rows, 100) to show the first 100 rows, and pd.set_option('max_colwidth', 700) to set the column width so it’s wide enough to show any long strings returned by Builtwith.

import pandas as pd
from builtwith import *
pd.set_option('max_rows', 100)
pd.set_option('max_colwidth', 700)

Create your competitive set

If you work in ecommerce you’ll likely already have a “competitive set” or list of competitors you watch more closely than others. These are typically the ones against whom you compare and benchmark your own business, perhaps because they compete with you most strongly on product range, organic search, or paid search. Add your competitive set of URLs to a Python list.

When undertaking ecommerce competitor analysis, one really important thing you should do is look at sites outside your industry, as well as inside. This is particularly important when you work in a niche or specialist ecommerce sector. Here, it’s quite common for the businesses to be smaller, so they have less knowledgeable staff and less money to throw at technology, development and customer service.

In these markets, it might not take very much to become the best in the sector, because everyone else is below the usual average. However, don’t fall into the trap of thinking you’re doing brilliantly because you’re the best of a bad bunch. When customers are judging your site, they won’t be subliminally comparing it to the other specialists in your sector, they’ll be comparing it to the sites they use everyday - whether it’s Amazon, Asos, or Etsy. Comparing yourself and your rivals against major ecommerce sites can therefore be a useful benchmark.

sites = [
    'https://www.missguided.co.uk',
    'https://www.prettylittlething.com',
    'https://www.asos.com',    
    'https://www.topshop.com',    
    'https://www.missselfridge.com',    
    'https://www.newlook.com',
    'https://www.dorothyperkins.com',
    'https://www.riverisland.com',    
    'https://www.allsaints.com',     
]

Create a function to parse the Builtwith data

By default, Builtwith returns a large Python dictionary containing all the client-side technologies it has been able to identify. Since this isn’t particularly easy for other humans to read, we’ll write a little function to loop over each of the sites in the competitive set, identify the given technology in which we’re interested, and add the data to a Pandas DataFrame. I added a verbose flag to mine so it prints its progress, as it can take some time to fetch all of the data if you have a large list.

def create_dataframe(sites, technology, verbose=False):
    """Return a Pandas DataFrame showing the specific technology used on each site.

    :param sites: Python list of site URLs to check
    :param technology: Technology to check, i.e. cdn, ecommerce, web-servers (from Builtwith)
    :return: Pandas DataFrame containing data on the specific feature
    """

    df = pd.DataFrame(columns=['site', technology])

    for site in sites:        

        if verbose:
            print('Checking', site)

        data = builtwith(site)
        result = data.get(technology)

        if not result:
            result = 'Not detected'

        row = {
            'site': site,
            technology: result,
        }

        df = df.append(row, ignore_index=True)

    return df

Fetch the data

Now we can identify the technologies found by Builtwith and print the outputs to a series of DataFrames to examine what our rivals are using. Obviously, while these might be interesting to know, there might not be much competitive advantage in knowing some of them. However, some of the data are certainly of use for competitor monitoring and they will show trends in your market and perhaps identify areas where you should be looking to improve your site functionality to get closer to the competition or help you stay ahead.

Available technologies are: cms, message-boards, database-managers, documentation-tools, widgets, ecommerce, photo-galleries, wikis, hosting-panels, analytics, blogs, javascript-frameworks, issue-trackers, video-players, comment-systems, captchas, font-scripts, web-frameworks, miscellaneous, editors, lms, web-servers, cache-tools, rich-text-editors, javascript-graphics, mobile-frameworks, programming-languages, operating-systems, search-engines, web-mail, cdn, marketing-automation, web-server-extensions, databases, maps, advertising-networks, network-devices, media-servers, webcams, printers, payment-processors, tag-managers, paywalls, build-ci-systems, control-systems, remote-access, dev-tools, network-storage, feed-readers, document-management-systems, and landing-page-builders.

Unfortunately, Builtwith is currently missing the required regular expressions to detect CRMs, chat systems, and mail systems such as Mailchimp, which could be useful to know. Of course, it may also fail to detect any system in which the code isn’t served to the URL you provide. If the blog sits on a subdomain, Builtwith won’t go and look for it. It also doesn’t look for server-side technologies. However, it’s still faster than poring through the code by eye.

Ecommerce

Unfortunately, Builtwith failed to detect the ecommerce platform upon which the sites were running. That might be because they’re custom platforms or because any client-side code exposing the platform has been removed or obfuscated. When I tried this on smaller sites it correctly detected common platforms such as WooCommerce, Magento, and Shopify, but these huge sites won’t be running these smaller platforms.

df_ecommerce = create_dataframe(sites, 'ecommerce', verbose=True)
df_ecommerce
Checking https://www.missguided.co.uk
Checking https://www.prettylittlething.com
Checking https://www.asos.com
Checking https://www.topshop.com
Checking https://www.missselfridge.com
Checking https://www.newlook.com
Checking https://www.dorothyperkins.com
Checking https://www.riverisland.com
Checking https://www.allsaints.com
site ecommerce
0 https://www.missguided.co.uk Not detected
1 https://www.prettylittlething.com Not detected
2 https://www.asos.com Not detected
3 https://www.topshop.com Not detected

Content Delivery Networks (CDNs)

The Akamai content delivery network seems to be the most popular choice in the UK women’s fashion market. Elsewhere, CloudFlare seems to be used quite widely.

df_cdn = create_dataframe(sites, 'cdn', verbose=False)
df_cdn
site cdn
0 https://www.missguided.co.uk Not detected
1 https://www.prettylittlething.com Not detected
2 https://www.asos.com [Akamai]

Web frameworks

More than half of the sites examined used Twitter Bootstrap as their front-end framework, which is unsurprisingly given how widespread it has become on the web in recent years.

df_web_frameworks = create_dataframe(sites, 'web-frameworks', verbose=False)
df_web_frameworks
site web-frameworks
0 https://www.missguided.co.uk Not detected
1 https://www.prettylittlething.com [Twitter Bootstrap]
2 https://www.asos.com Not detected
3 https://www.topshop.com [Twitter Bootstrap]
4 https://www.missselfridge.com [Twitter Bootstrap]

JavaScript frameworks

A range of JavaScript frameworks were in use. Pretty Little Thing was unusual in using Handlebars, but most sites use Prototype and jQuery, with several also using RequireJS. All Saints uses a plethora of frameworks.

df_javascript_frameworks = create_dataframe(sites, 'javascript-frameworks', verbose=False)
df_javascript_frameworks
site javascript-frameworks
0 https://www.missguided.co.uk Not detected
1 https://www.prettylittlething.com [Handlebars]
2 https://www.asos.com Not detected
3 https://www.topshop.com [Prototype, React, RequireJS, basket.js, jQuery]
4 https://www.missselfridge.com [Prototype, React, RequireJS, Vue.js, basket.js]

Blogs

As I didn’t provide the URLs for any blog pages, Builtwith wasn’t able to detect them. Chances are, for security reasons, the blog will be running on a separate server and served up using a subdomain or proxy, instead of being physically present on the main ecommerce platform.

df_blogs = create_dataframe(sites, 'blogs', verbose=False)
df_blogs
site blogs
0 https://www.missguided.co.uk Not detected
1 https://www.prettylittlething.com Not detected
2 https://www.asos.com Not detected
3 https://www.topshop.com Not detected
4 https://www.missselfridge.com Not detected
5 https://www.newlook.com Not detected
6 https://www.dorothyperkins.com Not detected
7 https://www.riverisland.com Not detected
8 https://www.allsaints.com Not detected

Search engines

As with the blog, the search engine techologies used was not detected by Builtwith. Most sites will likely be using Elastic or Solr, I suspect, maybe even with a more sophisticated machine learning backend powered by a click model using the Learning to Rank modeling technique. As these technologies are server-side, they’re unlikely to be detected on the client-side, but smaller sites may well expose their technologies in their source code.

df_search_engines = create_dataframe(sites, 'search-engines', verbose=False)
df_search_engines
site search-engines
0 https://www.missguided.co.uk Not detected
1 https://www.prettylittlething.com Not detected
2 https://www.asos.com Not detected
3 https://www.topshop.com Not detected
4 https://www.missselfridge.com Not detected
5 https://www.newlook.com Not detected
6 https://www.dorothyperkins.com Not detected
7 https://www.riverisland.com Not detected
8 https://www.allsaints.com Not detected

Marketing automation

Builtwith is capable of detecting Eloqua, HubSpot, Jirafe, Marketo, and Pardot marketing automation platforms, so it’s a little limited and didn’t detect any of these on our sample competitor list. You will likely be able to determine these technologies by examining the code and URLs used in their marketing emails instead.

df_marketing_automation = create_dataframe(sites, 'marketing-automation', verbose=False)
df_marketing_automation
site marketing-automation
0 https://www.missguided.co.uk Not detected
1 https://www.prettylittlething.com Not detected
2 https://www.asos.com Not detected
3 https://www.topshop.com Not detected
4 https://www.missselfridge.com Not detected
5 https://www.newlook.com Not detected
6 https://www.dorothyperkins.com Not detected
7 https://www.riverisland.com Not detected
8 https://www.allsaints.com Not detected

Payment processors

PayPal, Stripe, and Google Wallet are the only payment processors Builtwith can detect on the client side, so none of them were found on these sites.

df_payment_processors = create_dataframe(sites, 'payment-processors', verbose=False)
df_payment_processors
site payment-processors
0 https://www.missguided.co.uk Not detected
1 https://www.prettylittlething.com Not detected
2 https://www.asos.com Not detected
3 https://www.topshop.com Not detected

Tag managers

Google Tag Manager is the dominant tag management system in every market I’ve examined, with more than half of all sites incorporating this technology. This may also explain why some other tags are harder to identify within the pages, as their actual code is injected via Tag Manager.

df_tag_managers = create_dataframe(sites, 'tag-managers', verbose=False)
df_tag_managers
site tag-managers
0 https://www.missguided.co.uk Not detected
1 https://www.prettylittlething.com [Google Tag Manager]
2 https://www.asos.com Not detected
3 https://www.topshop.com [Google Tag Manager]
4 https://www.missselfridge.com [Google Tag Manager]

Analytics

It’s probable that at least half of these sites are using Google Analytics or Adobe Analytics, however, they’ll likely be serving these tags via a tag manager and the code weren’t detectable.

df_analytics = create_dataframe(sites, 'analytics', verbose=False)
df_analytics
site analytics
0 https://www.missguided.co.uk Not detected
1 https://www.prettylittlething.com Not detected
2 https://www.asos.com Not detected
3 https://www.topshop.com Not detected

Advertising networks

As with analytics tags, it’s probable that advertising network tags are going to be served up via a tag management platform and may be present only on specific pages, so Builtwith failed to detect them on these very large sites.

df_advertising_networks = create_dataframe(sites, 'advertising-networks', verbose=False)
df_advertising_networks
site advertising-networks
0 https://www.missguided.co.uk Not detected
1 https://www.prettylittlething.com Not detected

Bringing it all together

Finally, now that we’ve collected all of the data we want, we can pull it all together into a single DataFrame and save it to a CSV file. Builtwith collected quite a bit of data from these sites, but their large size and use of server-side technologies and tag management meant that some of it wasn’t easily detectable.

As a result, it turns out that there’s not a huge amount of actionable data in here, but it’s still intertesting and collecting it in this way was hundreds of times faster and can be quickly re-run again in future to look for any changes. If you work in a niche sector, with smaller, less technologically advanced sites, you’ll likely get better results.

df = df_ecommerce.merge(df_cdn, how='left', on='site')
df = df.merge(df_web_frameworks, how='left', on='site')
df = df.merge(df_javascript_frameworks, how='left', on='site')
df = df.merge(df_blogs, how='left', on='site')
df = df.merge(df_search_engines, how='left', on='site')
df = df.merge(df_marketing_automation, how='left', on='site')
df = df.merge(df_payment_processors, how='left', on='site')
df = df.merge(df_analytics, how='left', on='site')
df = df.merge(df_advertising_networks, how='left', on='site')
df = df.merge(df_tag_managers, how='left', on='site')
df = df.drop_duplicates('site', keep='first')
df.set_index('site')
ecommerce cdn web-frameworks javascript-frameworks analytics advertising-networks tag-managers
site
https://www.missguided.co.uk Not detected Not detected Not detected Not detected Not detected Not detected Not detected
https://www.prettylittlething.com Not detected Not detected [Twitter Bootstrap] [Handlebars] Not detected Not detected [Google Tag Manager]
https://www.asos.com Not detected [Akamai] Not detected Not detected Not detected Not detected Not detected
https://www.topshop.com Not detected [Akamai] [Twitter Bootstrap] [Prototype, React, RequireJS, basket.js, jQuery] Not detected Not detected [Google Tag Manager]
https://www.missselfridge.com Not detected [Akamai] [Twitter Bootstrap] [Prototype, React, RequireJS, Vue.js, basket.js] Not detected Not detected [Google Tag Manager]
https://www.newlook.com Not detected Not detected Not detected [jQuery, jQuery UI] Not detected Not detected Not detected
https://www.dorothyperkins.com Not detected [Akamai] [Twitter Bootstrap] [Prototype, React, RequireJS, basket.js] Not detected Not detected [Google Tag Manager]
https://www.riverisland.com Not detected Not detected Not detected [Prototype, RequireJS, jQuery] Not detected Not detected [Google Tag Manager]
https://www.allsaints.com Not detected [CloudFlare] [Twitter Bootstrap] [Hammer.js, Prototype, RequireJS, Select2, jQuery, Twitter typeahead.js, Underscore.js, spin.js] Not detected Not detected Not detected

Note: DataFrame truncated to fit.

df.to_csv('competitor-technology.csv')

Matt Clarke, Tuesday, March 02, 2021

Matt Clarke Matt is a Digital Director who uses data science to help in his work. He has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.

Joining Data with pandas

Learn to combine data from multiple tables by joining data together using pandas.

Start course for FREE

Comments