How to scrape JSON-LD competitor reviews using Extruct

Here's how you can use Python, Selenium, and Extruct to create a headless web browser and scrape your competitors' reviews to analyse.

How to scrape JSON-LD competitor reviews using Extruct
Which Land Rover parts suppliers have the best customer reviews?
19 minutes to read

In the ecommerce sector, you can learn a lot about your competitors and the expectations of your customers by analysing the reviews their customers leave for products and service on platforms such as Google Reviews, Trustpilot and Feefo, and comparing them to your own.

Where are your competitors going wrong? Why do they get praised? How does their service compare to yours? What products do potential customers in your market love or hate? What products are your competitors selling lots of? By understanding these data, you can learn useful things that can be used to shape your business, whether it’s from an operations, category management or customer service perspective.

Here, we’re going to write a scraper to fetch Trustpilot reviews from a list of Land Rover parts retailers and create a dataset to analyse. If you want to fetch Feefo reviews, you can fetch them directly using the Feefo Python API.

The problem with scrapers…

Ordinarily, when you’re scraping content you’ll use a system such as Selenium or Beautiful Soup to scrape the HTML of the page and then parse and extract the content you need. However, this has two massive drawbacks. Firstly, every scraper you write needs to be specific to the site you’re scraping, and secondly, if the site changes its HTML, which is inevitable, your scraper will break.

To work around this, you can take advantage of a feature that many sites add to their pages to help search engines avoid this exact problem. Instead of scraping and parsing the HTML of the page, we’re instead going to scrape and parse the page’s Schema.org JSON-LD markup.

Quite a lot of sites add these pieces of code to their pages, so if you write a scraper to handle one, you could apply it to multiple sites. As it’s a common standard, there’s a greater likelihood that the code will parse consistently and your scraper is far less likely to need rewriting in the future.

This technique is becoming much more widespread in data science, with researchers using schema.org scraping to create a wide range of standardised datasets for various machine learning problems, such as product matching and Product Attribute Extraction (PAE).

Install the packages

We’re using three core libraries for this project: Selenium to scrape the content, Extruct to parse the JSON-LD, and Pandas to manipulate and display the data. If you don’t have these installed you can install them via PyPi - the Python Package Index - using the below commands.

pip3 install pandas
pip3 install selenium
pip3 install extruct

For Selenium to work, you will also need to install the Chrome webdriver application on your machine. On Ubuntu you can enter whereis chromedriver to determine whether chromedriver is installed, and if so where it is. If this returns a blank value, you can install the package using sudo apt install chromium-chromedriver. If that works, you should see an output like this when you run whereis chromedriver:

!whereis chromedriver
chromedriver: /usr/bin/chromedriver

Load the libraries

Now you have the packages installed, you can import Pandas, Extruct, and the Selenium webdriver, then import Options from the selenium.webdriver.chrome.options component. The Options component allows us to pass extra arguments to Selenium that can be useful when making a headless scraper.

import pandas as pd
import extruct as ex
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

Create a list of URLs to scrape

Next we’re going to create a list of URLs for the competitors whose reviews we want to scrape. As I spend too much time and money doing up my Land Rover Defender, I’ve picked a small selection of Defender parts suppliers to scrape, which are all on the UK Trustpilot reviews website.

Go to Trustpilot (or another reviews site which uses the JSON-LD reviews schema in its pages) and note down the URL for the competitors whose reviews you want to scrape. For test purposes, I’d recommend picking one or two who have relatively small volumes of reviews.

urls = [
    'https://uk.trustpilot.com/review/www.mudstuff.co.uk',
    'https://uk.trustpilot.com/review/landroverdefendersecurity.com',
    'https://uk.trustpilot.com/review/famousfour.co.uk',
    'https://uk.trustpilot.com/review/www.bearmach.com',    
    'https://uk.trustpilot.com/review/lrparts.net',
    'https://uk.trustpilot.com/review/www.johncraddockltd.co.uk',
    'https://uk.trustpilot.com/review/www.paddockspares.com',
]

Identify the JSON-LD reviews to scrape

If you view the source code of a company’s Trustpilot reviews page and then search for the phrase json-ld, you should find some Schema.org markup like the block of code below. You’ll find one of these scripts containing the JSON content on each page of a company’s reviews. Therefore, if you scrape this block from each page, then go through all the paginated results for the company, you’ll be able to extract their reviews one page at a time.

JSON-LD is quite horrible to read, but basically this contains a Python dictionary like syntax containing the details on the business and each of the reviews its customers have posted on the page you’re currently viewing. That includes the reviewer name, the date of the review, their rating, and their comments. Everything we need is here, making this much easier than scraping everything out of the HTML.

<script type="application/ld+json" data-business-unit-json-ld>
            [{"@context":"http://schema.org","@type":"LocalBusiness","@id":"https://uk.trustpilot.com/review/www.mudstuff.co.uk","url":"http://www.mudstuff.co.uk","name":"MUD-UK","aggregateRating":{"@type":"AggregateRating","bestRating":"5","worstRating":"1","ratingValue":"4","reviewCount":"3"},"address":{"@type":"PostalAddress"},"review":[{"@type":"Review","itemReviewed":{"@type":"Thing","name":"MUD-UK"},"author":{"@type":"Person","name":"jools","url":"https://uk.trustpilot.com/users/57c560330000ff000a3f509f"},"datePublished":"2020-07-27T17:26:08Z","headline":"Great products and service","reviewBody":"If only all companies were as good as Mud UK. I\u0027ve ordered a bunch of bits from them over the last year, everything has been processed efficiently and delivered on time - even through the pandemic.\nWell done Mud.","reviewRating":{"@type":"Rating","bestRating":"5","worstRating":"1","ratingValue":"5"},"publisher":{"@type":"Organization","name":"Trustpilot","sameAs":"https://uk.trustpilot.com"},"inLanguage":"en"},{"@type":"Review","itemReviewed":{"@type":"Thing","name":"MUD-UK"},"author":{"@type":"Person","name":"Theo Merchant","url":"https://uk.trustpilot.com/users/5dadfba861f8ee83db36556a"},"datePublished":"2019-10-21T18:46:38Z","headline":"Ordered a few bits from the website…","reviewBody":"Ordered a few bits from the website which came very quickly and were as described. There was an issue with PayPal where it took the payment twice but mudstuff were quick to rectify this. Great service and very helpful.","reviewRating":{"@type":"Rating","bestRating":"5","worstRating":"1","ratingValue":"5"},"publisher":{"@type":"Organization","name":"Trustpilot","sameAs":"https://uk.trustpilot.com"},"inLanguage":"en"},{"@type":"Review","itemReviewed":{"@type":"Thing","name":"MUD-UK"},"author":{"@type":"Person","name":"Christian Østerbye","url":"https://uk.trustpilot.com/users/589dba600000ff000a75ff95","image":"https://user-images.trustpilot.com/589dba600000ff000a75ff95/73x73.png"},"datePublished":"2017-02-10T13:04:42Z","headline":"Absolutely stellar customer service","reviewBody":"Always very swift at shipping the orders. Got a wrong item in the last order but the \u0027issue\u0027 was VERY efficiently and professionally resolved!","reviewRating":{"@type":"Rating","bestRating":"5","worstRating":"1","ratingValue":"5"},"publisher":{"@type":"Organization","name":"Trustpilot","sameAs":"https://uk.trustpilot.com"},"inLanguage":"en"}]},{"@context":"http://schema.org","@type":"Dataset","name":"MUD-UK","description":"Bar chart review and ratings distribution for MUD-UK","publisher":{"@type":"Organization","name":"Trustpilot","sameAs":"https://uk.trustpilot.com"},"mainEntity":{"@type":"csvw:Table","csvw:tableSchema":{"csvw:columns":[{"csvw:name":"1 star","csvw:datatype":"integer","csvw:cells":[{"csvw:value":"0","csvw:notes":["0%"]}]},{"csvw:name":"2 stars","csvw:datatype":"integer","csvw:cells":[{"csvw:value":"0","csvw:notes":["0%"]}]},{"csvw:name":"3 stars","csvw:datatype":"integer","csvw:cells":[{"csvw:value":"0","csvw:notes":["0%"]}]},{"csvw:name":"4 stars","csvw:datatype":"integer","csvw:cells":[{"csvw:value":"0","csvw:notes":["0%"]}]},{"csvw:name":"5 stars","csvw:datatype":"integer","csvw:cells":[{"csvw:value":"3","csvw:notes":["100%"]}]},{"csvw:name":"Total","csvw:datatype":"integer","csvw:cells":[{"csvw:value":"3","csvw:notes":["100%"]}]}]}}}]

</script>

Identify the paginated review pages to scrape

As Trustpilot paginates its reviews, with 20 or so shown per page, if you only scrape the first page of results you’ll only get a small number of reviews which may not give a true representation of the business you’re trying to analyse.

The most common way to find the next page of reviews would be to find the block of pagination links at the bottom and then get Selenium to click the “next” one once it’s parsed each block of JSON-LD. However, there’s a much neater way to do this, which again prevents the need to rewrite your scraper in the event of the page HTML changing.

Providing you’re looking at a business with more than one page of reviews, if you view the source again and search for the phrase rel="next" you’ll find a line of HTML added to the page to help search engines crawl the site and identify the next page.

<link rel="next" href="https://uk.trustpilot.com/review/www.bearmach.com?page=2" />

Creating your JSON-LD reviews scraper

Now we have identified the two elements we want to scrape - the JSON-LD review and the URL of the next page in the results set - we can create our reviews scraper. To make this a bit easier to interpret, I’ve written some little functions to handle each bit, so we’ll go through these first and then pull it all together.

Get the driver

First, I’ve written a function called get_driver() which creates a headless Chrome web browser and returns the Selenium driver object. By passing in the --headless argument to options.add_argument() this doesn’t spawn a new browser window every time it opens a URL. If you want to watch Selenium running, just comment that line out.

def get_driver():
    options = Options()
    options.add_argument('--headless')
    driver = webdriver.Chrome(options=options)
    return driver

Get the page source

Now that we have accessed our page, we need to grab the page source code HTML which contains our JSON-LD and the link to the next page. The get_source() function I created takes the driver object from get_driver() along with the URL of the page of reviews. This returns the HTML source code which we can parse in the next steps.

def get_source(driver, url):
    driver.get(url)
    return driver.page_source

Scrape the JSON-LD with Extruct

To extract the JSON-LD from the page source, we pass the source code output from get_source() to the get_json() function and tell it to look for code in the json-ld syntax. There’s only one of these on the Trustpilot pages, so we don’t need to do anything else.

def get_json(source):
    return ex.extract(source, syntaxes=['json-ld'])

Scrape the next page URL

Next we are going to use Selenium in the more conventional manner and scrape the URL of the next page using its XPath. You can find the XPath of an element by inspecting the HTML for a given element in Chrome using the “inspect element” feature. The get_next_page() function below takes the Selenium driver object and the source from get_source() and then finds all elements with the XPath //link[@rel="next"]. If it finds any elements, it uses get_attribute('href') to return just the URL of the page.

def get_next_page(driver, source):
    """Parse the page source and return the URL for the next page of results.

    :param driver: Selenium webdriver
    :param source: Page source code from Selenium

    :return
        URL of next paginated page
    """

    elements = driver.find_elements_by_xpath('//link[@rel="next"]')
    if elements:
        return driver.find_element_by_xpath('//link[@rel="next"]').get_attribute('href')
    else:
        return ''

Parse the JSON-LD schema.org review

Now we have scraped the JSON-LD out of the page and scraped and parsed out the URL for the next page, we need to write a function to scrape the reviews from the JSON-LD tag. This is arguably the hardest bit and did take some trial and error to get right. The save_reviews() function I created takes two arguments: data (which is the JSON-LD schema.org review code) and df which is a Pandas DataFrame into which we’ll store the data collected. The DataFrame looks like this:

df = pd.DataFrame(columns = ['author', 'headline', 'body', 'rating', 
                             'item_reviewed', 'publisher', 'date_published'])

The save_reviews() function first finds the JSON-LD element in the code, checks to see if a review is present, and then loops through the reviews. The get() function is used to extract each element from the review. The un-nested elements can be accessed with review.get('reviewBody') where reviewBody is the name of the element, while the nested ones need a two-layered approach, such as review.get('reviewRating', {}).get('ratingValue'). Once the elements have been extracted from the JSON, the row of data is appended to the Pandas DataFrame.

def save_reviews(data, df):
    """Scrape the individual reviews from a schema.org JSON-LD tag and
    save the contents in the df_reviews Pandas dataframe. 

    :param data: JSON-LD source containing schema.org review markup
    :param df: Name of Pandas dataframe to which to append reviews

    :return
        df with reviews appended
    """

    for item in data['json-ld']:
        if "review" in item:
            for review in item['review']:

                row = {
                    'author': review.get('author', {}).get('name'),
                    'headline': review.get('headline'),
                    'body': review.get('reviewBody'),
                    'rating': review.get('reviewRating', {}).get('ratingValue'),
                    'item_reviewed': review.get('itemReviewed', {}).get('name'),
                    'publisher': review.get('publisher', {}).get('name'),
                    'date_published': review.get('datePublished')
                }

                df = df.append(row, ignore_index=True)

    return df

A Land Rover Defender.

Create the crawler

The final step is to create the crawler or spider. This is given the list of URLs of review pages and it then loops through them, parsing and saving the reviews as it goes, via the functions we created above. Selenium is quite quick at scraping the reviews from the site but this will obviously take a long time to run if you pick a competitor with lots of reviews. You may want to add a sleep() command to get your crawler to pause so it doesn’t hammer Trustpilot’s server too much.

for url in urls:

    print(url)

    # Save the reviews from the first page
    driver = get_driver()
    source = get_source(driver, url)
    json = get_json(source)
    df = save_reviews(json, df)

    # Get reviews on each paginated page
    next_page = get_next_page(driver, source)
    paginated_urls = []
    paginated_urls.append(next_page)

    if paginated_urls:

        for url in paginated_urls:

            if url:

                print(next_page)
                driver = get_driver()
                source = get_source(driver, url)
                json = get_json(source)
                df = save_reviews(json, df)
                next_page = get_next_page(driver, source)
                paginated_urls.append(next_page)
https://uk.trustpilot.com/review/www.mudstuff.co.uk
https://uk.trustpilot.com/review/landroverdefendersecurity.com
https://uk.trustpilot.com/review/landroverdefendersecurity.com?page=2
https://uk.trustpilot.com/review/landroverdefendersecurity.com?page=3
https://uk.trustpilot.com/review/landroverdefendersecurity.com?page=4
https://uk.trustpilot.com/review/famousfour.co.uk

Check the results

If you put this altogether and run it, then go away for a cup of tea, it should have scraped and parsed all the review content and placed it neatly into a Pandas DataFrame for you to analyse. If you use the to_csv() function, you can save the output to a file to save the hassle of scraping it again in future.

df.to_csv('reviews.csv')
df.head(1000)
author headline body rating item_reviewed publisher date_published
0 jools Great products and service If only all companies were as good as Mud UK. ... 5 MUD-UK Trustpilot 2020-07-27T17:26:08Z
1 Theo Merchant Ordered a few bits from the website… Ordered a few bits from the website which came... 5 MUD-UK Trustpilot 2019-10-21T18:46:38Z
2 Christian Østerbye Absolutely stellar customer service Always very swift at shipping the orders. Got ... 5 MUD-UK Trustpilot 2017-02-10T13:04:42Z
3 Dominic Ferrar Great customer service When I called to discuss my potential order th... 5 LRD Security Trustpilot 2020-03-19T20:00:52Z
4 Mr Ian Winskill Happy customer Promt and professional service 5 LRD Security Trustpilot 2020-03-19T16:25:34Z
... ... ... ... ... ... ... ...
995 Paul Callow Easy to deal with and very quick Easy to deal with and very quick delivery (3 d... 5 Famous Four Trustpilot 2018-05-08T20:01:02Z
996 morten lund Everything I need !! Everything I need !! 5 Famous Four Trustpilot 2018-05-08T17:17:43Z
997 CustomerM Williams brilliant service couldnt ask for better brilliant service couldnt ask for better 5 Famous Four Trustpilot 2018-05-08T16:33:29Z
998 Dag Lislerud Midtfjeld Fast and reliable Fast and reliable 5 Famous Four Trustpilot 2018-05-08T15:47:23Z
999 mohammed alsheheri saudi shipping Excellent with DHL saudi shipping Excellent with DHL 5 Famous Four Trustpilot 2018-05-08T14:21:28Z

Further reading

Earlier this year, researchers from the University of Mannheim used schema.org to train a machine learning model for product matching and achieved a state-of-the-art F1 score of 0.95, demonstrating the power that these data can bring to your models.

  • Peeters, R., Primpeli, A., Wichtlhuber, B. and Bizer, C., 2020, June. Using schema.org annotations for training and maintaining product matchers. In Proceedings of the 10th International Conference on Web Intelligence, Mining and Semantics (pp. 195-204).

Matt Clarke, Tuesday, March 02, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.