In the ecommerce sector, you can learn a lot about your competitors and the expectations of your customers by analysing the reviews their customers leave for products and service on platforms such as Google Reviews, Trustpilot and Feefo, and comparing them to your own.
Where are your competitors going wrong? Why do they get praised? How does their service compare to yours? What products do potential customers in your market love or hate? What products are your competitors selling lots of? By understanding these data, you can learn useful things that can be used to shape your business, whether it’s from an operations, category management or customer service perspective.
Here, we’re going to write a scraper to fetch Trustpilot reviews from a list of Land Rover parts retailers and create a dataset to analyse. If you want to fetch Feefo reviews, you can fetch them directly using the Feefo Python API.
Ordinarily, when you’re scraping content you’ll use a system such as Selenium or Beautiful Soup to scrape the HTML of the page and then parse and extract the content you need. However, this has two massive drawbacks. Firstly, every scraper you write needs to be specific to the site you’re scraping, and secondly, if the site changes its HTML, which is inevitable, your scraper will break.
To work around this, you can take advantage of a feature that many sites add to their pages to help search engines avoid this exact problem. Instead of scraping and parsing the HTML of the page, we’re instead going to scrape and parse the page’s Schema.org JSON-LD markup.
Quite a lot of sites add these pieces of code to their pages, so if you write a scraper to handle one, you could apply it to multiple sites. As it’s a common standard, there’s a greater likelihood that the code will parse consistently and your scraper is far less likely to need rewriting in the future.
This technique is becoming much more widespread in data science, with researchers using schema.org scraping to create a wide range of standardised datasets for various machine learning problems, such as product matching and Product Attribute Extraction (PAE).
We’re using three core libraries for this project: Selenium to scrape the content, Extruct to parse the JSON-LD, and Pandas to manipulate and display the data. If you don’t have these installed you can install them via PyPi - the Python Package Index - using the below commands.
pip3 install pandas
pip3 install selenium
pip3 install extruct
For Selenium to work, you will also need to install the Chrome webdriver application on your machine. On Ubuntu you can enter whereis chromedriver
to determine whether chromedriver is installed, and if so where it is. If this returns a blank value, you can install the package using sudo apt install chromium-chromedriver
. If that works, you should see an output like this when you run whereis chromedriver
:
!whereis chromedriver
chromedriver: /usr/bin/chromedriver
Now you have the packages installed, you can import Pandas, Extruct, and the Selenium webdriver, then import Options
from the selenium.webdriver.chrome.options
component. The Options
component allows us to pass extra arguments to Selenium that can be useful when making a headless scraper.
import pandas as pd
import extruct as ex
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
Next we’re going to create a list of URLs for the competitors whose reviews we want to scrape. As I spend too much time and money doing up my Land Rover Defender, I’ve picked a small selection of Defender parts suppliers to scrape, which are all on the UK Trustpilot reviews website.
Go to Trustpilot (or another reviews site which uses the JSON-LD reviews schema in its pages) and note down the URL for the competitors whose reviews you want to scrape. For test purposes, I’d recommend picking one or two who have relatively small volumes of reviews.
urls = [
'https://uk.trustpilot.com/review/www.mudstuff.co.uk',
'https://uk.trustpilot.com/review/landroverdefendersecurity.com',
'https://uk.trustpilot.com/review/famousfour.co.uk',
'https://uk.trustpilot.com/review/www.bearmach.com',
'https://uk.trustpilot.com/review/lrparts.net',
'https://uk.trustpilot.com/review/www.johncraddockltd.co.uk',
'https://uk.trustpilot.com/review/www.paddockspares.com',
]
If you view the source code of a company’s Trustpilot reviews page and then search for the phrase json-ld
, you should find some Schema.org markup like the block of code below. You’ll find one of these scripts containing the JSON content on each page of a company’s reviews. Therefore, if you scrape this block from each page, then go through all the paginated results for the company, you’ll be able to extract their reviews one page at a time.
JSON-LD is quite horrible to read, but basically this contains a Python dictionary like syntax containing the details on the business and each of the reviews its customers have posted on the page you’re currently viewing. That includes the reviewer name, the date of the review, their rating, and their comments. Everything we need is here, making this much easier than scraping everything out of the HTML.
<script type="application/ld+json" data-business-unit-json-ld>
[{"@context":"http://schema.org","@type":"LocalBusiness","@id":"https://uk.trustpilot.com/review/www.mudstuff.co.uk","url":"http://www.mudstuff.co.uk","name":"MUD-UK","aggregateRating":{"@type":"AggregateRating","bestRating":"5","worstRating":"1","ratingValue":"4","reviewCount":"3"},"address":{"@type":"PostalAddress"},"review":[{"@type":"Review","itemReviewed":{"@type":"Thing","name":"MUD-UK"},"author":{"@type":"Person","name":"jools","url":"https://uk.trustpilot.com/users/57c560330000ff000a3f509f"},"datePublished":"2020-07-27T17:26:08Z","headline":"Great products and service","reviewBody":"If only all companies were as good as Mud UK. I\u0027ve ordered a bunch of bits from them over the last year, everything has been processed efficiently and delivered on time - even through the pandemic.\nWell done Mud.","reviewRating":{"@type":"Rating","bestRating":"5","worstRating":"1","ratingValue":"5"},"publisher":{"@type":"Organization","name":"Trustpilot","sameAs":"https://uk.trustpilot.com"},"inLanguage":"en"},{"@type":"Review","itemReviewed":{"@type":"Thing","name":"MUD-UK"},"author":{"@type":"Person","name":"Theo Merchant","url":"https://uk.trustpilot.com/users/5dadfba861f8ee83db36556a"},"datePublished":"2019-10-21T18:46:38Z","headline":"Ordered a few bits from the website…","reviewBody":"Ordered a few bits from the website which came very quickly and were as described. There was an issue with PayPal where it took the payment twice but mudstuff were quick to rectify this. Great service and very helpful.","reviewRating":{"@type":"Rating","bestRating":"5","worstRating":"1","ratingValue":"5"},"publisher":{"@type":"Organization","name":"Trustpilot","sameAs":"https://uk.trustpilot.com"},"inLanguage":"en"},{"@type":"Review","itemReviewed":{"@type":"Thing","name":"MUD-UK"},"author":{"@type":"Person","name":"Christian Østerbye","url":"https://uk.trustpilot.com/users/589dba600000ff000a75ff95","image":"https://user-images.trustpilot.com/589dba600000ff000a75ff95/73x73.png"},"datePublished":"2017-02-10T13:04:42Z","headline":"Absolutely stellar customer service","reviewBody":"Always very swift at shipping the orders. Got a wrong item in the last order but the \u0027issue\u0027 was VERY efficiently and professionally resolved!","reviewRating":{"@type":"Rating","bestRating":"5","worstRating":"1","ratingValue":"5"},"publisher":{"@type":"Organization","name":"Trustpilot","sameAs":"https://uk.trustpilot.com"},"inLanguage":"en"}]},{"@context":"http://schema.org","@type":"Dataset","name":"MUD-UK","description":"Bar chart review and ratings distribution for MUD-UK","publisher":{"@type":"Organization","name":"Trustpilot","sameAs":"https://uk.trustpilot.com"},"mainEntity":{"@type":"csvw:Table","csvw:tableSchema":{"csvw:columns":[{"csvw:name":"1 star","csvw:datatype":"integer","csvw:cells":[{"csvw:value":"0","csvw:notes":["0%"]}]},{"csvw:name":"2 stars","csvw:datatype":"integer","csvw:cells":[{"csvw:value":"0","csvw:notes":["0%"]}]},{"csvw:name":"3 stars","csvw:datatype":"integer","csvw:cells":[{"csvw:value":"0","csvw:notes":["0%"]}]},{"csvw:name":"4 stars","csvw:datatype":"integer","csvw:cells":[{"csvw:value":"0","csvw:notes":["0%"]}]},{"csvw:name":"5 stars","csvw:datatype":"integer","csvw:cells":[{"csvw:value":"3","csvw:notes":["100%"]}]},{"csvw:name":"Total","csvw:datatype":"integer","csvw:cells":[{"csvw:value":"3","csvw:notes":["100%"]}]}]}}}]
</script>
As Trustpilot paginates its reviews, with 20 or so shown per page, if you only scrape the first page of results you’ll only get a small number of reviews which may not give a true representation of the business you’re trying to analyse.
The most common way to find the next page of reviews would be to find the block of pagination links at the bottom and then get Selenium to click the “next” one once it’s parsed each block of JSON-LD. However, there’s a much neater way to do this, which again prevents the need to rewrite your scraper in the event of the page HTML changing.
Providing you’re looking at a business with more than one page of reviews, if you view the source again and search for the phrase rel="next"
you’ll find a line of HTML added to the page to help search engines crawl the site and identify the next page.
<link rel="next" href="https://uk.trustpilot.com/review/www.bearmach.com?page=2" />
Now we have identified the two elements we want to scrape - the JSON-LD review and the URL of the next page in the results set - we can create our reviews scraper. To make this a bit easier to interpret, I’ve written some little functions to handle each bit, so we’ll go through these first and then pull it all together.
First, I’ve written a function called get_driver()
which creates a headless Chrome web browser and returns the Selenium driver object. By passing in the --headless
argument to options.add_argument()
this doesn’t spawn a new browser window every time it opens a URL. If you want to watch Selenium running, just comment that line out.
def get_driver():
options = Options()
options.add_argument('--headless')
driver = webdriver.Chrome(options=options)
return driver
Now that we have accessed our page, we need to grab the page source code HTML which contains our JSON-LD and the link to the next page. The get_source()
function I created takes the driver
object from get_driver()
along with the URL of the page of reviews. This returns the HTML source code which we can parse in the next steps.
def get_source(driver, url):
driver.get(url)
return driver.page_source
To extract the JSON-LD from the page source, we pass the source
code output from get_source()
to the get_json()
function and tell it to look for code in the json-ld
syntax. There’s only one of these on the Trustpilot pages, so we don’t need to do anything else.
def get_json(source):
return ex.extract(source, syntaxes=['json-ld'])
Next we are going to use Selenium in the more conventional manner and scrape the URL of the next page using its XPath. You can find the XPath of an element by inspecting the HTML for a given element in Chrome using the “inspect element” feature. The get_next_page()
function below takes the Selenium driver
object and the source
from get_source()
and then finds all elements with the XPath //link[@rel="next"]
. If it finds any elements, it uses get_attribute('href')
to return just the URL of the page.
def get_next_page(driver, source):
"""Parse the page source and return the URL for the next page of results.
:param driver: Selenium webdriver
:param source: Page source code from Selenium
:return
URL of next paginated page
"""
elements = driver.find_elements_by_xpath('//link[@rel="next"]')
if elements:
return driver.find_element_by_xpath('//link[@rel="next"]').get_attribute('href')
else:
return ''
Now we have scraped the JSON-LD out of the page and scraped and parsed out the URL for the next page, we need to write a function to scrape the reviews from the JSON-LD tag. This is arguably the hardest bit and did take some trial and error to get right. The save_reviews()
function I created takes two arguments: data
(which is the JSON-LD schema.org review code) and df
which is a Pandas DataFrame into which we’ll store the data collected. The DataFrame looks like this:
df = pd.DataFrame(columns = ['author', 'headline', 'body', 'rating',
'item_reviewed', 'publisher', 'date_published'])
The save_reviews()
function first finds the JSON-LD element in the code, checks to see if a review is present, and then loops through the reviews. The get()
function is used to extract each element from the review. The un-nested elements can be accessed with review.get('reviewBody')
where reviewBody
is the name of the element, while the nested ones need a two-layered approach, such as review.get('reviewRating', {}).get('ratingValue')
. Once the elements have been extracted from the JSON, the row of data is appended to the Pandas DataFrame.
def save_reviews(data, df):
"""Scrape the individual reviews from a schema.org JSON-LD tag and
save the contents in the df_reviews Pandas dataframe.
:param data: JSON-LD source containing schema.org review markup
:param df: Name of Pandas dataframe to which to append reviews
:return
df with reviews appended
"""
for item in data['json-ld']:
if "review" in item:
for review in item['review']:
row = {
'author': review.get('author', {}).get('name'),
'headline': review.get('headline'),
'body': review.get('reviewBody'),
'rating': review.get('reviewRating', {}).get('ratingValue'),
'item_reviewed': review.get('itemReviewed', {}).get('name'),
'publisher': review.get('publisher', {}).get('name'),
'date_published': review.get('datePublished')
}
df = df.append(row, ignore_index=True)
return df
The final step is to create the crawler or spider. This is given the list of URLs of review pages and it then loops through them, parsing and saving the reviews as it goes, via the functions we created above. Selenium is quite quick at scraping the reviews from the site but this will obviously take a long time to run if you pick a competitor with lots of reviews. You may want to add a sleep()
command to get your crawler to pause so it doesn’t hammer Trustpilot’s server too much.
for url in urls:
print(url)
# Save the reviews from the first page
driver = get_driver()
source = get_source(driver, url)
json = get_json(source)
df = save_reviews(json, df)
# Get reviews on each paginated page
next_page = get_next_page(driver, source)
paginated_urls = []
paginated_urls.append(next_page)
if paginated_urls:
for url in paginated_urls:
if url:
print(next_page)
driver = get_driver()
source = get_source(driver, url)
json = get_json(source)
df = save_reviews(json, df)
next_page = get_next_page(driver, source)
paginated_urls.append(next_page)
https://uk.trustpilot.com/review/www.mudstuff.co.uk
https://uk.trustpilot.com/review/landroverdefendersecurity.com
https://uk.trustpilot.com/review/landroverdefendersecurity.com?page=2
https://uk.trustpilot.com/review/landroverdefendersecurity.com?page=3
https://uk.trustpilot.com/review/landroverdefendersecurity.com?page=4
https://uk.trustpilot.com/review/famousfour.co.uk
If you put this altogether and run it, then go away for a cup of tea, it should have scraped and parsed all the review content and placed it neatly into a Pandas DataFrame for you to analyse. If you use the to_csv()
function, you can save the output to a file to save the hassle of scraping it again in future.
df.to_csv('reviews.csv')
df.head(1000)
author | headline | body | rating | item_reviewed | publisher | date_published | |
---|---|---|---|---|---|---|---|
0 | jools | Great products and service | If only all companies were as good as Mud UK. ... | 5 | MUD-UK | Trustpilot | 2020-07-27T17:26:08Z |
1 | Theo Merchant | Ordered a few bits from the website… | Ordered a few bits from the website which came... | 5 | MUD-UK | Trustpilot | 2019-10-21T18:46:38Z |
2 | Christian Østerbye | Absolutely stellar customer service | Always very swift at shipping the orders. Got ... | 5 | MUD-UK | Trustpilot | 2017-02-10T13:04:42Z |
3 | Dominic Ferrar | Great customer service | When I called to discuss my potential order th... | 5 | LRD Security | Trustpilot | 2020-03-19T20:00:52Z |
4 | Mr Ian Winskill | Happy customer | Promt and professional service | 5 | LRD Security | Trustpilot | 2020-03-19T16:25:34Z |
... | ... | ... | ... | ... | ... | ... | ... |
995 | Paul Callow | Easy to deal with and very quick | Easy to deal with and very quick delivery (3 d... | 5 | Famous Four | Trustpilot | 2018-05-08T20:01:02Z |
996 | morten lund | Everything I need !! | Everything I need !! | 5 | Famous Four | Trustpilot | 2018-05-08T17:17:43Z |
997 | CustomerM Williams | brilliant service couldnt ask for better | brilliant service couldnt ask for better | 5 | Famous Four | Trustpilot | 2018-05-08T16:33:29Z |
998 | Dag Lislerud Midtfjeld | Fast and reliable | Fast and reliable | 5 | Famous Four | Trustpilot | 2018-05-08T15:47:23Z |
999 | mohammed alsheheri | saudi shipping Excellent with DHL | saudi shipping Excellent with DHL | 5 | Famous Four | Trustpilot | 2018-05-08T14:21:28Z |
Earlier this year, researchers from the University of Mannheim used schema.org to train a machine learning model for product matching and achieved a state-of-the-art F1 score of 0.95, demonstrating the power that these data can bring to your models.
Matt Clarke, Tuesday, March 02, 2021