The downside to building datasets using web scraping is that every site has custom HTML. If you scrape sites in this way, you’ll forever be building bespoke scrapers, and they’ll be fragile and easily broken whenever a site changes its HTML markup.
The sensible and efficient solution to this problem is to a site’s scrape Schema.org metadata instead. Schema.org metadata is included in most decent websites, and is designed to make it easier for search engines to scrape and parse the content.
Since Schema.org markup is important to site performance, it tends to be well-structured, standardised, and unlike HTML, it rarely changes. This means you can build a single scraper and use it across multiple sites, cutting down custom scraping requirements dramatically.
The small drawback is that Schema.org metadata comes in various forms, and its usage differs between sites and even within sites. Thankfully, Extruct makes it much easier to parse.
While it isn’t present on every site, in my experience, you can generally cut your scraping workload by at least half by scraping structured data. In this project, we’ll be doing the groundwork for a larger scraping project by first identifying what metadata implementations are used on our target sites.
To get started, open a Jupyter notebook and import the packages below. I’m using Pandas for manipulating text data, Requests for fetching HTML source, urllib
for parsing URL structures, and Extruct for parsing the metadata. Most of these modules are built into Python, but anything you don’t have can be installed by entering pip3 install package-name
in your terminal.
import pandas as pd
import extruct
import requests
from w3lib.html import get_base_url
from urllib.parse import urlparse
Next, we’ll create a list of URLs to scrape. Importantly, since Schema.org metadata differs on a page-by-page basis, you’ll want to include URLs that are representative of the content you wish to scrape. For example, if you’re aiming to build a Schema.org product metadata scraper, you’ll need to include product page URLs, as the markup isn’t going to be present on the root URL.
sites = ['https://www.ebuyer.com/883790-amd-ryzen-7-3700x-am4-cpu-processor-with-wraith-prism-rgb-cooler-100-100000071box',
'https://www.scan.co.uk/products/amd-ryzen-7-3700x-am4-zen-2-8-core-16-thread-36ghz-44ghz-turbo-32mb-l3-pcie-40-65w-cpu-pluswraith-pr',
'https://www.novatech.co.uk/products/amd-ryzen-7-3700x-eight-core-processorcpu-with-wraith-prism-rgb-led-cooler/100-100000071box.html',
'https://www.overclockers.co.uk/amd-ryzen-7-3700x-eight-core-4.4ghz-socket-am4-processor-retail-cp-3b7-am.html',
'https://www.cclonline.com/product/287403/100-100000071BOX/CPU-Processors/AMD-Ryzen-7-3700X-3-6GHz-Octa-Core-Processor-with-8-Cores-16-Threads-65W-TDP-36MB-Cache-4-4GHz-Turbo-Wraith-Prism-Cooler/CPU0603/',
'https://www.lambda-tek.com/AMD-100-100000071BOX~sh/B43482756',
'http://www.pcupgrade.co.uk/productdetails.asp?ProductID=40655&categoryid=731',
'https://www.box.co.uk/100-100000071BOX-AMD-Ryzen-7-3700X-Gen3-(Socket-AM4)-Proc_2606583.html',
'https://business.currys.co.uk/catalogue/computing/components-upgrades/processors/amd-ryzen-7-3700x-processor/N266498W',
'https://www.aria.co.uk/SuperSpecials/Other+products/AMD+Ryzen+7+3700X+Gen3+8+Core+AM4+CPU%2FProcessor+with+Wraith+Prism+RGB+Cooler+?productId=71490',
]
Now we’ll create a simple web scraper. This uses requests
to grab the HTML source code for a given URL, then gets the base_url
, then passes the text
from the requests
response and the base_url
to the Extruct extract()
function. I’m specifically telling Extruct to only return json-ld
, microdata
, and opengraph
metadata.
def extract_metadata(url):
r = requests.get(url)
base_url = get_base_url(r.text, r.url)
metadata = extruct.extract(r.text,
base_url=base_url,
uniform=True,
syntaxes=['json-ld',
'microdata',
'opengraph'])
return metadata
Here’s an example of the function in action. As you can see, by scraping the URL https://www.aria.co.uk/
we get back a dictionary containing a list for each of the metadata types we requested. On the home page there’s no microdata
or json-ld
, but we do get a number of opengraph
attributes returned.
metadata = extract_metadata('https://www.aria.co.uk/')
metadata
{'microdata': [],
'json-ld': [],
'opengraph': [{'og:title': 'Aria PC - Computer Hardware, Components, Monitors.. at lowest prices',
'og:url': 'https://www.aria.co.uk/',
'og:image': 'https://www.aria.co.uk/static/images/supportAboutUsFooterNew.jpg',
'og:locale': 'en_GB',
'og:site_name': 'Aria PC',
'@type': 'website',
'@context': {'og': 'http://ogp.me/ns#'}}]}
To understand what types of metadata are used on each of the sites in our list, we’ll create a little function called uses_metadata_type
. This takes the metadata
dictionary above and checks to see whether the dictionary key contains data or not.
def uses_metadata_type(metadata, metadata_type):
if (metadata_type in metadata.keys()) and (len(metadata[metadata_type]) > 0):
return True
else:
return False
uses_metadata_type(metadata, 'opengraph')
True
uses_metadata_type(metadata, 'rdfa')
False
To get an overview of the metadata implementations on each of the sites in our dataset, we can loop over the sites, scrape each one, and use uses_metadata_type()
to check whether each metadata type is used on the representative page examined.
df = pd.DataFrame(columns = ['url', 'microdata', 'json-ld', 'opengraph'])
for url in sites:
metadata = extract_metadata(url)
urldata = urlparse(url)
row = {
'url': urldata.netloc,
'microdata': uses_metadata_type(metadata, 'microdata'),
'json-ld': uses_metadata_type(metadata, 'json-ld'),
'opengraph': uses_metadata_type(metadata, 'opengraph'),
}
df = df.append(row, ignore_index=True)
df.head(10).sort_values(by='microdata', ascending=False)
The dataframe created gives us a True or False value for each of the three metadata types we’ve examined. As you can see, even on this small set of websites in a similar niche, there’s quite a bit of variety in the metadata implementations used.
url | microdata | json-ld | opengraph | |
---|---|---|---|---|
2 | www.novatech.co.uk |
True | True | True |
3 | www.overclockers.co.uk |
True | False | True |
4 | www.cclonline.com |
True | True | True |
0 | www.ebuyer.com |
False | True | False |
1 | www.scan.co.uk |
False | False | False |
5 | www.lambda-tek.com |
False | True | True |
6 | www.pcupgrade.co.uk |
False | False | False |
7 | www.box.co.uk |
False | True | False |
8 | business.currys.co.uk |
False | False | False |
9 | www.aria.co.uk |
False | False | True |
The Schema.org system includes schemas (or schemata) for a whole raft of different things, from products and reviews, to recipes and books. In the ecommerce sector, only a handful of these get used.
The Product
schema is used to hold parent product or range data; Offers
is used to hold pricing, stock, and child SKU level data; Review
holds reviews; Organization
holds data on the company itself; BreadcrumbList
holds the information architecture of the page, and AggregateRating
holds the overall product rating scores.
Each of these values is the key for a dictionary in the list, so we can create another function called key_exists()
to check whether a given key is in each of the dictionaries embedded in the metadata lists returned by Extruct.
def key_exists(dict, key):
if not any(item['@type'] == key for item in dict):
return False
else:
return True
To watch the function in action, we’ll first scrape the metadata for a page on the Lambda Tek website. Then we can check to see whether a given key, such as Product
, exists for a given metadata type.
metadata = extract_metadata('https://www.lambda-tek.com/AMD-100-100000071BOX~sh/B43482756')
metadata
{'microdata': [],
'json-ld': [{'@context': 'http://schema.org/',
'@type': 'Product',
'name': 'AMD Ryzen 7 3700X processor 3.6 GHz 32 MB L3 Box ',
'image': ['images/imgB43482756.jpg'],
'description': '<b>AMD Ryzen 7 3700X processor 3.6 GHz 32 MB L3 Box</b>Ryzen 7 3700X, 3.6GHz (4.4GHz), 8C/16T, 32MB L3, AM4, 65W + Wraith Prism',
'mpn': '100-100000071BOX',
'sku': 'B43482756',
'brand': {'@type': 'Thing', 'name': 'AMD'},
'offers': {'@type': 'Offer',
'priceCurrency': 'GBP',
'price': '247.63',
'availability': 'http://schema.org/InStock',
'url': 'https://www.lambda-tek.com/AMD-100-100000071BOX~sh/B43482756',
'seller': {'@type': 'Organization', 'name': 'Lambda-Tek'}}}],
'opengraph': [{'og:locale': 'en_US',
'og:title': '100-100000071BOX AMD Ryzen 7 3700X processor 3.6 GHz 32 MB L3 Box (AMD RYZEN 7 3700X AM4 RET WRAITH PRISM)',
'og:description': '<b>AMD Ryzen 7 3700X processor 3.6 GHz 32 MB L3 Box</b>Ryzen 7 3700X, 3.6GHz (4.4GHz), 8C/16T, 32MB L3, AM4, 65W + Wraith Prism',
'og:url': 'https://www.lambda-tek.com/AMD-100-100000071BOX~sh/B43482756',
'og:site_name': 'LambdaTek',
'og:image': 'https://www.lambda-tek.com/componentshop/images/imgB43482756.jpg',
'@type': 'product',
'@context': {'og': 'http://ogp.me/ns#'}}]}
microdata_product = key_exists(metadata['microdata'], 'Product'),
microdata_product
(False,)
jsonld_product = key_exists(metadata['json-ld'], 'Product'),
jsonld_product
(True,)
Finally, we can put that all together and re-check the sites to find out what specific metadata is implemented on the product pages in our list. The technique is the same as in the previous steps. We’re looping over the URLs, scraping the HTML, extracting the metadata, and then checking each key
to see whether it is implemented by a given metadata type.
df_specific = pd.DataFrame(columns = ['url',
'organization-json-ld',
'organization-microdata',
'product-json-ld',
'product-microdata',
'offer-json-ld',
'offer-microdata',
'review-json-ld',
'review-microdata',
'aggregaterating-json-ld',
'aggregaterating-microdata',
'breadcrumblist-json-ld',
'breadcrumblist-microdata',
])
for url in sites:
metadata = extract_metadata(url)
urldata = urlparse(url)
row = {
'url': urldata.netloc,
'organization-json-ld': key_exists(metadata['json-ld'], 'Organization'),
'organization-microdata': key_exists(metadata['microdata'], 'Organization'),
'product-json-ld': key_exists(metadata['json-ld'], 'Product'),
'product-microdata': key_exists(metadata['microdata'], 'Product'),
'offer-json-ld': key_exists(metadata['json-ld'], 'Offer'),
'offer-microdata': key_exists(metadata['microdata'], 'Offer'),
'review-json-ld': key_exists(metadata['json-ld'], 'Review'),
'review-microdata': key_exists(metadata['microdata'], 'Review'),
'aggregaterating-json-ld': key_exists(metadata['json-ld'], 'AggregateRating'),
'aggregaterating-microdata': key_exists(metadata['microdata'], 'AggregateRating'),
'breadcrumblist-json-ld': key_exists(metadata['json-ld'], 'BreadcrumbList'),
'breadcrumblist-microdata': key_exists(metadata['microdata'], 'BreadcrumbList'),
}
df_specific = df_specific.append(row, ignore_index=True)
df_specific.sort_values(by='url', ascending=False).head(3).T
Now we’ve got this dataset, we have a better idea of the work that would be required to scrape product data from
these sites. We can use the Pandas value_counts()
function to calculate the numbers. Out of 10 sites, 6 of them include
either JSON-LD or microdata Product schema, so we can scrape these with a single universal scraper, and only need custom code to scrape the other four!
df_specific['product-microdata'].value_counts()
False 8
True 2
Name: product-microdata, dtype: int64
df_specific['product-json-ld'].value_counts()
False 6
True 4
Name: product-json-ld, dtype: int64
1 | 6 | 3 | |
---|---|---|---|
url | www.scan.co.uk |
www.pcupgrade.co.uk |
www.overclockers.co.uk |
organization-json-ld | False | False | False |
organization-microdata | False | False | False |
product-json-ld | False | False | False |
product-microdata | False | False | True |
offer-json-ld | False | False | False |
offer-microdata | False | False | False |
review-json-ld | False | False | False |
review-microdata | False | False | False |
aggregaterating-json-ld | False | False | False |
aggregaterating-microdata | False | False | False |
breadcrumblist-json-ld | False | False | False |
breadcrumblist-microdata | False | False | False |
df_specific.head(20)
url | organization-json-ld | organization-microdata | product-json-ld | product-microdata | offer-json-ld | offer-microdata | review-json-ld | review-microdata | aggregaterating-json-ld | aggregaterating-microdata | breadcrumblist-json-ld | breadcrumblist-microdata | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | www.ebuyer.com |
False | False | True | False | False | False | False | False | False | False | True | False |
1 | www.scan.co.uk |
False | False | False | False | False | False | False | False | False | False | False | False |
2 | www.novatech.co.uk |
True | False | False | True | False | False | False | False | False | False | False | True |
3 | www.overclockers.co.uk |
False | False | False | True | False | False | False | False | False | False | False | False |
4 | www.cclonline.com |
False | True | True | False | False | False | False | False | False | False | False | True |
5 | www.lambda-tek.com |
False | False | True | False | False | False | False | False | False | False | False | False |
6 | www.pcupgrade.co.uk |
False | False | False | False | False | False | False | False | False | False | False | False |
7 | www.box.co.uk |
False | False | True | False | False | False | False | False | False | False | True | False |
8 | business.currys.co.uk |
False | False | False | False | False | False | False | False | False | False | False | False |
9 | www.aria.co.uk |
False | False | False | False | False | False | False | False | False | False | False | False |
Matt Clarke, Friday, March 12, 2021