How to use Extruct to identify Schema.org metadata usage

Extruct allows you to reveal a site's Schema.org metadata implementation, so you can build a more stable and efficient web scraper. Here's how to use it.

How to use Extruct to identify Schema.org metadata usage
Picture by Negative Space, Pexels.
16 minutes to read

The downside to building datasets using web scraping is that every site has custom HTML. If you scrape sites in this way, you’ll forever be building bespoke scrapers, and they’ll be fragile and easily broken whenever a site changes its HTML markup.

The sensible and efficient solution to this problem is to a site’s scrape Schema.org metadata instead. Schema.org metadata is included in most decent websites, and is designed to make it easier for search engines to scrape and parse the content.

Why use Schema.org markup?

Since Schema.org markup is important to site performance, it tends to be well-structured, standardised, and unlike HTML, it rarely changes. This means you can build a single scraper and use it across multiple sites, cutting down custom scraping requirements dramatically.

The small drawback is that Schema.org metadata comes in various forms, and its usage differs between sites and even within sites. Thankfully, Extruct makes it much easier to parse.

While it isn’t present on every site, in my experience, you can generally cut your scraping workload by at least half by scraping structured data. In this project, we’ll be doing the groundwork for a larger scraping project by first identifying what metadata implementations are used on our target sites.

Load the packages

To get started, open a Jupyter notebook and import the packages below. I’m using Pandas for manipulating text data, Requests for fetching HTML source, urllib for parsing URL structures, and Extruct for parsing the metadata. Most of these modules are built into Python, but anything you don’t have can be installed by entering pip3 install package-name in your terminal.

import pandas as pd
import extruct
import requests
from w3lib.html import get_base_url
from urllib.parse import urlparse

Create a list of URLs to scrape

Next, we’ll create a list of URLs to scrape. Importantly, since Schema.org metadata differs on a page-by-page basis, you’ll want to include URLs that are representative of the content you wish to scrape. For example, if you’re aiming to build a Schema.org product metadata scraper, you’ll need to include product page URLs, as the markup isn’t going to be present on the root URL.

sites = ['https://www.ebuyer.com/883790-amd-ryzen-7-3700x-am4-cpu-processor-with-wraith-prism-rgb-cooler-100-100000071box', 
         'https://www.scan.co.uk/products/amd-ryzen-7-3700x-am4-zen-2-8-core-16-thread-36ghz-44ghz-turbo-32mb-l3-pcie-40-65w-cpu-pluswraith-pr',
         'https://www.novatech.co.uk/products/amd-ryzen-7-3700x-eight-core-processorcpu-with-wraith-prism-rgb-led-cooler/100-100000071box.html',
         'https://www.overclockers.co.uk/amd-ryzen-7-3700x-eight-core-4.4ghz-socket-am4-processor-retail-cp-3b7-am.html',
         'https://www.cclonline.com/product/287403/100-100000071BOX/CPU-Processors/AMD-Ryzen-7-3700X-3-6GHz-Octa-Core-Processor-with-8-Cores-16-Threads-65W-TDP-36MB-Cache-4-4GHz-Turbo-Wraith-Prism-Cooler/CPU0603/',
         'https://www.lambda-tek.com/AMD-100-100000071BOX~sh/B43482756',
         'http://www.pcupgrade.co.uk/productdetails.asp?ProductID=40655&categoryid=731',
         'https://www.box.co.uk/100-100000071BOX-AMD-Ryzen-7-3700X-Gen3-(Socket-AM4)-Proc_2606583.html',
         'https://business.currys.co.uk/catalogue/computing/components-upgrades/processors/amd-ryzen-7-3700x-processor/N266498W',
         'https://www.aria.co.uk/SuperSpecials/Other+products/AMD+Ryzen+7+3700X+Gen3+8+Core+AM4+CPU%2FProcessor+with+Wraith+Prism+RGB+Cooler+?productId=71490',
         ]

Extract the metadata

Now we’ll create a simple web scraper. This uses requests to grab the HTML source code for a given URL, then gets the base_url, then passes the text from the requests response and the base_url to the Extruct extract() function. I’m specifically telling Extruct to only return json-ld, microdata, and opengraph metadata.

def extract_metadata(url):

    r = requests.get(url)
    base_url = get_base_url(r.text, r.url)
    metadata = extruct.extract(r.text, 
                               base_url=base_url,
                               uniform=True,
                               syntaxes=['json-ld',
                                         'microdata',
                                         'opengraph'])
    return metadata

Here’s an example of the function in action. As you can see, by scraping the URL https://www.aria.co.uk/ we get back a dictionary containing a list for each of the metadata types we requested. On the home page there’s no microdata or json-ld, but we do get a number of opengraph attributes returned.

metadata = extract_metadata('https://www.aria.co.uk/')
metadata
{'microdata': [],
 'json-ld': [],
 'opengraph': [{'og:title': 'Aria PC - Computer Hardware, Components, Monitors.. at lowest prices',
   'og:url': 'https://www.aria.co.uk/',
   'og:image': 'https://www.aria.co.uk/static/images/supportAboutUsFooterNew.jpg',
   'og:locale': 'en_GB',
   'og:site_name': 'Aria PC',
   '@type': 'website',
   '@context': {'og': 'http://ogp.me/ns#'}}]}

Identify usage of specific metadata

To understand what types of metadata are used on each of the sites in our list, we’ll create a little function called uses_metadata_type. This takes the metadata dictionary above and checks to see whether the dictionary key contains data or not.

def uses_metadata_type(metadata, metadata_type):
    if (metadata_type in metadata.keys()) and (len(metadata[metadata_type]) > 0):
        return True
    else:
        return False
uses_metadata_type(metadata, 'opengraph')
True
uses_metadata_type(metadata, 'rdfa')
False

Scrape the metadata usage for each site

To get an overview of the metadata implementations on each of the sites in our dataset, we can loop over the sites, scrape each one, and use uses_metadata_type() to check whether each metadata type is used on the representative page examined.

df = pd.DataFrame(columns = ['url', 'microdata', 'json-ld', 'opengraph'])

for url in sites:    
    metadata = extract_metadata(url)
    urldata = urlparse(url)

    row = {
        'url': urldata.netloc, 
        'microdata': uses_metadata_type(metadata, 'microdata'),
        'json-ld': uses_metadata_type(metadata, 'json-ld'),
        'opengraph': uses_metadata_type(metadata, 'opengraph'),               
    }

    df = df.append(row, ignore_index=True)

df.head(10).sort_values(by='microdata', ascending=False)

The dataframe created gives us a True or False value for each of the three metadata types we’ve examined. As you can see, even on this small set of websites in a similar niche, there’s quite a bit of variety in the metadata implementations used.

url microdata json-ld opengraph
2 www.novatech.co.uk True True True
3 www.overclockers.co.uk True False True
4 www.cclonline.com True True True
0 www.ebuyer.com False True False
1 www.scan.co.uk False False False
5 www.lambda-tek.com False True True
6 www.pcupgrade.co.uk False False False
7 www.box.co.uk False True False
8 business.currys.co.uk False False False
9 www.aria.co.uk False False True

Examine the specific metadata used

The Schema.org system includes schemas (or schemata) for a whole raft of different things, from products and reviews, to recipes and books. In the ecommerce sector, only a handful of these get used.

The Product schema is used to hold parent product or range data; Offers is used to hold pricing, stock, and child SKU level data; Review holds reviews; Organization holds data on the company itself; BreadcrumbList holds the information architecture of the page, and AggregateRating holds the overall product rating scores.

Each of these values is the key for a dictionary in the list, so we can create another function called key_exists() to check whether a given key is in each of the dictionaries embedded in the metadata lists returned by Extruct.

def key_exists(dict, key):

    if not any(item['@type'] == key for item in dict):
        return False
    else:
        return True         

To watch the function in action, we’ll first scrape the metadata for a page on the Lambda Tek website. Then we can check to see whether a given key, such as Product, exists for a given metadata type.

metadata = extract_metadata('https://www.lambda-tek.com/AMD-100-100000071BOX~sh/B43482756')
metadata
{'microdata': [],
 'json-ld': [{'@context': 'http://schema.org/',
   '@type': 'Product',
   'name': 'AMD Ryzen 7 3700X processor 3.6 GHz 32 MB L3 Box ',
   'image': ['images/imgB43482756.jpg'],
   'description': '<b>AMD Ryzen 7 3700X processor 3.6 GHz 32 MB L3 Box</b>Ryzen 7 3700X, 3.6GHz (4.4GHz), 8C/16T, 32MB L3, AM4, 65W + Wraith Prism',
   'mpn': '100-100000071BOX',
   'sku': 'B43482756',
   'brand': {'@type': 'Thing', 'name': 'AMD'},
   'offers': {'@type': 'Offer',
    'priceCurrency': 'GBP',
    'price': '247.63',
    'availability': 'http://schema.org/InStock',
    'url': 'https://www.lambda-tek.com/AMD-100-100000071BOX~sh/B43482756',
    'seller': {'@type': 'Organization', 'name': 'Lambda-Tek'}}}],
 'opengraph': [{'og:locale': 'en_US',
   'og:title': '100-100000071BOX AMD Ryzen 7 3700X processor 3.6 GHz 32 MB L3 Box (AMD RYZEN 7 3700X AM4 RET WRAITH PRISM)',
   'og:description': '<b>AMD Ryzen 7 3700X processor 3.6 GHz 32 MB L3 Box</b>Ryzen 7 3700X, 3.6GHz (4.4GHz), 8C/16T, 32MB L3, AM4, 65W + Wraith Prism',
   'og:url': 'https://www.lambda-tek.com/AMD-100-100000071BOX~sh/B43482756',
   'og:site_name': 'LambdaTek',
   'og:image': 'https://www.lambda-tek.com/componentshop/images/imgB43482756.jpg',
   '@type': 'product',
   '@context': {'og': 'http://ogp.me/ns#'}}]}
microdata_product = key_exists(metadata['microdata'], 'Product'),
microdata_product
(False,)
jsonld_product = key_exists(metadata['json-ld'], 'Product'),
jsonld_product
(True,)

Scrape the specific metadata usage per site

Finally, we can put that all together and re-check the sites to find out what specific metadata is implemented on the product pages in our list. The technique is the same as in the previous steps. We’re looping over the URLs, scraping the HTML, extracting the metadata, and then checking each key to see whether it is implemented by a given metadata type.

df_specific = pd.DataFrame(columns = ['url', 
                                      'organization-json-ld', 
                                      'organization-microdata',                                   
                                      'product-json-ld', 
                                      'product-microdata',                  
                                      'offer-json-ld', 
                                      'offer-microdata',     
                                      'review-json-ld', 
                                      'review-microdata',   
                                      'aggregaterating-json-ld', 
                                      'aggregaterating-microdata',   
                                      'breadcrumblist-json-ld', 
                                      'breadcrumblist-microdata',            
                                     ])

for url in sites:    
    metadata = extract_metadata(url)
    urldata = urlparse(url)


    row = {
        'url': urldata.netloc, 
        'organization-json-ld': key_exists(metadata['json-ld'], 'Organization'),
        'organization-microdata': key_exists(metadata['microdata'], 'Organization'),
        'product-json-ld': key_exists(metadata['json-ld'], 'Product'),
        'product-microdata': key_exists(metadata['microdata'], 'Product'),
        'offer-json-ld': key_exists(metadata['json-ld'], 'Offer'),
        'offer-microdata': key_exists(metadata['microdata'], 'Offer'),
        'review-json-ld': key_exists(metadata['json-ld'], 'Review'),
        'review-microdata': key_exists(metadata['microdata'], 'Review'),
        'aggregaterating-json-ld': key_exists(metadata['json-ld'], 'AggregateRating'),
        'aggregaterating-microdata': key_exists(metadata['microdata'], 'AggregateRating'),
        'breadcrumblist-json-ld': key_exists(metadata['json-ld'], 'BreadcrumbList'),
        'breadcrumblist-microdata': key_exists(metadata['microdata'], 'BreadcrumbList'),
    }

    df_specific = df_specific.append(row, ignore_index=True)

df_specific.sort_values(by='url', ascending=False).head(3).T

Now we’ve got this dataset, we have a better idea of the work that would be required to scrape product data from these sites. We can use the Pandas value_counts() function to calculate the numbers. Out of 10 sites, 6 of them include either JSON-LD or microdata Product schema, so we can scrape these with a single universal scraper, and only need custom code to scrape the other four!

df_specific['product-microdata'].value_counts()
False    8
True     2
Name: product-microdata, dtype: int64
df_specific['product-json-ld'].value_counts()
False    6
True     4
Name: product-json-ld, dtype: int64
1 6 3
url www.scan.co.uk www.pcupgrade.co.uk www.overclockers.co.uk
organization-json-ld False False False
organization-microdata False False False
product-json-ld False False False
product-microdata False False True
offer-json-ld False False False
offer-microdata False False False
review-json-ld False False False
review-microdata False False False
aggregaterating-json-ld False False False
aggregaterating-microdata False False False
breadcrumblist-json-ld False False False
breadcrumblist-microdata False False False
df_specific.head(20)
url organization-json-ld organization-microdata product-json-ld product-microdata offer-json-ld offer-microdata review-json-ld review-microdata aggregaterating-json-ld aggregaterating-microdata breadcrumblist-json-ld breadcrumblist-microdata
0 www.ebuyer.com False False True False False False False False False False True False
1 www.scan.co.uk False False False False False False False False False False False False
2 www.novatech.co.uk True False False True False False False False False False False True
3 www.overclockers.co.uk False False False True False False False False False False False False
4 www.cclonline.com False True True False False False False False False False False True
5 www.lambda-tek.com False False True False False False False False False False False False
6 www.pcupgrade.co.uk False False False False False False False False False False False False
7 www.box.co.uk False False True False False False False False False False True False
8 business.currys.co.uk False False False False False False False False False False False False
9 www.aria.co.uk False False False False False False False False False False False False

Matt Clarke, Friday, March 12, 2021

Matt Clarke Matt is a Digital Director who uses data science to help in his work. He has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.

Web Scraping in Python

Learn to retrieve and parse information from the internet using the Python library scrapy.

Start course for FREE

Comments