How to scrape schema.org metadata using Python

Learn to scrape more efficiently by extracting Schema.org metadata in JSON-LD, Microdata, and OpenGraph formats using Extruct and Requests.

How to scrape schema.org metadata using Python
Picture by Kari Shea, Unsplash.
13 minutes to read

As I’ve mentioned in previous posts on web scraping, the most efficient way to scrape data is to identify what Schema.org metadata is in use and then create a microdata web scraper. This saves time and effort in creating custom scrapers, and allows a single scraper to be used on multiple sites.

In this project, I’ll show you another useful web scraping technique that you can utilise when scraping Schema.org markup. Instead of creating one scraper for each type of metadata you wish to extract, it’s possible to create a single function that can handle Schema.org markup in any of the allowed dialects. Here’s how it’s done.

Load the packages

Open up a Jupyter notebook and import the below packages. Most of these packages are part of Python, so you probably won’t need to install anything, apart from Extruct, which does the metadata parsing. You can do this by entering pip3 install extruct.

import pandas as pd
import extruct
import requests
from w3lib.html import get_base_url
from urllib.parse import urlparse

Extract the metadata

First, we’ll create a helper function called extract_metadata(). This takes the url as its only argument and uses Requests to get the source. We then pass the response text to the extract() function in the Extruct package and tell it to return any Schema markup it finds in json-ld, microdata, or opengraph format. There are other supported syntaxes, but they’re quite rarely used, so I skipped them out.

def extract_metadata(url):
    """Extract all metadata present in the page and return a dictionary of metadata lists. 
    
    Args:
        url (string): URL of page from which to extract metadata. 
    
    Returns: 
        metadata (dict): Dictionary of json-ld, microdata, and opengraph lists. 
        Each of the lists present within the dictionary contains multiple dictionaries.
    """
    
    r = requests.get(url)
    base_url = get_base_url(r.text, r.url)
    metadata = extruct.extract(r.text, 
                               base_url=base_url,
                               uniform=True,
                               syntaxes=['json-ld',
                                         'microdata',
                                         'opengraph'])
    return metadata

Examine some metadata

To understand what sort of data we’re dealing with, let’s pass in a couple of URLs from ecommerce websites. First, we’ll check out a product page from the Box website.

The extract_metadata() function returns a dictionary with three keys: json-ld, microdata, and opengraph. Depending on the site’s schema.org metadata usage, none or all of these could be populated.

metadata_box = extract_metadata('https://www.box.co.uk/20TQ0048UK-Lenovo-ThinkPad-P15v-Gen-1_3199775.html')
metadata_box

On the Box site, the microdata value contains an empty list, as does the opengraph value. However, the json-ld list is populated. Inside the list we have a number of dictionaries, each containing schema.org markup. We have both BreadcrumbList and Product schema metadata.

{'microdata': [],
 'json-ld': [{'@context': 'https://schema.org/',
   '@type': 'BreadcrumbList',
   'itemListElement': [{'@type': 'ListItem',
     'position': 1,
     'name': 'Computing',
     'item': 'https://www.box.co.uk/computing'},
    {'@type': 'ListItem',
     'position': 2,
     'name': 'Laptops',
     'item': 'https://www.box.co.uk/laptops'},
    {'@type': 'ListItem',
     'position': 3,
     'name': 'Lenovo',
     'item': 'https://www.box.co.uk/lenovo-laptops'},
    {'@type': 'ListItem',
     'position': 4,
     'name': 'ThinkPad P Series',
     'item': 'https://www.box.co.uk/thinkpad+p+series.htm'},
    {'@type': 'ListItem',
     'position': 5,
     'name': '20TQ0048UK',
     'item': 'https://www.box.co.uk/20TQ0048UK-Lenovo-ThinkPad-P15v-Gen-1_3199775.html'}]},
  {'@context': 'https://schema.org/',
   '@type': 'Product',
   'name': 'Lenovo ThinkPad P15v Gen 1',
   'image': ['http://www.box.co.uk/image?id=4514680&quality=90',
    'http://www.box.co.uk/image?id=4514681&quality=90',
    'http://www.box.co.uk/image?id=4514682&quality=90',
    'http://www.box.co.uk/image?id=4514683&quality=90',
    'http://www.box.co.uk/image?id=4514684&quality=90',
    'http://www.box.co.uk/image?id=4514685&quality=90',
    'http://www.box.co.uk/image?id=4514686&quality=90',
    'http://www.box.co.uk/image?id=4514687&quality=90'],
   'gtin12': '195235161272',
   'category': 'Computing\\Laptops\\Laptops',
   'description': 'Lenovo ThinkPad P15v Gen 1, Intel Core i7-10750H Hexa core Processor Processor, 15.6" Full HD IPS Anti-Glare Screen, Microsoft Windows 10 Professional 64bit, 32GB RAM, 512GB SSD, NVIDIA Quadro P620 4GB Graphics, USB3 | HDMI | Bluetooth, Discrete TPM 2.0, 20TQ0048UK',
   'sku': 'W770544',
   'mpn': '20TQ0048UK',
   'brand': {'@type': 'Thing', 'name': 'Lenovo'},
   'aggregateRating': {'@type': 'AggregateRating',
    'ratingValue': 9.0,
    'bestRating': '10',
    'ratingCount': 4,
    'worstRating': '1'},
   'offers': {'@type': 'Offer',
    'url': 'https://www.box.co.uk/20TQ0048UK-Lenovo-ThinkPad-P15v-Gen-1_3199775.html',
    'priceCurrency': 'GBP',
    'price': 1229.49,
    'itemCondition': 'http://schema.org/NewCondition',
    'availability': 'http://schema.org/InStock'}}],
 'opengraph': []}

The John Lewis website doesn’t have any microdata schema, but the json-ld and opengraph lists are both populated. The json-ld list includes Product schema (note that John Lewis has not used the correct capitalisation); the Organization schema giving details about the John Lewis chain; the Website schema (not the dodgy capitalisation once again); and the Breadcrumblist schema.

metadata_john_lewis = extract_metadata('https://www.johnlewis.com/google-pixelbook-go-ga00523-uk-laptop-intel-core-i5-processor-16gb-ram-128gb-13-3-full-hd-just-black/p4868201')
metadata_john_lewis
{'microdata': [],
 'json-ld': [{'@context': 'https://schema.org/',
   '@type': 'product',
   'offers': {'@type': 'Offer',
    'seller': {'@type': 'Organization', 'name': 'John Lewis & Partners'},
    'availability': 'InStock',
    'url': 'https://www.johnlewis.com/google-pixelbook-go-ga00523-uk-laptop-intel-core-i5-processor-16gb-ram-128gb-13-3-full-hd-just-black/p4868201',
    'priceCurrency': 'GBP',
    'price': '949.00'},
   'productId': '4868201',
   'sku': '238471012',
   'url': 'https://www.johnlewis.com/google-pixelbook-go-ga00523-uk-laptop-intel-core-i5-processor-16gb-ram-128gb-13-3-full-hd-just-black/p4868201',
   'name': 'Google Pixelbook Go GA00523-UK Laptop, Intel Core i5 Processor, 16GB RAM, 128GB, 13.3” Full HD, Just Black',
   'description': '<p>The portable Google Pixelbook Go laptop has been created with a battery that will last up to 12 hours, an 8th generation Intel Core i5 processor, a 13.3” Full HD touch display, a 2MP web cam, along with microphones that offer improved noise cancellation and dual front-firing speakers for immersive sound. This device uses Google’s Chrome OS (Operating System), which means access to your favourite Android apps via the Google Play Store.</p> \r\r<p><strong>Intel Core i5</strong><br>\rWith an Intel Core i5 processor, you\'ll easily handle day-to-day tasks like word processing, image editing, web browsing and casual gaming.</p> \r\r<p><strong>Stay safe</strong><br>\rBuilt-in anti-virus software, along with a Titan C security chip are ready to help protect your data, so you can play and work without worry.</p>\r\r<p><strong>What is RAM?</strong><br>\rRAM (Random Access Memory) is different to the permanent storage provided by hard disk drives (HDD), solid state drives (SSD) or memory cards in your equipment. RAM is used by your device to temporarily store data to carry out everyday operations. The more RAM your machine has, the faster you can expect it to open and run programs.</p> \r\r<p><strong>RAM packed</strong><br>\rFeaturing a massive 16GB of RAM, this computer is comfortable running lots of complex tasks at once, and you won\'t see any obvious lag or slowdown.</p> \r\r<p><strong>Great entertainment</strong><br>\rThanks to its crisp 13.3”, Full HD touch screen, viewing your high-resolution photos, films and videos will be a pleasure. It has been designed to convey stunning images and it makes for a wonderful multimedia experience.</p>\r\r<p><strong>Wireless communication</strong><br>  \rBuilt-in Wi-Fi makes it easy to connect this laptop to the internet, whether that\'s through your own home network, Wi-Fi at work or a public connection. Bluetooth will let you send files and folders wirelessly between Bluetooth devices, or even stream music to compatible speakers or headphones. </p>\r\r<p><strong>Download MS Office</strong><br />\rAt the time of writing (Q1 2020) Microsoft enables you to download their Office apps, like Word or Excel for free when using an Android device (such as this one). Please search on the Google Play Store for further details.</p> \r\r<br>\r\r<p><strong>We can help you get set up</strong><br>\rOur specialist in-store Partners can help you to get set up on your new device right away. We can help with anything from transferring personal data from your old device to setting up any essential features.</p>\r\r<p>Visit your local shop with your new device, or simply Click & Collect from a John Lewis & Partners when purchasing and an available Technical Support Partner can set it up for you.</p>\r\r<p><a href="https://www.johnlewis.com/our-services/computer-installation">Find out more details and pricing for our tech support services.</a></p>\r\r<br> \r',
   'image': 'https://johnlewis.scene7.com/is/image/JohnLewis/238471012?',
   'brand': {'@type': 'Brand', 'name': 'Google', 'logo': 'https:null'},
   'aggregateRating': {'@type': 'AggregateRating',
    'reviewCount': 2,
    'ratingValue': 5}},
  {'@context': 'http://schema.org',
   '@type': 'Organization',
   'name': 'John Lewis & Partners',
   'url': 'http://www.johnlewis.com',
   'sameAs': ['https://www.facebook.com/JohnLewisandPartners',
    'https://twitter.com/JLandPartners',
    'https://www.youtube.com/JohnLewisandPartners',
    'https://www.pinterest.co.uk/johnlewisandpartners',
    'https://www.instagram.com/JohnLewisandPartners',
    'https://www.linkedin.com/company/johnlewisandpartners'],
   'logo': 'https://www.johnlewis.com/static/assets/logo/john-lewis-logo.png',
   'contactPoint': [{'@type': 'ContactPoint',
     'telephone': '+44-3456-100-329',
     'contactType': 'customer service',
     'areaServed': 'GB'},
    {'@type': 'ContactPoint',
     'telephone': '+44-1698-54-54-54',
     'contactType': 'customer service'},
    {'@type': 'ContactPoint',
     'telephone': '+44-3301-230-106',
     'contactType': 'technical support'}]},
  {'@context': 'https://schema.org',
   '@type': 'WebSite',
   'url': 'https://www.johnlewis.com/',
   'potentialAction': {'@type': 'SearchAction',
    'target': 'https://www.johnlewis.com/search?search-term={search_term_string}',
    'query-input': 'required name=search_term_string'}},
  {'breadcrumb': {'itemListElement': [{'item': {'name': 'Homepage',
       '@id': 'https://www.johnlewis.com/'},
      '@type': 'ListItem',
      'position': '1'},
     {'item': {'name': 'Electricals',
       '@id': 'https://www.johnlewis.com/electricals/c500001'},
      '@type': 'ListItem',
      'position': '2'},
     {'item': {'name': 'Laptops & MacBooks',
       '@id': 'https://www.johnlewis.com/electricals/laptops-macbooks/c60000876'},
      '@type': 'ListItem',
      'position': '3'},
     {'item': {'name': 'View All Laptops & MacBooks',
       '@id': 'https://www.johnlewis.com/browse/electricals/laptops-macbooks/view-all-laptops-macbooks/_/N-a8f'},
      '@type': 'ListItem',
      'position': '4'}],
    '@type': 'BreadcrumbList'},
   '@type': 'WebPage',
   'name': '',
   'description': '',
   '@context': 'https://schema.org'}],
 'opengraph': [{'og:title': 'Google Pixelbook Go GA00523-UK Laptop, Intel Core i5 Processor, 16GB RAM, 128GB, 13.3” Full HD, Just Black',
   'og:image': 'https://johnlewis.scene7.com/is/image/JohnLewis/238471012?$rsp-pdp-port-320$',
   'og:image:type': 'image/jpeg',
   'og:site_name': 'John Lewis',
   'og:description': 'Buy Google Pixelbook Go GA00523-UK Laptop, Intel Core i5 Processor, 16GB RAM, 128GB, 13.3” Full HD, Just Black from our View All Laptops & MacBooks range at John Lewis & Partners. Free Delivery on orders over £50.',
   'og:url': 'https://www.johnlewis.com/google-pixelbook-go-ga00523-uk-laptop-intel-core-i5-processor-16gb-ram-128gb-13-3-full-hd-just-black/p4868201',
   'og:locale': 'en_GB',
   '@type': 'product',
   '@context': {'og': 'http://ogp.me/ns#'}}]}

How to extract specific metadata

Sites are free to present their schema in microdata, json-ld, or opengraph, so you never really know which will be used until your parse the metadata and check. To handle this step automatically, I’ve created a helper function called get_dictionary_by_key_value(). This takes the dictionary containing the metadata, the target key to look for (i.e. @type) and the target value (i.e. BreadcrumbList), and then returns any matching metadata found.

def get_dictionary_by_key_value(dictionary, target_key, target_value):
    """Return a dictionary that contains a target key value pair. 
    
    Args:
        dictionary: Metadata dictionary containing lists of other dictionaries.
        target_key: Target key to search for within a dictionary inside a list. 
        target_value: Target value to search for within a dictionary inside a list. 
    
    Returns:
        target_dictionary: Target dictionary that contains target key value pair. 
    """
    
    for key in dictionary:
        if len(dictionary[key]) > 0:
            for item in dictionary[key]:
                if item[target_key] == target_value:
                    return item

Give it a try…

Now that’s written, we can try it out on some really data using the metadata we collected from John Lewis and Box. This works, but you’ll need to ensure the key value provided (i.e. Organization, Product, or BreadcrumbList) matches the capitalisation used in the site’s metadata. You could probably change the function to make this case insensitive.

Organization = get_dictionary_by_key_value(metadata_john_lewis, "@type", "Organization")
Organization
{'@context': 'http://schema.org',
 '@type': 'Organization',
 'name': 'John Lewis & Partners',
 'url': 'http://www.johnlewis.com',
 'sameAs': ['https://www.facebook.com/JohnLewisandPartners',
  'https://twitter.com/JLandPartners',
  'https://www.youtube.com/JohnLewisandPartners',
  'https://www.pinterest.co.uk/johnlewisandpartners',
  'https://www.instagram.com/JohnLewisandPartners',
  'https://www.linkedin.com/company/johnlewisandpartners'],
 'logo': 'https://www.johnlewis.com/static/assets/logo/john-lewis-logo.png',
 'contactPoint': [{'@type': 'ContactPoint',
   'telephone': '+44-3456-100-329',
   'contactType': 'customer service',
   'areaServed': 'GB'},
  {'@type': 'ContactPoint',
   'telephone': '+44-1698-54-54-54',
   'contactType': 'customer service'},
  {'@type': 'ContactPoint',
   'telephone': '+44-3301-230-106',
   'contactType': 'technical support'}]}
Product = get_dictionary_by_key_value(metadata_box, "@type", "Product")
Product
{'@context': 'https://schema.org/',
 '@type': 'Product',
 'name': 'Lenovo ThinkPad P15v Gen 1',
 'image': ['http://www.box.co.uk/image?id=4514680&quality=90',
  'http://www.box.co.uk/image?id=4514681&quality=90',
  'http://www.box.co.uk/image?id=4514682&quality=90',
  'http://www.box.co.uk/image?id=4514683&quality=90',
  'http://www.box.co.uk/image?id=4514684&quality=90',
  'http://www.box.co.uk/image?id=4514685&quality=90',
  'http://www.box.co.uk/image?id=4514686&quality=90',
  'http://www.box.co.uk/image?id=4514687&quality=90'],
 'gtin12': '195235161272',
 'category': 'Computing\\Laptops\\Laptops',
 'description': 'Lenovo ThinkPad P15v Gen 1, Intel Core i7-10750H Hexa core Processor Processor, 15.6" Full HD IPS Anti-Glare Screen, Microsoft Windows 10 Professional 64bit, 32GB RAM, 512GB SSD, NVIDIA Quadro P620 4GB Graphics, USB3 | HDMI | Bluetooth, Discrete TPM 2.0, 20TQ0048UK',
 'sku': 'W770544',
 'mpn': '20TQ0048UK',
 'brand': {'@type': 'Thing', 'name': 'Lenovo'},
 'aggregateRating': {'@type': 'AggregateRating',
  'ratingValue': 9.0,
  'bestRating': '10',
  'ratingCount': 4,
  'worstRating': '1'},
 'offers': {'@type': 'Offer',
  'url': 'https://www.box.co.uk/20TQ0048UK-Lenovo-ThinkPad-P15v-Gen-1_3199775.html',
  'priceCurrency': 'GBP',
  'price': 1229.49,
  'itemCondition': 'http://schema.org/NewCondition',
  'availability': 'http://schema.org/InStock'}}
BreadcrumbList = get_dictionary_by_key_value(metadata_box, "@type", "BreadcrumbList")
BreadcrumbList
{'@context': 'https://schema.org/',
 '@type': 'BreadcrumbList',
 'itemListElement': [{'@type': 'ListItem',
   'position': 1,
   'name': 'Computing',
   'item': 'https://www.box.co.uk/computing'},
  {'@type': 'ListItem',
   'position': 2,
   'name': 'Laptops',
   'item': 'https://www.box.co.uk/laptops'},
  {'@type': 'ListItem',
   'position': 3,
   'name': 'Lenovo',
   'item': 'https://www.box.co.uk/lenovo-laptops'},
  {'@type': 'ListItem',
   'position': 4,
   'name': 'ThinkPad P Series',
   'item': 'https://www.box.co.uk/thinkpad+p+series.htm'},
  {'@type': 'ListItem',
   'position': 5,
   'name': '20TQ0048UK',
   'item': 'https://www.box.co.uk/20TQ0048UK-Lenovo-ThinkPad-P15v-Gen-1_3199775.html'}]}

Since the dictionaries returned are all in a standard format (apart from the sometimes non-standard capitalisation used for key values), it can all be parsed easily in Python and Pandas, allowing you to manipulate, analyse, or store the data with ease.

Matt Clarke, Saturday, March 13, 2021

Matt Clarke Matt is a Digital Director who uses data science to help in his work. He has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.

Web Scraping in Python

Learn to retrieve and parse information from the internet using the Python library scrapy.

Start course for FREE

Comments