How to create a product and price metadata scraper

Learn how to keep tabs on your competitors' pricing by building an ecommerce price scraper that utilise schema.org metadata instead of custom HTML.

How to create a product and price metadata scraper
Picture by Freddie Marriage, Unsplash.
17 minutes to read

In ecommerce, price monitoring is a really important consideration. If you offer your products at a price which is too high within the market, you may lose sales to rivals, and you could see your customers defect to those offering keener pricing.

Similarly, if you’re the only seller for a particular line, and your customers’ opportunities for switching to alternatives are slim, you may get away with increasing your prices a little. Whatever the case, price scraping is usually well worth the effort and quickly pays for itself.

In this project, I’ll show you how you can create a metadata scraper to build a dataset of products and prices. Unlike standard web scrapers, this one utilises schema.org markup, so theoretically, a single scraper can work on almost any site. Here’s how it’s done.

Load the packages

Open a Jupyter notebook and import the packages below. We’ll be using Pandas for data manipulation, Extruct for parsing metadata, BeautifulSoup for sitemap parsing, plus a few other packages from the Python core.

import pandas as pd
import extruct
import requests
from w3lib.html import get_base_url
import urllib.request
from urllib.parse import urlparse
from bs4 import BeautifulSoup

Although I’m using Requests for the scraping here, I’d highly recommend that you use Scrapy on larger projects. It’s far better suited to larger crawls and can run concurrent threads, so will be much quicker if you need to scrape several sites. Requests is only suitable for small tasks.

Scrape the sitemap

The first step is to scrape the sitemap to obtain our list of URLs. The site we’re scraping in this project only has one sitemap, so there’s no need for a more complex solution. If you need to parse a sitemapindex instead of a urlset, please check out my guide to sitemap scraping.

def get_sitemap(url):
    """Scrapes an XML sitemap from the provided URL and returns XML source.

    Args:
        url (string): Fully qualified URL pointing to XML sitemap.

    Returns:
        xml (string): XML source of scraped sitemap.
    """

    response = urllib.request.urlopen(url)
    xml = BeautifulSoup(response, 
                         'lxml-xml', 
                         from_encoding=response.info().get_param('charset'))

    return xml

For this demonstration I’m scraping the Patch Plants website. This online retailer of house plants is a really nicely executed ecommerce site with excellent content and photography. They do lots of things very well indeed and have a really strong brand as a result.

url = "https://www.patchplants.com/sitemap_gb.xml"
xml = get_sitemap(url)

Once we’ve got back the raw XML of our sitemap, we can use Beautiful Soup to parse the XML and extract the elements we need. We’ll put each page into a dictionary and then append the dictionary values to a Pandas dataframe. This will take a few minutes on most sites.

def sitemap_to_dataframe(xml, name=None, data=None, verbose=False):
    """Read an XML sitemap into a Pandas dataframe. 

    Args:
        xml (string): XML source of sitemap. 
        name (optional): Optional name for sitemap parsed.
        verbose (boolean, optional): Set to True to monitor progress.

    Returns:
        dataframe: Pandas dataframe of XML sitemap content. 
    """

    df = pd.DataFrame(columns=['loc', 'changefreq', 'priority', 'domain', 'sitemap_name'])

    urls = xml.find_all("url")
  
    for url in urls:

        if xml.find("loc"):
            loc = url.findNext("loc").text
            parsed_uri = urlparse(loc)
            domain = '{uri.netloc}'.format(uri=parsed_uri)
        else:
            loc = ''
            domain = ''

        if xml.find("changefreq"):
            changefreq = url.findNext("changefreq").text
        else:
            changefreq = ''

        if xml.find("priority"):
            priority = url.findNext("priority").text
        else:
            priority = ''

        if name:
            sitemap_name = name
        else:
            sitemap_name = ''
              
        row = {
            'domain': domain,
            'loc': loc,
            'changefreq': changefreq,
            'priority': priority,
            'sitemap_name': sitemap_name,
        }

        if verbose:
            print(row)

        df = df.append(row, ignore_index=True)
    return df
df = sitemap_to_dataframe(xml)
df.head()
loc changefreq priority domain sitemap_name
0 https://www.patchplants.com/gb/en/accessories/... daily 0.5 www.patchplants.com
1 https://www.patchplants.com/gb/en/bundles/bund... daily 0.5 www.patchplants.com
2 https://www.patchplants.com/gb/en/bundles/conc... daily 0.5 www.patchplants.com
3 https://www.patchplants.com/gb/en/bundles/dipp... daily 0.5 www.patchplants.com
4 https://www.patchplants.com/gb/en/bundles/frac... daily 0.5 www.patchplants.com
df.shape
(844, 5)

Extract the metadata

Rather than writing a custom HTML scraper, we’ll take advantage of the schema.org microdata present on most ecommerce websites. The function I’ve written below, takes a URL from the sitemap and uses Extruct to extract any metadata present, whatever the format.

def extract_metadata(url):
    """Extract all metadata present in the page and return a dictionary of metadata lists. 
    
    Args:
        url (string): URL of page from which to extract metadata. 
    
    Returns: 
        metadata (dict): Dictionary of json-ld, microdata, and opengraph lists. 
        Each of the lists present within the dictionary contains multiple dictionaries.
    """
    
    r = requests.get(url)
    base_url = get_base_url(r.text, r.url)
    metadata = extruct.extract(r.text, 
                               base_url=base_url,
                               uniform=True,
                               syntaxes=['json-ld',
                                         'microdata',
                                         'opengraph'])
    return metadata
metadata = extract_metadata('https://www.patchplants.com/gb/en/plants/kentia-palm-17/')

Extract the Product schema

Now we have the metadata from a product page, we can write another helper function to extract the data we want. This parses the microdata and returns any embedded dictionaries that contain a target key value pair present within the schema.org markup. It should scrape any schema.org microdata, irrespective of its dialect.

def get_dictionary_by_key_value(dictionary, target_key, target_value):
    """Return a dictionary that contains a target key value pair. 
    
    Args:
        dictionary: Metadata dictionary containing lists of other dictionaries.
        target_key: Target key to search for within a dictionary inside a list. 
        target_value: Target value to search for within a dictionary inside a list. 
    
    Returns:
        target_dictionary: Target dictionary that contains target key value pair. 
    """
    
    for key in dictionary:
        if len(dictionary[key]) > 0:
            for item in dictionary[key]:
                if item[target_key] == target_value:
                    return item

If we run the above function on a sample product page, we can examine the dictionary returned. From the below metadata we want to scrape the name, and the data for each of the child SKUs stored within the offers dictionary. The aim is to get a single row for each product variant or child SKU, which also references the parent SKU.

Product = get_dictionary_by_key_value(metadata, "@type", "Product")
Product
{'@context': 'http://schema.org/',
 '@type': 'Product',
 'name': 'Kentia Palm',
 'image': 'https://res.cloudinary.com/patch-gardens/image/upload/c_fill,f_auto,h_400,q_auto:good,w_400/v1602863316/idemraxp5a0pmwam0gos.jpg',
 'description': '<h2><iframe frameborder="0" height="315" scrolling="no" src="https://www.youtube.com/embed/5c1232GDvXg?rel=0" width="560"></iframe></h2>\r\n\r\n<h2>Details</h2>\r\n\r\n<p>Our Big Kens come in four different heights:</p>\r\n\r\n<p style="margin-left:0px; margin-right:0px">Plant height: 90-100cm; Nursery pot width: 19cm<br />\r\nPlant height: 120-130cm; Nursery pot width: 21cm<br />\r\nPlant height:140-150cm; Nursery pot width: 24cm<br />\r\nPlant height: 190-200cm; Nursery pot width: 27cm</p>\r\n\r\n<p style="margin-left:0px; margin-right:0px">(All plant heights are measured from the base of the pot)</p>\r\n\r\n<p style="margin-left:0px; margin-right:0px">All of our Kens needs a 1m-diameter site to make space for his soft, arching fronds. While this spread shouldn\'t increase much with his growth, he may grow by 2ft in height before you need to repot him.</p>\r\n\r\n<p style="margin-left:0px; margin-right:0px">Give our 120-130cm & 140-150cm Big Kens a boost with our lovely new 25cm <a href="https://patch.garden/accessories/plant-stand-261/">plant stand</a>!</p>\r\n\r\n<h2>Likes</h2>\r\n\r\n<p>Copes with medium to low light but happiest in a well-lit spot. During Winter, it\'s best to move him nearer windows to maximise light levels.</p>\r\n\r\n<h2>Dislikes</h2>\r\n\r\n<p>Ken isn\'t the biggest fan of dust collecting on his leaves. It won\'t cause damage but you will notice he doesn\'t look his best. So, occasionally wipe any dust of the leaves with a damp cloth. It\'s a little time-consuming but the results are worth it.</p>\r\n\r\n<h2>Health benefits</h2>\r\n\r\n<p>Found to remove volatile organic compounds (VOCs) that emit from synthetic building materials and furnishings. Also cleanses the air of Carbon Dioxide. Win win.</p>\r\n\r\n<h2>Watering and feeding</h2>\r\n\r\n<p>Water once per 15 days approximately in the Winter and once per week in the Spring and Summer. Always keep in mind that the first 2 inches of soil need to be dry before watering thoroughly. He\'ll like a monthly feed during the growing season diluted at half strength, but not during Winter.</p>\r\n\r\n<h2>Living with pets and children</h2>\r\n\r\n<p>Nothing to worry about here. That said, we wouldn\'t recommend eating his leaves!</p>',
 'offers': {'@type': 'AggregateOffer',
  'itemCondition': 'http://schema.org/AggregateOffer',
  'offers': [{'@type': 'Offer',
    'itemCondition': 'http://schema.org/offers',
    'name': 'Big Ken - 90-100cm',
    'sku': 'PLANT-1-001-001',
    'url': '/gb/en/plants/kentia-palm-17/?variant=146',
    'availability': 'http://schema.org/InStock',
    'price': '50.00',
    'priceCurrency': 'GBP'},
   {'@type': 'Offer',
    'itemCondition': 'http://schema.org/offers',
    'name': 'Big Ken - 150-160cm',
    'sku': 'PLANT-1-001-003',
    'url': '/gb/en/plants/kentia-palm-17/?variant=145',
    'availability': 'http://schema.org/InStock',
    'price': '95.00',
    'priceCurrency': 'GBP'},
   {'@type': 'Offer',
    'itemCondition': 'http://schema.org/offers',
    'name': 'Big Ken - 120-130cm',
    'sku': 'PLANT-1-001-002',
    'url': '/gb/en/plants/kentia-palm-17/?variant=80',
    'availability': 'http://schema.org/InStock',
    'price': '70.00',
    'priceCurrency': 'GBP'},
   {'@type': 'Offer',
    'itemCondition': 'http://schema.org/offers',
    'name': 'Big Ken - 170-180cm',
    'sku': 'PLANT-1-001-004',
    'url': '/gb/en/plants/kentia-palm-17/?variant=147',
    'availability': 'http://schema.org/InStock',
    'price': '140.00',
    'priceCurrency': 'GBP'}],
  'lowPrice': '50.00',
  'priceCurrency': 'GBP',
  'highPrice': '140.00'},
 'brand': {'@type': 'Thing', 'name': 'Patch'},
 'alternateName': ['Big Ken', 'Kentia Palm', 'Thatch Palm', 'Flat Palm'],
 'sameAs': 'https://en.wikipedia.org/wiki/Howea_forsteriana'}

Extract the products

We can now scrape the sitemap, extract the URLs, fetch the microdata from each page, and extract the Product schema if it’s present. Since most product pages include multiple child products or product variants, we need to write another little function to extract these.

Child products are typically stored in an element called offers, so we’ll look for that and then loop over the contents. On each iteration of the loop on offers, we create a new dictionary called product which contains a row to add to our dataframe.

This contains the parent_name for the product range, and the name, sku, url, availability, price, and priceCurrency for each child product found. Each dictionary is added to the products list returned.

def get_products(metadata):
    """Return a list containing the products found on a product page. 
    
    Args:
        metadata (string): Schema.org metadata to parse. 
    
    Returns: 
        products (list): A list of dictionaries containing products found. 
    """
    
    Product = get_dictionary_by_key_value(metadata, "@type", "Product")

    if Product:
    
        products = []

        for offer in Product['offers']['offers']:
            product = {
                'parent_name': Product.get('name', ''),
                'name': offer.get('name', ''),
                'sku': offer.get('sku', ''),
                'url': offer.get('url', ''),
                'availability': offer.get('availability', ''),
                'price': offer.get('price', ''),
                'priceCurrency': offer.get('priceCurrency', ''),
            }

            products.append(product)

        return products
products = get_products(metadata)
products
[{'parent_name': 'Kentia Palm',
  'name': 'Big Ken - 90-100cm',
  'sku': 'PLANT-1-001-001',
  'url': '/gb/en/plants/kentia-palm-17/?variant=146',
  'availability': 'http://schema.org/InStock',
  'price': '50.00',
  'priceCurrency': 'GBP'},
 {'parent_name': 'Kentia Palm',
  'name': 'Big Ken - 150-160cm',
  'sku': 'PLANT-1-001-003',
  'url': '/gb/en/plants/kentia-palm-17/?variant=145',
  'availability': 'http://schema.org/InStock',
  'price': '95.00',
  'priceCurrency': 'GBP'},
 {'parent_name': 'Kentia Palm',
  'name': 'Big Ken - 120-130cm',
  'sku': 'PLANT-1-001-002',
  'url': '/gb/en/plants/kentia-palm-17/?variant=80',
  'availability': 'http://schema.org/InStock',
  'price': '70.00',
  'priceCurrency': 'GBP'},
 {'parent_name': 'Kentia Palm',
  'name': 'Big Ken - 170-180cm',
  'sku': 'PLANT-1-001-004',
  'url': '/gb/en/plants/kentia-palm-17/?variant=147',
  'availability': 'http://schema.org/InStock',
  'price': '140.00',
  'priceCurrency': 'GBP'}]

Crawl the site and store the products

The final step is to crawl the site, scrape the products, and return a Pandas dataframe containing all of the products and prices found. This will take a while, so you’ll likely want to go away and have a cup of tea or take the dog for a walk while it’s running. At the end, it will spit out a dataframe of scraped product, stock, and price data.

def scrape_products(df, url='url'):
    """Scrapes every page in a Pandas dataframe column and returns products found. 

    Args:
        df: Pandas dataframe containing the URL list.
        url (optional, string): Optional name of URL column, if not 'url'

    Returns:
        df: Pandas dataframe containing all products found. 
    """

    df_products = pd.DataFrame(columns = ['parent_name', 'name', 'sku', 'url', 
                                          'availability', 'price', 'priceCurrency'])

    for index, row in df.iterrows(): 
        
        metadata = extract_metadata(row[url])          
        products = get_products(metadata)
        
        if products is not None:
            for product in products:             
                df_products = df_products.append(product, ignore_index=True)

    return df_products
df_products = scrape_products(df, url='loc')
df_products.to_csv('patch.csv', index=False)
df_products.head()
parent_name name sku url availability price priceCurrency
0 Bundle of love Bundle of love - Pots included BUNDLE-1-020-002 /gb/en/bundles/bundle-of-love-664/?variant=1400 http://schema.org/InStock 95.00 GBP
1 Bundle of love Bundle of love - Pots not included BUNDLE-1-020-001 /gb/en/bundles/bundle-of-love-664/?variant=1399 http://schema.org/InStock 35.00 GBP
2 Concrete pot in stand Concrete pot in stand - 22cm Light BUNDLE-1-018-002 /gb/en/bundles/concrete-pot-in-stand-663/?vari... http://schema.org/OutOfStock 79.00 GBP
3 Concrete pot in stand Concrete pot in stand - 22cm Dark BUNDLE-1-018-001 /gb/en/bundles/concrete-pot-in-stand-663/?vari... http://schema.org/OutOfStock 79.00 GBP
4 Dipped pot in stand Dipped pot in stand - 25cm Pink BUNDLE-1-017-003 /gb/en/bundles/dipped-pot-in-stand-662/?varian... http://schema.org/InStock 105.00 GBP

Matt Clarke, Sunday, March 14, 2021

Matt Clarke Matt is a Digital Director who uses data science to help in his work. He has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.

Web Scraping in Python

Learn to retrieve and parse information from the internet using the Python library scrapy.

Start course for FREE

Comments