How to scrape a Shopify site in Python via products.json

Picture by Cottonbro, Pexels.

15 minutes to read

Since many modern websites use JavaScript and JSON to build their pages, you can sometimes find public facing APIs buried in the page code that give you access to structured data that you can scrape easily, without the need to use your web scraping skills to write a custom HTML scraper for every site you encounter.

Shopify ecommerce sites are a great example of this web development technique. Shopify builds its front-end pages using JavaScript and accesses the data it needs via a public facing API that serves up structured data in JSON format. One of these APIs is called products.json and it’s found at the document root of all Shopify sites.

The products.json file contains information on a Shopify site’s entire catalogue. This includes every product name, ID, SKU, URL, image, price, description, and a host of other values. In this project I’ll show you how to build a web scraper that accesses products.json and exports the entire Shopify product catalogue to a Pandas dataframe.

If you want a quicker approach, I’ll also how to use my ShopifyScraper package to scrape an entire Shopify store in just a few lines of code, usually in under a minute.

Import the packages

To get started, open a Jupyter notebook and import the json, requests, and pandas packages. You’ll likely have all of these already installed.

import json
import pandas as pd
import requests

Get the products.json file

Our first step is to create a function that fetches the contents of the products.json file. This file is in JavaScript Object Notation or JSON format, which is very similar to a Python dictionary. The products.json file is found at the document root of every Shopify store, so you can view it for the site you want to scrape in your browser by visiting https://example.com/products.json.

The products.json file takes various GET parameters, including limit and page, which control pagination. We’ll set the limit parameter to 250, so we get back the maximum number of products per file, but we’ll make the page value a variable, so we can use the function to loop through paginated content. We’ll use the Python try except block to handle any errors that might occur.

def get_json(url, page):
    """
    Get Shopify products.json from a store URL.
    Args:
        url (str): URL of the store.
        page (int): Page number of the products.json.
    Returns:
        products_json: Products.json from the store.
    """

    try:
        response = requests.get(f'{url}/products.json?limit=250&page={page}', timeout=5)
        products_json = response.text
        response.raise_for_status()
        return products_json

    except requests.exceptions.HTTPError as error_http:
        print("HTTP Error:", error_http)

    except requests.exceptions.ConnectionError as error_connection:
        print("Connection Error:", error_connection)

    except requests.exceptions.Timeout as error_timeout:
        print("Timeout Error:", error_timeout)

    except requests.exceptions.RequestException as error:
        print("Error: ", error)

If you pass a URL to the get_json() function, i.e. https://example.com, along with a page value, i.e. 1, you’ll get back the raw JSON for the first page of products.

products_json = get_json('https://shop.brooklynmuseum.org', 1)

Convert the products.json file to a Pandas dataframe

Next, we’ll create a function called to_df() that takes our products_json data in JSON format and converts it to a Python dictionary using the json.loads() function. It then takes the products element of the dictionary, which contains a list of the products in the result set, and uses the Pandas from_dict() function to turn that into a Pandas dataframe.

def to_df(products_json):
    """
    Convert products.json to a pandas DataFrame.
    Args:
        products_json (json): Products.json from the store.
    Returns:
        df: Pandas DataFrame of the products.json.
    """

    try:
        products_dict = json.loads(products_json)
        df = pd.DataFrame.from_dict(products_dict['products'])
        return df
    except Exception as e:
        print(e)

Loop through every page in the products.json file

The products.json file we retrieved is set to show the maximum number of 250 product ranges per page. If there are more than 250 products in the product catalogue, we’ll need to fetch each page of results and append the data in order to build up a dataset that contains every product in the Shopify product catalogue.

To do this we’ll create a function called get_products() to which we’ll pass our Shopify store url. We’ll create a variable called results which we’ll set to True, set the page value to 1, and create a Pandas dataframe in which to store the product range data.

We’ll then create a while loop that runs get_json() to retrieve the products.json data, then uses to_df() to turn it into a Pandas dataframe. If the products_dict we get back is empty, we’ll end execution and return the dataframe. If it’s not empty, we’ll add the results to the dataframe and try the next page.

def get_products(url):
    """
    Get all products from a store.
    Returns:
        df: Pandas DataFrame of the products.json.
    """

    results = True
    page = 1
    df = pd.DataFrame()

    while results:
        products_json = get_json(url, page)
        products_dict = to_df(products_json)

        if len(products_dict) == 0:
            break
        else:
            df = pd.concat([df, products_dict], ignore_index=True)
            page += 1

    df['url'] = f"{url}/products/" + df['handle']
    return df

If you run the get_products() function on the Shopify store URL you want to scrape, in a few seconds you’ll get back a Pandas dataframe containing all the product ranges in the product catalogue.

This includes: the id of the product; the title of the product; the handle or URL key of the product; the body_html of the page (which contains the product description); the published_at, created_at, and updated_at dates; the vendor; the product_type value; the tags the merchant assigned to the product, and the full url of the product page.

There are also three lists stored in variants, images, and options. The variants column is the most interesting as it contains the product variants or children of the parent product range. This includes the price of each product, so we’ll extract this next.

products = get_products('https://shop.brooklynmuseum.org')

products.head(1).T

	0
id	7195243380932
title	A Team With No Sport: Virgil Abloh Pyrex Visio...
handle	a-team-with-no-sport-virgil-abloh-pyrex-vision...
body_html	<meta charset="utf-8"><span data-mce-fragment=...
published_at	2022-08-22T16:00:15-04:00
created_at	2022-07-21T12:54:00-04:00
updated_at	2022-08-22T16:00:15-04:00
vendor	Prestel Art Books
product_type	Books-General-1402
tags	[Brand_Prestel Art Books, Type_Books]
variants	[{'id': 41694576607428, 'title': 'Default Titl...
images	[{'id': 32501394276548, 'created_at': '2022-07...
options	[{'name': 'Title', 'position': 1, 'values': ['...
url	https://shop.brooklynmuseum.org/products/a-tea...

Loop through the product variants

Extracting the product variants, or children of a parent, requires us to parse the list stored in the variants column of the products dataframe and iterate through the results. This isn’t very efficient code-wise, but it’s still not too slow and is certainly much more efficient than scraping and parsing every page, so I wouldn’t lose too much sleep over this.

To extract the product variants we’ll create a function called get_variants() to which we’ll pass the Pandas dataframe of products that we returned from the get_products() function. First, to facilitate a join, we need to cast the id column to int, then create an empty dataframe called df_variants in which to store the data we retrieve.

We’ll then use itertuples to iterate over the rows in our dataframe using a for loop. We’ll then do another for loop on the variants column to extract each individual product variant from the list stored in the variants column. Each variant we find will be written to df_variants with its data.

We’ll then extract some other useful data from the products dataframe, such as the id, title, and vendor of the product range, then we’ll join a subset dataframe of these values to our dataframe of variants using merge(), using the Pandas rename function to change column names.

def get_variants(products):
    """Get variants from a list of products.
    Args:
        products (pd.DataFrame): Pandas dataframe of products from get_products()
    Returns:
        variants (pd.DataFrame): Pandas dataframe of variants
    """

    products['id'].astype(int)
    df_variants = pd.DataFrame()

    for row in products.itertuples(index='True'):

        for variant in getattr(row, 'variants'):
            df_variants = pd.concat([df_variants, pd.DataFrame.from_records(variant, index=[0])])

    df_variants['id'].astype(int)
    df_variants['product_id'].astype(int)
    df_parent_data = products[['id', 'title', 'vendor']]
    df_parent_data = df_parent_data.rename(columns={'title': 'parent_title', 'id': 'parent_id'})
    df_variants = df_variants.merge(df_parent_data, left_on='product_id', right_on='parent_id')
    return df_variants

To get back our variants dataframe, which includes the price of every child product or variant in the Shopify store, we can simply run the function below. Since this is iterating over many rows, and then iterating again over each variant found, this process will take a minute or so to run, depending on the size of the Shopify store.

If you run the function, you’ll see we get back a neat dataframe containing all the product details, including the price and whether it’s in stock or out of stock (availability). These two quick functions, therefore, have allowed us to scrape an entire Shopify site in under a minute, and they’ll work for any Shopify site we wish to scrape.

variants = get_variants(products)

variants.head(1).T

	0
available	True
compare_at_price	None
created_at	2022-07-21T12:54:00-04:00
featured_image	None
grams	454
id	41694576607428
option1	Default Title
option2	None
option3	None
position	1
price	12.95
product_id	7195243380932
requires_shipping	True
sku
taxable	True
title	Default Title
updated_at	2022-08-18T09:10:55-04:00
parent_id	7195243380932
parent_title	A Team With No Sport: Virgil Abloh Pyrex Visio...
vendor	Prestel Art Books

Scrape a Shopify store in three lines of code

If you want to save yourself the bother of writing out all that code, you can use my ShopifyScraper Python package, which wraps up everything above and lets you scrape a Shopify site to Pandas in just three lines of code. To use this, simply install the package from Git using the command below:

!pip3 install git+https://github.com/practical-data-science/ShopifyScraper.git

Next, import the scraper module from shopify_scraper, define the URL of the Shopify store you want to scrape, and run scraper.get_products() to get back a Pandas dataframe containing the parent product ranges, which we’ll store in a dataframe called parents. Then, pass parents to scraper.get_variants() and it will loop over the dataframe to extract the children or product variants.

from shopify_scraper import scraper

url = "https://shop.brooklynmuseum.org"

parents = scraper.get_products(url)

children = scraper.get_variants(parents)

Now you’ve got the parents and children dataframes, you can analyse them, track stock availability and prices, write the data to a database, or save it to CSV other file formats.

Matt Clarke, Tuesday, August 23, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.