Since many modern websites use JavaScript and JSON to build their pages, you can sometimes find public facing APIs buried in the page code that give you access to structured data that you can scrape easily, without the need to use your web scraping skills to write a custom HTML scraper for every site you encounter.
Shopify ecommerce sites are a great example of this web development technique. Shopify builds its front-end pages using JavaScript and accesses the data it needs via a public facing API that serves up structured data in JSON format. One of these APIs is called products.json
and it’s found at the document root of all Shopify sites.
The products.json
file contains information on a Shopify site’s entire catalogue. This includes every product name, ID, SKU, URL, image, price, description, and a host of other values. In this project I’ll show you how to build a web scraper that accesses products.json
and exports the entire Shopify product catalogue to a Pandas dataframe.
If you want a quicker approach, I’ll also how to use my ShopifyScraper package to scrape an entire Shopify store in just a few lines of code, usually in under a minute.
To get started, open a Jupyter notebook and import the json
, requests
, and pandas
packages. You’ll likely have all of these already installed.
import json
import pandas as pd
import requests
Our first step is to create a function that fetches the contents of the products.json
file. This file is in JavaScript Object Notation or JSON format, which is very similar to a Python dictionary. The products.json
file is found at the document root of every Shopify store, so you can view it for the site you want to scrape in your browser by visiting https://example.com/products.json
.
The products.json
file takes various GET
parameters, including limit
and page
, which control pagination. We’ll set the limit
parameter to 250, so we get back the maximum number of products per file, but we’ll make the page
value a variable, so we can use the function to loop through paginated content. We’ll use the Python try except block to handle any errors that might occur.
def get_json(url, page):
"""
Get Shopify products.json from a store URL.
Args:
url (str): URL of the store.
page (int): Page number of the products.json.
Returns:
products_json: Products.json from the store.
"""
try:
response = requests.get(f'{url}/products.json?limit=250&page={page}', timeout=5)
products_json = response.text
response.raise_for_status()
return products_json
except requests.exceptions.HTTPError as error_http:
print("HTTP Error:", error_http)
except requests.exceptions.ConnectionError as error_connection:
print("Connection Error:", error_connection)
except requests.exceptions.Timeout as error_timeout:
print("Timeout Error:", error_timeout)
except requests.exceptions.RequestException as error:
print("Error: ", error)
If you pass a URL to the get_json()
function, i.e. https://example.com
, along with a page
value, i.e. 1, you’ll get back the raw JSON for the first page of products.
products_json = get_json('https://shop.brooklynmuseum.org', 1)
Next, we’ll create a function called to_df()
that takes our products_json
data in JSON format and converts it to a Python dictionary using the json.loads()
function. It then takes the products
element of the dictionary, which contains a list of the products in the result set, and uses the Pandas from_dict()
function to turn that into a Pandas dataframe.
def to_df(products_json):
"""
Convert products.json to a pandas DataFrame.
Args:
products_json (json): Products.json from the store.
Returns:
df: Pandas DataFrame of the products.json.
"""
try:
products_dict = json.loads(products_json)
df = pd.DataFrame.from_dict(products_dict['products'])
return df
except Exception as e:
print(e)
The products.json
file we retrieved is set to show the maximum number of 250 product ranges per page. If there are more than 250 products in the product catalogue, we’ll need to fetch each page of results and append the data in order to build up a dataset that contains every product in the Shopify product catalogue.
To do this we’ll create a function called get_products()
to which we’ll pass our Shopify store url
. We’ll create a variable called results
which we’ll set to True
, set the page
value to 1
, and create a Pandas dataframe in which to store the product range data.
We’ll then create a while
loop that runs get_json()
to retrieve the products.json
data, then uses to_df()
to turn it into a Pandas dataframe. If the products_dict
we get back is empty, we’ll end execution and return the dataframe. If it’s not empty, we’ll add the results to the dataframe and try the next page.
def get_products(url):
"""
Get all products from a store.
Returns:
df: Pandas DataFrame of the products.json.
"""
results = True
page = 1
df = pd.DataFrame()
while results:
products_json = get_json(url, page)
products_dict = to_df(products_json)
if len(products_dict) == 0:
break
else:
df = pd.concat([df, products_dict], ignore_index=True)
page += 1
df['url'] = f"{url}/products/" + df['handle']
return df
If you run the get_products()
function on the Shopify store URL you want to scrape, in a few seconds you’ll get back a Pandas dataframe containing all the product ranges in the product catalogue.
This includes: the id
of the product; the title
of the product; the handle
or URL key of the product; the body_html
of the page (which contains the product description); the published_at
, created_at
, and updated_at
dates; the vendor
; the product_type
value; the tags
the merchant assigned to the product, and the full url
of the product page.
There are also three lists stored in variants
, images
, and options
. The variants
column is the most interesting as it contains the product variants or children of the parent product range. This includes the price of each product, so we’ll extract this next.
products = get_products('https://shop.brooklynmuseum.org')
products.head(1).T
0 | |
---|---|
id | 7195243380932 |
title | A Team With No Sport: Virgil Abloh Pyrex Visio... |
handle | a-team-with-no-sport-virgil-abloh-pyrex-vision... |
body_html | <meta charset="utf-8"><span data-mce-fragment=... |
published_at | 2022-08-22T16:00:15-04:00 |
created_at | 2022-07-21T12:54:00-04:00 |
updated_at | 2022-08-22T16:00:15-04:00 |
vendor | Prestel Art Books |
product_type | Books-General-1402 |
tags | [Brand_Prestel Art Books, Type_Books] |
variants | [{'id': 41694576607428, 'title': 'Default Titl... |
images | [{'id': 32501394276548, 'created_at': '2022-07... |
options | [{'name': 'Title', 'position': 1, 'values': ['... |
url | https://shop.brooklynmuseum.org/products/a-tea... |
Extracting the product variants, or children of a parent, requires us to parse the list stored in the variants
column of the products dataframe and iterate through the results. This isn’t very efficient code-wise, but it’s still not too slow and is certainly much more efficient than scraping and parsing every page, so I wouldn’t lose too much sleep over this.
To extract the product variants we’ll create a function called get_variants()
to which we’ll pass the Pandas dataframe of products that we returned from the get_products()
function. First, to facilitate a join, we need to cast the id
column to int
, then create an empty dataframe called df_variants
in which to store the data we retrieve.
We’ll then use itertuples
to iterate over the rows in our dataframe using a for
loop. We’ll then do another for
loop on the variants
column to extract each individual product variant from the list stored in the variants
column. Each variant we find will be written to df_variants
with its data.
We’ll then extract some other useful data from the products
dataframe, such as the id
, title
, and vendor
of the product range, then we’ll join a subset dataframe of these values to our dataframe of variants using merge()
, using the Pandas rename function to change column names.
def get_variants(products):
"""Get variants from a list of products.
Args:
products (pd.DataFrame): Pandas dataframe of products from get_products()
Returns:
variants (pd.DataFrame): Pandas dataframe of variants
"""
products['id'].astype(int)
df_variants = pd.DataFrame()
for row in products.itertuples(index='True'):
for variant in getattr(row, 'variants'):
df_variants = pd.concat([df_variants, pd.DataFrame.from_records(variant, index=[0])])
df_variants['id'].astype(int)
df_variants['product_id'].astype(int)
df_parent_data = products[['id', 'title', 'vendor']]
df_parent_data = df_parent_data.rename(columns={'title': 'parent_title', 'id': 'parent_id'})
df_variants = df_variants.merge(df_parent_data, left_on='product_id', right_on='parent_id')
return df_variants
To get back our variants dataframe, which includes the price of every child product or variant in the Shopify store, we can simply run the function below. Since this is iterating over many rows, and then iterating again over each variant found, this process will take a minute or so to run, depending on the size of the Shopify store.
If you run the function, you’ll see we get back a neat dataframe containing all the product details, including the price
and whether it’s in stock or out of stock (availability
). These two quick functions, therefore, have allowed us to scrape an entire Shopify site in under a minute, and they’ll work for any Shopify site we wish to scrape.
variants = get_variants(products)
variants.head(1).T
0 | |
---|---|
available | True |
compare_at_price | None |
created_at | 2022-07-21T12:54:00-04:00 |
featured_image | None |
grams | 454 |
id | 41694576607428 |
option1 | Default Title |
option2 | None |
option3 | None |
position | 1 |
price | 12.95 |
product_id | 7195243380932 |
requires_shipping | True |
sku | |
taxable | True |
title | Default Title |
updated_at | 2022-08-18T09:10:55-04:00 |
parent_id | 7195243380932 |
parent_title | A Team With No Sport: Virgil Abloh Pyrex Visio... |
vendor | Prestel Art Books |
If you want to save yourself the bother of writing out all that code, you can use my ShopifyScraper
Python package, which wraps up everything above and lets you scrape a Shopify site to Pandas in just three lines of code. To use this, simply install the package from Git using the command below:
!pip3 install git+https://github.com/practical-data-science/ShopifyScraper.git
Next, import the scraper
module from shopify_scraper
, define the URL of the Shopify store you want to scrape, and run scraper.get_products()
to get back a Pandas dataframe containing the parent product ranges, which we’ll store in a dataframe called parents
. Then, pass parents
to scraper.get_variants()
and it will loop over the dataframe to extract the children or product variants.
from shopify_scraper import scraper
url = "https://shop.brooklynmuseum.org"
parents = scraper.get_products(url)
children = scraper.get_variants(parents)
Now you’ve got the parents
and children
dataframes, you can analyse them, track stock availability and prices, write the data to a database, or save it to CSV other file formats.
Matt Clarke, Tuesday, August 23, 2022