XML feeds are a data format that uses Extensible Markup Language to provide structured data that can be read by search engines and online advertising providers. For example, a Google Shopping feed uses XML to provide structured data on the products a retailer sells so Google can use them to serve ads and create price comparisons.
Google Shopping feeds are critical to the performance of most ecommerce retail businesses. They need to be correctly formatted and contain all the relevant data Google requires to serve ads on Google Shopping and elsewhere. Google sets strict criteria on what Google Shopping feeds should include so ecommerce retailers need to check them regularly to ensure everything is present and correct.
The disadvantage of using XML for Google Shopping feeds is that they are inherently much harder for humans to read. While most technical people quickly pick up how to read simple markup languages like XML, reading them at scale and checking the data they contain is still quite cumbersome.
In this project I’ll show you how you can build an XML feed parser that reads an XML feed, such as your Google Shopping feed, and parses it into a neatly formatted Pandas dataframe that you can analyse or write to a CSV file. Not only is it perfect for Google Shopping, but it will also work on almost any other XML feed you throw at it. Here’s how it works.
First, fire up a new Jupyter notebook. For this project we’ll be using Pandas, a few core Python libraries, and the Beautiful Soup HTML and XML parser library. The standard Python packages will already be installed, and you’ll likely already have Pandas installed. However, you can install anything you don’t have by entering the below commands into a cell in a Jupyter notebook.
!pip3 install pandas
!pip3 install bs4
!pip3 install lxml
Requirement already satisfied: pandas in /conda/envs/data-science-stack-2.5.1/lib/python3.7/site-packages (1.1.4)
Requirement already satisfied: python-dateutil>=2.7.3 in /conda/envs/data-science-stack-2.5.1/lib/python3.7/site-packages (from pandas) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /conda/envs/data-science-stack-2.5.1/lib/python3.7/site-packages (from pandas) (2020.4)
Requirement already satisfied: numpy>=1.15.4 in /conda/envs/data-science-stack-2.5.1/lib/python3.7/site-packages (from pandas) (1.19.4)
Requirement already satisfied: six>=1.5 in /conda/envs/data-science-stack-2.5.1/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
Requirement already satisfied: bs4 in /conda/envs/data-science-stack-2.5.1/lib/python3.7/site-packages (0.0.1)
Requirement already satisfied: beautifulsoup4 in /conda/envs/data-science-stack-2.5.1/lib/python3.7/site-packages (from bs4) (4.9.3)
Requirement already satisfied: soupsieve>1.2; python_version >= "3.0" in /conda/envs/data-science-stack-2.5.1/lib/python3.7/site-packages (from beautifulsoup4->bs4) (2.0.1)
Requirement already satisfied: lxml in /conda/envs/data-science-stack-2.5.1/lib/python3.7/site-packages (4.6.4)
Now you have the required packages installed, you’ll need to import them using the commands below. To make the Pandas dataframe a little easier to read, I’ve passed in an additional set_option()
command to increase the maximum number of rows shown.
import typing
import pandas as pd
import urllib.request
from urllib.parse import urlparse
from bs4 import BeautifulSoup
pd.set_option('max_rows', 10000)
Next, we need to write a little function to fetch the XML source of the remote XML feed we want to parse. The easiest way to do this is via the requests
package. Depending on your server configuration, you may need to define a dictionary of headers that include a User-Agent
value, otherwise the server might reject your request.
Once we have the response
back from requests
, we can use Beautiful Soup to parse what we find. We’ll use the lxml-xml
XML parser library for this. If you don’t have this installed, you’ll get an error that says “Error: Couldn’t find a tree builder with the features you requested: lxml-xml. Do you need to install a parser library?”. To install the lxml-xml
parser library on an Ubuntu Linux data science workstation you can install it by entering pip3 install lxml
, if you missed the step above.
def get_feed(url):
"""Scrapes an XML feed from the provided URL and returns XML source.
Args:
url (string): Fully qualified URL pointing to XML feed.
Returns:
source (string): XML source of scraped feed.
"""
try:
response = urllib.request.urlopen(urllib.request.Request(url, headers={'User-Agent': 'Mozilla'}))
source = BeautifulSoup(response, 'lxml-xml', from_encoding=response.info().get_param('charset'))
return source
except Exception as e:
print('Error: ' + str(e))
Next, enter the URL of the remote XML feed you want to download and pass it to the get_feed()
function and assign the result to a variable called xml
. After a few seconds, depending on the size of the XML file you’re downloading, you’ll get back the raw XML of the feed. I’ve skipped out the URL here as the Google Shopping feed I’m parsing is commercially sensitive.
FEED_URL = 'https://yourdomain.com/google_shopping.xml'
xml = get_feed(FEED_URL)
To parse the XML feed we could simply define each element we want to extract, for example the title
, and write specific code to handle each element. However, this is a bit repetitive and will only work on feeds that have an identical schema. Instead, we’ll identify the elements inside the XML feed so we can dynamically extract them, whatever the feed contains.
The get_elements()
function will do this for us. We’ll pass in the xml
variable holding the XML file source and define the name of each parent element. In a Google Shopping feed this is called item
, so I’ve set that as the default value. Google Shopping feeds use an RSS dialect based on the schema http://base.google.com/ns/1.0
version 2.0.
If you look at the XML of a Google Shopping feed you’ll notice that many of the elements have a g:
prefix, such as g:id
. This is because they correspond to the schema namespace. When we use the find_all()
function to extract the element names it will return them all. The namespace prefix won’t be shown but that’s not required for parsing the elements, so there’s no need to worry about the difference.
def get_elements(xml, item='item'):
try:
items = xml.find_all(item)
elements = [element.name for element in items[0].find_all()]
return elements
except Exception as e:
print('Error: ' + str(e))
elements = get_elements(xml)
elements
['id',
'title',
'description',
'google_product_category',
'product_type',
'link',
'image_link',
'additional_image_link',
'additional_image_link',
'condition',
'availability',
'price',
'sale_price',
'sale_price_effective_date',
'brand',
'gtin',
'mpn',
'identifier_exists',
'gender',
'age_group',
'size',
'item_group_id',
'color',
'is_bundle',
'material',
'pattern',
'shipping_weight',
'multipack',
'adult',
'adwords_grouping',
'adwords_labels',
'adwords_redirect',
'unit_pricing_measure',
'unit_pricing_base_measure',
'energy_efficiency_class',
'online_only']
Now, we’ll create a function called feed_to_df()
and will pass it the URL of our XML feed and define the parent elements we want to extract, which are called item
in the Google Shopping XML feed. We’ll then use get_feed()
to fetch the raw XML and get_elements()
to extract the list of XML element names.
We’ll then use if isinstance(elements, typing.List)
to check that the elements
variable contains a list of XML elements. If it does, we’ll use it to create the columns in an empty Pandas dataframe using df = pd.DataFrame(columns=elements)
into which we’ll store a row for each item
we find in the feed.
We’ll use the find_all()
function from Beautiful Soup to parse the XML and return all the item
elements. We’ll then use a for
loop to loop over each one. For each item, we’ll then look for each element name from our list and we’ll parse the value and store it in the row
dictionary and append each row
to the df
dictionary.
def feed_to_df(url, item='item'):
xml = get_feed(url)
elements = get_elements(xml)
if isinstance(elements, typing.List):
df = pd.DataFrame(columns=elements)
items = xml.find_all(item)
for item in items:
row = {}
for element in elements:
if xml.find(element):
if item.find(element):
row[element] = item.findNext(element).text
else:
row[element] = ''
else:
row[element] = ''
df = df.append(row, ignore_index=True)
return df
If you run the feed_to_df()
function you’ll see that the code fetches and parses the XML feed, dynamically generates a Pandas dataframe, puts each item in a row and each element in a column and populates the values. If the feed changes, or if you want to parse an XML feed with a different schema, there’s no need to rewrite the code.
df = feed_to_df(FEED_URL)
df.head(1).T
0 | |
---|---|
id | ABC1234 |
title | Dell Precision 7770 |
description | Dell Precision 7770 workstation |
google_product_category | |
product_type | |
link | https://www... |
image_link | https://www... |
additional_image_link | https://www... |
additional_image_link | https://www... |
condition | new |
availability | out of stock |
price | 2999.99 GBP |
sale_price | 2499.99 GBP |
sale_price_effective_date | 2021-01-01T00:00+0100/2099-01-01T00:00+0100 |
brand | Dell |
gtin | |
mpn | AHBSI87388 |
identifier_exists | TRUE |
gender | |
age_group | |
size | |
item_group_id | AHBSI87388 |
color | |
is_bundle | false |
material | |
pattern | |
shipping_weight | |
multipack | |
adult | |
adwords_grouping | |
adwords_labels | |
adwords_redirect | |
unit_pricing_measure | |
unit_pricing_base_measure | |
energy_efficiency_class | |
online_only |
df.to_csv('current_feed.csv', index=False)
Matt Clarke, Saturday, December 11, 2021