How to read an XML feed into a Pandas dataframe

Picture by Firmbee, Pexels.

12 minutes to read

XML feeds are a data format that uses Extensible Markup Language to provide structured data that can be read by search engines and online advertising providers. For example, a Google Shopping feed uses XML to provide structured data on the products a retailer sells so Google can use them to serve ads and create price comparisons.

Google Shopping feeds are critical to the performance of most ecommerce retail businesses. They need to be correctly formatted and contain all the relevant data Google requires to serve ads on Google Shopping and elsewhere. Google sets strict criteria on what Google Shopping feeds should include so ecommerce retailers need to check them regularly to ensure everything is present and correct.

The disadvantage of using XML for Google Shopping feeds is that they are inherently much harder for humans to read. While most technical people quickly pick up how to read simple markup languages like XML, reading them at scale and checking the data they contain is still quite cumbersome.

In this project I’ll show you how you can build an XML feed parser that reads an XML feed, such as your Google Shopping feed, and parses it into a neatly formatted Pandas dataframe that you can analyse or write to a CSV file. Not only is it perfect for Google Shopping, but it will also work on almost any other XML feed you throw at it. Here’s how it works.

Install the packages

First, fire up a new Jupyter notebook. For this project we’ll be using Pandas, a few core Python libraries, and the Beautiful Soup HTML and XML parser library. The standard Python packages will already be installed, and you’ll likely already have Pandas installed. However, you can install anything you don’t have by entering the below commands into a cell in a Jupyter notebook.

!pip3 install pandas
!pip3 install bs4
!pip3 install lxml

Requirement already satisfied: pandas in /conda/envs/data-science-stack-2.5.1/lib/python3.7/site-packages (1.1.4)
Requirement already satisfied: python-dateutil>=2.7.3 in /conda/envs/data-science-stack-2.5.1/lib/python3.7/site-packages (from pandas) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /conda/envs/data-science-stack-2.5.1/lib/python3.7/site-packages (from pandas) (2020.4)
Requirement already satisfied: numpy>=1.15.4 in /conda/envs/data-science-stack-2.5.1/lib/python3.7/site-packages (from pandas) (1.19.4)
Requirement already satisfied: six>=1.5 in /conda/envs/data-science-stack-2.5.1/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
Requirement already satisfied: bs4 in /conda/envs/data-science-stack-2.5.1/lib/python3.7/site-packages (0.0.1)
Requirement already satisfied: beautifulsoup4 in /conda/envs/data-science-stack-2.5.1/lib/python3.7/site-packages (from bs4) (4.9.3)
Requirement already satisfied: soupsieve>1.2; python_version >= "3.0" in /conda/envs/data-science-stack-2.5.1/lib/python3.7/site-packages (from beautifulsoup4->bs4) (2.0.1)
Requirement already satisfied: lxml in /conda/envs/data-science-stack-2.5.1/lib/python3.7/site-packages (4.6.4)

Import the packages

Now you have the required packages installed, you’ll need to import them using the commands below. To make the Pandas dataframe a little easier to read, I’ve passed in an additional set_option() command to increase the maximum number of rows shown.

import typing
import pandas as pd
import urllib.request
from urllib.parse import urlparse
from bs4 import BeautifulSoup

pd.set_option('max_rows', 10000)

Get the XML feed source

Next, we need to write a little function to fetch the XML source of the remote XML feed we want to parse. The easiest way to do this is via the requests package. Depending on your server configuration, you may need to define a dictionary of headers that include a User-Agent value, otherwise the server might reject your request.

Once we have the response back from requests, we can use Beautiful Soup to parse what we find. We’ll use the lxml-xml XML parser library for this. If you don’t have this installed, you’ll get an error that says “Error: Couldn’t find a tree builder with the features you requested: lxml-xml. Do you need to install a parser library?”. To install the lxml-xml parser library on an Ubuntu Linux data science workstation you can install it by entering pip3 install lxml, if you missed the step above.

def get_feed(url):
    """Scrapes an XML feed from the provided URL and returns XML source.
    
    Args:
        url (string): Fully qualified URL pointing to XML feed.

    Returns:
        source (string): XML source of scraped feed.
    """
    
    try:
        response = urllib.request.urlopen(urllib.request.Request(url, headers={'User-Agent': 'Mozilla'}))
        source = BeautifulSoup(response, 'lxml-xml', from_encoding=response.info().get_param('charset'))
        return source
    except Exception as e:
        print('Error: ' + str(e))

Next, enter the URL of the remote XML feed you want to download and pass it to the get_feed() function and assign the result to a variable called xml. After a few seconds, depending on the size of the XML file you’re downloading, you’ll get back the raw XML of the feed. I’ve skipped out the URL here as the Google Shopping feed I’m parsing is commercially sensitive.

FEED_URL = 'https://yourdomain.com/google_shopping.xml'
xml = get_feed(FEED_URL)

Extract the XML elements in each item

To parse the XML feed we could simply define each element we want to extract, for example the title, and write specific code to handle each element. However, this is a bit repetitive and will only work on feeds that have an identical schema. Instead, we’ll identify the elements inside the XML feed so we can dynamically extract them, whatever the feed contains.

The get_elements() function will do this for us. We’ll pass in the xml variable holding the XML file source and define the name of each parent element. In a Google Shopping feed this is called item, so I’ve set that as the default value. Google Shopping feeds use an RSS dialect based on the schema http://base.google.com/ns/1.0 version 2.0.

If you look at the XML of a Google Shopping feed you’ll notice that many of the elements have a g: prefix, such as g:id. This is because they correspond to the schema namespace. When we use the find_all() function to extract the element names it will return them all. The namespace prefix won’t be shown but that’s not required for parsing the elements, so there’s no need to worry about the difference.

def get_elements(xml, item='item'):
    
    try:
        items = xml.find_all(item)
        elements = [element.name for element in items[0].find_all()]
        return elements
    except Exception as e:
        print('Error: ' + str(e))

elements = get_elements(xml)
elements

['id',
 'title',
 'description',
 'google_product_category',
 'product_type',
 'link',
 'image_link',
 'additional_image_link',
 'additional_image_link',
 'condition',
 'availability',
 'price',
 'sale_price',
 'sale_price_effective_date',
 'brand',
 'gtin',
 'mpn',
 'identifier_exists',
 'gender',
 'age_group',
 'size',
 'item_group_id',
 'color',
 'is_bundle',
 'material',
 'pattern',
 'shipping_weight',
 'multipack',
 'adult',
 'adwords_grouping',
 'adwords_labels',
 'adwords_redirect',
 'unit_pricing_measure',
 'unit_pricing_base_measure',
 'energy_efficiency_class',
 'online_only']

Parse an XML feed into a Pandas dataframe

Now, we’ll create a function called feed_to_df() and will pass it the URL of our XML feed and define the parent elements we want to extract, which are called item in the Google Shopping XML feed. We’ll then use get_feed() to fetch the raw XML and get_elements() to extract the list of XML element names.

We’ll then use if isinstance(elements, typing.List) to check that the elements variable contains a list of XML elements. If it does, we’ll use it to create the columns in an empty Pandas dataframe using df = pd.DataFrame(columns=elements) into which we’ll store a row for each item we find in the feed.

We’ll use the find_all() function from Beautiful Soup to parse the XML and return all the item elements. We’ll then use a for loop to loop over each one. For each item, we’ll then look for each element name from our list and we’ll parse the value and store it in the row dictionary and append each row to the df dictionary.

def feed_to_df(url, item='item'):
    
    xml = get_feed(url)
    elements = get_elements(xml)
    
    if isinstance(elements, typing.List):
        df = pd.DataFrame(columns=elements)
    
        items = xml.find_all(item)
        
        for item in items:
            row = {}
            for element in elements:
                if xml.find(element):
                    if item.find(element):
                        row[element] = item.findNext(element).text
                    else:
                        row[element] = ''
                else:
                    row[element] = ''
            
            df = df.append(row, ignore_index=True)
        return df

If you run the feed_to_df() function you’ll see that the code fetches and parses the XML feed, dynamically generates a Pandas dataframe, puts each item in a row and each element in a column and populates the values. If the feed changes, or if you want to parse an XML feed with a different schema, there’s no need to rewrite the code.

df = feed_to_df(FEED_URL)

df.head(1).T

	0
id	ABC1234
title	Dell Precision 7770
description	Dell Precision 7770 workstation
google_product_category
product_type
link	https://www...
image_link	https://www...
additional_image_link	https://www...
additional_image_link	https://www...
condition	new
availability	out of stock
price	2999.99 GBP
sale_price	2499.99 GBP
sale_price_effective_date	2021-01-01T00:00+0100/2099-01-01T00:00+0100
brand	Dell
gtin
mpn	AHBSI87388
identifier_exists	TRUE
gender
age_group
size
item_group_id	AHBSI87388
color
is_bundle	false
material
pattern
shipping_weight
multipack
adult
adwords_grouping
adwords_labels
adwords_redirect
unit_pricing_measure
unit_pricing_base_measure
energy_efficiency_class
online_only

df.to_csv('current_feed.csv', index=False)

Matt Clarke, Saturday, December 11, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.