How to read an XML feed into a Pandas dataframe

Learn how to create an XML feed parser that will read your Google Shopping feed (or any other XML file) into a Pandas dataframe whatever it contains.

How to read an XML feed into a Pandas dataframe
Picture by Firmbee, Pexels.
12 minutes to read

XML feeds are a data format that uses Extensible Markup Language to provide structured data that can be read by search engines and online advertising providers. For example, a Google Shopping feed uses XML to provide structured data on the products a retailer sells so Google can use them to serve ads and create price comparisons.

Google Shopping feeds are critical to the performance of most ecommerce retail businesses. They need to be correctly formatted and contain all the relevant data Google requires to serve ads on Google Shopping and elsewhere. Google sets strict criteria on what Google Shopping feeds should include so ecommerce retailers need to check them regularly to ensure everything is present and correct.

The disadvantage of using XML for Google Shopping feeds is that they are inherently much harder for humans to read. While most technical people quickly pick up how to read simple markup languages like XML, reading them at scale and checking the data they contain is still quite cumbersome.

In this project I’ll show you how you can build an XML feed parser that reads an XML feed, such as your Google Shopping feed, and parses it into a neatly formatted Pandas dataframe that you can analyse or write to a CSV file. Not only is it perfect for Google Shopping, but it will also work on almost any other XML feed you throw at it. Here’s how it works.

Install the packages

First, fire up a new Jupyter notebook. For this project we’ll be using Pandas, a few core Python libraries, and the Beautiful Soup HTML and XML parser library. The standard Python packages will already be installed, and you’ll likely already have Pandas installed. However, you can install anything you don’t have by entering the below commands into a cell in a Jupyter notebook.

!pip3 install pandas
!pip3 install bs4
!pip3 install lxml
Requirement already satisfied: pandas in /conda/envs/data-science-stack-2.5.1/lib/python3.7/site-packages (1.1.4)
Requirement already satisfied: python-dateutil>=2.7.3 in /conda/envs/data-science-stack-2.5.1/lib/python3.7/site-packages (from pandas) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /conda/envs/data-science-stack-2.5.1/lib/python3.7/site-packages (from pandas) (2020.4)
Requirement already satisfied: numpy>=1.15.4 in /conda/envs/data-science-stack-2.5.1/lib/python3.7/site-packages (from pandas) (1.19.4)
Requirement already satisfied: six>=1.5 in /conda/envs/data-science-stack-2.5.1/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
Requirement already satisfied: bs4 in /conda/envs/data-science-stack-2.5.1/lib/python3.7/site-packages (0.0.1)
Requirement already satisfied: beautifulsoup4 in /conda/envs/data-science-stack-2.5.1/lib/python3.7/site-packages (from bs4) (4.9.3)
Requirement already satisfied: soupsieve>1.2; python_version >= "3.0" in /conda/envs/data-science-stack-2.5.1/lib/python3.7/site-packages (from beautifulsoup4->bs4) (2.0.1)
Requirement already satisfied: lxml in /conda/envs/data-science-stack-2.5.1/lib/python3.7/site-packages (4.6.4)

Import the packages

Now you have the required packages installed, you’ll need to import them using the commands below. To make the Pandas dataframe a little easier to read, I’ve passed in an additional set_option() command to increase the maximum number of rows shown.

import typing
import pandas as pd
import urllib.request
from urllib.parse import urlparse
from bs4 import BeautifulSoup
pd.set_option('max_rows', 10000)

Get the XML feed source

Next, we need to write a little function to fetch the XML source of the remote XML feed we want to parse. The easiest way to do this is via the requests package. Depending on your server configuration, you may need to define a dictionary of headers that include a User-Agent value, otherwise the server might reject your request.

Once we have the response back from requests, we can use Beautiful Soup to parse what we find. We’ll use the lxml-xml XML parser library for this. If you don’t have this installed, you’ll get an error that says “Error: Couldn’t find a tree builder with the features you requested: lxml-xml. Do you need to install a parser library?”. To install the lxml-xml parser library on an Ubuntu Linux data science workstation you can install it by entering pip3 install lxml, if you missed the step above.

def get_feed(url):
    """Scrapes an XML feed from the provided URL and returns XML source.
    
    Args:
        url (string): Fully qualified URL pointing to XML feed.

    Returns:
        source (string): XML source of scraped feed.
    """
    
    try:
        response = urllib.request.urlopen(urllib.request.Request(url, headers={'User-Agent': 'Mozilla'}))
        source = BeautifulSoup(response, 'lxml-xml', from_encoding=response.info().get_param('charset'))
        return source
    except Exception as e:
        print('Error: ' + str(e))

Next, enter the URL of the remote XML feed you want to download and pass it to the get_feed() function and assign the result to a variable called xml. After a few seconds, depending on the size of the XML file you’re downloading, you’ll get back the raw XML of the feed. I’ve skipped out the URL here as the Google Shopping feed I’m parsing is commercially sensitive.

FEED_URL = 'https://yourdomain.com/google_shopping.xml'
xml = get_feed(FEED_URL)

Extract the XML elements in each item

To parse the XML feed we could simply define each element we want to extract, for example the title, and write specific code to handle each element. However, this is a bit repetitive and will only work on feeds that have an identical schema. Instead, we’ll identify the elements inside the XML feed so we can dynamically extract them, whatever the feed contains.

The get_elements() function will do this for us. We’ll pass in the xml variable holding the XML file source and define the name of each parent element. In a Google Shopping feed this is called item, so I’ve set that as the default value. Google Shopping feeds use an RSS dialect based on the schema http://base.google.com/ns/1.0 version 2.0.

If you look at the XML of a Google Shopping feed you’ll notice that many of the elements have a g: prefix, such as g:id. This is because they correspond to the schema namespace. When we use the find_all() function to extract the element names it will return them all. The namespace prefix won’t be shown but that’s not required for parsing the elements, so there’s no need to worry about the difference.

def get_elements(xml, item='item'):
    
    try:
        items = xml.find_all(item)
        elements = [element.name for element in items[0].find_all()]
        return elements
    except Exception as e:
        print('Error: ' + str(e))
elements = get_elements(xml)
elements
['id',
 'title',
 'description',
 'google_product_category',
 'product_type',
 'link',
 'image_link',
 'additional_image_link',
 'additional_image_link',
 'condition',
 'availability',
 'price',
 'sale_price',
 'sale_price_effective_date',
 'brand',
 'gtin',
 'mpn',
 'identifier_exists',
 'gender',
 'age_group',
 'size',
 'item_group_id',
 'color',
 'is_bundle',
 'material',
 'pattern',
 'shipping_weight',
 'multipack',
 'adult',
 'adwords_grouping',
 'adwords_labels',
 'adwords_redirect',
 'unit_pricing_measure',
 'unit_pricing_base_measure',
 'energy_efficiency_class',
 'online_only']

Parse an XML feed into a Pandas dataframe

Now, we’ll create a function called feed_to_df() and will pass it the URL of our XML feed and define the parent elements we want to extract, which are called item in the Google Shopping XML feed. We’ll then use get_feed() to fetch the raw XML and get_elements() to extract the list of XML element names.

We’ll then use if isinstance(elements, typing.List) to check that the elements variable contains a list of XML elements. If it does, we’ll use it to create the columns in an empty Pandas dataframe using df = pd.DataFrame(columns=elements) into which we’ll store a row for each item we find in the feed.

We’ll use the find_all() function from Beautiful Soup to parse the XML and return all the item elements. We’ll then use a for loop to loop over each one. For each item, we’ll then look for each element name from our list and we’ll parse the value and store it in the row dictionary and append each row to the df dictionary.

def feed_to_df(url, item='item'):
    
    xml = get_feed(url)
    elements = get_elements(xml)
    
    if isinstance(elements, typing.List):
        df = pd.DataFrame(columns=elements)
    
        items = xml.find_all(item)
        
        for item in items:
            row = {}
            for element in elements:
                if xml.find(element):
                    if item.find(element):
                        row[element] = item.findNext(element).text
                    else:
                        row[element] = ''
                else:
                    row[element] = ''
            
            df = df.append(row, ignore_index=True)
        return df

If you run the feed_to_df() function you’ll see that the code fetches and parses the XML feed, dynamically generates a Pandas dataframe, puts each item in a row and each element in a column and populates the values. If the feed changes, or if you want to parse an XML feed with a different schema, there’s no need to rewrite the code.

df = feed_to_df(FEED_URL)
df.head(1).T
0
id ABC1234
title Dell Precision 7770
description Dell Precision 7770 workstation
google_product_category
product_type
link https://www...
image_link https://www...
additional_image_link https://www...
additional_image_link https://www...
condition new
availability out of stock
price 2999.99 GBP
sale_price 2499.99 GBP
sale_price_effective_date 2021-01-01T00:00+0100/2099-01-01T00:00+0100
brand Dell
gtin
mpn AHBSI87388
identifier_exists TRUE
gender
age_group
size
item_group_id AHBSI87388
color
is_bundle false
material
pattern
shipping_weight
multipack
adult
adwords_grouping
adwords_labels
adwords_redirect
unit_pricing_measure
unit_pricing_base_measure
energy_efficiency_class
online_only
df.to_csv('current_feed.csv', index=False)

Matt Clarke, Saturday, December 11, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.