How to parse XML sitemaps using Python

Picture by Burst, Pexels.

11 minutes to read

XML sitemaps are designed to make life easier for search engines by providing an index of a site’s URLs. However, they’re also a useful tool in competitor analysis and allow you to quickly identify all of a site’s pages, and the level of importance the site assigns to each page.

In this web scraping project, we’ll use Python’s urllib package to fetch XML sitemaps, parse the underlying XML using Beautiful Soup’s lxml parser, and read the contents into a Pandas dataframe, so you can analyse the content of every page on a site. Here’s how it’s done.

Load the packages

First, open a Jupyter notebook and import the pandas, urllib.request, urllib.parse and bs4 packages. Any packages you don’t have can be installed by entering pip3 install package-name in your terminal.

import pandas as pd
import urllib.request
from urllib.parse import urlparse
from bs4 import BeautifulSoup

Scrape the XML sitemap

The first step is to create a simple function to fetch the raw XML of the sitemap. We’ll create a function called get_sitemap() to which we’ll pass the URL of the remote sitemap.xml file. We’ll pass this URL to urllib.request.urlopen() and store the HTTP response dictionary returned.

Next, we’ll pass that response object to BeautifulSoup(), and we’ll set the parser to lxml-xml so it handles the XML source better. Finally, we’ll pass the character set information from response.info().get_param('charset') to from_encoding so the file is read correctly.

def get_sitemap(url):
    """Scrapes an XML sitemap from the provided URL and returns XML source.

    Args:
        url (string): Fully qualified URL pointing to XML sitemap.

    Returns:
        xml (string): XML source of scraped sitemap.
    """

    response = urllib.request.urlopen(url)
    xml = BeautifulSoup(response, 
                         'lxml-xml', 
                         from_encoding=response.info().get_param('charset'))

    return xml

Fetch your sitemap XML

Now we can pass in a fully qualified URL pointing to the XML sitemap we want to fetch. As you’ll see from the site I’ve chosen, the xml returned is the raw source of the page and includes links to a number of other child sitemaps.

XML sitemaps are usually (but not always) called sitemap.xml and are located at the site root (i.e. /sitemap.xml). However, if you don’t find the sitemap, check the robots.txt file at /robots.txt, which should provide the alternate address if one has been used.

url = "https://themarket.co.uk/sitemap.xml"
xml = get_sitemap(url)

xml

<?xml version="1.0" encoding="utf-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://themarket.co.uk/themarket.xml</loc>
<lastmod>2021-01-20T14:00:10+00:00</lastmod>
</sitemap>
<sitemap>
<loc>https://themarket.co.uk/finished.xml</loc>
<lastmod>2021-01-20T14:00:10+00:00</lastmod>
</sitemap>
<sitemap>
<loc>https://themarket.co.uk/live.xml</loc>
<lastmod>2021-01-20T14:00:10+00:00</lastmod>
</sitemap>
</sitemapindex>

Parse the sitemap to determine its type

There are two main types of XML sitemap - the sitemapindex (shown above) which includes links to child sitemaps, and the urlset which includes direct links to all the underlying pages. Since there are no page URLs in the sitemapindex sitemap, we need another function to determine the sitemap type, so we can parse it accordingly.

def get_sitemap_type(xml):
    """Parse XML source and returns the type of sitemap.

    Args:
        xml (string): Source code of XML sitemap.

    Returns:
        sitemap_type (string): Type of sitemap (sitemap, sitemapindex, or None).
    """

    sitemapindex = xml.find_all('sitemapindex')
    sitemap = xml.find_all('urlset')

    if sitemapindex:
        return 'sitemapindex'
    elif sitemap:
        return 'urlset'
    else:
        return

sitemap_type = get_sitemap_type(xml)
sitemap_type

'sitemapindex'

Parse the sitemap and extract child sitemaps

If we detect that the sitemap is of the sitemapindex type, we need another bit of code to fetch the URLs of the underlying child sitemaps. We can do that by using find_all() to detect all of the sitemap elements, and then append() the loc text element to a list.

def get_child_sitemaps(xml):
    """Return a list of child sitemaps present in a XML sitemap file.

    Args:
        xml (string): XML source of sitemap. 

    Returns:
        sitemaps (list): Python list of XML sitemap URLs.
    """

    sitemaps = xml.find_all("sitemap")

    output = []

    for sitemap in sitemaps:
        output.append(sitemap.findNext("loc").text)
    return output

child_sitemaps = get_child_sitemaps(xml)
child_sitemaps

['https://themarket.co.uk/themarket.xml',
 'https://themarket.co.uk/finished.xml',
 'https://themarket.co.uk/live.xml']

Read the sitemap XML into a Pandas dataframe

Finally, we can create a function called sitemap_to_dataframe() to parse the sitemap.xml file and return all of the url elements using find_all(). By looping over these we can then extract the loc (holding the URL), the changefreq indicating the frequency that the page is typically changed, its priority and the domain from which the URL was scraped.

def sitemap_to_dataframe(xml, name=None, data=None, verbose=False):
    """Read an XML sitemap into a Pandas dataframe. 

    Args:
        xml (string): XML source of sitemap. 
        name (optional): Optional name for sitemap parsed.
        verbose (boolean, optional): Set to True to monitor progress.

    Returns:
        dataframe: Pandas dataframe of XML sitemap content. 
    """

    df = pd.DataFrame(columns=['loc', 'changefreq', 'priority', 'domain', 'sitemap_name'])

    urls = xml.find_all("url")
  
    for url in urls:

        if xml.find("loc"):
            loc = url.findNext("loc").text
            parsed_uri = urlparse(loc)
            domain = '{uri.netloc}'.format(uri=parsed_uri)
        else:
            loc = ''
            domain = ''

        if xml.find("changefreq"):
            changefreq = url.findNext("changefreq").text
        else:
            changefreq = ''

        if xml.find("priority"):
            priority = url.findNext("priority").text
        else:
            priority = ''

        if name:
            sitemap_name = name
        else:
            sitemap_name = ''
              
        row = {
            'domain': domain,
            'loc': loc,
            'changefreq': changefreq,
            'priority': priority,
            'sitemap_name': sitemap_name,
        }

        if verbose:
            print(row)

        df = df.append(row, ignore_index=True)
    return df

Running the code on one of the URLs in the original sitemap returns a Pandas dataframe containing all the data we need to analyse this site.

url_finished = "https://themarket.co.uk/finished.xml"
xml_finished = get_sitemap(url_finished)

df = sitemap_to_dataframe(xml_finished, name='finished.xml', verbose=False)
df.head()

	loc	changefreq	priority	domain
0	https://themarket.co.uk/listings/100-ot/cheris...	daily	0.8	themarket.co.uk
1	https://themarket.co.uk/listings/3-geo/cherish...	daily	0.8	themarket.co.uk
2	https://themarket.co.uk/listings/abarth/695c-e...	daily	0.8	themarket.co.uk
3	https://themarket.co.uk/listings/ac/buckland/7...	daily	0.8	themarket.co.uk
4	https://themarket.co.uk/listings/ac/cobra-dax-...	daily	0.8	themarket.co.uk

df.shape

(1139, 4)

Fetching the URL of every page on a site

One final function wraps everything up. This takes a single sitemap URL, retrieves the XML source, parses it to determine the sitemap type, obtains the URLs of any child sitemaps, then loops over the sitemaps, extracts their contents, and returns a single dataframe.

def get_all_urls(url):
    """Return a dataframe containing all of the URLs from a site's XML sitemaps.

    Args:
        url (string): URL of site's XML sitemap. Usually located at /sitemap.xml

    Returns:
        df (dataframe): Pandas dataframe containing all sitemap content. 

    """


    xml = get_sitemap(url)
    sitemap_type = get_sitemap_type(xml)

    if sitemap_type =='sitemapindex':
        sitemaps = get_child_sitemaps(xml)
    else:
        sitemaps = [url]

    df = pd.DataFrame(columns=['loc', 'changefreq', 'priority', 'domain', 'sitemap_name'])

    for sitemap in sitemaps:
        sitemap_xml = get_sitemap(sitemap)
        df_sitemap = sitemap_to_dataframe(sitemap_xml, name=sitemap)

        df = pd.concat([df, df_sitemap], ignore_index=True)

    return df

df = get_all_urls(url)

df.head()

	loc	changefreq	priority	domain	sitemap_name
0	https://themarket.co.uk	daily	0.8	themarket.co.uk	https://themarket.co.uk/themarket.xml
1	https://themarket.co.uk/	daily	0.8	themarket.co.uk	https://themarket.co.uk/themarket.xml
2	https://themarket.co.uk/auctions/coming-soon	daily	0.8	themarket.co.uk	https://themarket.co.uk/themarket.xml
3	https://themarket.co.uk/auctions/live	daily	0.8	themarket.co.uk	https://themarket.co.uk/themarket.xml
4	https://themarket.co.uk/auctions/no-reserve	daily	0.8	themarket.co.uk	https://themarket.co.uk/themarket.xml

df.sitemap_name.value_counts()

https://themarket.co.uk/finished.xml     1139
https://themarket.co.uk/live.xml           27
https://themarket.co.uk/themarket.xml      14
Name: sitemap_name, dtype: int64

Matt Clarke, Friday, March 12, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.