XML sitemaps are designed to make life easier for search engines by providing an index of a site’s URLs. However, they’re also a useful tool in competitor analysis and allow you to quickly identify all of a site’s pages, and the level of importance the site assigns to each page.
In this web scraping project, we’ll use
Python’s urllib
package to fetch XML sitemaps, parse the underlying XML using Beautiful Soup’s lxml parser, and read the contents into a Pandas dataframe, so you can analyse the content of every page on a site. Here’s how it’s done.
First, open a Jupyter notebook and import the pandas
, urllib.request
, urllib.parse
and bs4
packages. Any packages you don’t have can be installed by entering pip3 install package-name
in your terminal.
import pandas as pd
import urllib.request
from urllib.parse import urlparse
from bs4 import BeautifulSoup
The first step is to create a simple function to fetch the raw XML of the sitemap. We’ll create a function called get_sitemap()
to which we’ll pass the URL of the remote sitemap.xml file. We’ll pass this URL to urllib.request.urlopen()
and store the HTTP response
dictionary returned.
Next, we’ll pass that response
object to BeautifulSoup()
, and we’ll set the parser to lxml-xml
so it handles the XML source better. Finally, we’ll pass the character set information from response.info().get_param('charset')
to from_encoding
so the file is read correctly.
def get_sitemap(url):
"""Scrapes an XML sitemap from the provided URL and returns XML source.
Args:
url (string): Fully qualified URL pointing to XML sitemap.
Returns:
xml (string): XML source of scraped sitemap.
"""
response = urllib.request.urlopen(url)
xml = BeautifulSoup(response,
'lxml-xml',
from_encoding=response.info().get_param('charset'))
return xml
Now we can pass in a fully qualified URL pointing to the XML sitemap we want to fetch. As you’ll see from the site I’ve chosen, the xml
returned is the raw source of the page and includes links to a number of other child sitemaps.
XML sitemaps are usually (but not always) called sitemap.xml
and are located at the site root (i.e. /sitemap.xml
). However, if you don’t find the sitemap, check the robots.txt
file at /robots.txt
, which should provide the alternate address if one has been used.
url = "https://themarket.co.uk/sitemap.xml"
xml = get_sitemap(url)
xml
<?xml version="1.0" encoding="utf-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://themarket.co.uk/themarket.xml</loc>
<lastmod>2021-01-20T14:00:10+00:00</lastmod>
</sitemap>
<sitemap>
<loc>https://themarket.co.uk/finished.xml</loc>
<lastmod>2021-01-20T14:00:10+00:00</lastmod>
</sitemap>
<sitemap>
<loc>https://themarket.co.uk/live.xml</loc>
<lastmod>2021-01-20T14:00:10+00:00</lastmod>
</sitemap>
</sitemapindex>
There are two main types of XML sitemap - the sitemapindex
(shown above) which includes links to child sitemaps, and the urlset
which includes direct links to all the underlying pages. Since there are no page URLs in the sitemapindex
sitemap, we need another function to determine the sitemap type, so we can parse it accordingly.
def get_sitemap_type(xml):
"""Parse XML source and returns the type of sitemap.
Args:
xml (string): Source code of XML sitemap.
Returns:
sitemap_type (string): Type of sitemap (sitemap, sitemapindex, or None).
"""
sitemapindex = xml.find_all('sitemapindex')
sitemap = xml.find_all('urlset')
if sitemapindex:
return 'sitemapindex'
elif sitemap:
return 'urlset'
else:
return
sitemap_type = get_sitemap_type(xml)
sitemap_type
'sitemapindex'
If we detect that the sitemap is of the sitemapindex
type, we need another bit of code to fetch the URLs of the underlying child sitemaps. We can do that by using find_all()
to detect all of the sitemap
elements, and then append()
the loc
text element to a list.
def get_child_sitemaps(xml):
"""Return a list of child sitemaps present in a XML sitemap file.
Args:
xml (string): XML source of sitemap.
Returns:
sitemaps (list): Python list of XML sitemap URLs.
"""
sitemaps = xml.find_all("sitemap")
output = []
for sitemap in sitemaps:
output.append(sitemap.findNext("loc").text)
return output
child_sitemaps = get_child_sitemaps(xml)
child_sitemaps
['https://themarket.co.uk/themarket.xml',
'https://themarket.co.uk/finished.xml',
'https://themarket.co.uk/live.xml']
Finally, we can create a function called sitemap_to_dataframe()
to parse the sitemap.xml file and return all of the url
elements using find_all()
. By looping over these we can then extract the loc
(holding the URL), the changefreq
indicating the frequency that the page is typically changed, its priority
and the domain
from which the URL was scraped.
def sitemap_to_dataframe(xml, name=None, data=None, verbose=False):
"""Read an XML sitemap into a Pandas dataframe.
Args:
xml (string): XML source of sitemap.
name (optional): Optional name for sitemap parsed.
verbose (boolean, optional): Set to True to monitor progress.
Returns:
dataframe: Pandas dataframe of XML sitemap content.
"""
df = pd.DataFrame(columns=['loc', 'changefreq', 'priority', 'domain', 'sitemap_name'])
urls = xml.find_all("url")
for url in urls:
if xml.find("loc"):
loc = url.findNext("loc").text
parsed_uri = urlparse(loc)
domain = '{uri.netloc}'.format(uri=parsed_uri)
else:
loc = ''
domain = ''
if xml.find("changefreq"):
changefreq = url.findNext("changefreq").text
else:
changefreq = ''
if xml.find("priority"):
priority = url.findNext("priority").text
else:
priority = ''
if name:
sitemap_name = name
else:
sitemap_name = ''
row = {
'domain': domain,
'loc': loc,
'changefreq': changefreq,
'priority': priority,
'sitemap_name': sitemap_name,
}
if verbose:
print(row)
df = df.append(row, ignore_index=True)
return df
Running the code on one of the URLs in the original sitemap returns a Pandas dataframe containing all the data we need to analyse this site.
url_finished = "https://themarket.co.uk/finished.xml"
xml_finished = get_sitemap(url_finished)
df = sitemap_to_dataframe(xml_finished, name='finished.xml', verbose=False)
df.head()
loc | changefreq | priority | domain | |
---|---|---|---|---|
0 | https://themarket.co.uk/listings/100-ot/cheris... | daily | 0.8 | themarket.co.uk |
1 | https://themarket.co.uk/listings/3-geo/cherish... | daily | 0.8 | themarket.co.uk |
2 | https://themarket.co.uk/listings/abarth/695c-e... | daily | 0.8 | themarket.co.uk |
3 | https://themarket.co.uk/listings/ac/buckland/7... | daily | 0.8 | themarket.co.uk |
4 | https://themarket.co.uk/listings/ac/cobra-dax-... | daily | 0.8 | themarket.co.uk |
df.shape
(1139, 4)
One final function wraps everything up. This takes a single sitemap URL, retrieves the XML source, parses it to determine the sitemap type, obtains the URLs of any child sitemaps, then loops over the sitemaps, extracts their contents, and returns a single dataframe.
def get_all_urls(url):
"""Return a dataframe containing all of the URLs from a site's XML sitemaps.
Args:
url (string): URL of site's XML sitemap. Usually located at /sitemap.xml
Returns:
df (dataframe): Pandas dataframe containing all sitemap content.
"""
xml = get_sitemap(url)
sitemap_type = get_sitemap_type(xml)
if sitemap_type =='sitemapindex':
sitemaps = get_child_sitemaps(xml)
else:
sitemaps = [url]
df = pd.DataFrame(columns=['loc', 'changefreq', 'priority', 'domain', 'sitemap_name'])
for sitemap in sitemaps:
sitemap_xml = get_sitemap(sitemap)
df_sitemap = sitemap_to_dataframe(sitemap_xml, name=sitemap)
df = pd.concat([df, df_sitemap], ignore_index=True)
return df
df = get_all_urls(url)
df.head()
loc | changefreq | priority | domain | sitemap_name | |
---|---|---|---|---|---|
0 | https://themarket.co.uk | daily | 0.8 | themarket.co.uk | https://themarket.co.uk/themarket.xml |
1 | https://themarket.co.uk/ | daily | 0.8 | themarket.co.uk | https://themarket.co.uk/themarket.xml |
2 | https://themarket.co.uk/auctions/coming-soon | daily | 0.8 | themarket.co.uk | https://themarket.co.uk/themarket.xml |
3 | https://themarket.co.uk/auctions/live | daily | 0.8 | themarket.co.uk | https://themarket.co.uk/themarket.xml |
4 | https://themarket.co.uk/auctions/no-reserve | daily | 0.8 | themarket.co.uk | https://themarket.co.uk/themarket.xml |
df.sitemap_name.value_counts()
https://themarket.co.uk/finished.xml 1139
https://themarket.co.uk/live.xml 27
https://themarket.co.uk/themarket.xml 14
Name: sitemap_name, dtype: int64
Matt Clarke, Friday, March 12, 2021