When scraping websites, and when checking how well a site is configured for crawling, it pays to carefully check and parse the site’s robots.txt file. This file, which should be stored at the document root of every web server, contains various directives and parameters which instruct bots, spiders, and crawlers what they can and cannot view.
In this project, we’ll use the web scraping
tools urllib
and BeautifulSoup
to fetch and parse a robots.txt file, extract the
sitemap
URLs from within, and write the includes directives and parameters to a Pandas dataframe. Whenever you’re scraping a site, you should really be viewing the robots.txt file and adhering to the directives set.
You can also examine the directives to check that you’re not inadvertently blocking bots from accessing key parts of your site that you want search engines to index. Here’s how it’s done.
To get started, open a new Python script or Jupyter notebook and import the packages below. We’ll be using Pandas for storing our the data from our robots.txt, urllib
to grab the content, and BeautifulSoup
for parsing. Any packages you don’t have can be installed by typing pip3 install package-name
in your terminal.
import pandas as pd
from urllib.request import urlopen
from urllib.request import Request
from urllib.parse import urlparse
from bs4 import BeautifulSoup
Next, we’ll create a little function to grab the content of a robots.txt file. I’ve used a combination of urllib.request.urlopen
and urllib.request.Request
for this, as we need to pass through a User-Agent
string to many sites in order to avoid getting a 403 error from the server.
Once we get back our response
object, we’ll pass this to BeautifulSoup
, use the correct character encoding, and return the output in a variable called soup
.
def get_page(url):
"""Scrapes a URL and returns the HTML source.
Args:
url (string): Fully qualified URL of a page.
Returns:
soup (string): HTML source of scraped page.
"""
response = urllib.request.urlopen(urllib.request.Request(url,
headers={'User-Agent': 'Mozilla'}))
soup = BeautifulSoup(response,
'html.parser',
from_encoding=response.info().get_param('charset'))
return soup
robots = get_page("https://moz.com/robots.txt")
robots
Sitemap: https://moz.com/sitemaps-1-sitemap.xml
Sitemap: https://moz.com/blog-sitemap.xml
User-agent: *
Allow: /researchtools/ose/$
Allow: /researchtools/ose/dotbot$
Allow: /researchtools/ose/links$
Allow: /researchtools/ose/just-discovered$
Allow: /researchtools/ose/pages$
Allow: /researchtools/ose/domains$
Allow: /researchtools/ose/anchors$
Allow: /products/
Allow: /local/
Allow: /learn/
Allow: /researchtools/ose/
Allow: /researchtools/ose/dotbot$
Disallow: /products/content/
Disallow: /local/enterprise/confirm
Disallow: /researchtools/ose/
Disallow: /page-strength/*
Disallow: /thumbs/*
Disallow: /api/user?*
Disallow: /checkout/freetrial/*
Disallow: /local/search/
Disallow: /local/details/
Disallow: /messages/
Disallow: /content/audit/*
Disallow: /content/search/*
Disallow: /marketplace/
Disallow: /cpresources/
Disallow: /vendor/
Disallow: /community/q/questions/*/view_counts
Disallow: /admin-preview/*
One common thing you may want to do is find the locations of any XML sitemaps on a site. These are generally stated in the robots.txt file, if they don’t exist at the default path of /sitemap.xml
. The function below scans each line in the robots.txt to find the lines that start with the Sitemap:
declaration, and adds each one to a list.
def get_sitemaps(robots):
"""Parse a robots.txt file and return a Python list containing any sitemap URLs found.
Args:
robots (string): Contents of robots.txt file.
Returns:
data (list): List containing each sitemap found.
"""
data = []
lines = str(robots).splitlines()
for line in lines:
if line.startswith('Sitemap:'):
split = line.split(':', maxsplit=1)
data.append(split[1].strip())
return data
sitemaps = get_sitemaps(robots)
sitemaps
['https://moz.com/sitemaps-1-sitemap.xml', 'https://moz.com/blog-sitemap.xml']
Next, we can use the same technique to loop over all the lines in the robots.txt and add each directive
and parameter
to a column of a Pandas dataframe. This will give you all the data in an easy-to-parse format. However, as robots.txt files can be somewhat freestyle, the resulting dataframe is not always perfectly readable.
def to_dataframe(robots):
"""Parses robots.txt file contents into a Pandas DataFrame.
Args:
robots (string): Contents of robots.txt file.
Returns:
df (list): Pandas dataframe containing robots.txt directives and parameters.
"""
data = []
lines = str(robots).splitlines()
for line in lines:
if line.strip():
if not line.startswith('#'):
split = line.split(':', maxsplit=1)
data.append([split[0].strip(), split[1].strip()])
return pd.DataFrame(data, columns=['directive', 'parameter'])
df = to_dataframe(robots)
df.head(10)
directive | parameter | |
---|---|---|
0 | Sitemap | https://moz.com/sitemaps-1-sitemap.xml |
1 | Sitemap | https://moz.com/blog-sitemap.xml |
2 | User-agent | * |
3 | Allow | /researchtools/ose/$ |
4 | Allow | /researchtools/ose/dotbot$ |
5 | Allow | /researchtools/ose/links$ |
6 | Allow | /researchtools/ose/just-discovered$ |
7 | Allow | /researchtools/ose/pages$ |
8 | Allow | /researchtools/ose/domains$ |
9 | Allow | /researchtools/ose/anchors$ |
Matt Clarke, Friday, March 12, 2021