How to scrape and parse a robots.txt file using Python

How to scrape and parse a https://d33wubrfki0l68.cloudfront.net/b15abfe9c991b95252bfcbccd8469f70eba51d20/dabac/robots.txt file using Python

Picture by Luis J, Pexels.

6 minutes to read

When scraping websites, and when checking how well a site is configured for crawling, it pays to carefully check and parse the site’s robots.txt file. This file, which should be stored at the document root of every web server, contains various directives and parameters which instruct bots, spiders, and crawlers what they can and cannot view.

In this project, we’ll use the web scraping tools urllib and BeautifulSoup to fetch and parse a robots.txt file, extract the sitemap URLs from within, and write the includes directives and parameters to a Pandas dataframe. Whenever you’re scraping a site, you should really be viewing the robots.txt file and adhering to the directives set.

You can also examine the directives to check that you’re not inadvertently blocking bots from accessing key parts of your site that you want search engines to index. Here’s how it’s done.

Load the packages

To get started, open a new Python script or Jupyter notebook and import the packages below. We’ll be using Pandas for storing our the data from our robots.txt, urllib to grab the content, and BeautifulSoup for parsing. Any packages you don’t have can be installed by typing pip3 install package-name in your terminal.

import pandas as pd
from urllib.request import urlopen
from urllib.request import Request
from urllib.parse import urlparse
from bs4 import BeautifulSoup

Scrape the robots.txt file

Next, we’ll create a little function to grab the content of a robots.txt file. I’ve used a combination of urllib.request.urlopen and urllib.request.Request for this, as we need to pass through a User-Agent string to many sites in order to avoid getting a 403 error from the server.

Once we get back our response object, we’ll pass this to BeautifulSoup, use the correct character encoding, and return the output in a variable called soup.

def get_page(url):
    """Scrapes a URL and returns the HTML source.
    
    Args:
        url (string): Fully qualified URL of a page.
    
    Returns:
        soup (string): HTML source of scraped page.
    """

    response = urllib.request.urlopen(urllib.request.Request(url, 
                                                             headers={'User-Agent': 'Mozilla'}))
    soup = BeautifulSoup(response, 
                         'html.parser', 
                         from_encoding=response.info().get_param('charset'))
    
    return soup

robots = get_page("https://moz.com/robots.txt")
robots

Sitemap: https://moz.com/sitemaps-1-sitemap.xml
Sitemap: https://moz.com/blog-sitemap.xml


User-agent: *
Allow: /researchtools/ose/$
Allow: /researchtools/ose/dotbot$
Allow: /researchtools/ose/links$
Allow: /researchtools/ose/just-discovered$
Allow: /researchtools/ose/pages$
Allow: /researchtools/ose/domains$
Allow: /researchtools/ose/anchors$
Allow: /products/
Allow: /local/
Allow: /learn/
Allow: /researchtools/ose/
Allow: /researchtools/ose/dotbot$

Disallow: /products/content/
Disallow: /local/enterprise/confirm
Disallow: /researchtools/ose/
Disallow: /page-strength/*
Disallow: /thumbs/*
Disallow: /api/user?*
Disallow: /checkout/freetrial/*
Disallow: /local/search/
Disallow: /local/details/
Disallow: /messages/
Disallow: /content/audit/*
Disallow: /content/search/*
Disallow: /marketplace/
Disallow: /cpresources/
Disallow: /vendor/
Disallow: /community/q/questions/*/view_counts
Disallow: /admin-preview/*

Extract the sitemaps from the robots.txt

One common thing you may want to do is find the locations of any XML sitemaps on a site. These are generally stated in the robots.txt file, if they don’t exist at the default path of /sitemap.xml. The function below scans each line in the robots.txt to find the lines that start with the Sitemap: declaration, and adds each one to a list.

def get_sitemaps(robots):
    """Parse a robots.txt file and return a Python list containing any sitemap URLs found.

    Args:
        robots (string): Contents of robots.txt file.
    
    Returns:
        data (list): List containing each sitemap found.
    """

    data = []
    lines = str(robots).splitlines()

    for line in lines:
        if line.startswith('Sitemap:'):
            split = line.split(':', maxsplit=1)
            data.append(split[1].strip())

    return data

sitemaps = get_sitemaps(robots)
sitemaps

['https://moz.com/sitemaps-1-sitemap.xml', 'https://moz.com/blog-sitemap.xml']

Write the robots.txt to a Pandas dataframe

Next, we can use the same technique to loop over all the lines in the robots.txt and add each directive and parameter to a column of a Pandas dataframe. This will give you all the data in an easy-to-parse format. However, as robots.txt files can be somewhat freestyle, the resulting dataframe is not always perfectly readable.

def to_dataframe(robots):
    """Parses robots.txt file contents into a Pandas DataFrame.

    Args:
        robots (string): Contents of robots.txt file.
    
    Returns:
        df (list): Pandas dataframe containing robots.txt directives and parameters.
    """

    data = []
    lines = str(robots).splitlines()
    for line in lines:

        if line.strip():
            if not line.startswith('#'):
                split = line.split(':', maxsplit=1)
                data.append([split[0].strip(), split[1].strip()])

    return pd.DataFrame(data, columns=['directive', 'parameter'])

df = to_dataframe(robots)
df.head(10)

	directive	parameter
0	Sitemap	https://moz.com/sitemaps-1-sitemap.xml
1	Sitemap	https://moz.com/blog-sitemap.xml
2	User-agent	*
3	Allow	/researchtools/ose/$
4	Allow	/researchtools/ose/dotbot$
5	Allow	/researchtools/ose/links$
6	Allow	/researchtools/ose/just-discovered$
7	Allow	/researchtools/ose/pages$
8	Allow	/researchtools/ose/domains$
9	Allow	/researchtools/ose/anchors$

Matt Clarke, Friday, March 12, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.