How to scan a site for 404 errors and 301 redirect chains

404 errors and 301 redirect chains can be damaging to the performance of a website and impact the user experience. Here’s how to find them with Python.

How to scan a site for 404 errors and 301 redirect chains
Picture by Alleksana, Pexels.
5 minutes to read

Both 404 page not found errors and 301 redirect chains can be costly and damaging to the performance of a website. They’re both easy to introduce, especially on ecommerce sites where content and URLs are constantly being changed, and can impact paid search costs, organic search rankings, and the user experience.

Fortunately, both 404 errors and 301 redirect chains can be identified through regular site scans and rectified by fixing broken links or adjusting redirects, so the absolute minimum number are used. Here’s a quick guide to creating a web scraping script to check your site for 404s and redirect issues, so you can keep it running smoothly.

Load the packages

First, open a Python script or a Jupyter notebook and import the requests and pandas packages. The requests package is part of Python, but you’ll need to install pandas if you don’t have it already, which you can do by entering pip3 install pandas in your terminal.

import requests
import pandas as pd

Load the data

Next load up a dataframe containing all the URLs of the site you want to scan. Check out my guide to parsing XML sitemaps using Python for details on how you can create a dataset like this to use.

df = pd.read_csv('sitemap.csv')
df_urls = df[['loc']]
df_urls.head()
loc
0 https://themarket.co.uk
1 https://themarket.co.uk/
2 https://themarket.co.uk/auctions/coming-soon
3 https://themarket.co.uk/auctions/live
4 https://themarket.co.uk/auctions/no-reserve

Scan the pages in the sitemap

First, we’ll create an empty Pandas dataframe in which to store our results. Then we’ll iterate over each row of our dataframe of URLs using the iterrows() function. For each row, we’ll get the URL stored in row['loc'] and we’ll use requests to fetch the page response using get().

The get() function returns a range of different HTTP status values, including the status_code (i.e. 404 or 301), and the history, which contains details on the redirect chain if there is one. We’ll grab each of these two values and write it into a dictionary called page, then we’ll append each one to our df_pages dataframe.

df_pages = pd.DataFrame(columns = ['url', 'status_code', 'history'])

for index, row in df.iterrows(): 

    request = requests.get(row['loc'])
    status_code = request.status_code
    history = request.history
        
    page = {
        'url': row['loc'],
        'status_code': status_code,
        'history': history
    }
    
    df_pages = df_pages.append(page, ignore_index=True)

After a few minutes, the code will have looped through all the URLs and fetched and stored the data, which we can view and manipulate using Pandas. Here’s the first few rows.

df_pages.head()
url status_code history
0 https://themarket.co.uk 200 []
1 https://themarket.co.uk/ 200 []
2 https://themarket.co.uk/auctions/coming-soon 200 []
3 https://themarket.co.uk/auctions/live 200 []
4 https://themarket.co.uk/auctions/no-reserve 200 []

We can save the output of our crawl using to_csv() to output the dataframe to a CSV file, and then use the Pandas value_counts() function to see what status codes were returned. All was good on this crawl, with every page of the 1180 checked returning the 200 or OK status code.

df_pages.to_csv('checks.csv', index=False)
df_pages.status_code.value_counts()
200    1180
Name: status_code, dtype: int64

Matt Clarke, Friday, March 12, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.