How to check if URLs are redirected using Requests

Picture by Pixabay, Pexels.

10 minutes to read

Data Science Python SEO Web scraping

The requests HTTP library for Python allows you to make HTTP requests to servers and receive back HTTP status codes, site content, and other data. It’s extremely useful for building web scraping projects and is one of the most widely used Python SEO tools.

One particularly useful feature of requests for those who work in SEO, or support an SEO team using their data science skills, is that the package can return HTTP status codes and redirect information for URLs. This can be extremely useful during SEO site migrations.

What is a site migration?

A site migration is the technical name SEOs use to refer to the re-mapping of URLs from an old site to a new one, or an old domain to a new one. During an SEO site migration your SEO team will identify all the URLs from your old site and those from your old site and then create a URL redirection plan so that anyone visiting an old URL is redirected to a new one, without hitting a 404.

The dataset of URLs from the old site are harvested from a range of data sources, including site crawls, Google Analytics, Google Search Console, and SEO tools such as Ahrefs, and will include not only the current URLs on the site, present in the sitemap and crawl data, but also those URLs that have appeared in the various systems over time.

Once the two datasets have been created, fuzzy matching (usually via fuzzywuzzy, or more recently polyfuzz) can be applied to find the closest matching URL on the new site to the one on the old site. This means that anyone who lands on an old URL for, say, /how-to-keep-a-hamster will be redirected to the closest matching URL on the new site, even if it’s slightly different, i.e. /everything-you-need-to-know-to-keep-a-hamster.

Why detect redirects during a site migration?

If you simply generate a list of all the old URLs and map each one to the closest URL on the new site you’ll potentially be ignoring the existing redirects that site administrators may have put in place. For example, if the product at /super-widget-x1000 has been discontinued, your site admins might have replaced it with mega-tool-x2000 and created a redirect. However, fuzzy matching would ignore this and send the user somewhere potentially less relevant.

This can result in you redirecting traffic to another page using fuzzy matching and ignoring a human-selected alternative. Next, I’ll show you how you can use Requests to crawl a list of URLs and identify if they are redirected, and to where. This can give you extra data to ensure more accurate redirections during your site migration.

Load the packages

First, open a Jupyter notebook and import the requests and pandas packages. We’ll use Pandas to load and manipulate the data on the URLs and create an output dataframe that we can save to CSV, and we’ll use requests for checking each URL.

import requests
import pandas as pd

Load the data

Next, load up the list of URLs you want to check. I’ve included a simple list below, but you’ll probably have yours in a CSV file. If that’s the case, you can extract the column and save it to a list using code like this: urls = df['url'].tolist(). That will take the url column in your dataframe and return a list of values to check.

urls = [
    'https://bbc.co.uk/iplayer',
    'https://facebook.com/',
    'http://www.theguardian.co.uk',
    'https://practicaldatascience.co.uk'
]

Loop over the URLs and check for redirects

Now we have our list of URLs to check, we’ll create a for loop to check each one using requests. Before doing this, we’ll create an empty Pandas dataframe called df_output into which we’ll store the original URL and its HTTP status code (i.e. 200 for OK, 301 for a permanent redirect, 302 for a temporary redirect, or 404 for page not found), then we’ll record the destination URL and destination HTTP status code if a redirect was detected.

We’ll pass each URL to requests and make a GET request, passing in a user agent to help ensure the server returns a response (some servers won’t return one if you don’t provide a user agent string). Then we’ll create an empty dictionary and capture the values.

If the response object from requests contains a value in history that means we have a redirect, so we can fetch the url and status_code and store them. If there’s no history, then no redirect was found and we can just store the original URL and status code and some blank values.

df_output = pd.DataFrame(columns=['original_url', 'original_status', 'destination_url', 'destination_status'])

for url in urls:
    
    response = requests.get(url, headers={'User-Agent': 'Google Chrome'})
    row = {}
    
    if response.history:
        for step in response.history:
            row['original_url'] = step.url
            row['original_status'] = step.status_code
        row['destination_url'] = response.url
        row['destination_status'] = response.status_code        
    else:
        row['original_url'] = response.url
        row['original_status'] = response.status_code 
        row['destination_url'] = ''
        row['destination_status'] = ''
        
    print(row)
    
    df_output = df_output.append(row, ignore_index=True)

{'original_url': 'https://bbc.co.uk/iplayer', 'original_status': 301, 'destination_url': 'https://www.bbc.co.uk/iplayer', 'destination_status': 200}
{'original_url': 'https://facebook.com/', 'original_status': 301, 'destination_url': 'https://www.facebook.com/', 'destination_status': 200}
{'original_url': 'https://www.theguardian.com/', 'original_status': 302, 'destination_url': 'https://www.theguardian.com/uk', 'destination_status': 200}
{'original_url': 'https://practicaldatascience.co.uk/', 'original_status': 200, 'destination_url': '', 'destination_status': ''}

I’ve printed out each row dictionary so I can monitor progress, then I’ve used the Pandas append() to add each row to the dataframe. Finally, we can print the df_output dataframe to see the HTTP status code and redirect for each URL in our list.

df_output.head()

	original_url	original_status	destination_url	destination_status
0	https://bbc.co.uk/iplayer	301	https://www.bbc.co.uk/iplayer	200
1	https://facebook.com/	301	https://www.facebook.com/	200
2	https://www.theguardian.com/	302	https://www.theguardian.com/uk	200
3	https://practicaldatascience.co.uk/	200

Check your sitemap URLs for redirects

Another useful application of this technique is to check your sitemap URLs for redirects. You’d typically expect any of the pages listed in your XML sitemap to not redirect to another page, otherwise users will never be able to reach desired page.

The easiest way to do this is via my EcommerceTools package, which you can install by entering the following command in your terminal: !pip3 install --upgrade ecommercetools.

By running the seo.get_sitemap() function and passing in the URL of your XML sitemap you can create a Pandas dataframe containing all your site URLs and then check whether they redirect. If they do, you need to remove the redirects so the pages can be reached.

import requests
import pandas as pd
from ecommercetools import seo

df = seo.get_sitemap('https://www.examplecom/sitemap.xml')
urls = df['loc'].tolist()

df_output = pd.DataFrame(columns=['original_url', 'original_status', 'destination_url', 'destination_status'])

for url in urls:
    
    response = requests.get(url, headers={'User-Agent': 'Google Chrome'})
    row = {}
    
    if response.history:
        for step in response.history:
            row['original_url'] = step.url
            row['original_status'] = step.status_code
        row['destination_url'] = response.url
        row['destination_status'] = response.status_code        
    else:
        row['original_url'] = response.url
        row['original_status'] = response.status_code 
        row['destination_url'] = ''
        row['destination_status'] = ''
        
    print(row)
    
    df_output = df_output.append(row, ignore_index=True)

df_output

Matt Clarke, Wednesday, February 02, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.