Both 404 page not found errors and 301 redirect chains can be costly and damaging to the performance of a website. They’re both easy to introduce, especially on ecommerce sites where content and URLs are constantly being changed, and can impact paid search costs, organic search rankings, and the user experience.
Fortunately, both 404 errors and 301 redirect chains can be identified through regular site scans and rectified by fixing broken links or adjusting redirects, so the absolute minimum number are used. Here’s a quick guide to creating a web scraping script to check your site for 404s and redirect issues, so you can keep it running smoothly.
First, open a Python script or a Jupyter notebook and import the requests
and pandas
packages. The requests
package is part of Python, but you’ll need to install pandas
if you don’t have it already, which you can do by entering pip3 install pandas
in your terminal.
import requests
import pandas as pd
Next load up a dataframe containing all the URLs of the site you want to scan. Check out my guide to parsing XML sitemaps using Python for details on how you can create a dataset like this to use.
df = pd.read_csv('sitemap.csv')
df_urls = df[['loc']]
df_urls.head()
loc | |
---|---|
0 | https://themarket.co.uk |
1 | https://themarket.co.uk/ |
2 | https://themarket.co.uk/auctions/coming-soon |
3 | https://themarket.co.uk/auctions/live |
4 | https://themarket.co.uk/auctions/no-reserve |
First, we’ll create an empty Pandas dataframe in which to store our results. Then we’ll iterate over each row of our dataframe of URLs using the iterrows()
function. For each row, we’ll get the URL stored in row['loc']
and we’ll use requests
to fetch the page response using get()
.
The get()
function returns a range of different HTTP status values, including the status_code
(i.e. 404 or 301), and the history
, which contains details on the redirect chain if there is one. We’ll grab each of these two values and write it into a dictionary called page
, then we’ll append each one to our df_pages
dataframe.
df_pages = pd.DataFrame(columns = ['url', 'status_code', 'history'])
for index, row in df.iterrows():
request = requests.get(row['loc'])
status_code = request.status_code
history = request.history
page = {
'url': row['loc'],
'status_code': status_code,
'history': history
}
df_pages = df_pages.append(page, ignore_index=True)
After a few minutes, the code will have looped through all the URLs and fetched and stored the data, which we can view and manipulate using Pandas. Here’s the first few rows.
df_pages.head()
url | status_code | history | |
---|---|---|---|
0 | https://themarket.co.uk | 200 | [] |
1 | https://themarket.co.uk/ | 200 | [] |
2 | https://themarket.co.uk/auctions/coming-soon | 200 | [] |
3 | https://themarket.co.uk/auctions/live | 200 | [] |
4 | https://themarket.co.uk/auctions/no-reserve | 200 | [] |
We can save the output of our crawl using to_csv()
to output the dataframe to a CSV file, and then use the Pandas
value_counts()
function to see what status codes were returned. All was good on this crawl, with every page of the
1180 checked returning the 200
or OK
status code.
df_pages.to_csv('checks.csv', index=False)
df_pages.status_code.value_counts()
200 1180
Name: status_code, dtype: int64
Matt Clarke, Friday, March 12, 2021