If your site’s pages aren’t indexed by Google, you’re obviously not going to generate any traffic to them, so you’ll want to check that everything you expect to be present actually appears in Google’s index and can drive new visitors to your site.
Similarly, having too many pages can also be a problem. Sometimes, search engines index far more pages than we expect, causing index bloat. Here, paginated content, or pages from your faceted search system, gets indexed causing huge numbers of pages to be indexed.
In this project, we’ll use Python to solve this technical SEO problem. We’ll fetch the pages from our sitemap and compare them to the pages present in our Google Search Console data, to identify what’s missing and what shouldn’t be present. Here’s how it’s done.
Open a Jupyter notebook and import the below packages. We’ll be using google.oauth2
for authenticating on the Google Search Console API and the googleapiclient
for accessing the data. We’ll be using requests
and requests_html
to fetch the Search Console data and scrape our sitemap, and Pandas for data manipulation.
from google.oauth2 import service_account
from googleapiclient.discovery import build
import requests
import json
import pandas as pd
from requests_html import HTML
from requests_html import HTMLSession
Our first step is to create a dataframe containing the URLs of every page in our sitemap. These represent what we think we should have in the Google search index. The below function takes the URL of our sitemap and returns its XML source code using the Requests-HTML package. Check out my full guide to Requests HTML for more details on using this.
def get_source(url):
"""Return the source code for the provided URL.
Args:
url (string): URL of the page to scrape.
Returns:
response (object): HTTP response object from requests_html.
"""
try:
session = HTMLSession()
response = session.get(url)
return response
except requests.exceptions.RequestException as e:
print(e)
Next, we’ll create another function called scrape_sitemap()
. This takes the URL of our sitemap, fetches the source code using get_source()
, and parses the XML to extract the loc
elements containing the URL for each page. At the end, it spits out a Pandas dataframe containing the URL for every page in the sitemap.
def scrape_sitemap(url):
"""Scrape the contents of an XML sitemap and return the contents in a dataframe.
Args:
url (string): Absolute URL of urlset XML sitemap.
Returns:
df (dataframe): Pandas dataframe containing sitemap contents.
"""
df = pd.DataFrame(columns = ['page'])
response = get_source(url)
with response as r:
urls = r.html.find("loc", first=False)
for url in urls:
row = {'page': url.text}
df = df.append(row, ignore_index=True)
return df
Running the scrape_sitemap()
function only takes a second on a small site. We’ll save the output of this to a Pandas dataframe called df_sitemap
and will back up the data to a CSV called sitemap.csv
so we can use it elsewhere if we need to.
df_sitemap = scrape_sitemap("http://flyandlure.org/sitemap.xml")
df_sitemap.to_csv("sitemap.csv", index=False)
df_sitemap.tail(10)
page | |
---|---|
1664 | http://flyandlure.org/listings/places_to_fly_fish/wales/swansea/paper_mill_fishery |
1665 | http://flyandlure.org/listings/places_to_fly_fish/wales/swansea/shimano_felindre_big_fish_water |
1666 | http://flyandlure.org/listings/places_to_fly_fish/wales/torfaen/llandegfedd_reservoir |
1667 | http://flyandlure.org/listings/places_to_fly_fish/wales/vale_of_glamorgan/dyffryn_springs_ |
1668 | http://flyandlure.org/listings/places_to_fly_fish/wales/wrexham/chirk_fishery |
1669 | http://flyandlure.org/listings/places_to_fly_fish/wales/wrexham/llandegla_trout_fishery |
1670 | http://flyandlure.org/listings/places_to_fly_fish/wales/wrexham/penycae_lower_reservoir |
1671 | http://flyandlure.org/listings/places_to_fly_fish/wales/wrexham/penycae_upper_reservoir |
1672 | http://flyandlure.org/listings/places_to_fly_fish/wales/wrexham/tree_tops_fly_fishery |
1673 | http://flyandlure.org/listings/places_to_fly_fish/wales/wrexham/ty_mawr_reservoir |
Now we’ve got the URLs of the pages we expect to be indexed, we can use the Google Search Console API to find out what pages actually generate impressions. This is probably not perfectly correlated with what has been indexed, but it looks extremely close, based on my data.
To create our connection to the API, we’ll create a function to pass our JSON client secrets key to the Google authentication service and get back the required service object.
def connect(key):
"""Create a connection to the Google Search Console API and return service object.
Args:
key (string): Google Search Console JSON client secrets path.
Returns:
service (object): Google Search Console service object.
"""
scope = ['https://www.googleapis.com/auth/webmasters']
credentials = service_account.Credentials.from_service_account_file(key,
scopes=scope)
service = build(
'webmasters',
'v3',
credentials=credentials
)
return service
key = 'google-search-console.json'
service = connect(key)
Now we’re connected, we can create a function to run a search query on Google Search Console. Using this we can create a dictionary called payload
which contains the API search query parameters, and pass it to the API along with the service
object and the site_url
of the GSC property we want to query.
The function will create a Pandas dataframe containing the search result, which the dimensions used placed in named columns, so we can access them easily. You can re-use this function for any query, not just the one we’re writing below.
def query(service, site_url, payload):
"""Run a query on the Google Search Console API and return a dataframe of results.
Args:
service (object): Service object from connect()
site_url (string): URL of Google Search Console property
payload (dict): API query payload dictionary
Return:
df (dataframe): Pandas dataframe containing requested data.
"""
response = service.searchanalytics().query(siteUrl=site_url, body=payload).execute()
results = []
for row in response['rows']:
data = {}
for i in range(len(payload['dimensions'])):
data[payload['dimensions'][i]] = row['keys'][i]
data['clicks'] = row['clicks']
data['impressions'] = row['impressions']
data['ctr'] = round(row['ctr'] * 100, 2)
data['position'] = round(row['position'], 2)
results.append(data)
return pd.DataFrame.from_dict(results)
We can now create a payload
dictionary containing our search query and pass it to the GSC API. I’ve set this to return all the metrics for 2020 grouped by the page
or URL, and have told it to query my fly fishing website property in Search Console. We get back the clicks
, impressions
, ctr
and average position
for each page.
payload = {
'startDate': "2020-01-01",
'endDate': "2020-12-31",
'dimensions': ["page"],
'rowLimit': 10000,
'startRow': 0
}
site_url = "http://flyandlure.org"
df_gsc = query(service, site_url, payload)
df_gsc.head()
page | clicks | impressions | ctr | position | |
---|---|---|---|---|---|
0 | http://flyandlure.org/articles/fly_fishing/still_water_trout_flies_how_and_when_to_fish_them | 4713 | 69956 | 6.74 | 18.40 |
1 | http://flyandlure.org/articles/fly_fishing/how_to_fly_fish_with_a_strike_indicator | 3344 | 75859 | 4.41 | 10.93 |
2 | http://flyandlure.org/articles/fly_tying/7_of_the_best_boobies_youll_ever_see | 2994 | 47521 | 6.30 | 9.01 |
3 | http://flyandlure.org/articles/fly_fishing/how_to_fly_fish_with_droppers | 2930 | 59244 | 4.95 | 16.06 |
4 | http://flyandlure.org/articles/fly_fishing/how_to_fish_the_blob_fly | 2796 | 27285 | 10.25 | 6.25 |
Now we need to combine the two datasets so we can see what’s missing from each one. There are various ways you can do this. I went with the Pandas merge()
function and used a right
join on the page
column, which is present in both dataframes.
df_all = pd.merge(df_gsc, df_sitemap, how='right', on=['page'])
df_all.head()
We get back a dataframe which includes all the pages present in our sitemap, and their corresponding web metrics from GSC. You could also use a different join to give you all the pages in Google Search Console, including any blank lines for pages not found in the sitemap.
page | clicks | impressions | ctr | position | |
---|---|---|---|---|---|
0 | http://flyandlure.org/ | 1203.0 | 15734.0 | 7.65 | 32.94 |
1 | http://flyandlure.org/about | 13.0 | 1102.0 | 1.18 | 7.30 |
2 | http://flyandlure.org/terms | 0.0 | 24.0 | 0.00 | 6.50 |
3 | http://flyandlure.org/privacy | 0.0 | 20.0 | 0.00 | 8.10 |
4 | http://flyandlure.org/copyright | 7.0 | 189.0 | 3.70 | 3.26 |
To confirm that the merge
worked as expected, you can print out the shape
for each dataframe, which will confirm the number of rows and columns present. As you can see, our new dataframe has the same number of rows as the df_sitemap
dataframe, so it’s missing the extra pages present in GSC and not in the sitemap.
df_sitemap.shape, df_gsc.shape, df_all.shape
((1674, 1), (2099, 5), (1674, 5))
To have a poke around in the data, my next step was to examine the pages present in the df_all
dataframe that didn’t generate any impressions. This is probably a good indication that the pages haven’t been indexed, but it might also point to issues with the content that could be improved.
df_no_clicks = df_all[df_all['clicks'] < 1]
df_no_clicks.sort_values(by='impressions', ascending=False).head(10)
page | clicks | impressions | ctr | position | |
---|---|---|---|---|---|
1484 | http://flyandlure.org/listings/places_to_fly_fish/scotland/north_lanarkshire/avon_trout_fishery | 0.0 | 1777.0 | 0.0 | 15.30 |
957 | http://flyandlure.org/listings/places_to_fly_fish/england/northamptonshire/pitsford_reservoir | 0.0 | 1160.0 | 0.0 | 43.83 |
1069 | http://flyandlure.org/listings/places_to_fly_fish/england/worcestershire/lenches_lakes | 0.0 | 800.0 | 0.0 | 17.92 |
1174 | http://flyandlure.org/listings/places_to_fly_fish/scotland/angus/kingennie_fishery | 0.0 | 737.0 | 0.0 | 17.54 |
236 | http://flyandlure.org/articles/fly_fishing_gear_reviews/patagonia_simple_guide_hoody_review | 0.0 | 676.0 | 0.0 | 46.09 |
409 | http://flyandlure.org/listings/fly_fishing_clubs/england/east_sussex/the_hastings_fly_fishers_club | 0.0 | 640.0 | 0.0 | 13.16 |
566 | http://flyandlure.org/listings/fly_fishing_clubs/scotland/perth_and_kinross/perth_and_district_a... | 0.0 | 578.0 | 0.0 | 8.67 |
902 | http://flyandlure.org/listings/places_to_fly_fish/england/hampshire/woodington_trout_fishery | 0.0 | 564.0 | 0.0 | 15.93 |
495 | http://flyandlure.org/listings/fly_fishing_clubs/england/west_sussex/sussex_piscatorial_society | 0.0 | 536.0 | 0.0 | 19.41 |
1295 | http://flyandlure.org/listings/places_to_fly_fish/scotland/fife_/raith_lake_trout_fishery | 0.0 | 507.0 | 0.0 | 15.95 |
To identify the pages which are in Google Search Console, but aren’t in the sitemap I’ve first converted the page
column in each dataframe to a list using tolist()
.
sitemap_pages = df_sitemap['page'].tolist()
gsc_pages = df_gsc['page'].tolist()
all_pages = sitemap_pages + gsc_pages
To obtain a list of all the unique pages in the all_pages
dataframe I’ve passed this to set()
which removes any duplicates in the list, and then converted it to a list using list()
. We get back 2145.
unique_all_pages = list(set(all_pages))
len(unique_all_pages)
2145
To identify the pages which are shared and present on both the sitemap.xml and on Google Search Console, we can use intersection()
. This gives us a list of 1625 pages that are present in both places. As there are 1674 pages in the sitemap, this means some probably aren’t indexed, and since there are 2099 in Google Search Console, it looks like 425 pages are found there but are missing from the sitemap, which could point to index bloat.
shared_pages = list(set(sitemap_pages).intersection(set(gsc_pages)))
len(shared_pages)
1625
Next, we’ll identify which pages in our sitemap are completely absent from the Google Search Console data. They don’t rank for anything, which is probably an indication that they’ve not been indexed by Google. However, I suspect this might be an inexact science!
You can confirm this by entering the URL with the site:
prefix into Google to check whether it returns any results. The sample I checked were in the sitemap, but Google hadn’t indexed them, so it works for me. I found 46 pages in the sitemap that weren’t indexed.
non_ranking_pages = list(set(unique_all_pages).difference(set(gsc_pages)))
len(non_ranking_pages)
46
non_ranking_pages[:10]
['http://flyandlure.org/listings/places_to_fly_fish/scotland/highland/lochan_a_chlierich',
'http://flyandlure.org/listings//scotland/highland/loch_na_beinne_reidhe',
'http://flyandlure.org/listings/places_to_fly_fish/wales/rhondda_cynon_taff/lluest_wen_reservoir',
'http://flyandlure.org/listings/places_to_fly_fish/england/county_durham/jubilee_lakes',
'http://flyandlure.org/listings/places_to_fly_fish/scotland/highland/loch_an_alltam_fhearna',
'http://flyandlure.org/listings/places_to_fly_fish/scotland/argyll_and_bute/loch_awe',
'http://flyandlure.org/listings/places_to_fly_fish/scotland/tayside/easter_balado_fishery',
'http://flyandlure.org/listings/places_to_fly_fish/scotland/highland/loch_na_hi-uidhe_doimhne',
'http://flyandlure.org/listings/places_to_fly_fish/scotland/highland/loch_droighinn',
'http://flyandlure.org/listings/places_to_fly_fish/scotland/argyll_and_bute/loch_nan_clach']
Now we’ve got the non-indexed pages sorted, we’ll look at the potential issue of search index bloat. The numbers above confirm that Google Search Console includes many more pages than we are showing in the sitemap, which means Google has found some pages that are missing.
If a page appears within our unique_all_pages
list, but isn’t found in the sitemap_pages
list, this means the page isn’t present in our sitemap. This could be because the page hasn’t been added to the sitemap in error, or because paginated or faceted content has been indexed.
To determine which pages are in Google Search Console and not in the sitemap I’ve used the set()
of unique_all_pages
versus the difference()
of the set()
of sitemap_pages
. As we spotted above, there are 474 pages in GSC that aren’t in the sitemap.
index_bloat_pages = list(set(unique_all_pages).difference(set(sitemap_pages)))
len(index_bloat_pages)
474
To print out the affected pages we can display the contents of the index_bloat_pages
list. This highlights several issues on my site. It looks like I inadvertently failed to add category pages (such as /articles/fly_fishing_gear_reviews
) and tag pages (such as /articles/tagged/18/fly_boxes
) to the sitemap, so adding these might bring some extra traffic.
However, there are also some paginated pages appearing (i.e. ?page=15
). I could prevent these being indexed by disallowing these query parameters in my robots.txt
file. The other new addition are the URLs featuring Chrome’s new deep linking highlighting, where people have shared my content and deep linked using a highlighted term.
index_bloat_pages[:10]
['http://flyandlure.org/articles/fly_fishing_gear_reviews/simms_retractor_review#:~:text=Picture%20copyright%20%C2%A9%20Fly&Lure',
'http://flyandlure.org/articles/fly_fishing_gear_reviews/guideline_lps_euro_nymphing_fly_rod_review#:~:text=Picture%20copyright%20%C2%A9%20Fly%20and%20Lure',
'http://flyandlure.org/listings/fly_fishing_clubs?page=8',
'http://flyandlure.org/articles/tagged/26/loch_style',
'http://flyandlure.org/listings/places_to_fly_fish/england/hampshire',
'http://flyandlure.org/listings/places_to_fly_fish/wales',
'http://flyandlure.org/listings/fly_fishing_clubs?page=15',
'http://flyandlure.org/articles/tagged/22/clothing',
'http://flyandlure.org/listings/fly_fishing_clubs/scotland/highland',
'http://flyandlure.org/articles/tagged/18/fly_boxes']
Matt Clarke, Friday, March 12, 2021