How to analyse non-ranking pages and search index bloat

Learn how you can use Python to identify how many non-ranking pages your site has and check whether you're a victim of search index bloat.

How to analyse non-ranking pages and search index bloat
Picture by Jessica Lewis, Pexels.
18 minutes to read

If your site’s pages aren’t indexed by Google, you’re obviously not going to generate any traffic to them, so you’ll want to check that everything you expect to be present actually appears in Google’s index and can drive new visitors to your site.

Similarly, having too many pages can also be a problem. Sometimes, search engines index far more pages than we expect, causing index bloat. Here, paginated content, or pages from your faceted search system, gets indexed causing huge numbers of pages to be indexed.

In this project, we’ll use Python to solve this technical SEO problem. We’ll fetch the pages from our sitemap and compare them to the pages present in our Google Search Console data, to identify what’s missing and what shouldn’t be present. Here’s how it’s done.

Load the packages

Open a Jupyter notebook and import the below packages. We’ll be using google.oauth2 for authenticating on the Google Search Console API and the googleapiclient for accessing the data. We’ll be using requests and requests_html to fetch the Search Console data and scrape our sitemap, and Pandas for data manipulation.

from google.oauth2 import service_account
from googleapiclient.discovery import build
import requests
import json
import pandas as pd
from requests_html import HTML
from requests_html import HTMLSession

Scrape the URLs from your sitemap

Our first step is to create a dataframe containing the URLs of every page in our sitemap. These represent what we think we should have in the Google search index. The below function takes the URL of our sitemap and returns its XML source code using the Requests-HTML package. Check out my full guide to Requests HTML for more details on using this.

def get_source(url):
    """Return the source code for the provided URL. 

        url (string): URL of the page to scrape.

        response (object): HTTP response object from requests_html. 

        session = HTMLSession()
        response = session.get(url)
        return response

    except requests.exceptions.RequestException as e:

Next, we’ll create another function called scrape_sitemap(). This takes the URL of our sitemap, fetches the source code using get_source(), and parses the XML to extract the loc elements containing the URL for each page. At the end, it spits out a Pandas dataframe containing the URL for every page in the sitemap.

def scrape_sitemap(url):
    """Scrape the contents of an XML sitemap and return the contents in a dataframe.

        url (string): Absolute URL of urlset XML sitemap. 

        df (dataframe): Pandas dataframe containing sitemap contents.  

    df = pd.DataFrame(columns = ['page'])

    response = get_source(url)

    with response as r:
        urls = r.html.find("loc", first=False)

        for url in urls:        
            row = {'page': url.text}

            df = df.append(row, ignore_index=True)

    return df

Running the scrape_sitemap() function only takes a second on a small site. We’ll save the output of this to a Pandas dataframe called df_sitemap and will back up the data to a CSV called sitemap.csv so we can use it elsewhere if we need to.

df_sitemap = scrape_sitemap("")
df_sitemap.to_csv("sitemap.csv", index=False)

Create a connection to Google Search Console

Now we’ve got the URLs of the pages we expect to be indexed, we can use the Google Search Console API to find out what pages actually generate impressions. This is probably not perfectly correlated with what has been indexed, but it looks extremely close, based on my data.

To create our connection to the API, we’ll create a function to pass our JSON client secrets key to the Google authentication service and get back the required service object.

def connect(key):
    """Create a connection to the Google Search Console API and return service object.

        key (string): Google Search Console JSON client secrets path.

        service (object): Google Search Console service object.

    scope = ['']
    credentials = service_account.Credentials.from_service_account_file(key, 
    service = build(

    return service
key = 'google-search-console.json'
service = connect(key)

Run a query on Google Search Console

Now we’re connected, we can create a function to run a search query on Google Search Console. Using this we can create a dictionary called payload which contains the API search query parameters, and pass it to the API along with the service object and the site_url of the GSC property we want to query.

The function will create a Pandas dataframe containing the search result, which the dimensions used placed in named columns, so we can access them easily. You can re-use this function for any query, not just the one we’re writing below.

def query(service, site_url, payload):
    """Run a query on the Google Search Console API and return a dataframe of results.

        service (object): Service object from connect()
        site_url (string): URL of Google Search Console property
        payload (dict): API query payload dictionary

        df (dataframe): Pandas dataframe containing requested data. 


    response = service.searchanalytics().query(siteUrl=site_url, body=payload).execute()

    results = []

    for row in response['rows']:    
        data = {}

        for i in range(len(payload['dimensions'])):
            data[payload['dimensions'][i]] = row['keys'][i]

        data['clicks'] = row['clicks']
        data['impressions'] = row['impressions']
        data['ctr'] = round(row['ctr'] * 100, 2)
        data['position'] = round(row['position'], 2)        

    return pd.DataFrame.from_dict(results)

Fetch Google Search Console data on your pages

We can now create a payload dictionary containing our search query and pass it to the GSC API. I’ve set this to return all the metrics for 2020 grouped by the page or URL, and have told it to query my fly fishing website property in Search Console. We get back the clicks, impressions, ctr and average position for each page.

payload = {
    'startDate': "2020-01-01",
    'endDate': "2020-12-31",
    'dimensions': ["page"],  
    'rowLimit': 10000,
    'startRow': 0

site_url = ""

df_gsc = query(service, site_url, payload)
page clicks impressions ctr position
0 4713 69956 6.74 18.40
1 3344 75859 4.41 10.93
2 2994 47521 6.30 9.01
3 2930 59244 4.95 16.06
4 2796 27285 10.25 6.25

Join the Google Search Console data to your sitemap data

Now we need to combine the two datasets so we can see what’s missing from each one. There are various ways you can do this. I went with the Pandas merge() function and used a right join on the page column, which is present in both dataframes.

df_all = pd.merge(df_gsc, df_sitemap, how='right', on=['page'])

We get back a dataframe which includes all the pages present in our sitemap, and their corresponding web metrics from GSC. You could also use a different join to give you all the pages in Google Search Console, including any blank lines for pages not found in the sitemap.

page clicks impressions ctr position
0 1203.0 15734.0 7.65 32.94
1 13.0 1102.0 1.18 7.30
2 0.0 24.0 0.00 6.50
3 0.0 20.0 0.00 8.10
4 7.0 189.0 3.70 3.26

To confirm that the merge worked as expected, you can print out the shape for each dataframe, which will confirm the number of rows and columns present. As you can see, our new dataframe has the same number of rows as the df_sitemap dataframe, so it’s missing the extra pages present in GSC and not in the sitemap.

df_sitemap.shape, df_gsc.shape, df_all.shape
((1674, 1), (2099, 5), (1674, 5))

Examine the pages with no clicks

To have a poke around in the data, my next step was to examine the pages present in the df_all dataframe that didn’t generate any impressions. This is probably a good indication that the pages haven’t been indexed, but it might also point to issues with the content that could be improved.

df_no_clicks = df_all[df_all['clicks'] < 1]
df_no_clicks.sort_values(by='impressions', ascending=False).head(10)
page clicks impressions ctr position
1484 0.0 1777.0 0.0 15.30
957 0.0 1160.0 0.0 43.83
1069 0.0 800.0 0.0 17.92
1174 0.0 737.0 0.0 17.54
236 0.0 676.0 0.0 46.09
409 0.0 640.0 0.0 13.16
566 0.0 578.0 0.0 8.67
902 0.0 564.0 0.0 15.93
495 0.0 536.0 0.0 19.41
1295 0.0 507.0 0.0 15.95

Identify which pages in Google Search Console aren’t in the sitemap

To identify the pages which are in Google Search Console, but aren’t in the sitemap I’ve first converted the page column in each dataframe to a list using tolist().

sitemap_pages = df_sitemap['page'].tolist()
gsc_pages = df_gsc['page'].tolist()
all_pages = sitemap_pages + gsc_pages

To obtain a list of all the unique pages in the all_pages dataframe I’ve passed this to set() which removes any duplicates in the list, and then converted it to a list using list(). We get back 2145.

unique_all_pages = list(set(all_pages))

To identify the pages which are shared and present on both the sitemap.xml and on Google Search Console, we can use intersection(). This gives us a list of 1625 pages that are present in both places. As there are 1674 pages in the sitemap, this means some probably aren’t indexed, and since there are 2099 in Google Search Console, it looks like 425 pages are found there but are missing from the sitemap, which could point to index bloat.

shared_pages = list(set(sitemap_pages).intersection(set(gsc_pages)))

Identifying the non-ranking pages

Next, we’ll identify which pages in our sitemap are completely absent from the Google Search Console data. They don’t rank for anything, which is probably an indication that they’ve not been indexed by Google. However, I suspect this might be an inexact science!

You can confirm this by entering the URL with the site: prefix into Google to check whether it returns any results. The sample I checked were in the sitemap, but Google hadn’t indexed them, so it works for me. I found 46 pages in the sitemap that weren’t indexed.

non_ranking_pages = list(set(unique_all_pages).difference(set(gsc_pages)))

Analyse search index bloat

Now we’ve got the non-indexed pages sorted, we’ll look at the potential issue of search index bloat. The numbers above confirm that Google Search Console includes many more pages than we are showing in the sitemap, which means Google has found some pages that are missing.

If a page appears within our unique_all_pages list, but isn’t found in the sitemap_pages list, this means the page isn’t present in our sitemap. This could be because the page hasn’t been added to the sitemap in error, or because paginated or faceted content has been indexed.

To determine which pages are in Google Search Console and not in the sitemap I’ve used the set() of unique_all_pages versus the difference() of the set() of sitemap_pages. As we spotted above, there are 474 pages in GSC that aren’t in the sitemap.

index_bloat_pages = list(set(unique_all_pages).difference(set(sitemap_pages)))

To print out the affected pages we can display the contents of the index_bloat_pages list. This highlights several issues on my site. It looks like I inadvertently failed to add category pages (such as /articles/fly_fishing_gear_reviews) and tag pages (such as /articles/tagged/18/fly_boxes) to the sitemap, so adding these might bring some extra traffic.

However, there are also some paginated pages appearing (i.e. ?page=15). I could prevent these being indexed by disallowing these query parameters in my robots.txt file. The other new addition are the URLs featuring Chrome’s new deep linking highlighting, where people have shared my content and deep linked using a highlighted term.


Matt Clarke, Friday, March 12, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.