How to count indexed pages using Python

Learn how to use Python to count the number of indexed pages a website has to help you monitor its size and growth rate.

How to count indexed pages using Python
Picture by PhotoMix, Pexels.
6 minutes to read

One quick and easy way to understand the size of a website, and its growth rate, is to examine the number of its web pages Google has indexed. You can obtain this value by entering an advanced search term comprising the site: prefix followed by the URL of the site.

While this is only a rough approximation, it’s usually (but not always) relatively close to the actual value. It’s also as good a guide as any, in the lack of access to your competitors’ Google Search Console accounts. In this project, we’ll use web scraping to build a simple tool to fetch these data.

Load the packages

First, open up a Jupyter notebook and install the packages below. To install Requests-HTML you can enter pip3 install requests_html in your terminal. You’ll likely have the other packages pre-installed.

import requests
import urllib
import pandas as pd
from requests_html import HTML
from requests_html import HTMLSession

Get the page source

Next, we’ll create a function called get_source(). This takes a URL and returns the raw HTML of the page for us to parse. As this is only a simple task, we’ll use Requests-HTML to handle this. Internally, this uses Requests and Beautiful Soup. We’ll use Python to catch the exception if the page doesn’t load.


```python
def get_source(url):
    """Return the source code for the provided URL. 

    Args: 
        url (string): URL of the page to scrape.

    Returns:
        response (object): HTTP response object from requests_html. 
    """

    try:
        session = HTMLSession()
        response = session.get(url)
        return response

    except requests.exceptions.RequestException as e:
        print(e)

We’ll create another function called get_results() next. This takes the URL of the site we want to check, and appends it to a Google query which includes the all important site: prefix. This runs get_source() and returns the HTML of the search query.

def get_results(url):
    
    query = urllib.parse.quote_plus(url)
    response = get_source("https://www.google.co.uk/search?q=site%3A" + url)
    
    return response

Parse the results

To parse the results HTML we can use Requests HTML. This is a wrapper around Beautiful Soup, so allows us to look for the div with the CSS ID result-stats in the page and return the text from inside the tag.

The string value will contain some text in the format “About 1,710 results (0.30 seconds)”, but we only want the number of results, so we’ll use split to break this up at the spaces and get the second element in [1] which contains the number. Then, we’ll replace the comma with nothing and cast the value to an int.

def parse_results(response):
    string = response.html.find("#result-stats", first=True).text
    indexed = int(string.split(' ')[1].replace(',',''))
    return indexed

Count the number of indexed pages

Finally, we can wrap these up in another function called count_indexed_pages(). This takes our url, runs get_results() to obtain the source of the page, then uses parse_results() to return the number of indexed pages.

def count_indexed_pages(url):
    response = get_results(url)
    return parse_results(response)
count_indexed_pages("http://flyandlure.org")
2150

Fetch the data for multiple sites

If you have a bunch of competitors you want to monitor, all it takes is a list of URLs and a for loop and we can fetch the data for each one in a matter of seconds and return the data in a neat Pandas dataframe.

sites = ['http://flyandlure.org',
        'https://beardybros.co.uk',
        'https://yorkshireflyfishing.org.uk',
        'https://www.flyfishing-and-flytying.co.uk',
        'https://www.turrall.com',
        'https://dgfishing.co.uk']
data = []

for site in sites:
    site_data = {
        'url': site,
        'indexed_pages': count_indexed_pages(site)
    }
    
    data.append(site_data)
df = pd.DataFrame.from_records(data)
df.sort_values(by='indexed_pages')
url indexed_pages
4 https://www.turrall.com 195
1 https://beardybros.co.uk 227
2 https://yorkshireflyfishing.org.uk 431
5 https://dgfishing.co.uk 524
3 https://www.flyfishing-and-flytying.co.uk 1330
0 http://flyandlure.org 2150

Do bear in mind that, as with any form of Google scraping, you’re not technically supposed to do this, so I wouldn’t recommend it on larger scale projects where you’re more likely to hit Google’s bot detection code. There are various ways to circumvent these, if you so wish, including using rotating proxies.

Matt Clarke, Saturday, March 13, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.