One quick and easy way to understand the size of a website, and its growth rate, is to examine the number of its web pages Google has indexed. You can obtain this value by entering an advanced search term comprising the site:
prefix followed by the URL of the site.
While this is only a rough approximation, it’s usually (but not always) relatively close to the actual value. It’s also as good a guide as any, in the lack of access to your competitors’ Google Search Console accounts. In this project, we’ll use web scraping to build a simple tool to fetch these data.
First, open up a Jupyter notebook and install the packages below. To install Requests-HTML you can enter pip3 install requests_html
in your terminal. You’ll likely have the other packages pre-installed.
import requests
import urllib
import pandas as pd
from requests_html import HTML
from requests_html import HTMLSession
Next, we’ll create a function called get_source()
. This takes a URL and returns the raw HTML of the page for us to parse. As this is only a simple task, we’ll use Requests-HTML to handle this. Internally, this uses Requests and Beautiful Soup. We’ll use Python to catch the exception if the page doesn’t load.
```python
def get_source(url):
"""Return the source code for the provided URL.
Args:
url (string): URL of the page to scrape.
Returns:
response (object): HTTP response object from requests_html.
"""
try:
session = HTMLSession()
response = session.get(url)
return response
except requests.exceptions.RequestException as e:
print(e)
We’ll create another function called get_results()
next. This takes the URL of the site we want to check, and appends it to a Google query which includes the all important site:
prefix. This runs get_source()
and returns the HTML of the search query.
def get_results(url):
query = urllib.parse.quote_plus(url)
response = get_source("https://www.google.co.uk/search?q=site%3A" + url)
return response
To parse the results HTML we can use Requests HTML. This is a wrapper around Beautiful Soup, so allows us to look for the div
with the CSS ID result-stats
in the page and return the text
from inside the tag.
The string
value will contain some text in the format “About 1,710 results (0.30 seconds)”, but we only want the number of results, so we’ll use split to break this up at the spaces and get the second element in [1]
which contains the number. Then, we’ll replace the comma with nothing and cast the value to an int
.
def parse_results(response):
string = response.html.find("#result-stats", first=True).text
indexed = int(string.split(' ')[1].replace(',',''))
return indexed
Finally, we can wrap these up in another function called count_indexed_pages()
. This takes our url
, runs get_results()
to obtain the source of the page, then uses parse_results()
to return the number of indexed pages.
def count_indexed_pages(url):
response = get_results(url)
return parse_results(response)
count_indexed_pages("http://flyandlure.org")
2150
If you have a bunch of competitors you want to monitor, all it takes is a list of URLs and a for
loop and we can fetch the data for each one in a matter of seconds and return the data in a neat Pandas dataframe.
sites = ['http://flyandlure.org',
'https://beardybros.co.uk',
'https://yorkshireflyfishing.org.uk',
'https://www.flyfishing-and-flytying.co.uk',
'https://www.turrall.com',
'https://dgfishing.co.uk']
data = []
for site in sites:
site_data = {
'url': site,
'indexed_pages': count_indexed_pages(site)
}
data.append(site_data)
df = pd.DataFrame.from_records(data)
df.sort_values(by='indexed_pages')
url | indexed_pages | |
---|---|---|
4 | https://www.turrall.com | 195 |
1 | https://beardybros.co.uk | 227 |
2 | https://yorkshireflyfishing.org.uk | 431 |
5 | https://dgfishing.co.uk | 524 |
3 | https://www.flyfishing-and-flytying.co.uk | 1330 |
0 | http://flyandlure.org | 2150 |
Do bear in mind that, as with any form of Google scraping, you’re not technically supposed to do this, so I wouldn’t recommend it on larger scale projects where you’re more likely to hit Google’s bot detection code. There are various ways to circumvent these, if you so wish, including using rotating proxies.
Matt Clarke, Saturday, March 13, 2021