How to scrape Google search results using Python

Learn to scrape Google search results using Python and save loads of time and collect data that aids a wide range of technical SEO and data science analyses.

How to scrape Google search results using Python
Stallions... Picture by Alex Kotliarksyi, Unsplash.
9 minutes to read

Although I suspect you are probably not technically allowed to do it, I doubt there’s an SEO in the land who hasn’t scraped Google search engine results to analyse them, or used an SEO tool that does the same thing. It’s much more convenient than picking through the SERPs to extract links by hand.

In this project, I’ll show you how you can build a relatively robust (but also slightly flawed) web scraper using Requests-HTML that can return a list of URLs from a Google search, so you can analyse the URLs in your technical SEO projects.

If you just want a quick, free way to scrape Google search results using Python, without paying for a SERP API service, then give my EcommerceTools package a try. It lets you scrape Google search results in three lines of code. Here’s how it’s done.

Load the packages

First, open up a Jupyter notebook and import the below packages. You’ll likely already have requests, urllib, and pandas, but you can install requests_html by entering pip3 install requests_html, if you don’t already have it.

import requests
import urllib
import pandas as pd
from requests_html import HTML
from requests_html import HTMLSession

Get the page source

Next, we’ll write a little function to pass our URL to Requests-HTML and return the source code of the page. This first creates a session, then fetches the response, or throws an exception if something goes wrong. We’ll scrape the interesting bits in the next step.

def get_source(url):
    """Return the source code for the provided URL. 

    Args: 
        url (string): URL of the page to scrape.

    Returns:
        response (object): HTTP response object from requests_html. 
    """

    try:
        session = HTMLSession()
        response = session.get(url)
        return response

    except requests.exceptions.RequestException as e:
        print(e)

Scrape the results

This is the bit where things get interesting, and slightly hacky. I suspect Google does not like people scraping their search results, so you’ll find that there are no convenient CSS class names we can tap into. Those that are present, seem to change, causing scrapers to break. To work around this I’ve used an alternate approach, which is more robust, but does have a limitation.

First, we’re using urllib.parse.quote_plus() to URL encode our search query. This will add + characters where spaces sit and ensure that the search term used doesn’t break the URL when we append it. After that, we’ll combine it with the Google search URL and get back the page source using get_source().

Rather than using the current CSS class or XPath to extract the links, I’ve just exported all the absolute URLs from the page using response.html.absolute_links. This is more resistant to changes in Google’s source code, but it means there will be Google URLs also present.

Since it’s only non-Google content in which I’m interested, I’ve removed any URLs with a Google-related URL prefix. The downside is that it will remove legitimate Google URLs in the SERPs.

def scrape_google(query):

    query = urllib.parse.quote_plus(query)
    response = get_source("https://www.google.co.uk/search?q=" + query)

    links = list(response.html.absolute_links)
    google_domains = ('https://www.google.', 
                      'https://google.', 
                      'https://webcache.googleusercontent.', 
                      'http://webcache.googleusercontent.', 
                      'https://policies.google.',
                      'https://support.google.',
                      'https://maps.google.')

    for url in links[:]:
        if url.startswith(google_domains):
            links.remove(url)

    return links

Running the function gives us a list of URLs that were found on the Google search results for our chosen term, with any Google-related URLs removed. This obviously isn’t a perfect match for the actual results, however, it does return the non-Google domains in which I’m interested.

scrape_google("data science blogs")
['https://medium.com/@exastax/top-20-data-science-blogs-and-websites-for-data-scientists-d88b7d99740',
 'https://data-science-blog.com/',
 'https://blog.feedspot.com/data_science_blogs/',
 'https://github.com/rushter/data-science-blogs',
 'https://365datascience.com/51-data-science-blogs/',
 'https://towardsdatascience.com/best-data-science-blogs-to-follow-in-2020-d03044169eb4',
 'https://www.dataquest.io/blog/',
 'https://www.tableau.com/learn/articles/data-science-blogs',
 'https://www.kdnuggets.com/websites/blogs.html',
 'https://www.thinkful.com/blog/data-science-blogs/']

You can tweak the code accordingly to extract only the links from certain parts of the SERPs, but you’ll find that you’ll need to update the code regular as the source code is changed frequently. For what I needed, this did the job fine.

Want the text instead?

If you’re after the title, snippet, and the URL for each search engine result, try this approach instead. First, create a function to format and URL encode the query, send it to Google and show the output.

def get_results(query):
    
    query = urllib.parse.quote_plus(query)
    response = get_source("https://www.google.co.uk/search?q=" + query)
    
    return response

Next, we’ll parse the response HTML. I’ve pored over the obfuscated HTML and extracted the current CSS values that hold the values for the result, the title, the link, and the snippet text. These change frequently, so this may not work in the future without adjusting these values.

def parse_results(response):
    
    css_identifier_result = ".tF2Cxc"
    css_identifier_title = "h3"
    css_identifier_link = ".yuRUbf a"
    css_identifier_text = ".IsZvec"
    
    results = response.html.find(css_identifier_result)

    output = []
    
    for result in results:

        item = {
            'title': result.find(css_identifier_title, first=True).text,
            'link': result.find(css_identifier_link, first=True).attrs['href'],
            'text': result.find(css_identifier_text, first=True).text
        }
        
        output.append(item)
        
    return output

Finally, we’ll wrap up the functions in a google_search() function, which will put everything above together and return a neat list of dictionaries containing the results.

def google_search(query):
    response = get_results(query)
    return parse_results(response)
results = google_search("web scraping")
results
[{'title': 'What is Web Scraping and What is it Used For? | ParseHub',
  'link': 'https://www.parsehub.com/blog/what-is-web-scraping/',
  'text': ''},
 {'title': 'Web scraping - Wikipedia',
  'link': 'https://en.wikipedia.org/wiki/Web_scraping',
  'text': 'Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World\xa0...\n\u200eHistory · \u200eTechniques · \u200eSoftware · \u200eLegal issues'},
 {'title': 'Web Scraper - The #1 web scraping extension',
  'link': 'https://webscraper.io/',
  'text': 'The most popular web scraping extension. Start scraping in minutes. Automate your tasks with our Cloud Scraper. No software to download, no coding needed.\n\u200eWeb Scraper · \u200eCloud · \u200eTest Sites · \u200eDocumentation'},
 {'title': 'Web Scraper - Free Web Scraping',
  'link': 'https://chrome.google.com/webstore/detail/web-scraper-free-web-scra/jnhgnonknehpejjnehehllkliplmbmhn?hl=en',
  'text': '23 Sept 2020 — With a simple point-and-click interface, the ability to extract thousands of records from a website takes only a few minutes of scraper setup. Web\xa0...'},
 {'title': 'Python Web Scraping Tutorials – Real Python',
  'link': 'https://realpython.com/tutorials/web-scraping/',
  'text': 'Web scraping is about downloading structured data from the web, selecting some of that data, and passing along what you selected to another process.'},
 {'title': 'ParseHub | Free web scraping - The most powerful web scraper',
  'link': 'https://www.parsehub.com/',
  'text': 'ParseHub is a free web scraping tool. Turn any site into a spreadsheet or API. As easy as clicking on the data you want to extract.'},
 {'title': 'Web Scraping Explained - WebHarvy',
  'link': 'https://www.webharvy.com/articles/what-is-web-scraping.html',
  'text': 'Web Scraping (also termed Screen Scraping, Web Data Extraction, Web Harvesting etc.) is a technique employed to extract large amounts of data from websites\xa0...'},
 {'title': 'What Is Web Scraping And How Does Web Crawling Work?',
  'link': 'https://www.zyte.com/learn/what-is-web-scraping/',
  'text': 'Web scraping, also called web data extraction, is the process of extracting or scraping data from websites. Learn about web crawling and how it works.'},
 {'title': "A beginner's guide to web scraping with Python | Opensource ...",
  'link': 'https://opensource.com/article/20/5/web-scraping-python',
  'text': "22 May 2020 — Setting a goal for our web scraping project. Now we have our dependencies installed, but what does it take to scrape a webpage? Let's take a\xa0..."}]

If you want to quickly scrape several pages of Google search results, rather than just the first page of results, check out EcommerceTools instead, or adapt the code above to support pagination.

Matt Clarke, Saturday, March 13, 2021

Matt Clarke Matt is a Digital Director who uses data science to help in his work. He has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.

Web Scraping in Python

Learn to retrieve and parse information from the internet using the Python library scrapy.

Start course for FREE

Comments