How to build a web scraper using Requests-HTML

Requests-HTML wraps up the best bits from Requests and Beautiful Soup packages to create a web scraper that’s quick and easy to use. Here’s how it works.

How to build a web scraper using Requests-HTML
Picture by Vojtech Okenka, Pexels.
15 minutes to read

Unless you’re building a large and complex web scraper using Scrapy or Selenium, it’s probable that you’ll utilise Requests and Beautiful Soup. These two packages are brilliant for web scraping. However, the Requests-HTML package, which combines the two of them in an easy-to-use wrapper, makes them even easier to utilise.

In this web scraping project, we’ll install Requests-HTML and go over the basics of creating a simple web scraper. We’ll look at the different ways of parsing and scraping content from a website, then we’ll build a simple web scraper that scrapes an XML sitemap, then crawls the site and creates a Pandas dataframe of the scraped content.

Load the packages

Requests-HTML is based on Requests and Beautiful Soup, and uses some other bits from parse, pyquery, and fake-useragent. To get started, open a Jupyter notebook and import the requests, pandas and requests_html packages. Requests-HTML will load any dependencies during the install, but we’re also loading requests to add an extra feature ourselves.

import requests
import pandas as pd
from requests_html import HTML
from requests_html import HTMLSession

Fetching a page’s source with Requests-HTML

The requests_html package is one of the easiest ways to get started with web scraping in Python. It combines a system for making an HTTP request with easy-to-use code for parsing content to scrape out the bits you need. To get started, you need to use HTMLSession() to create a new session, then use get() to fetch your URL. Here’s a really basic example. Printing the response object returns the HTTP status code 200 if the query worked.

url = "http://flyandlure.org/articles/fly_fishing_gear_reviews/sunray_marsden_hi_viz_shooting_head_fly_line_review"
session = HTMLSession()
response = session.get(url)
response
<Response [200]>

Since this will throw an ugly exception if something goes wrong, we can wrap this up in a try except block and return the exception error by tapping into the underlying requests package that Requests-HTML uses. If everything works, this will return the response. As this is something we’ll need throughout our code, we’ll also wrap it in a little function to aid re-use.

def get_source(url):
    """Return the source code for the provided URL. 

    Args: 
        url (string): URL of the page to scrape.

    Returns:
        response (object): HTTP response object from requests_html. 
    """

    try:
        session = HTMLSession()
        response = session.get(url)
        return response

    except requests.exceptions.RequestException as e:
        print(e)
response = get_source(url)
response
<Response [200]>

Parsing a page’s source using Requests-HTML

Requests-HTML includes a load of different functions to allow you to scrape the content from within a web page. It’s very similar to the popular Beautiful Soup package, in that it supports CSS and XPath selectors, as well as having various other features to extract common data.

Using find()

The find() function takes a CSS selector and returns any elements that match within the page. In the below example, we’re passing the response instance, setting the parser to html, and using the find() function to look for the title element and return the first match. The .text bit returns the text from within the tag, rather than the element itself.

title = response.html.find('title', first=True).text
title
'Sunray Marsden Hi Viz Shooting Head fly line review | Fly&Lure'

You can pass any CSS selector to find(), allowing you to scrape very specific elements from the page. Here, I’m searching for the .reading-time class and returning the text from within.

reading_time = response.html.find('.reading-time', first=True).text
reading_time
'Estimated reading time 6 - 10 minutes'
Using xpath()

XPath uses path expressions that point to specific nodes or node-sets within a document. These allow you to pinpoint exact parts within the source of a page. You can view XPaths by using the inspect element feature in your web browser, and then transfer them to code below to extract the content you want.

canonical = response.html.xpath("//link[@rel='canonical']/@href")
canonical
['http://flyandlure.org/articles/fly_fishing_gear_reviews/sunray_marsden_hi_viz_shooting_head_fly_line_review']
description =  response.html.xpath("//meta[@name='description']/@content")
description
['The Sunray Marsden Hi Viz Shooting Head fly line is a floating weight forward fly line designed to help you get more distance with fewer false casts.']
author =  response.html.xpath("//meta[@name='author']/@content")
author
['Matt Clarke']

You can also use the xpath() approach to parse and scrape Open Graph data present within the page. For demonstration purposes, here’s how you can extract the image, type, article, title, and description from any Open Graph tags present.

og_image =  response.html.xpath("//meta[@property='og:image']/@content")
og_image
['http://flyandlure.org/images/uploads/large/d4e48616862825ed7f72812355b66e14.jpg']
og_type =  response.html.xpath("//meta[@property='og:type']/@content")
og_type
['article']
og_url =  response.html.xpath("//meta[@property='og:url']/@content")
og_url
['http://flyandlure.org/articles/fly_fishing_gear_reviews/sunray_marsden_hi_viz_shooting_head_fly_line_review']
og_title =  response.html.xpath("//meta[@property='og:title']/@content")
og_title
['Sunray Marsden Hi Viz Shooting Head fly line review']
og_description =  response.html.xpath("//meta[@property='og:description']/@content")
og_description
['The Sunray Marsden Hi Viz Shooting Head fly line is a floating weight forward fly line designed to help you get more distance with fewer false casts.']

If you only want to scrape all the absolute URLs from a page you can use absolute_links. Simply run the code below and it will return a massive dictionary full of links.

links = response.html.absolute_links
links
{'http://flyandlure.org/about',
 'http://flyandlure.org/articles',
 'http://flyandlure.org/articles/fly_fishing',
 'http://flyandlure.org/articles/fly_fishing/15_tips_for_fly_fishing_with_boobies',
 'http://flyandlure.org/articles/fly_fishing/fly_fishing_diary_december_2019',
 'http://flyandlure.org/articles/fly_fishing/fly_fishing_diary_february_2020',
 'http://flyandlure.org/articles/fly_fishing/fly_fishing_diary_january_2020',
 'http://flyandlure.org/articles/fly_fishing/fly_fishing_diary_july_2020',
 'http://flyandlure.org/articles/fly_fishing/fly_fishing_diary_june_2020',
 'http://flyandlure.org/articles/fly_fishing/fly_fishing_diary_march_2020',
 'http://flyandlure.org/articles/fly_fishing/how_and_why_to_fish_wind_lanes_for_trout',
 'http://flyandlure.org/articles/fly_fishing/how_to_fish_the_diawl_bach_for_trout',
 'http://flyandlure.org/articles/fly_fishing_destinations',
 'http://flyandlure.org/articles/fly_fishing_gear_reviews',
 'http://flyandlure.org/articles/fly_fishing_gear_reviews/fulling_mill_fly_box_range_review',
 'http://flyandlure.org/articles/fly_fishing_gear_reviews/fulling_mill_grayling_jigs_review',
 'http://flyandlure.org/articles/fly_fishing_gear_reviews/fulling_mill_masterclass_tapered_leaders_review',
 'http://flyandlure.org/articles/fly_fishing_gear_reviews/fulling_mill_world_class_v2_fluorocarbon_review',
 'http://flyandlure.org/articles/fly_fishing_gear_reviews/guideline_exp5_fly_rod_review',
 'http://flyandlure.org/articles/fly_fishing_gear_reviews/hends_camou_french_leader_review',
 'http://flyandlure.org/articles/fly_fishing_gear_reviews/iain_barr_world_champions_choice_fly_selection_review',
 'http://flyandlure.org/articles/fly_fishing_gear_reviews/opst_commando_head_micro_series_review',
 'http://flyandlure.org/articles/fly_fishing_gear_reviews/orvis_clearwater_sink_tip_type_iii_fly_line_review',
 'http://flyandlure.org/articles/fly_fishing_gear_reviews/rio_two_tone_indicator_tippet_review',
 'http://flyandlure.org/articles/fly_fishing_gear_reviews/sunray_competition_float_fly_line_review',
 'http://flyandlure.org/articles/fly_fishing_gear_reviews/sunray_distance_intermediate_fly_line_review',
 'http://flyandlure.org/articles/fly_fishing_gear_reviews/sunray_el_guapo_streamer_fly_line_review',
 'http://flyandlure.org/articles/fly_fishing_gear_reviews/sunray_short_head_fly_line_review',
 'http://flyandlure.org/articles/fly_fishing_gear_reviews/vision_valu_fly_reel_review',
 'http://flyandlure.org/articles/fly_tying',
 'http://flyandlure.org/articles/lure_fishing',
 'http://flyandlure.org/articles/tagged/29/sunray',
 'http://flyandlure.org/articles/tagged/4/fly_lines',
 'http://flyandlure.org/contact',
 'http://flyandlure.org/copyright',
 'http://flyandlure.org/index.php',
 'http://flyandlure.org/listings',
 'http://flyandlure.org/listings/fly_fishing_clubs',
 'http://flyandlure.org/listings/fly_fishing_instructors',
 'http://flyandlure.org/listings/fly_fishing_shops',
 'http://flyandlure.org/listings/places_to_fly_fish',
 'http://flyandlure.org/privacy',
 'http://flyandlure.org/terms',
 'https://plus.google.com/share?url=http%3A%2F%2Fflyandlure.org%2Farticles%2Ffly_fishing_gear_reviews%2Fsunray_marsden_hi_viz_shooting_head_fly_line_review',
 'https://twitter.com/flyandlure',
 'https://twitter.com/intent/tweet?text=Sunray+Marsden+Hi+Viz+Shooting+Head+fly+line+review%20from%20@flyandlure&url=http%3A%2F%2Fflyandlure.org%2Farticles%2Ffly_fishing_gear_reviews%2Fsunray_marsden_hi_viz_shooting_head_fly_line_review',
 'https://www.facebook.com/flyandlure',
 'https://www.facebook.com/sharer/sharer.php?u=http%3A%2F%2Fflyandlure.org%2Farticles%2Ffly_fishing_gear_reviews%2Fsunray_marsden_hi_viz_shooting_head_fly_line_review',
 'https://www.instagram.com/fly_and_lure/'}

Scrape an XML sitemap

Now we’ve covered the basics, we’ll create a function that uses requests_html to scrape the content of a regular urlset XML sitemap. As explained in my tutorial on scraping and parsing XML sitemaps using Python, these come in several forms, so you may need to adapt the code to suit the format of the sitemap you want to scrape.

The function starts off by creating a new session using HTMLSession(), then uses get() to fetch the HTTP response object for the URL passed. It then uses html.find() to find all of the loc elements that contain the page URLs, then loops through them puts them in a dictionary, and appends that to a Pandas dataframe, which it spits out at the end.

def scrape_sitemap(url):
    """Scrape the contents of an XML sitemap and return the contents in a dataframe.

    Args:
        url (string): Absolute URL of urlset XML sitemap. 

    Returns: 
        df (dataframe): Pandas dataframe containing sitemap contents.  
    """

    df = pd.DataFrame(columns = ['url'])

    response = get_source(url)

    with response as r:
        urls = r.html.find("loc", first=False)

        for url in urls:        
            row = {'url': url.text}

            df = df.append(row, ignore_index=True)

    return df

Running the function on the sitemap for my other website quickly returns a dataframe containing all the site’s URLs, which I’ve written to a CSV file to store using to_csv(). Finally, we can print the tail() which shows that my other site now has 1673 pages listed in its sitemap.

df = scrape_sitemap("http://flyandlure.org/sitemap.xml")
df.to_csv("sitemap.csv", index=False)
df.tail(10)
url
1664 http://flyandlure.org/listings/places_to_fly_f...
1665 http://flyandlure.org/listings/places_to_fly_f...
1666 http://flyandlure.org/listings/places_to_fly_f...
1667 http://flyandlure.org/listings/places_to_fly_f...
1668 http://flyandlure.org/listings/places_to_fly_f...
1669 http://flyandlure.org/listings/places_to_fly_f...
1670 http://flyandlure.org/listings/places_to_fly_f...
1671 http://flyandlure.org/listings/places_to_fly_f...
1672 http://flyandlure.org/listings/places_to_fly_f...
1673 http://flyandlure.org/listings/places_to_fly_f...

Scrape all the pages on a site

Next, we’ll use our code snippets above to iterate over all the URLs in our sitemap dataframe and scrape the content from the pages and store it in a new dataframe. There’s nothing new here. We’re just running the snippets above and assigning their outputs to a row dictionary that we append to the df_pages dataframe at the end of each iteration.

def scrape_site(df, url='url'):
    """Scrapes every page in a Pandas dataframe column. 

    Args:
        df: Pandas dataframe containing the URL list.
        url (optional, string): Optional name of URL column, if not 'url'

    Returns:
        df: Pandas dataframe containing all scraped content.
    """

    df_pages = pd.DataFrame(columns = ['url', 'title', 'description'])

    for index, row in df.iterrows(): 

        response = get_source(row[url])

        with response as r:

            row = {
                'url': row[url],
                'title': r.html.find('title', first=True).text,
                'description': r.html.xpath('//meta[@name="description"]/@content'),
                'type': r.html.xpath("//meta[@property='og:type']/@content"),
                'author': r.html.xpath("//meta[@name='author']/@content"),
                'image': r.html.xpath("//meta[@property='og:image']/@content"),
                'reading_time': r.html.find('.reading-time', first=True),
            }

            df_pages = df_pages.append(row, ignore_index=True)

    return df_pages

After five minutes of scraping, requests_html has visited each of the 1673 URLs on my site and scraped the selected content from them. As with the previous example, I’ve saved the output to a CSV file so I can use it elsewhere, and I’ve printed the last 10 rows using the Pandas tail() function.

df_pages = scrape_site(df)
df_pages.to_csv("pages.csv", index=False)
df_pages.tail(10)
url title description author image reading_time type
1664 http://flyandlure.org/listings/places_to_fly_f... Paper Mill Fishery, Swansea, Wales | Fly&Lure [Paper Mill Fishery, Swansea, Wales] [Matt Clarke] [https://maps.googleapis.com/maps/api/staticma... None [article]
1665 http://flyandlure.org/listings/places_to_fly_f... Shimano Felindre Big Fish Water, Swansea, Wale... [Shimano Felindre Big Fish Water, Swansea, Wales] [Matt Clarke] [https://maps.googleapis.com/maps/api/staticma... None [article]
1666 http://flyandlure.org/listings/places_to_fly_f... Llandegfedd Reservoir, Torfaen, Wales | Fly&Lure [Llandegfedd Reservoir, Torfaen, Wales] [Matt Clarke] [https://maps.googleapis.com/maps/api/staticma... None [article]
1667 http://flyandlure.org/listings/places_to_fly_f... Dyffryn Springs , Vale of Glamorgan, Wales | F... [Dyffryn Springs , Vale of Glamorgan, Wales] [Matt Clarke] [https://maps.googleapis.com/maps/api/staticma... None [article]
1668 http://flyandlure.org/listings/places_to_fly_f... Chirk Fishery, Wrexham, Wales | Fly&Lure [Chirk Fishery, Wrexham, Wales] [Matt Clarke] [https://maps.googleapis.com/maps/api/staticma... None [article]
1669 http://flyandlure.org/listings/places_to_fly_f... Llandegla Trout Fishery, Wrexham, Wales | Fly&... [Llandegla Trout Fishery, Wrexham, Wales] [Matt Clarke] [https://maps.googleapis.com/maps/api/staticma... None [article]
1670 http://flyandlure.org/listings/places_to_fly_f... Penycae Lower Reservoir, Wrexham, Wales | Fly&... [Penycae Lower Reservoir, Wrexham, Wales] [Matt Clarke] [https://maps.googleapis.com/maps/api/staticma... None [article]
1671 http://flyandlure.org/listings/places_to_fly_f... Penycae Upper Reservoir, Wrexham, Wales | Fly&... [Penycae Upper Reservoir, Wrexham, Wales] [Matt Clarke] [https://maps.googleapis.com/maps/api/staticma... None [article]
1672 http://flyandlure.org/listings/places_to_fly_f... Tree Tops Fly Fishery, Wrexham, Wales | Fly&Lure [Tree Tops Fly Fishery, Wrexham, Wales] [Matt Clarke] [https://maps.googleapis.com/maps/api/staticma... None [article]
1673 http://flyandlure.org/listings/places_to_fly_f... Ty Mawr Reservoir, Wrexham, Wales | Fly&Lure [Ty Mawr Reservoir, Wrexham, Wales] [Matt Clarke] [https://maps.googleapis.com/maps/api/staticma... None [article]

Matt Clarke, Friday, March 12, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.