How to scrape a site's page titles and meta descriptions

Learn how to apply web scraping tools to scrape a site's content and parse the page titles and meta descriptions and return the output in a Pandas dataframe.

How to scrape a site's page titles and meta descriptions
Picture by Marc Mueller, Pexels.
8 minutes to read

Scraping the titles and meta descriptions from every page on a site can tell you a great deal about its content, the underlying content strategy, or product ranges, and many other things. Whether you’re examining your own site or those of your competitors, it’s worth learning some basic web scraping skills to fetch this useful data.

In this project, I’ll show you how you can use web scraping to create a simple scraper using urllib, Beautiful Soup, and pandas, to scrape and parse all the pages on a website and return the information in a Pandas dataframe. Here’s how it’s done.

Load the packages

First, open a Python script or Jupyter notebook and import the pandas, urllib and BeautifulSoup packages. Any packages you don’t have can be installed by typing pip3 install package-name in your terminal.

import pandas as pd
import urllib.request
from urllib.parse import urlparse
from bs4 import BeautifulSoup

Load the data

Next, load up the list of URLs you want to scrape. I’m assuming that you already have these stored in a CSV file that you can load into Pandas. However, if you need to construct the URL list, check out my guide to parsing XML sitemaps, which explains how you can obtain the URL for every page on a site.

df = pd.read_csv('sitemap.csv')
df = df[['loc']]
df.head()
loc
0 https://themarket.co.uk
1 https://themarket.co.uk/
2 https://themarket.co.uk/auctions/coming-soon
3 https://themarket.co.uk/auctions/live
4 https://themarket.co.uk/auctions/no-reserve

Scrape the URL

First, we’ll create a simple function to scrape the content of a URL and return the HTML source code within. There are lots of different ways to perform web scraping tasks in Python. For larger projects, I’d highly recommend using Scrapy as it supports threading and is much quicker. However, for smaller projects, such as scraping your own site, requests, urllib, and BeautifulSoup are fine.

In the below function, I’ve used urlopen() from urllib.request to open an HTTP connection to the page. I’ve passed that response to Beautiful Soup, and have used the html.parser to extract the source code. Depending on the site, you may also need to obtain the page’s character encoding and pass that to Beautiful Soup for things to work seamlessly. The function returns the page’s HTML code in a variable called soup.

def get_page(url):
    """Scrapes a URL and returns the HTML source.

    Args:
        url (string): Fully qualified URL of a page.

    Returns:
        soup (string): HTML source of scraped page.
    """

    response = urllib.request.urlopen(url)
    soup = BeautifulSoup(response, 
                         'html.parser', 
                         from_encoding=response.info().get_param('charset'))

    return soup
soup = get_page("https://themarket.co.uk")

Parse the meta data

Next, we’ll create a function to parse the soup HTML returned by Beautiful Soup. We’ll run Beautiful Soup’s findall() function on this, and we’ll return all of the meta name="description" elements and extract the content from within.

def get_description(soup):
    """Return the meta description content

    Args:
        soup: HTML code from Beautiful Soup
        
    Returns: 
        value (string): Parsed value
    """

    if soup.findAll("meta", attrs={"name": "description"}):
        return soup.find("meta", attrs={"name": "description"}).get("content")
    else:
        return

    return
meta = get_description(soup)
meta
'The Market Collectable Car Auctions No buyer fees, just 5% + VAT seller fees, see how much more we return. 90% sale rate in 2020. Signup for our weekly email.'

Fetch the site title

We can now repeat this process by using Beautiful Soup to parse the HTML soup again and extract the title element from the page. This returns the title string for each page of code examined.

def get_title(soup):
    """Return the page title

    Args:
        soup: HTML code from Beautiful Soup
        
    Returns: 
        value (string): Parsed value
    """

    if soup.findAll("title"):
        return soup.find("title").string
    else:
        return
title = get_title(soup)
title
'Classic and Collectable Car Auctions: Cars for Sale'

Fetch all the meta data on a site

Finally, we can put all the steps together. We’ll create a Pandas dataframe called df_pages to store the url, title, and description of each page. Then, we’ll loop through the rows in the dataframe using iterrows(), scrape the page’s HTML, and parse the content to return the title and description. We’ll then store these in our dataframe.

df_pages = pd.DataFrame(columns = ['url', 'title', 'description'])

for index, row in df.iterrows(): 

    soup = get_page(row['loc'])
    title = get_title(soup)
    description = get_description(soup)

    page = {
        'url': row['loc'],
        'title': title,
        'description': description
    }

    df_pages = df_pages.append(page, ignore_index=True)

After a few minutes (depending on the size of the site), we get back a Pandas dataframe containing all the data we need for our analysis.

df_pages.head()
url title description
0 https://themarket.co.uk Classic and Collectable Car Auctions: Cars for... The Market Collectable Car Auctions No buyer f...
1 https://themarket.co.uk/ Classic and Collectable Car Auctions: Cars for... The Market Collectable Car Auctions No buyer f...
2 https://themarket.co.uk/auctions/coming-soon Classic and Collectable Car Auctions: Cars for... Search results Upcoming Auctions
3 https://themarket.co.uk/auctions/live Classic and Collectable Car Auctions: Cars for... Search Results Live Listings: Classic Cars for...
4 https://themarket.co.uk/auctions/no-reserve Classic and Collectable Car Auctions: Cars for... Search results No Reserve Listings: Classic Ca...

Matt Clarke, Friday, March 12, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.