How to scrape Open Graph protocol data using Python

Picture by Pixabay, Pexels.

10 minutes to read

Many websites include Open Graph protocol data in their document head. This structured data allows social networks, such as Facebook and Twitter, to access specific elements of the page’s content to improve the quality of tweets and shares.

Open Graph protocol data is also very useful for web scraping, as it allows you to easily extract the key elements of any page on a site, such as its title, description, image, and even other media elements such as videos and audio files. Here’s how you can build a web scraper to extract it from a site.

Load the packages

Open a Python script or Jupyter notebook and import the pandas, urllib and bs4 packages. We’ll be using Pandas for manipulating our data, urllib for fetching the HTML of each page and the BeautifulSoup package from bs4 for parsing the HTML.

import pandas as pd
import urllib.request
from urllib.parse import urlparse
from bs4 import BeautifulSoup

Load the data

Next, load up a Pandas dataframe containing the URLs you want to scrape. If you want to obtain a list of all the URLs on a site, check out my guide to parsing and scraping XML sitemaps, which explains how this is done.

df = pd.read_csv('sitemap.csv')
df = df[['loc']]
df.head()

	loc
0	https://themarket.co.uk
1	https://themarket.co.uk/
2	https://themarket.co.uk/auctions/coming-soon
3	https://themarket.co.uk/auctions/live
4	https://themarket.co.uk/auctions/no-reserve

Scrape the URL

The first step in web scraping is to fetch the source code of the page you want to scrape and parse. There are many packages available to do this in Python. Scrapy would be my recommendation for larger projects, but requests and urllib work fine for simple tasks like this.

The function below takes a url and uses urlopen() to grab the HTTP response. Using this object, we determine the correct character encoding used on the page, and pass the response to Beautiful Soup, which returns the full HTML source code for the page.

def get_page(url):
    """Scrapes a URL and returns the HTML source.
    
    Args:
        url (string): Fully qualified URL of a page.
    
    Returns:
        soup (string): HTML source of scraped page.
    """
    
    response = urllib.request.urlopen(url)
    soup = BeautifulSoup(response, 
                         'html.parser', 
                         from_encoding=response.info().get_param('charset'))
    
    return soup

soup = get_page("https://www.bbc.co.uk/news/av/uk-politics-44820849")

Parse the Open Graph data

Parsing the Open Graph protocol data from within the soup HTML returned by Beautiful Soup is pretty straightforward. To keep things simple, I’ve created a function for each Open Graph element we want to scrape. These all work in the same way, but look for a different meta property value using the Beautiful Soup findAll() function, and return the content within the tag.

def get_og_title(soup):
    """Return the Open Graph title

    Args:
        soup: HTML from Beautiful Soup.
    
    Returns:
        value: Parsed content. 
    """

    if soup.findAll("meta", property="og:title"):
        return soup.find("meta", property="og:title")["content"]
    else:
        return
    
    return

og_title = get_og_title(soup)
og_title

'Trump baby blimp launched in London'

def get_og_locale(soup):
    """Return the Open Graph locale

    Args:
        soup: HTML from Beautiful Soup.
    
    Returns:
        value: Parsed content. 
    """

    if soup.findAll("meta", property="og:locale"):
        return soup.find("meta", property="og:locale")["content"]
    else:
        return
    
    return

og_locale = get_og_locale(soup)
og_locale

'en_GB'

def get_og_description(soup):
    """Return the Open Graph description

    Args:
        soup: HTML from Beautiful Soup.
    
    Returns:
        value: Parsed content. 
    """

    if soup.findAll("meta", property="og:description"):
        return soup.find("meta", property="og:description")["content"]
    else:
        return
    
    return

og_description = get_og_description(soup)
og_description

'A giant blimp of Donald Trump as a baby is floating above central London.'

def get_og_site_name(soup):
    """Return the Open Graph site name

    Args:
        soup: HTML from Beautiful Soup.
    
    Returns:
        value: Parsed content. 
    """

    if soup.findAll("meta", property="og:site_name"):
        return soup.find("meta", property="og:site_name")["content"]
    else:
        return
    
    return

og_site_name = get_og_site_name(soup)
og_site_name

'BBC News'

def get_og_image(soup):
    """Return the Open Graph site name

    Args:
        soup: HTML from Beautiful Soup.
    
    Returns:
        value: Parsed content. 
    """

    if soup.findAll("meta", property="og:image"):
        return soup.find("meta", property="og:image")["content"]
    else:
        return
    
    return

og_image = get_og_image(soup)
og_image

'https://ichef.bbci.co.uk/images/ic/400xn/p06dmz9z.jpg'

def get_og_url(soup):
    """Return the Open Graph site name

    Args:
        soup: HTML from Beautiful Soup.
    
    Returns:
        value: Parsed content. 
    """

    if soup.findAll("meta", property="og:url"):
        return soup.find("meta", property="og:url")["content"]
    else:
        return
    
    return

og_url = get_og_url(soup)
og_url

'https://www.bbc.co.uk/news/av/uk-politics-44820849'

Fetch all the meta data on a site

Finally, we can put these all together. First we’ll create a Pandas dataframe including a column for each of the Open Graph values we’re going to scrape. Then we’ll loop through each loc URL in our dataframe, parse the content from the soup HTML using the functions above, and append each page of data to the dataframe.

df_pages = pd.DataFrame(columns = ['og_title', 'og_description', 'og_image', 
                                   'og_site_name', 'og_locale'])

for index, row in df.iterrows(): 

    soup = get_page(row['loc'])
    og_title = get_og_title(soup)
    og_description = get_og_description(soup)
    og_image = get_og_image(soup)
    og_site_name = get_og_site_name(soup)
    
    page = {
        'url': row['loc'],
        'og_title': og_title,
        'og_description': og_description,
        'og_image': og_image,
        'og_site_name': og_site_name,        
    }
    
    df_pages = df_pages.append(page, ignore_index=True)

Our final dataframe includes the Open Graph data for every page in the original dataframe of URLs. Some Open Graph attributes aren’t present on the site scraped, so there are some None and NaN values, but we’ve got loads of data to work with, using minimal effort.

df_pages.head()

	og_title	og_description	og_image	og_site_name	og_locale	url
0	Classic and Collectable Car Auctions: Cars for...	The Market Collectable Car Auctions No buyer f...	https://themarket.co.uk/assets/img/apple-touch...	None	NaN	https://themarket.co.uk
1	Classic and Collectable Car Auctions: Cars for...	The Market Collectable Car Auctions No buyer f...	https://themarket.co.uk/assets/img/apple-touch...	None	NaN	https://themarket.co.uk/
2	Classic and Collectable Car Auctions: Cars for...	Search results Upcoming Auctions	https://themarket.co.uk/assets/img/apple-touch...	None	NaN	https://themarket.co.uk/auctions/coming-soon
3	Classic and Collectable Car Auctions: Cars for...	Search Results Live Listings: Classic Cars for...	https://themarket.co.uk/assets/img/apple-touch...	None	NaN	https://themarket.co.uk/auctions/live
4	Classic and Collectable Car Auctions: Cars for...	Search results No Reserve Listings: Classic Ca...	https://themarket.co.uk/assets/img/apple-touch...	None	NaN	https://themarket.co.uk/auctions/no-reserve

Matt Clarke, Friday, March 12, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.