Many websites include Open Graph protocol data in their document head. This structured data allows social networks, such as Facebook and Twitter, to access specific elements of the page’s content to improve the quality of tweets and shares.
Open Graph protocol data is also very useful for web scraping, as it allows you to easily extract the key elements of any page on a site, such as its title, description, image, and even other media elements such as videos and audio files. Here’s how you can build a web scraper to extract it from a site.
Open a Python script or Jupyter notebook and import the pandas
, urllib
and bs4
packages. We’ll be using Pandas for manipulating our data, urllib
for fetching the HTML of each page and the BeautifulSoup
package from bs4
for parsing the HTML.
import pandas as pd
import urllib.request
from urllib.parse import urlparse
from bs4 import BeautifulSoup
Next, load up a Pandas dataframe containing the URLs you want to scrape. If you want to obtain a list of all the URLs on a site, check out my guide to parsing and scraping XML sitemaps, which explains how this is done.
df = pd.read_csv('sitemap.csv')
df = df[['loc']]
df.head()
loc | |
---|---|
0 | https://themarket.co.uk |
1 | https://themarket.co.uk/ |
2 | https://themarket.co.uk/auctions/coming-soon |
3 | https://themarket.co.uk/auctions/live |
4 | https://themarket.co.uk/auctions/no-reserve |
The first step in web scraping is to fetch the source code of the page you want to scrape and parse. There are many packages available to do this in Python. Scrapy would be my recommendation for larger projects, but requests
and urllib
work fine for simple tasks like this.
The function below takes a url
and uses urlopen()
to grab the HTTP response
. Using this object, we determine the correct character encoding used on the page, and pass the response to Beautiful Soup, which returns the full HTML source code for the page.
def get_page(url):
"""Scrapes a URL and returns the HTML source.
Args:
url (string): Fully qualified URL of a page.
Returns:
soup (string): HTML source of scraped page.
"""
response = urllib.request.urlopen(url)
soup = BeautifulSoup(response,
'html.parser',
from_encoding=response.info().get_param('charset'))
return soup
soup = get_page("https://www.bbc.co.uk/news/av/uk-politics-44820849")
Parsing the Open Graph protocol data from within the soup
HTML returned by Beautiful Soup is pretty straightforward. To keep things simple, I’ve created a function for each Open Graph element we want to scrape. These all work in the same way, but look for a different meta
property
value using the Beautiful Soup findAll()
function, and return the content
within the tag.
def get_og_title(soup):
"""Return the Open Graph title
Args:
soup: HTML from Beautiful Soup.
Returns:
value: Parsed content.
"""
if soup.findAll("meta", property="og:title"):
return soup.find("meta", property="og:title")["content"]
else:
return
return
og_title = get_og_title(soup)
og_title
'Trump baby blimp launched in London'
def get_og_locale(soup):
"""Return the Open Graph locale
Args:
soup: HTML from Beautiful Soup.
Returns:
value: Parsed content.
"""
if soup.findAll("meta", property="og:locale"):
return soup.find("meta", property="og:locale")["content"]
else:
return
return
og_locale = get_og_locale(soup)
og_locale
'en_GB'
def get_og_description(soup):
"""Return the Open Graph description
Args:
soup: HTML from Beautiful Soup.
Returns:
value: Parsed content.
"""
if soup.findAll("meta", property="og:description"):
return soup.find("meta", property="og:description")["content"]
else:
return
return
og_description = get_og_description(soup)
og_description
'A giant blimp of Donald Trump as a baby is floating above central London.'
def get_og_site_name(soup):
"""Return the Open Graph site name
Args:
soup: HTML from Beautiful Soup.
Returns:
value: Parsed content.
"""
if soup.findAll("meta", property="og:site_name"):
return soup.find("meta", property="og:site_name")["content"]
else:
return
return
og_site_name = get_og_site_name(soup)
og_site_name
'BBC News'
def get_og_image(soup):
"""Return the Open Graph site name
Args:
soup: HTML from Beautiful Soup.
Returns:
value: Parsed content.
"""
if soup.findAll("meta", property="og:image"):
return soup.find("meta", property="og:image")["content"]
else:
return
return
og_image = get_og_image(soup)
og_image
'https://ichef.bbci.co.uk/images/ic/400xn/p06dmz9z.jpg'
def get_og_url(soup):
"""Return the Open Graph site name
Args:
soup: HTML from Beautiful Soup.
Returns:
value: Parsed content.
"""
if soup.findAll("meta", property="og:url"):
return soup.find("meta", property="og:url")["content"]
else:
return
return
og_url = get_og_url(soup)
og_url
'https://www.bbc.co.uk/news/av/uk-politics-44820849'
Finally, we can put these all together. First we’ll create a Pandas dataframe including a column for each of the Open Graph values we’re going to scrape. Then we’ll loop through each loc
URL in our dataframe, parse the content from the soup
HTML using the functions above, and append each page
of data to the dataframe.
df_pages = pd.DataFrame(columns = ['og_title', 'og_description', 'og_image',
'og_site_name', 'og_locale'])
for index, row in df.iterrows():
soup = get_page(row['loc'])
og_title = get_og_title(soup)
og_description = get_og_description(soup)
og_image = get_og_image(soup)
og_site_name = get_og_site_name(soup)
page = {
'url': row['loc'],
'og_title': og_title,
'og_description': og_description,
'og_image': og_image,
'og_site_name': og_site_name,
}
df_pages = df_pages.append(page, ignore_index=True)
Our final dataframe includes the Open Graph data for every page in the original dataframe of URLs. Some Open Graph attributes aren’t present on the site scraped, so there are some None
and NaN
values, but we’ve got loads of data to work with, using minimal effort.
df_pages.head()
og_title | og_description | og_image | og_site_name | og_locale | url | |
---|---|---|---|---|---|---|
0 | Classic and Collectable Car Auctions: Cars for... | The Market Collectable Car Auctions No buyer f... | https://themarket.co.uk/assets/img/apple-touch... | None | NaN | https://themarket.co.uk |
1 | Classic and Collectable Car Auctions: Cars for... | The Market Collectable Car Auctions No buyer f... | https://themarket.co.uk/assets/img/apple-touch... | None | NaN | https://themarket.co.uk/ |
2 | Classic and Collectable Car Auctions: Cars for... | Search results Upcoming Auctions | https://themarket.co.uk/assets/img/apple-touch... | None | NaN | https://themarket.co.uk/auctions/coming-soon |
3 | Classic and Collectable Car Auctions: Cars for... | Search Results Live Listings: Classic Cars for... | https://themarket.co.uk/assets/img/apple-touch... | None | NaN | https://themarket.co.uk/auctions/live |
4 | Classic and Collectable Car Auctions: Cars for... | Search results No Reserve Listings: Classic Ca... | https://themarket.co.uk/assets/img/apple-touch... | None | NaN | https://themarket.co.uk/auctions/no-reserve |
Matt Clarke, Friday, March 12, 2021