Scraping the titles and meta descriptions from every page on a site can tell you a great deal about its content, the underlying content strategy, or product ranges, and many other things. Whether you’re examining your own site or those of your competitors, it’s worth learning some basic web scraping skills to fetch this useful data.
In this project, I’ll show you how you can use web scraping to create a simple scraper using urllib
,
Beautiful Soup, and pandas
, to
scrape and parse all the pages on a website and return the information in a Pandas dataframe. Here’s how it’s done.
First, open a Python script or Jupyter notebook and import the pandas
, urllib
and BeautifulSoup
packages. Any packages you don’t have can be installed by typing pip3 install package-name
in your terminal.
import pandas as pd
import urllib.request
from urllib.parse import urlparse
from bs4 import BeautifulSoup
Next, load up the list of URLs you want to scrape. I’m assuming that you already have these stored in a CSV file that you can load into Pandas. However, if you need to construct the URL list, check out my guide to parsing XML sitemaps, which explains how you can obtain the URL for every page on a site.
df = pd.read_csv('sitemap.csv')
df = df[['loc']]
df.head()
loc | |
---|---|
0 | https://themarket.co.uk |
1 | https://themarket.co.uk/ |
2 | https://themarket.co.uk/auctions/coming-soon |
3 | https://themarket.co.uk/auctions/live |
4 | https://themarket.co.uk/auctions/no-reserve |
First, we’ll create a simple function to scrape the content of a URL and return the HTML source code within. There are lots of different ways to perform web scraping tasks in Python. For larger projects, I’d highly recommend using Scrapy as it supports threading and is much quicker. However, for smaller projects, such as scraping your own site, requests
, urllib
, and BeautifulSoup
are fine.
In the below function, I’ve used urlopen()
from urllib.request
to open an HTTP connection to the page. I’ve passed that response
to Beautiful Soup, and have used the html.parser
to extract the source code. Depending on the site, you may also need to obtain the page’s character encoding and pass that to Beautiful Soup for things to work seamlessly. The function returns the page’s HTML code in a variable called soup
.
def get_page(url):
"""Scrapes a URL and returns the HTML source.
Args:
url (string): Fully qualified URL of a page.
Returns:
soup (string): HTML source of scraped page.
"""
response = urllib.request.urlopen(url)
soup = BeautifulSoup(response,
'html.parser',
from_encoding=response.info().get_param('charset'))
return soup
soup = get_page("https://themarket.co.uk")
Next, we’ll create a function to parse the soup
HTML returned by Beautiful Soup. We’ll run Beautiful Soup’s findall()
function on this, and we’ll return all of the meta name="description"
elements and extract the content from within.
def get_description(soup):
"""Return the meta description content
Args:
soup: HTML code from Beautiful Soup
Returns:
value (string): Parsed value
"""
if soup.findAll("meta", attrs={"name": "description"}):
return soup.find("meta", attrs={"name": "description"}).get("content")
else:
return
return
meta = get_description(soup)
meta
'The Market Collectable Car Auctions No buyer fees, just 5% + VAT seller fees, see how much more we return. 90% sale rate in 2020. Signup for our weekly email.'
We can now repeat this process by using Beautiful Soup to parse the HTML soup
again and extract the title
element from the page. This returns the title
string for each page of code examined.
def get_title(soup):
"""Return the page title
Args:
soup: HTML code from Beautiful Soup
Returns:
value (string): Parsed value
"""
if soup.findAll("title"):
return soup.find("title").string
else:
return
title = get_title(soup)
title
'Classic and Collectable Car Auctions: Cars for Sale'
Finally, we can put all the steps together. We’ll create a Pandas dataframe called df_pages
to store the url
, title
, and description
of each page. Then, we’ll loop through the rows in the dataframe using iterrows()
, scrape the page’s HTML, and parse the content to return the title
and description
. We’ll then store these in our dataframe.
df_pages = pd.DataFrame(columns = ['url', 'title', 'description'])
for index, row in df.iterrows():
soup = get_page(row['loc'])
title = get_title(soup)
description = get_description(soup)
page = {
'url': row['loc'],
'title': title,
'description': description
}
df_pages = df_pages.append(page, ignore_index=True)
After a few minutes (depending on the size of the site), we get back a Pandas dataframe containing all the data we need for our analysis.
df_pages.head()
url | title | description | |
---|---|---|---|
0 | https://themarket.co.uk | Classic and Collectable Car Auctions: Cars for... | The Market Collectable Car Auctions No buyer f... |
1 | https://themarket.co.uk/ | Classic and Collectable Car Auctions: Cars for... | The Market Collectable Car Auctions No buyer f... |
2 | https://themarket.co.uk/auctions/coming-soon | Classic and Collectable Car Auctions: Cars for... | Search results Upcoming Auctions |
3 | https://themarket.co.uk/auctions/live | Classic and Collectable Car Auctions: Cars for... | Search Results Live Listings: Classic Cars for... |
4 | https://themarket.co.uk/auctions/no-reserve | Classic and Collectable Car Auctions: Cars for... | Search results No Reserve Listings: Classic Ca... |
Matt Clarke, Friday, March 12, 2021