RSS feeds have been a mainstay on the web for over 20 years now. These XML-based documents are generated by web servers and designed to be read in RSS feed readers to allow readers to be kept up-to-date with any new posts added to the feed, without the need for the user to visit the site. They’re a great way to keep up with content across numerous sites.
From a data science perspective, RSS feeds also represent a great way to gain quick and easy access to structured data on a site’s editorial content, such as blog posts or articles. In this project, I’ll show you how you can read an RSS feed in Python using web scraping. We’ll handle everything step-by-step, from scraping the RSS source code, to parsing the RSS feed contents, and outputting the text into a Pandas dataframe. Let’s get started.
For this project we’ll be using three Python packages: Requests, Requests-HTML, and Pandas. Requests is one of the most popular Python libraries and is used for making HTTP requests to servers to fetch data. Requests-HTML is a web scraping library that combines Requests with the Beautiful Soup parsing package, while Pandas is used for data storage and manipulation.
Open a Jupyter notebook and import the below packages and modules. You may need to install requests_html
, but you’ll usually have requests
and pandas
pre-installed. You can do that by entering a pip3 install package-name
command in your code cell or terminal.
!pip3 install requests_html
import requests
import pandas as pd
from requests_html import HTML
from requests_html import HTMLSession
The first step of reading an RSS feed in Python requires us to fetch the source of the feed itself. We can do this
using the HTMLSession()
feature of requests_html
. The try except block creates a session, then fetches the URL, and returns the
source code of the page in its response
object. If it fails, we’ll use requests
to throw an exception telling us
why. We’ll wrap this code in a function that we can re-use in any other web scraping projects we tackle.
def get_source(url):
"""Return the source code for the provided URL.
Args:
url (string): URL of the page to scrape.
Returns:
response (object): HTTP response object from requests_html.
"""
try:
session = HTMLSession()
response = session.get(url)
return response
except requests.exceptions.RequestException as e:
print(e)
Next we’ll build the Python RSS feed parser. Before we start, we first need to examine the XML elements within the code of the feed itself. RSS feeds come in various dialects, which you can determine by reading the XML declaration on the first line of the file. Mine is written in the 2005 version of Atom: <rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
.
The Python RSS parser we build next needs to specifically detect certain elements from within the feed, which can have slightly different formats depending on the RSS dialect used. Each article in the feed is usually wrapped in an ,<item>
element. Mine contains the below data, which our RSS parser needs to detect and extract.
<item>
<title>...</title>
<description>...</description>
<pubDate>Sat, 22 May 2021 00:00:00 +0000</pubDate>
<link>https://practicaldatascience.co.uk/data-science/19-python-seo-projects-that-will-improve-your-site</link>
<guid isPermaLink="true">https://practicaldatascience.co.uk/data-science/19-python-seo-projects-that-will-improve-your-site</guid>
<category>Web scraping</category>
<category>Python</category>
<category>Pandas</category>
<category>Technical SEO</category>
<category>Data Science</category>
</item>
Now we know what XML tags we need to extract, we can build the RSS parser itself. To do this, I’ve created a function called get_feed()
which takes the URL of the RSS feed and passes it to the get_source()
function we created above, returning the raw XML code of the feed itself in an element called response
.
We then create a Pandas dataframe in which to store the parsed data, then loop through each item
element found
within the response
. When each item
is detected, we then use the find()
function to look for the element names, i.e. title
, pubDate
, link
, and description
, and we extract the text from within by appending the .text
argument.
Finally, we add the contents of each item
to a dictionary called row
which maps the data to our dataframe, and then uses the Pandas append()
function to add the row of data to the dataframe, without adding an index. At the end, we return a dataframe containing all the feed contents we’ve scraped.
def get_feed(url):
"""Return a Pandas dataframe containing the RSS feed contents.
Args:
url (string): URL of the RSS feed to read.
Returns:
df (dataframe): Pandas dataframe containing the RSS feed contents.
"""
response = get_source(url)
df = pd.DataFrame(columns = ['title', 'pubDate', 'guid', 'description'])
with response as r:
items = r.html.find("item", first=False)
for item in items:
title = item.find('title', first=True).text
pubDate = item.find('pubDate', first=True).text
guid = item.find('guid', first=True).text
description = item.find('description', first=True).text
row = {'title': title, 'pubDate': pubDate, 'guid': guid, 'description': description}
df = df.append(row, ignore_index=True)
return df
The final step is to run our get_feed()
function. We’ll pass this the address of my RSS feed (which is found at /feed.xml
) and the function will fetch the feed, scrape the RSS code, parse the contents, and write the output to a Pandas dataframe. We can then examine or use that dataframe as we wish, or write its output to a range of other formats, including CSV or SQL.
url = "https://practicaldatascience.co.uk/feed.xml"
df = get_feed(url)
df.head()
title | pubDate | guid | description | |
---|---|---|---|---|
0 | 19 Python SEO projects that will improve your ... | Sat, 22 May 2021 00:00:00 +0000 | <p>Although I have never really considered mys... | |
1 | How to identify internal and external links us... | Sun, 09 May 2021 00:00:00 +0000 | <p>Internal linking helps improve the user exp... | |
2 | How to create a basic Marketing Mix Model in s... | Mon, 03 May 2021 00:00:00 +0000 | <p>Marketing Mix Models (MMMs) utilise multiva... | |
3 | How to scrape Google results in three lines of... | Sun, 02 May 2021 00:00:00 +0000 | <p>EcommerceTools makes it really quick and ea... | |
4 | How to make time series forecasts with Neural ... | Sun, 02 May 2021 00:00:00 +0000 | <p>The Neural Prophet model is relatively new ... |
Matt Clarke, Sunday, May 23, 2021