Beautiful Soup is one of the most powerful libraries for performing web scraping in Python. Here's a step-by-step guide to using it to scrape a website.

How to create a Python web scraper using Beautiful Soup
Web scraping is a really useful skill in data science. We obviously need data for our models and analyses, but it’s not always easily available, so building our own datasets through web scraping is often the only way to get what we need.

We’re fortunate in the Python community to have access to a number of powerful web scraping libraries, including Scrapy, Selenium, and Beautiful Soup, which all make it much easier and quicker to develop custom scrapers to quickly extract content from websites using either XPath or CSS rules.

Seeing as I recently needed to use [web scraping] (/data-science/16-python-web-scraping-projects-for-ecommerce-and-seo) to scrape some information on the courses offered by DataCamp for this site’s page on Data Science Courses, I thought I’d show you how I did it. It’s obviously a very specific example, but the steps below are easily transferred to whatever site you need to scrape. Let’s get started.

Load the packages

For this project we’ll need the re package for creating some Python regular expressions to extract specific chunks of content, pandas for manipulating and storing scraped data, urllib for working with URLs, and the BeautifulSoup package from bs4 for scraping the content out of the HTML. Load up the packages at the top of a Jupyter notebook and install any you don’t have by entering pip3 install package-name in your terminal.

import re
import math
from urllib.request import Request, urlopen
import pandas as pd
from bs4 import BeautifulSoup as soup
pd.set_option('max_columns', 2)

Fetch the raw HTML

We’re going to scrape the search results from the DataCamp website to extract the details on the courses they provide. To do this we use Request() and pass two arguments: the URL of the page we want to scrape, and the headers denoting our user agent. Without the headers, many servers will reject the request for the page and return a 403 status code.

We can then pass the result object to urlopen() and assign the output to a variable called page. Finally, we pass page to Beautiful Soup’s soup() function and define the HTML parser we want to use as html.parser, to return all of the source code from the page. To allow us to re-use the code later, I’ve wrapped it in a function called get_soup().

example_url = ''

def get_soup(url):
    """Fetch the raw HTML for a URL using Request and Beautiful Soup. 

        url (str): URL of page to fetch.

        soup (object): HTML code of fetched page


    result = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    page = urlopen(result).read()
    return soup(page, "html.parser")

Examine the HTML scraped

Now we’ve got some raw HTML from typical DataCamp search results, we can examine the code to identify which elements in the page we need to extract. Looking at the DataCamp search page source code reveals that the first 50 results are shown, but any results that follow are revealed only when you click a link. Clicking the link changes the search URL and appends a page variable with an integer containing the page number.

For example, will return the first 50 search results, but returns the following 50. We can use this to paginate through the results, if we can identify the number of results returned by each search.

html = get_soup(example_url)

Handling search result pagination

In order to loop through each paginated page of search results we need to know how many individual search results are on each page. The easiest way to do this is by extracting the results count from the page. Examining the raw HTML returned in the html variable reveals that the text containing the number of results is stored in a div with the class dc-u-mt-16 dc-u-lh-1.

All we need to do is create a function to find this specific class name in the page and return the integer value at the beginning, which shows the number of search results for the given search term used.

<div class="dc-u-mt-16 dc-u-lh-1">
180 results for "<span class="dc-u-fw-bold dc-u-fst-italic">python</span>"

We’ll create a function called get_total_pages() to do this and will pass it the raw HTML stored in the html variable. If the html variable is present, we’ll use the Beautiful Soup find() function to look for the div containing a class with the value dc-u-mt-16 dc-u-lh-1. We can append get_text() to extract the content from within the tags, but need to pass in the optional argument strip=True to ensure the HTML gets removed.

Next, we need to extract only the integer at the beginning containing the number of results found for the search. The easiest way to do this is using the re package’s findall() function. We create a regular expression containing \d+ to extract the numbers and add them to the variable results, then we return element 0 and cast the value from a str to int.

Finally, as there are 50 search results per page, we need to divide the results value by results_per_page and obtain the ceil of the value (since we can’t have a part page). This then returns the total number of pages of results, even though the specific value isn’t shown on the page.

def get_total_pages(soup):
    """Return the number of pages of results found for the search on DataCamp. 

        soup (object): HTML object from Beautiful Soup. 

        pages (int): Total pages of results found. 

    results_per_page = 50

    if soup:
        total_results = soup.find('div', 
                                  attrs={'class':'dc-u-mt-16 dc-u-lh-1'}).get_text(strip=True)
        results = re.findall(r'\d+', total_results)
        results = int(results[0])
        pages = math.ceil(results / results_per_page)

        return pages
results = get_total_pages(html)

Define the search results URLs to scrape

Next, we’ll define the search terms we want to search for on DataCamp and assign them to a list called search_terms. We’ll loop over these search terms and fetch the HTML for each page, including any subsequent paginated pages, if we find any. For now, we’ll just print the URLs and check that they work as expected.

search_terms = ['python', 'r', 'sql', 'git', 'shell', 'spreadsheets', 
                'theory', 'scala', 'excel', 'tableau', 'power%20bi']
search_url = ''
for search_term in search_terms:

    url = search_url+search_term
    html = get_soup(url)
    total_pages = get_total_pages(html)

    i = 1
    while(i <= total_pages):
        i += 1

Storing the scraped content

Now we have identified the terms we want to search for, have created a function to scrape the raw HTML, and can count the number of search results found, we can move on to the more complicated step of actually running the searches and extracting the course summaries. First, we’ll create an empty Pandas dataframe in which to store the results we scrape from the page.

df = pd.DataFrame(columns=[

Fetching the HTML and parsing the content

Now the tricky bit. We’ll use the code above to loop through each search term, fetch the HTML from each page, and then extract all of article elements from the page which contain the individual course descriptions. Then, we’ll create another loop and iterate through each of the article elements returned by the Beautiful Soup find_all() function.

After identifying the corresponding elements in the source code using the inspect element feature of Chrome, we can then use find() to extract the element. Since some of the elements return HTML we can use the strip=True argument to remove this. The separator=' ' argument adds a space when removing the element results in a missing space.

Finally, once we’ve extracted all of the items, we’ll assign them to a dictionary called row and then use append() to add them to the empty dataframe we created above. After a few minutes of scraping, our df dataframe should containing the details on all of the courses offered.

for search_term in search_terms:

    url = search_url+search_term
    html = get_soup(url)
    total_pages = get_total_pages(html)

    i = 1
    while(i <= total_pages):   

        current_url = url+'&p='+str(i)


        html = get_soup(current_url)
        articles = html.find_all('article')

        for course in articles:
            course_title = course.find('h4').get_text(strip=True, separator=' ')
            course_summary = course.find('p').get_text(strip=True, separator=' ')
            course_url = course.find('a', attrs={'class':'shim'}).attrs['href']
            course_url = ''+course_url
            spans = course.find_all('span')
            course_duration = spans[0].get_text(strip=True, separator=' ')
            course_categories = spans[3].get_text(strip=True, separator=' ')
            course_author = spans[6].get_text(strip=True, separator=' ')
            course_type = spans[9].get_text(strip=True, separator=' ')

            row = {
                'course_title': course_title,
                'course_summary': course_summary,
                'course_duration': course_duration,
                'course_categories': course_categories,
                'course_author': course_author, 
                'course_type': course_type,
                'course_provider': 'DataCamp',
                'course_url': course_url

            df = df.append(row, ignore_index=True)

        i += 1

Tidy the data

df = df.replace(r'\\n',' ', regex=True)
df['course_title'] = df['course_title'].replace('"','',regex=True)
df['course_summary'] = df['course_summary'].replace('"','',regex=True)
df['course_slug'] = df['course_title'].str.lower().str.strip().replace('[^0-9a-zA-Z]+','_',regex=True)
Remove duplicates

Some of the courses on DataCamp appear for multiple search terms, so we’ll have some duplicates in our dataframe. To remove the duplicate courses, we can use the Pandas drop_duplicates() function and pass in the course_title column. We’ll tell Pandas to keep the first value found, and drop the rest inplace.

df.drop_duplicates('course_title', keep='first', inplace=True)

To check this has worked, we re-run df.course_type.value_counts(), which reveals we’ve got 196 unique courses and 83 unique projects.

Course     196
Project     83
Name: course_type, dtype: int64

Finally, to save the data we’ve scraped from DataCamp we’ll use to_csv() to write the data to a CSV file. While this is obviously quite a specific example, the concepts shown here will work on any site and show how easy it can be to scrape web data and reformat it to display in Pandas, allowing you to construct your own custom data sets.


