How to create a Python web scraper using Beautiful Soup

Picture by Foodie Factor, Pexels.

26 minutes to read

Data Science Python Web scraping

Web scraping is a really useful skill in data science. We obviously need data for our models and analyses, but it’s not always easily available, so building our own datasets through web scraping is often the only way to get what we need.

We’re fortunate in the Python community to have access to a number of powerful web scraping libraries, including Scrapy, Selenium, and Beautiful Soup, which all make it much easier and quicker to develop custom scrapers to quickly extract content from websites using either XPath or CSS rules.

Seeing as I recently needed to use [web scraping] (/data-science/16-python-web-scraping-projects-for-ecommerce-and-seo) to scrape some information on the courses offered by DataCamp for this site’s page on Data Science Courses, I thought I’d show you how I did it. It’s obviously a very specific example, but the steps below are easily transferred to whatever site you need to scrape. Let’s get started.

Load the packages

For this project we’ll need the re package for creating some Python regular expressions to extract specific chunks of content, pandas for manipulating and storing scraped data, urllib for working with URLs, and the BeautifulSoup package from bs4 for scraping the content out of the HTML. Load up the packages at the top of a Jupyter notebook and install any you don’t have by entering pip3 install package-name in your terminal.

import re
import math
from urllib.request import Request, urlopen
import pandas as pd
from bs4 import BeautifulSoup as soup

pd.set_option('max_columns', 2)

Fetch the raw HTML

We’re going to scrape the search results from the DataCamp website to extract the details on the courses they provide. To do this we use Request() and pass two arguments: the URL of the page we want to scrape, and the headers denoting our user agent. Without the headers, many servers will reject the request for the page and return a 403 status code.

We can then pass the result object to urlopen() and assign the output to a variable called page. Finally, we pass page to Beautiful Soup’s soup() function and define the HTML parser we want to use as html.parser, to return all of the source code from the page. To allow us to re-use the code later, I’ve wrapped it in a function called get_soup().

example_url = 'https://www.datacamp.com/search?q=python'

def get_soup(url):
    """Fetch the raw HTML for a URL using Request and Beautiful Soup. 

    Args:
        url (str): URL of page to fetch.

    Returns: 
        soup (object): HTML code of fetched page

    """

    result = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    page = urlopen(result).read()
    return soup(page, "html.parser")

Examine the HTML scraped

Now we’ve got some raw HTML from typical DataCamp search results, we can examine the code to identify which elements in the page we need to extract. Looking at the DataCamp search page source code reveals that the first 50 results are shown, but any results that follow are revealed only when you click a link. Clicking the link changes the search URL and appends a page variable with an integer containing the page number.

For example, https://www.datacamp.com/search?q=python will return the first 50 search results, but https://www.datacamp.com/search?q=python&page=2 returns the following 50. We can use this to paginate through the results, if we can identify the number of results returned by each search.

html = get_soup(example_url)

Handling search result pagination

In order to loop through each paginated page of search results we need to know how many individual search results are on each page. The easiest way to do this is by extracting the results count from the page. Examining the raw HTML returned in the html variable reveals that the text containing the number of results is stored in a div with the class dc-u-mt-16 dc-u-lh-1.

All we need to do is create a function to find this specific class name in the page and return the integer value at the beginning, which shows the number of search results for the given search term used.

<div class="dc-u-mt-16 dc-u-lh-1">
180 results for "<span class="dc-u-fw-bold dc-u-fst-italic">python</span>"
</div>

We’ll create a function called get_total_pages() to do this and will pass it the raw HTML stored in the html variable. If the html variable is present, we’ll use the Beautiful Soup find() function to look for the div containing a class with the value dc-u-mt-16 dc-u-lh-1. We can append get_text() to extract the content from within the tags, but need to pass in the optional argument strip=True to ensure the HTML gets removed.

Next, we need to extract only the integer at the beginning containing the number of results found for the search. The easiest way to do this is using the re package’s findall() function. We create a regular expression containing \d+ to extract the numbers and add them to the variable results, then we return element 0 and cast the value from a str to int.

Finally, as there are 50 search results per page, we need to divide the results value by results_per_page and obtain the ceil of the value (since we can’t have a part page). This then returns the total number of pages of results, even though the specific value isn’t shown on the page.

def get_total_pages(soup):
    """Return the number of pages of results found for the search on DataCamp. 

    Args:
        soup (object): HTML object from Beautiful Soup. 

    Returns: 
        pages (int): Total pages of results found. 
    """

    results_per_page = 50

    if soup:
        total_results = soup.find('div', 
                                  attrs={'class':'dc-u-mt-16 dc-u-lh-1'}).get_text(strip=True)
        results = re.findall(r'\d+', total_results)
        results = int(results[0])
        pages = math.ceil(results / results_per_page)

        return pages

results = get_total_pages(html)
results

Define the search results URLs to scrape

Next, we’ll define the search terms we want to search for on DataCamp and assign them to a list called search_terms. We’ll loop over these search terms and fetch the HTML for each page, including any subsequent paginated pages, if we find any. For now, we’ll just print the URLs and check that they work as expected.

search_terms = ['python', 'r', 'sql', 'git', 'shell', 'spreadsheets', 
                'theory', 'scala', 'excel', 'tableau', 'power%20bi']
search_url = 'https://datacamp.com/search?q='

for search_term in search_terms:

    url = search_url+search_term
    html = get_soup(url)
    total_pages = get_total_pages(html)

    i = 1
    while(i <= total_pages):
        print(url+'&p='+str(i))
        i += 1

https://datacamp.com/search?q=python&p=1
https://datacamp.com/search?q=python&p=2
https://datacamp.com/search?q=python&p=3
https://datacamp.com/search?q=python&p=4
https://datacamp.com/search?q=r&p=1
https://datacamp.com/search?q=r&p=2
https://datacamp.com/search?q=r&p=3
https://datacamp.com/search?q=r&p=4
https://datacamp.com/search?q=r&p=5
https://datacamp.com/search?q=r&p=6
https://datacamp.com/search?q=r&p=7
https://datacamp.com/search?q=r&p=8
https://datacamp.com/search?q=r&p=9
https://datacamp.com/search?q=sql&p=1
https://datacamp.com/search?q=git&p=1
https://datacamp.com/search?q=shell&p=1
https://datacamp.com/search?q=spreadsheets&p=1
https://datacamp.com/search?q=theory&p=1
https://datacamp.com/search?q=scala&p=1
https://datacamp.com/search?q=excel&p=1
https://datacamp.com/search?q=tableau&p=1
https://datacamp.com/search?q=power%20bi&p=1

Storing the scraped content

Now we have identified the terms we want to search for, have created a function to scrape the raw HTML, and can count the number of search results found, we can move on to the more complicated step of actually running the searches and extracting the course summaries. First, we’ll create an empty Pandas dataframe in which to store the results we scrape from the page.

df = pd.DataFrame(columns=[
    'course_title',
    'course_summary',
    'course_duration',
    'course_categories',
    'course_author',
    'course_type',
    'course_provider',
    'course_url'
])

Fetching the HTML and parsing the content

Now the tricky bit. We’ll use the code above to loop through each search term, fetch the HTML from each page, and then extract all of article elements from the page which contain the individual course descriptions. Then, we’ll create another loop and iterate through each of the article elements returned by the Beautiful Soup find_all() function.

After identifying the corresponding elements in the source code using the inspect element feature of Chrome, we can then use find() to extract the element. Since some of the elements return HTML we can use the strip=True argument to remove this. The separator=' ' argument adds a space when removing the element results in a missing space.

Finally, once we’ve extracted all of the items, we’ll assign them to a dictionary called row and then use append() to add them to the empty dataframe we created above. After a few minutes of scraping, our df dataframe should containing the details on all of the courses offered.

for search_term in search_terms:

    url = search_url+search_term
    html = get_soup(url)
    total_pages = get_total_pages(html)

    i = 1
    while(i <= total_pages):   

        current_url = url+'&p='+str(i)

        print(current_url)

        html = get_soup(current_url)
        articles = html.find_all('article')

        for course in articles:
            course_title = course.find('h4').get_text(strip=True, separator=' ')
            course_summary = course.find('p').get_text(strip=True, separator=' ')
            course_url = course.find('a', attrs={'class':'shim'}).attrs['href']
            course_url = 'https://datacamp.com'+course_url
            spans = course.find_all('span')
            course_duration = spans[0].get_text(strip=True, separator=' ')
            course_categories = spans[3].get_text(strip=True, separator=' ')
            course_author = spans[6].get_text(strip=True, separator=' ')
            course_type = spans[9].get_text(strip=True, separator=' ')

            row = {
                'course_title': course_title,
                'course_summary': course_summary,
                'course_duration': course_duration,
                'course_categories': course_categories,
                'course_author': course_author, 
                'course_type': course_type,
                'course_provider': 'DataCamp',
                'course_url': course_url
            }

            df = df.append(row, ignore_index=True)

        i += 1

https://datacamp.com/search?q=python&p=1
https://datacamp.com/search?q=python&p=2
https://datacamp.com/search?q=python&p=3
https://datacamp.com/search?q=python&p=4
https://datacamp.com/search?q=r&p=1
https://datacamp.com/search?q=r&p=2
https://datacamp.com/search?q=r&p=3
https://datacamp.com/search?q=r&p=4
https://datacamp.com/search?q=r&p=5
https://datacamp.com/search?q=r&p=6
https://datacamp.com/search?q=r&p=7
https://datacamp.com/search?q=r&p=8
https://datacamp.com/search?q=r&p=9
https://datacamp.com/search?q=sql&p=1
https://datacamp.com/search?q=git&p=1
https://datacamp.com/search?q=shell&p=1
https://datacamp.com/search?q=spreadsheets&p=1
https://datacamp.com/search?q=theory&p=1
https://datacamp.com/search?q=scala&p=1
https://datacamp.com/search?q=excel&p=1
https://datacamp.com/search?q=tableau&p=1
https://datacamp.com/search?q=power%20bi&p=1

Tidy the data

df = df.replace(r'\\n',' ', regex=True)
df['course_title'] = df['course_title'].replace('"','',regex=True)
df['course_summary'] = df['course_summary'].replace('"','',regex=True)
df['course_slug'] = df['course_title'].str.lower().str.strip().replace('[^0-9a-zA-Z]+','_',regex=True)
df.head()

	course_title	...	course_slug
0	Introduction to Python	...	introduction_to_python
1	Intermediate Python	...	intermediate_python
2	Introduction to Data Science in Python	...	introduction_to_data_science_in_python
3	Data Manipulation with pandas	...	data_manipulation_with_pandas
4	Supervised Learning with scikit-learn	...	supervised_learning_with_scikit_learn

5 rows × 9 columns

df.course_type.value_counts()

Course     790
Project    625
Name: course_type, dtype: int64

df.course_title.unique()

array(['Introduction to Python', 'Intermediate Python',
       'Introduction to Data Science in Python',
       'Data Manipulation with pandas',
       'Supervised Learning with scikit-learn',
       'Python Data Science Toolbox (Part 1)',
       'Python Data Science Toolbox (Part 2)', 'Joining Data with pandas',
       'Introduction to Data Visualization with Matplotlib',
       'Introduction to Importing Data in Python',
       'Statistical Thinking in Python (Part 1)',
       'Writing Efficient Python Code', 'Cleaning Data in Python',
       'Introduction to Data Visualization with Seaborn',
       'Introduction to Deep Learning in Python',
       'Unsupervised Learning in Python', 'Writing Functions in Python',
       'Introduction to PySpark', 'Object-Oriented Programming in Python',
       'Intermediate Importing Data in Python',
       'Introduction to Data Visualization in Python',
       'Machine Learning with Tree-Based Models in Python',
       'Introduction to Python for Finance',
       'Introduction to Data Engineering',
       'Exploratory Data Analysis in Python', 'pandas Foundations',
       'Statistical Thinking in Python (Part 2)',
       'Data Types for Data Science in Python', 'Web Scraping in Python',
       'Intermediate Data Visualization with Seaborn',
       'Working with Dates and Times in Python',
       'Introduction to Natural Language Processing in Python',
       'Streamlined Data Ingestion with pandas',
       'Cluster Analysis in Python',
       'Analyzing Police Activity with pandas',
       'Building Recommendation Engines in Python',
       'Machine Learning for Time Series Data in Python',
       'Manipulating DataFrames with pandas',
       'Introduction to TensorFlow in Python',
       'Image Processing in Python', 'Linear Classifiers in Python',
       'Manipulating Time Series Data in Python',
       'Case Study: School Budgeting with Machine Learning in Python',
       'Preprocessing for Machine Learning in Python',
       'Extreme Gradient Boosting with XGBoost',
       'Regular Expressions in Python',
       'Introduction to Databases in Python',
       'Big Data Fundamentals with PySpark',
       'Software Engineering for Data Scientists in Python',
       'Unit Testing for Data Science in Python',
       'Name Game: Gender Prediction using Sound',
       'Exploring the Bitcoin Cryptocurrency Market',
       'Real-time Insights from Social Media Data',
       'A Network Analysis of Game of Thrones',
       'Disney Movies and Box Office Success',
       'Analyze Your Runkeeper Fitness Data',
       'Comparing Cosmetics by Ingredients',
       'TV, Halftime Shows, and the Big Game',
       'Risk and Returns: The Sharpe Ratio',
       'Find Movie Similarity from Plot Summaries',
       'Give Life: Predict Blood Donations',
       'The Android App Market on Google Play',
       'Extract Stock Sentiment from News Headlines',
       'Book Recommendations from Charles Darwin',
       'Predicting Credit Card Approvals',
       'Naïve Bees: Deep Learning with Images',
       'Up and Down With the Kardashians',
       'ASL Recognition with Deep Learning',
       "Which Debts Are Worth the Bank's Effort?",
       'Do Left-handed People Really Die Young?',
       'Who Is Drunk and When in Ames, Iowa?',
       "Who's Tweeting? Trump or Trudeau?",
       'Reducing Traffic Mortality in the USA',
       'Classify Song Genres from Audio Data',
       'A Visual History of Nobel Prize Winners',
       'Naïve Bees: Predict Species from Images',
       'Generating Keywords for Google Ads',
       'Word Frequency in Moby Dick',
       'Naïve Bees: Image Loading and Processing',
       'Introduction to DataCamp Projects',
       'A New Era of Data Analysis in Baseball',
       'Dr. Semmelweis and the Discovery of Handwashing',
       'Mobile Games A/B Testing with Cookie Cats',
       'The GitHub History of the Scala Language',
       'The Hottest Topics in Machine Learning',
       'Bad passwords and the NIST guidelines',
       "Recreating John Snow's Ghost Map",
       'Exploring the Evolution of Linux', 'Exploring 67 years of LEGO',
       'Rise and Fall of Programming Languages', 'Introduction to R',
       'Intermediate R', 'Introduction to the Tidyverse',
       'Introduction to Data Visualization with ggplot2',
       'Data Manipulation with dplyr',
       'Introduction to Importing Data in R',
       'Supervised Learning in R : Classification',
       'Joining Data with dplyr',
       'Intermediate Data Visualization with ggplot2',
       'Correlation and Regression in R', 'Introduction to Data in R',
       'Cleaning Data in R', 'Exploratory Data Analysis in R',
       'Unsupervised Learning in R', 'Writing Efficient R Code',
       'Intermediate Importing Data in R',
       'Introduction to Writing Functions in R',
       'Supervised Learning in R : Regression',
       'Case Study: Exploratory Data Analysis in R',
       'Building Web Applications with Shiny in R',
       'Working with Dates and Times in R',
       'Multiple and Logistic Regression in R',
       'Data Manipulation with data.table in R',
       'Introduction to R for Finance', 'Cluster Analysis in R',
       'Web Scraping in R', 'Reshaping Data with tidyr',
       'Time Series Analysis in R', 'Working with Data in the Tidyverse',
       'Machine Learning with caret in R', 'Forecasting in R',
       'Joining Data with data.table in R',
       'Communicating with Data in the Tidyverse',
       'Reporting with R Markdown',
       'Manipulating Time Series Data with xts and zoo in R',
       'Parallel Programming in R',
       'String Manipulation with stringr in R',
       'Intermediate R for Finance',
       'Modeling with Data in the Tidyverse',
       'Introduction to Text Analysis in R',
       'Foundations of Probability in R', 'Foundations of Inference',
       'Data Visualization in R', 'Generalized Linear Models in R',
       'Categorical Data in the Tidyverse',
       'Fundamentals of Bayesian Data Analysis in R',
       'Introduction to Statistics in R',
       'Introduction to Regression in R', 'Financial Trading in R',
       'ARIMA Models in R', "Text Mining America's Toughest Game Show",
       'Wrangling and Visualizing Musical Data',
       'Importing and Cleaning Data',
       'Exploring the Kaggle Data Science Survey',
       'Modeling the Volatility of US Bond Yields',
       'What Makes a Pokémon Legendary?',
       "Kidney Stones and Simpson's Paradox",
       'Health Survey Data Analysis of BMI',
       'Trends in Maryland Crime Rates',
       'Are You Ready for the Zombie Apocalypse?',
       'The Impact of Climate Change on Birds',
       'Clustering Bustabit Gambling Behavior',
       'Planning Public Policy in Argentina',
       'Phyllotaxis: Draw Flowers Using Mathematics',
       'Data Science for Social Good: Crime Study',
       'Degrees That Pay You Back', 'Gender Bias in Graduate Admissions',
       'Going Down to South Park: A Text Analysis',
       'Clustering Heart Disease Patient Data', 'Where Are the Fishes?',
       'Functions for Food Price Forecasts',
       "A Text Analysis of Trump's Tweets",
       'Predict Taxi Fares with Random Forests',
       'Drunken Datetimes in Ames, Iowa',
       'Where Would You Open a Chipotle?',
       "Explore 538's Halloween Candy Rankings",
       'What Your Heart Rate Is Telling You',
       'Partnering to Protect You from Peril',
       'Classify Suspected Infection in Patients',
       'Scout your Athletics Fantasy Team',
       'Visualizing Inequalities in Life Expectancy',
       'Level Difficulty in Candy Crush Saga',
       'R eal-time Insights from Social Media Data',
       'R isk and R eturns: The Sharpe R atio',
       'R educing Traffic Mortality in the USA',
       "R ecreating John Snow's Ghost Map",
       'Book R ecommendations from Charles Darwin',
       'ASL R ecognition with Deep Learning',
       'Analyze Your R unkeeper Fitness Data',
       'Do Left-handed People R eally Die Young?', 'Introduction to SQL',
       'Joining Data in SQL', 'Intermediate SQL',
       'Introduction to SQL Server',
       'Introduction to Relational Databases in SQL',
       'Exploratory Data Analysis in SQL',
       'PostgreSQL Summary Stats and Window Functions', 'Database Design',
       'Intermediate SQL Server',
       'Functions for Manipulating Data in PostgreSQL',
       'Time Series Analysis in SQL Server',
       'Analyzing Business Data in SQL',
       'Functions for Manipulating Data in SQL Server',
       'Improving Query Performance in SQL Server',
       'Transactions and Error Handling in SQL Server',
       'Data-Driven Decision Making in SQL',
       'Building and Optimizing Triggers in SQL Server',
       'Hierarchical and Recursive Queries in SQL Server',
       'Reporting in SQL',
       'Writing Functions and Stored Procedures in SQL Server',
       'Applying SQL to Real-World Problems',
       'Cleaning Data in SQL Server Databases',
       'Introduction to Oracle SQL', 'Creating PostgreSQL Databases',
       'Improving Query Performance in PostgreSQL',
       'Cleaning Data in PostgreSQL Databases',
       'Transactions and Error Handling in PostgreSQL',
       'Introduction to Spark SQL in Python',
       'Introduction to MongoDB in Python', 'Data Processing in Shell',
       'Analyze International Debt Statistics', 'Introduction to Git',
       'Course Creation at DataCamp',
       'Foundations of Functional Programming with purrr',
       'The Git Hub History of the Scala Language',
       'Introduction to Shell', 'Conda Essentials',
       'Introduction to Bash Scripting',
       'Building and Distributing Packages with Conda',
       'Inference for Categorical Data in R',
       'Recurrent Neural Networks for Language Modeling in Python',
       'Market Basket Analysis in Python',
       'Data Analysis in Spreadsheets', 'Pivot Tables in Spreadsheets',
       'Introduction to Statistics in Spreadsheets',
       'Intermediate Spreadsheets', 'Data Visualization in Spreadsheets',
       'Introduction to Spreadsheets',
       'Financial Analytics in Spreadsheets',
       'Financial Modeling in Spreadsheets',
       'Marketing Analytics in Spreadsheets',
       'Conditional Formatting in Spreadsheets',
       'Error and Uncertainty in Spreadsheets',
       'Loan Amortization in Spreadsheets',
       'Options Trading in Spreadsheets',
       'Pandas Joins for Spreadsheet Users',
       'Merging DataFrames with pandas', 'Python for Spreadsheet Users',
       'Data Science for Everyone', 'Machine Learning for Everyone',
       'Data Engineering for Everyone', 'Data Visualization for Everyone',
       'Data Science for Business', 'Cloud Computing for Everyone',
       'Machine Learning for Business',
       'Practicing Machine Learning Interview Questions in R',
       'Intermediate Portfolio Analysis in R',
       'Introduction to Portfolio Analysis in R', 'Factor Analysis in R',
       'Inference for Numerical Data in R',
       'Foundations of Probability in Python',
       'Practicing Machine Learning Interview Questions in Python',
       'Fraud Detection in Python', 'Introduction to Scala',
       'Scala ble Data Processing in R',
       'Visualizing Big Data with Trelliscope in R',
       'Parallel Programming with Dask in Python',
       'Structural Equation Modeling with lavaan in R',
       'Differential Expression Analysis with limma in R',
       'Ensemble Methods in Python',
       'Survey and Measurement Development in R',
       'Multivariate Probability Distributions in R',
       'Data Analysis in Excel',
       'Introduction to Natural Language Processing in R',
       'Importing and Managing Financial Data in Python',
       'Financial Forecasting in Python', 'Connecting Data in Tableau',
       'Creating Robust Workflows in Python',
       'Introduction to Bioconductor in R', 'Introduction to Tableau',
       'Analyzing Data in Tableau', 'Creating Dashboards in Tableau',
       'Introduction to Power BI', 'Machine Learning with PySpark',
       'Statistical Simulation in Python',
       'Natural Language Generation in Python',
       'Writing Efficient Code with pandas', 'Probability Puzzles in R',
       'Gender Bi as in Graduate Admissions'], dtype=object)

Remove duplicates

Some of the courses on DataCamp appear for multiple search terms, so we’ll have some duplicates in our dataframe. To remove the duplicate courses, we can use the Pandas drop_duplicates() function and pass in the course_title column. We’ll tell Pandas to keep the first value found, and drop the rest inplace.

df.drop_duplicates('course_title', keep='first', inplace=True)

To check this has worked, we re-run df.course_type.value_counts(), which reveals we’ve got 196 unique courses and 83 unique projects.

df.course_type.value_counts()

Course     196
Project     83
Name: course_type, dtype: int64

Finally, to save the data we’ve scraped from DataCamp we’ll use to_csv() to write the data to a CSV file. While this is obviously quite a specific example, the concepts shown here will work on any site and show how easy it can be to scrape web data and reformat it to display in Pandas, allowing you to construct your own custom data sets.

df.to_csv('datacamp-courses.csv')

Matt Clarke, Tuesday, March 09, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.