How to scrape a website using Advertools

Picture by Nimit Kansagra, Pexels.

14 minutes to read

Data Science Python SEO Web scraping

For larger web scraping projects, the Scrapy web scraping Python package is one of the most effective tools. It’s powerful and fast and have a huge range of features. However, it’s much more fiddly and time-consuming to set up. Advertools solves this problem by utilising the power of Scrapy with far less of the code and hassle.

Advertools (written by Elias Dabbas) is a popular package in the Python SEO community for its simplicity and speed, and the fact that it scrapes a wide range of website content automatically, without the need for you to write any custom code. In this simple project, I’ll show you the example code you need to get up and running to scrape a website using Advertools in list mode. It’s great fun and very easy.

Install the packages

To get started, open a Jupyter notebook and install the Advertools package using the Pip package management system. This will install Advertools and all its dependencies. It’s powered by other scraping tools, such as Scrapy, and massively simplifies the process of scraping a website and will fetch a wide range of data by default without the need to perform custom extraction.

!pip3 install advertools

Import the packages

Once the package has installed you’ll need to import the Advertools package and the Pandas package. The convention is to alias these packages as pd and adv to keep the code cleaner and more consistent. Since we’ll be dealing with large Pandas dataframes containing many rows and wide columns we’ll use the Pandas set_option() function to increase the default values, so we can easily view the data returned.

import pandas as pd
import advertools as adv

pd.set_option('max_rows', 100)
pd.set_option('max_colwidth', 100)

Scrape the sitemap

Typically, web crawlers and scrapers (such as Screamingfrog) have two “modes” - list mode and discovery mode. As the name suggests, list mode crawls and scrapes pages from a specific list you provide, while discovery mode takes an initial URL and then follows each URL to eventually find every URL that can be scraped. We’ll be using list mode in this project.

The easiest way to quickly obtain a list of URLs to scrape when using a web scraper in list mode is to scrape and parse the XML sitemap for the website. This includes a list of the site’s key pages, and some other data about them to guide search engines on what they should check. We can use the Advertools sitemap_to_df() function to scrape the XML sitemap and return the output in a Pandas dataframe.

df_sitemap = adv.sitemap_to_df('https://practicaldatascience.co.uk/sitemap.xml')

2022-07-15 06:50:17,487 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://practicaldatascience.co.uk/sitemap.xml

df_sitemap.head()

	loc	lastmod	sitemap	etag	sitemap_size_mb	download_date
0	https://practicaldatascience.co.uk/data-science/how-to-create-a-python-virtual-environment-for-j...	2021-03-01 00:00:00+00:00	https://practicaldatascience.co.uk/sitemap.xml	"942a323e3a4a2bd3bf96cbb92c3aba68-ssl"	0.046545	2022-07-15 06:50:17.506204+00:00
1	https://practicaldatascience.co.uk/data-science/how-to-engineer-date-features-using-pandas	2021-03-01 00:00:00+00:00	https://practicaldatascience.co.uk/sitemap.xml	"942a323e3a4a2bd3bf96cbb92c3aba68-ssl"	0.046545	2022-07-15 06:50:17.506204+00:00
2	https://practicaldatascience.co.uk/machine-learning/how-to-impute-missing-numeric-values-in-your...	2021-03-01 00:00:00+00:00	https://practicaldatascience.co.uk/sitemap.xml	"942a323e3a4a2bd3bf96cbb92c3aba68-ssl"	0.046545	2022-07-15 06:50:17.506204+00:00
3	https://practicaldatascience.co.uk/machine-learning/how-to-interpret-the-confusion-matrix	2021-03-01 00:00:00+00:00	https://practicaldatascience.co.uk/sitemap.xml	"942a323e3a4a2bd3bf96cbb92c3aba68-ssl"	0.046545	2022-07-15 06:50:17.506204+00:00
4	https://practicaldatascience.co.uk/machine-learning/how-to-use-mean-encoding-in-your-machine-lea...	2021-03-01 00:00:00+00:00	https://practicaldatascience.co.uk/sitemap.xml	"942a323e3a4a2bd3bf96cbb92c3aba68-ssl"	0.046545	2022-07-15 06:50:17.506204+00:00

Scrape the website in list mode

Next, we need to create a Python list containing the URLs we want to get Advertools to scrape. This is very easy. We simply define the Pandas column and then append the to_list() function like this: df_sitemap['loc'].to_list(). This will take every URL in the loc column and place it in a Python list that we’ll assign to the variable url_list.

We can then pass url_list to the crawl() function to crawl and scrape the website. We’ll use the follow_links=False argument to ensure the crawl sticks only to the pages in our list, not those it may discover. Advertools crawls run asynchronously via Scrapy so they are very fast. The output is returned in JSON lines format, so we’ll store this in output.jl.

url_list = df_sitemap['loc'].to_list()

adv.crawl(url_list, 'output.jl', follow_links=False)

View the crawl data JSON

Once the crawl() function has run, we can view the JSON lines crawl output from Advertools by converting it to a Pandas dataframe using the Pandas read_json() function with the lines=True argument. If you print the info() on the returned Pandas dataframe, you’ll be able to see what content Advertools detected within the page. Each of these values has been scraped, extracted, and stored in the dataframe, without the need for you to write any custom code to parse them.

df_crawl = pd.read_json('output.jl', lines=True)

df_crawl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 638 entries, 0 to 637
Data columns (total 83 columns):
 #   Column                                  Non-Null Count  Dtype         
---  ------                                  --------------  -----         
 url                                     638 non-null    object        
 title                                   638 non-null    object        
 meta_desc                               638 non-null    object        
 viewport                                638 non-null    object        
 charset                                 638 non-null    object        
 h1                                      580 non-null    object        
 h2                                      520 non-null    object        
 h3                                      638 non-null    object        
 h4                                      636 non-null    object        
 canonical                               638 non-null    object        
og:locale                               638 non-null    object        
og:title                                638 non-null    object        
og:description                          638 non-null    object        
og:image                                638 non-null    object        
og:url                                  638 non-null    object        
og:type                                 512 non-null    object        
twitter:card                            638 non-null    object        
twitter:site                            638 non-null    object        
twitter:creator                         638 non-null    object        
twitter:title                           638 non-null    object        
twitter:description                     638 non-null    object        
twitter:image                           638 non-null    object        
twitter:url                             638 non-null    object        
jsonld_@context                         638 non-null    object        
jsonld_@type                            638 non-null    object        
jsonld_itemListElement                  512 non-null    object        
jsonld_1_@context                       512 non-null    object        
jsonld_1_@type                          512 non-null    object        
jsonld_1_name                           512 non-null    object        
jsonld_1_@id                            512 non-null    object        
jsonld_1_nationality                    512 non-null    object        
jsonld_1_gender                         512 non-null    object        
jsonld_1_Description                    512 non-null    object        
jsonld_1_jobTitle                       512 non-null    object        
jsonld_1_url                            512 non-null    object        
jsonld_1_image                          512 non-null    object        
jsonld_1_sameAs                         512 non-null    object        
jsonld_1_alumniOf                       512 non-null    object        
body_text                               638 non-null    object        
size                                    638 non-null    int64         
download_timeout                        638 non-null    int64         
download_slot                           638 non-null    object        
download_latency                        638 non-null    float64       
depth                                   638 non-null    int64         
status                                  638 non-null    int64         
links_url                               638 non-null    object        
links_text                              638 non-null    object        
links_nofollow                          638 non-null    object        
nav_links_url                           638 non-null    object        
nav_links_text                          638 non-null    object        
nav_links_nofollow                      638 non-null    object        
footer_links_url                        638 non-null    object        
footer_links_text                       638 non-null    object        
footer_links_nofollow                   638 non-null    object        
img_src                                 636 non-null    object        
img_alt                                 636 non-null    object        
ip_address                              638 non-null    object        
crawl_time                              638 non-null    datetime64[ns]
resp_headers_age                        638 non-null    int64         
resp_headers_cache-control              638 non-null    object        
resp_headers_content-type               638 non-null    object        
resp_headers_date                       638 non-null    object        
resp_headers_etag                       638 non-null    object        
resp_headers_server                     638 non-null    object        
resp_headers_strict-transport-security  638 non-null    object        
resp_headers_vary                       638 non-null    object        
resp_headers_x-nf-request-id            638 non-null    object        
request_headers_accept                  638 non-null    object        
request_headers_accept-language         638 non-null    object        
request_headers_user-agent              638 non-null    object        
request_headers_accept-encoding         638 non-null    object        
resp_headers_content-length             437 non-null    float64       
h5                                      44 non-null     object        
jsonld_name                             126 non-null    object        
jsonld_@id                              126 non-null    object        
jsonld_nationality                      126 non-null    object        
jsonld_gender                           126 non-null    object        
jsonld_Description                      126 non-null    object        
jsonld_jobTitle                         126 non-null    object        
jsonld_url                              126 non-null    object        
jsonld_image                            126 non-null    object        
jsonld_sameAs                           126 non-null    object        
jsonld_alumniOf                         126 non-null    object        
dtypes: datetime64[ns](1), float64(2), int64(5), object(75)
memory usage: 413.8+ KB

If we view the first row of the dataframe and transpose the output with .T we’ll see all the data Advertools found on the first page scraped. We’ve got the URL, title, meta description and other data from the document head, plus the headings, canonical and tons of other information.

Any values that are present from multiple elements, such as multiple headings of the same level, are separated by the @@ separator. These can be exploded into a Python list and parsed separately, though it would be great if Advertools did that by default or as an option. Most (but not all) schema markup found within a page is also extracted and placed in its own variable, so for some sites, you may not need to write that much custom scraping code.

df_crawl.head(1).T

	0
url	https://practicaldatascience.co.uk/data-science/how-to-create-a-python-virtual-environment-for-j...
title	How to create a Python virtual environment for Jupyter
meta_desc	Learn how to create a Python virtual environment for your Jupyter notebook using venv and virtua...
viewport	width=device-width, initial-scale=1, shrink-to-fit=no
charset	utf-8
h1	How to create a Python virtual environment for Jupyter
h2	Learn how to create a Python virtual environment for your Jupyter notebook using venv and virtua...
h3	Introduction to Python@@Intermediate Python@@Introduction to Data Science in Python@@Other posts...
h4	Creating a virtual environment@@Activating the virtual environment@@Running a Jupyter notebook@@...
canonical	https://practicaldatascience.co.uk/data-science/how-to-create-a-python-virtual-environment-for-j...
og:locale	en_GB
og:title	How to create a Python virtual environment for Jupyter
og:description	Learn how to create a Python virtual environment for your Jupyter notebook using venv and virtua...
og:image	https://practicaldatascience.co.uk/assets/images/posts/mac.png
og:url	https://practicaldatascience.co.uk/data-science/how-to-create-a-python-virtual-environment-for-j...
og:type	article
twitter:card	summary_large_image
twitter:site	@
twitter:creator	@
twitter:title	How to create a Python virtual environment for Jupyter
twitter:description	Learn how to create a Python virtual environment for your Jupyter notebook using venv and virtua...
twitter:image	https://practicaldatascience.co.uk/assets/images/posts/mac.png
twitter:url	https://practicaldatascience.co.uk/data-science/how-to-create-a-python-virtual-environment-for-j...
jsonld_@context	http://schema.org
jsonld_@type	BreadcrumbList
jsonld_itemListElement	[{'@type': 'ListItem', 'position': 1, 'item': {'@id': 'https://practicaldatascience.co.uk/', 'na...
jsonld_1_@context	https://schema.org/
jsonld_1_@type	Person
jsonld_1_name	Matt Clarke
jsonld_1_@id	https://practicaldatascience.co.uk/about
jsonld_1_nationality	British
jsonld_1_gender	Male
jsonld_1_Description	Ecommerce and Marketing data science specialist
jsonld_1_jobTitle	Ecommerce and Marketing Director
jsonld_1_url	https://practicaldatascience.co.uk
jsonld_1_image	https://practicaldatascience.co.uk/assets/images/posts/matt-clarke.jpg
jsonld_1_sameAs	[https://twitter.com/EcommerceMatt, https://www.linkedin.com/in/mattclarke/, https://practicalda...
jsonld_1_alumniOf	[{'@type': 'EducationalOrganization', 'name': 'Imperial College London', 'sameAs': 'https://ic.a...
body_text	\n Data Science \n \n Machine Learning \n \n...
size	67291
download_timeout	180
download_slot	practicaldatascience.co.uk
download_latency	0.492844
depth	0
status	200
links_url	https://practicaldatascience.co.uk/@@https://practicaldatascience.co.uk/data-science@@https://pr...
links_text	\n \n Practical Data Science\n @@Data Science@@Machine Learning@@Data Engineering@...
links_nofollow	False@@False@@False@@False@@False@@False@@False@@False@@False@@False@@False@@False@@True@@True@@...
nav_links_url	https://practicaldatascience.co.uk/@@https://practicaldatascience.co.uk/data-science@@https://pr...
nav_links_text	\n \n Practical Data Science\n @@Data Science@@Machine Learning@@Data Engineering@...
nav_links_nofollow	False@@False@@False@@False@@False@@False
footer_links_url	https://practicaldatascience.co.uk/data-science@@https://practicaldatascience.co.uk/machine-lear...
footer_links_text	Data Science@@Machine Learning@@Data Engineering@@Data Science Courses@@Sitemap@@About@@LinkedIn...
footer_links_nofollow	False@@False@@False@@False@@False@@False@@True@@True@@False
img_src	data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAMAAAACCAQAAAA3fa6RAAAADklEQVR42mNkAANGCAUAACMAA2...
img_alt	How to create a Python virtual environment for Jupyter@@Test anything in a Jupyter venv without ...
ip_address	104.198.14.52
crawl_time	2022-07-15 06:48:35
resp_headers_age	0
resp_headers_cache-control	public, max-age=0, must-revalidate
resp_headers_content-type	text/html; charset=UTF-8
resp_headers_date	Fri, 15 Jul 2022 06:48:35 GMT
resp_headers_etag	"6e837a095eb5cdabf6f09565f0bc00f3-ssl-df"
resp_headers_server	Netlify
resp_headers_strict-transport-security	max-age=31536000
resp_headers_vary	Accept-Encoding
resp_headers_x-nf-request-id	01G809VGJ6N7QE836FRY40269S
request_headers_accept	text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8
request_headers_accept-language	en
request_headers_user-agent	advertools/0.13.1
request_headers_accept-encoding	gzip, deflate, br
resp_headers_content-length	NaN
h5	NaN
jsonld_name	NaN
jsonld_@id	NaN
jsonld_nationality	NaN
jsonld_gender	NaN
jsonld_Description	NaN
jsonld_jobTitle	NaN
jsonld_url	NaN
jsonld_image	NaN
jsonld_sameAs	NaN
jsonld_alumniOf	NaN

Export the data to CSV

Finally, we can save the output of the Pandas dataframe containing our scraped data to a CSV using the Pandas to_csv() function with the index=False argument to prevent Pandas adding an additional index.

df_crawl.to_csv('output.csv', index=False)

In the next example, I’ll show you how you can perform custom extraction using Advertools to find, scrape, and parse custom elements of page content when you’re scraping a website. This is a little more complicated because you need to identify the HTML elements containing the data you want to extract but, on the plus side, Advertools extracts so much data by default that custom extractions are only needed for the odd element.

Matt Clarke, Friday, July 15, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.