How to scrape a website using Advertools

Learn how to perform web scraping with Advertools in list mode to scrape an XML sitemap and scrape the content from a URL list and return the data in a Pandas dataframe.

How to scrape a website using Advertools
Picture by Nimit Kansagra, Pexels.
14 minutes to read

For larger web scraping projects, the Scrapy web scraping Python package is one of the most effective tools. It’s powerful and fast and have a huge range of features. However, it’s much more fiddly and time-consuming to set up. Advertools solves this problem by utilising the power of Scrapy with far less of the code and hassle.

Advertools (written by Elias Dabbas) is a popular package in the Python SEO community for its simplicity and speed, and the fact that it scrapes a wide range of website content automatically, without the need for you to write any custom code. In this simple project, I’ll show you the example code you need to get up and running to scrape a website using Advertools in list mode. It’s great fun and very easy.

Install the packages

To get started, open a Jupyter notebook and install the Advertools package using the Pip package management system. This will install Advertools and all its dependencies. It’s powered by other scraping tools, such as Scrapy, and massively simplifies the process of scraping a website and will fetch a wide range of data by default without the need to perform custom extraction.

!pip3 install advertools

Import the packages

Once the package has installed you’ll need to import the Advertools package and the Pandas package. The convention is to alias these packages as pd and adv to keep the code cleaner and more consistent. Since we’ll be dealing with large Pandas dataframes containing many rows and wide columns we’ll use the Pandas set_option() function to increase the default values, so we can easily view the data returned.

import pandas as pd
import advertools as adv
pd.set_option('max_rows', 100)
pd.set_option('max_colwidth', 100)

Scrape the sitemap

Typically, web crawlers and scrapers (such as Screamingfrog) have two “modes” - list mode and discovery mode. As the name suggests, list mode crawls and scrapes pages from a specific list you provide, while discovery mode takes an initial URL and then follows each URL to eventually find every URL that can be scraped. We’ll be using list mode in this project.

The easiest way to quickly obtain a list of URLs to scrape when using a web scraper in list mode is to scrape and parse the XML sitemap for the website. This includes a list of the site’s key pages, and some other data about them to guide search engines on what they should check. We can use the Advertools sitemap_to_df() function to scrape the XML sitemap and return the output in a Pandas dataframe.

df_sitemap = adv.sitemap_to_df('https://practicaldatascience.co.uk/sitemap.xml')
2022-07-15 06:50:17,487 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://practicaldatascience.co.uk/sitemap.xml
df_sitemap.head()
loc lastmod sitemap etag sitemap_size_mb download_date
0 https://practicaldatascience.co.uk/data-science/how-to-create-a-python-virtual-environment-for-j... 2021-03-01 00:00:00+00:00 https://practicaldatascience.co.uk/sitemap.xml "942a323e3a4a2bd3bf96cbb92c3aba68-ssl" 0.046545 2022-07-15 06:50:17.506204+00:00
1 https://practicaldatascience.co.uk/data-science/how-to-engineer-date-features-using-pandas 2021-03-01 00:00:00+00:00 https://practicaldatascience.co.uk/sitemap.xml "942a323e3a4a2bd3bf96cbb92c3aba68-ssl" 0.046545 2022-07-15 06:50:17.506204+00:00
2 https://practicaldatascience.co.uk/machine-learning/how-to-impute-missing-numeric-values-in-your... 2021-03-01 00:00:00+00:00 https://practicaldatascience.co.uk/sitemap.xml "942a323e3a4a2bd3bf96cbb92c3aba68-ssl" 0.046545 2022-07-15 06:50:17.506204+00:00
3 https://practicaldatascience.co.uk/machine-learning/how-to-interpret-the-confusion-matrix 2021-03-01 00:00:00+00:00 https://practicaldatascience.co.uk/sitemap.xml "942a323e3a4a2bd3bf96cbb92c3aba68-ssl" 0.046545 2022-07-15 06:50:17.506204+00:00
4 https://practicaldatascience.co.uk/machine-learning/how-to-use-mean-encoding-in-your-machine-lea... 2021-03-01 00:00:00+00:00 https://practicaldatascience.co.uk/sitemap.xml "942a323e3a4a2bd3bf96cbb92c3aba68-ssl" 0.046545 2022-07-15 06:50:17.506204+00:00

Scrape the website in list mode

Next, we need to create a Python list containing the URLs we want to get Advertools to scrape. This is very easy. We simply define the Pandas column and then append the to_list() function like this: df_sitemap['loc'].to_list(). This will take every URL in the loc column and place it in a Python list that we’ll assign to the variable url_list.

We can then pass url_list to the crawl() function to crawl and scrape the website. We’ll use the follow_links=False argument to ensure the crawl sticks only to the pages in our list, not those it may discover. Advertools crawls run asynchronously via Scrapy so they are very fast. The output is returned in JSON lines format, so we’ll store this in output.jl.

url_list = df_sitemap['loc'].to_list()
adv.crawl(url_list, 'output.jl', follow_links=False)

View the crawl data JSON

Once the crawl() function has run, we can view the JSON lines crawl output from Advertools by converting it to a Pandas dataframe using the Pandas read_json() function with the lines=True argument. If you print the info() on the returned Pandas dataframe, you’ll be able to see what content Advertools detected within the page. Each of these values has been scraped, extracted, and stored in the dataframe, without the need for you to write any custom code to parse them.

df_crawl = pd.read_json('output.jl', lines=True)
df_crawl.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 638 entries, 0 to 637
Data columns (total 83 columns):
 #   Column                                  Non-Null Count  Dtype         
---  ------                                  --------------  -----         
 0   url                                     638 non-null    object        
 1   title                                   638 non-null    object        
 2   meta_desc                               638 non-null    object        
 3   viewport                                638 non-null    object        
 4   charset                                 638 non-null    object        
 5   h1                                      580 non-null    object        
 6   h2                                      520 non-null    object        
 7   h3                                      638 non-null    object        
 8   h4                                      636 non-null    object        
 9   canonical                               638 non-null    object        
 10  og:locale                               638 non-null    object        
 11  og:title                                638 non-null    object        
 12  og:description                          638 non-null    object        
 13  og:image                                638 non-null    object        
 14  og:url                                  638 non-null    object        
 15  og:type                                 512 non-null    object        
 16  twitter:card                            638 non-null    object        
 17  twitter:site                            638 non-null    object        
 18  twitter:creator                         638 non-null    object        
 19  twitter:title                           638 non-null    object        
 20  twitter:description                     638 non-null    object        
 21  twitter:image                           638 non-null    object        
 22  twitter:url                             638 non-null    object        
 23  jsonld_@context                         638 non-null    object        
 24  jsonld_@type                            638 non-null    object        
 25  jsonld_itemListElement                  512 non-null    object        
 26  jsonld_1_@context                       512 non-null    object        
 27  jsonld_1_@type                          512 non-null    object        
 28  jsonld_1_name                           512 non-null    object        
 29  jsonld_1_@id                            512 non-null    object        
 30  jsonld_1_nationality                    512 non-null    object        
 31  jsonld_1_gender                         512 non-null    object        
 32  jsonld_1_Description                    512 non-null    object        
 33  jsonld_1_jobTitle                       512 non-null    object        
 34  jsonld_1_url                            512 non-null    object        
 35  jsonld_1_image                          512 non-null    object        
 36  jsonld_1_sameAs                         512 non-null    object        
 37  jsonld_1_alumniOf                       512 non-null    object        
 38  body_text                               638 non-null    object        
 39  size                                    638 non-null    int64         
 40  download_timeout                        638 non-null    int64         
 41  download_slot                           638 non-null    object        
 42  download_latency                        638 non-null    float64       
 43  depth                                   638 non-null    int64         
 44  status                                  638 non-null    int64         
 45  links_url                               638 non-null    object        
 46  links_text                              638 non-null    object        
 47  links_nofollow                          638 non-null    object        
 48  nav_links_url                           638 non-null    object        
 49  nav_links_text                          638 non-null    object        
 50  nav_links_nofollow                      638 non-null    object        
 51  footer_links_url                        638 non-null    object        
 52  footer_links_text                       638 non-null    object        
 53  footer_links_nofollow                   638 non-null    object        
 54  img_src                                 636 non-null    object        
 55  img_alt                                 636 non-null    object        
 56  ip_address                              638 non-null    object        
 57  crawl_time                              638 non-null    datetime64[ns]
 58  resp_headers_age                        638 non-null    int64         
 59  resp_headers_cache-control              638 non-null    object        
 60  resp_headers_content-type               638 non-null    object        
 61  resp_headers_date                       638 non-null    object        
 62  resp_headers_etag                       638 non-null    object        
 63  resp_headers_server                     638 non-null    object        
 64  resp_headers_strict-transport-security  638 non-null    object        
 65  resp_headers_vary                       638 non-null    object        
 66  resp_headers_x-nf-request-id            638 non-null    object        
 67  request_headers_accept                  638 non-null    object        
 68  request_headers_accept-language         638 non-null    object        
 69  request_headers_user-agent              638 non-null    object        
 70  request_headers_accept-encoding         638 non-null    object        
 71  resp_headers_content-length             437 non-null    float64       
 72  h5                                      44 non-null     object        
 73  jsonld_name                             126 non-null    object        
 74  jsonld_@id                              126 non-null    object        
 75  jsonld_nationality                      126 non-null    object        
 76  jsonld_gender                           126 non-null    object        
 77  jsonld_Description                      126 non-null    object        
 78  jsonld_jobTitle                         126 non-null    object        
 79  jsonld_url                              126 non-null    object        
 80  jsonld_image                            126 non-null    object        
 81  jsonld_sameAs                           126 non-null    object        
 82  jsonld_alumniOf                         126 non-null    object        
dtypes: datetime64[ns](1), float64(2), int64(5), object(75)
memory usage: 413.8+ KB

If we view the first row of the dataframe and transpose the output with .T we’ll see all the data Advertools found on the first page scraped. We’ve got the URL, title, meta description and other data from the document head, plus the headings, canonical and tons of other information.

Any values that are present from multiple elements, such as multiple headings of the same level, are separated by the @@ separator. These can be exploded into a Python list and parsed separately, though it would be great if Advertools did that by default or as an option. Most (but not all) schema markup found within a page is also extracted and placed in its own variable, so for some sites, you may not need to write that much custom scraping code.

df_crawl.head(1).T
0
url https://practicaldatascience.co.uk/data-science/how-to-create-a-python-virtual-environment-for-j...
title How to create a Python virtual environment for Jupyter
meta_desc Learn how to create a Python virtual environment for your Jupyter notebook using venv and virtua...
viewport width=device-width, initial-scale=1, shrink-to-fit=no
charset utf-8
h1 How to create a Python virtual environment for Jupyter
h2 Learn how to create a Python virtual environment for your Jupyter notebook using venv and virtua...
h3 Introduction to Python@@Intermediate Python@@Introduction to Data Science in Python@@Other posts...
h4 Creating a virtual environment@@Activating the virtual environment@@Running a Jupyter notebook@@...
canonical https://practicaldatascience.co.uk/data-science/how-to-create-a-python-virtual-environment-for-j...
og:locale en_GB
og:title How to create a Python virtual environment for Jupyter
og:description Learn how to create a Python virtual environment for your Jupyter notebook using venv and virtua...
og:image https://practicaldatascience.co.uk/assets/images/posts/mac.png
og:url https://practicaldatascience.co.uk/data-science/how-to-create-a-python-virtual-environment-for-j...
og:type article
twitter:card summary_large_image
twitter:site @
twitter:creator @
twitter:title How to create a Python virtual environment for Jupyter
twitter:description Learn how to create a Python virtual environment for your Jupyter notebook using venv and virtua...
twitter:image https://practicaldatascience.co.uk/assets/images/posts/mac.png
twitter:url https://practicaldatascience.co.uk/data-science/how-to-create-a-python-virtual-environment-for-j...
jsonld_@context http://schema.org
jsonld_@type BreadcrumbList
jsonld_itemListElement [{'@type': 'ListItem', 'position': 1, 'item': {'@id': 'https://practicaldatascience.co.uk/', 'na...
jsonld_1_@context https://schema.org/
jsonld_1_@type Person
jsonld_1_name Matt Clarke
jsonld_1_@id https://practicaldatascience.co.uk/about
jsonld_1_nationality British
jsonld_1_gender Male
jsonld_1_Description Ecommerce and Marketing data science specialist
jsonld_1_jobTitle Ecommerce and Marketing Director
jsonld_1_url https://practicaldatascience.co.uk
jsonld_1_image https://practicaldatascience.co.uk/assets/images/posts/matt-clarke.jpg
jsonld_1_sameAs [https://twitter.com/EcommerceMatt, https://www.linkedin.com/in/mattclarke/, https://practicalda...
jsonld_1_alumniOf [{'@type': 'EducationalOrganization', 'name': 'Imperial College London', 'sameAs': 'https://ic.a...
body_text \n Data Science \n \n Machine Learning \n \n...
size 67291
download_timeout 180
download_slot practicaldatascience.co.uk
download_latency 0.492844
depth 0
status 200
links_url https://practicaldatascience.co.uk/@@https://practicaldatascience.co.uk/data-science@@https://pr...
links_text \n \n Practical Data Science\n @@Data Science@@Machine Learning@@Data Engineering@...
links_nofollow False@@False@@False@@False@@False@@False@@False@@False@@False@@False@@False@@False@@True@@True@@...
nav_links_url https://practicaldatascience.co.uk/@@https://practicaldatascience.co.uk/data-science@@https://pr...
nav_links_text \n \n Practical Data Science\n @@Data Science@@Machine Learning@@Data Engineering@...
nav_links_nofollow False@@False@@False@@False@@False@@False
footer_links_url https://practicaldatascience.co.uk/data-science@@https://practicaldatascience.co.uk/machine-lear...
footer_links_text Data Science@@Machine Learning@@Data Engineering@@Data Science Courses@@Sitemap@@About@@LinkedIn...
footer_links_nofollow False@@False@@False@@False@@False@@False@@True@@True@@False
img_src ...
img_alt How to create a Python virtual environment for Jupyter@@Test anything in a Jupyter venv without ...
ip_address 104.198.14.52
crawl_time 2022-07-15 06:48:35
resp_headers_age 0
resp_headers_cache-control public, max-age=0, must-revalidate
resp_headers_content-type text/html; charset=UTF-8
resp_headers_date Fri, 15 Jul 2022 06:48:35 GMT
resp_headers_etag "6e837a095eb5cdabf6f09565f0bc00f3-ssl-df"
resp_headers_server Netlify
resp_headers_strict-transport-security max-age=31536000
resp_headers_vary Accept-Encoding
resp_headers_x-nf-request-id 01G809VGJ6N7QE836FRY40269S
request_headers_accept text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
request_headers_accept-language en
request_headers_user-agent advertools/0.13.1
request_headers_accept-encoding gzip, deflate, br
resp_headers_content-length NaN
h5 NaN
jsonld_name NaN
jsonld_@id NaN
jsonld_nationality NaN
jsonld_gender NaN
jsonld_Description NaN
jsonld_jobTitle NaN
jsonld_url NaN
jsonld_image NaN
jsonld_sameAs NaN
jsonld_alumniOf NaN

Export the data to CSV

Finally, we can save the output of the Pandas dataframe containing our scraped data to a CSV using the Pandas to_csv() function with the index=False argument to prevent Pandas adding an additional index.

df_crawl.to_csv('output.csv', index=False)

In the next example, I’ll show you how you can perform custom extraction using Advertools to find, scrape, and parse custom elements of page content when you’re scraping a website. This is a little more complicated because you need to identify the HTML elements containing the data you want to extract but, on the plus side, Advertools extracts so much data by default that custom extractions are only needed for the odd element.

Matt Clarke, Friday, July 15, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.