For larger web scraping projects, the Scrapy web scraping Python package is one of the most effective tools. It’s powerful and fast and have a huge range of features. However, it’s much more fiddly and time-consuming to set up. Advertools solves this problem by utilising the power of Scrapy with far less of the code and hassle.
Advertools (written by Elias Dabbas) is a popular package in the Python SEO community for its simplicity and speed, and the fact that it scrapes a wide range of website content automatically, without the need for you to write any custom code. In this simple project, I’ll show you the example code you need to get up and running to scrape a website using Advertools in list mode. It’s great fun and very easy.
To get started, open a Jupyter notebook and install the Advertools package using the Pip package management system. This will install Advertools and all its dependencies. It’s powered by other scraping tools, such as Scrapy, and massively simplifies the process of scraping a website and will fetch a wide range of data by default without the need to perform custom extraction.
!pip3 install advertools
Once the package has installed you’ll need to import the Advertools package and the Pandas package. The convention is to alias these packages as pd
and adv
to keep the code cleaner and more consistent. Since we’ll be dealing with large Pandas dataframes containing many rows and wide columns we’ll use the Pandas set_option()
function to increase the default values, so we can easily view the data returned.
import pandas as pd
import advertools as adv
pd.set_option('max_rows', 100)
pd.set_option('max_colwidth', 100)
Typically, web crawlers and scrapers (such as Screamingfrog) have two “modes” - list mode and discovery mode. As the name suggests, list mode crawls and scrapes pages from a specific list you provide, while discovery mode takes an initial URL and then follows each URL to eventually find every URL that can be scraped. We’ll be using list mode in this project.
The easiest way to quickly obtain a list of URLs to scrape when using a web scraper in list mode is to scrape and parse the XML sitemap for the website. This includes a list of the site’s key pages, and some other data about them to guide search engines on what they should check. We can use the Advertools sitemap_to_df()
function to scrape the XML sitemap and return the output in a Pandas dataframe.
df_sitemap = adv.sitemap_to_df('https://practicaldatascience.co.uk/sitemap.xml')
2022-07-15 06:50:17,487 | INFO | sitemaps.py:536 | sitemap_to_df | Getting https://practicaldatascience.co.uk/sitemap.xml
df_sitemap.head()
loc | lastmod | sitemap | etag | sitemap_size_mb | download_date | |
---|---|---|---|---|---|---|
0 | https://practicaldatascience.co.uk/data-science/how-to-create-a-python-virtual-environment-for-j... | 2021-03-01 00:00:00+00:00 | https://practicaldatascience.co.uk/sitemap.xml | "942a323e3a4a2bd3bf96cbb92c3aba68-ssl" | 0.046545 | 2022-07-15 06:50:17.506204+00:00 |
1 | https://practicaldatascience.co.uk/data-science/how-to-engineer-date-features-using-pandas | 2021-03-01 00:00:00+00:00 | https://practicaldatascience.co.uk/sitemap.xml | "942a323e3a4a2bd3bf96cbb92c3aba68-ssl" | 0.046545 | 2022-07-15 06:50:17.506204+00:00 |
2 | https://practicaldatascience.co.uk/machine-learning/how-to-impute-missing-numeric-values-in-your... | 2021-03-01 00:00:00+00:00 | https://practicaldatascience.co.uk/sitemap.xml | "942a323e3a4a2bd3bf96cbb92c3aba68-ssl" | 0.046545 | 2022-07-15 06:50:17.506204+00:00 |
3 | https://practicaldatascience.co.uk/machine-learning/how-to-interpret-the-confusion-matrix | 2021-03-01 00:00:00+00:00 | https://practicaldatascience.co.uk/sitemap.xml | "942a323e3a4a2bd3bf96cbb92c3aba68-ssl" | 0.046545 | 2022-07-15 06:50:17.506204+00:00 |
4 | https://practicaldatascience.co.uk/machine-learning/how-to-use-mean-encoding-in-your-machine-lea... | 2021-03-01 00:00:00+00:00 | https://practicaldatascience.co.uk/sitemap.xml | "942a323e3a4a2bd3bf96cbb92c3aba68-ssl" | 0.046545 | 2022-07-15 06:50:17.506204+00:00 |
Next, we need to create a Python list containing the URLs we want to get Advertools to scrape. This is very easy. We simply define the Pandas column and then append the to_list()
function like this: df_sitemap['loc'].to_list()
. This will take every URL in the loc
column and place it in a Python list that we’ll assign to the variable url_list
.
We can then pass url_list
to the crawl()
function to crawl and scrape the website. We’ll use the follow_links=False
argument to ensure the crawl sticks only to the pages in our list, not those it may discover. Advertools crawls run asynchronously via Scrapy so they are very fast. The output is returned in JSON lines format, so we’ll store this in output.jl
.
url_list = df_sitemap['loc'].to_list()
adv.crawl(url_list, 'output.jl', follow_links=False)
Once the crawl()
function has run, we can view the JSON lines crawl output from Advertools by converting it to a Pandas dataframe using the Pandas read_json()
function with the lines=True
argument. If you print the info()
on the returned Pandas dataframe, you’ll be able to see what content Advertools detected within the page. Each of these values has been scraped, extracted, and stored in the dataframe, without the need for you to write any custom code to parse them.
df_crawl = pd.read_json('output.jl', lines=True)
df_crawl.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 638 entries, 0 to 637
Data columns (total 83 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 url 638 non-null object
1 title 638 non-null object
2 meta_desc 638 non-null object
3 viewport 638 non-null object
4 charset 638 non-null object
5 h1 580 non-null object
6 h2 520 non-null object
7 h3 638 non-null object
8 h4 636 non-null object
9 canonical 638 non-null object
10 og:locale 638 non-null object
11 og:title 638 non-null object
12 og:description 638 non-null object
13 og:image 638 non-null object
14 og:url 638 non-null object
15 og:type 512 non-null object
16 twitter:card 638 non-null object
17 twitter:site 638 non-null object
18 twitter:creator 638 non-null object
19 twitter:title 638 non-null object
20 twitter:description 638 non-null object
21 twitter:image 638 non-null object
22 twitter:url 638 non-null object
23 jsonld_@context 638 non-null object
24 jsonld_@type 638 non-null object
25 jsonld_itemListElement 512 non-null object
26 jsonld_1_@context 512 non-null object
27 jsonld_1_@type 512 non-null object
28 jsonld_1_name 512 non-null object
29 jsonld_1_@id 512 non-null object
30 jsonld_1_nationality 512 non-null object
31 jsonld_1_gender 512 non-null object
32 jsonld_1_Description 512 non-null object
33 jsonld_1_jobTitle 512 non-null object
34 jsonld_1_url 512 non-null object
35 jsonld_1_image 512 non-null object
36 jsonld_1_sameAs 512 non-null object
37 jsonld_1_alumniOf 512 non-null object
38 body_text 638 non-null object
39 size 638 non-null int64
40 download_timeout 638 non-null int64
41 download_slot 638 non-null object
42 download_latency 638 non-null float64
43 depth 638 non-null int64
44 status 638 non-null int64
45 links_url 638 non-null object
46 links_text 638 non-null object
47 links_nofollow 638 non-null object
48 nav_links_url 638 non-null object
49 nav_links_text 638 non-null object
50 nav_links_nofollow 638 non-null object
51 footer_links_url 638 non-null object
52 footer_links_text 638 non-null object
53 footer_links_nofollow 638 non-null object
54 img_src 636 non-null object
55 img_alt 636 non-null object
56 ip_address 638 non-null object
57 crawl_time 638 non-null datetime64[ns]
58 resp_headers_age 638 non-null int64
59 resp_headers_cache-control 638 non-null object
60 resp_headers_content-type 638 non-null object
61 resp_headers_date 638 non-null object
62 resp_headers_etag 638 non-null object
63 resp_headers_server 638 non-null object
64 resp_headers_strict-transport-security 638 non-null object
65 resp_headers_vary 638 non-null object
66 resp_headers_x-nf-request-id 638 non-null object
67 request_headers_accept 638 non-null object
68 request_headers_accept-language 638 non-null object
69 request_headers_user-agent 638 non-null object
70 request_headers_accept-encoding 638 non-null object
71 resp_headers_content-length 437 non-null float64
72 h5 44 non-null object
73 jsonld_name 126 non-null object
74 jsonld_@id 126 non-null object
75 jsonld_nationality 126 non-null object
76 jsonld_gender 126 non-null object
77 jsonld_Description 126 non-null object
78 jsonld_jobTitle 126 non-null object
79 jsonld_url 126 non-null object
80 jsonld_image 126 non-null object
81 jsonld_sameAs 126 non-null object
82 jsonld_alumniOf 126 non-null object
dtypes: datetime64[ns](1), float64(2), int64(5), object(75)
memory usage: 413.8+ KB
If we view the first row of the dataframe and transpose the output with .T
we’ll see all the data Advertools found on the first page scraped. We’ve got the URL, title, meta description and other data from the document head, plus the headings, canonical and tons of other information.
Any values that are present from multiple elements, such as multiple headings of the same level, are separated by the @@
separator. These can be exploded into a Python list and parsed separately, though it would be great if Advertools did that by default or as an option. Most (but not all) schema markup found within a page is also extracted and placed in its own variable, so for some sites, you may not need to write that much custom scraping code.
df_crawl.head(1).T
0 | |
---|---|
url | https://practicaldatascience.co.uk/data-science/how-to-create-a-python-virtual-environment-for-j... |
title | How to create a Python virtual environment for Jupyter |
meta_desc | Learn how to create a Python virtual environment for your Jupyter notebook using venv and virtua... |
viewport | width=device-width, initial-scale=1, shrink-to-fit=no |
charset | utf-8 |
h1 | How to create a Python virtual environment for Jupyter |
h2 | Learn how to create a Python virtual environment for your Jupyter notebook using venv and virtua... |
h3 | Introduction to Python@@Intermediate Python@@Introduction to Data Science in Python@@Other posts... |
h4 | Creating a virtual environment@@Activating the virtual environment@@Running a Jupyter notebook@@... |
canonical | https://practicaldatascience.co.uk/data-science/how-to-create-a-python-virtual-environment-for-j... |
og:locale | en_GB |
og:title | How to create a Python virtual environment for Jupyter |
og:description | Learn how to create a Python virtual environment for your Jupyter notebook using venv and virtua... |
og:image | https://practicaldatascience.co.uk/assets/images/posts/mac.png |
og:url | https://practicaldatascience.co.uk/data-science/how-to-create-a-python-virtual-environment-for-j... |
og:type | article |
twitter:card | summary_large_image |
twitter:site | @ |
twitter:creator | @ |
twitter:title | How to create a Python virtual environment for Jupyter |
twitter:description | Learn how to create a Python virtual environment for your Jupyter notebook using venv and virtua... |
twitter:image | https://practicaldatascience.co.uk/assets/images/posts/mac.png |
twitter:url | https://practicaldatascience.co.uk/data-science/how-to-create-a-python-virtual-environment-for-j... |
jsonld_@context | http://schema.org |
jsonld_@type | BreadcrumbList |
jsonld_itemListElement | [{'@type': 'ListItem', 'position': 1, 'item': {'@id': 'https://practicaldatascience.co.uk/', 'na... |
jsonld_1_@context | https://schema.org/ |
jsonld_1_@type | Person |
jsonld_1_name | Matt Clarke |
jsonld_1_@id | https://practicaldatascience.co.uk/about |
jsonld_1_nationality | British |
jsonld_1_gender | Male |
jsonld_1_Description | Ecommerce and Marketing data science specialist |
jsonld_1_jobTitle | Ecommerce and Marketing Director |
jsonld_1_url | https://practicaldatascience.co.uk |
jsonld_1_image | https://practicaldatascience.co.uk/assets/images/posts/matt-clarke.jpg |
jsonld_1_sameAs | [https://twitter.com/EcommerceMatt, https://www.linkedin.com/in/mattclarke/, https://practicalda... |
jsonld_1_alumniOf | [{'@type': 'EducationalOrganization', 'name': 'Imperial College London', 'sameAs': 'https://ic.a... |
body_text | \n Data Science \n \n Machine Learning \n \n... |
size | 67291 |
download_timeout | 180 |
download_slot | practicaldatascience.co.uk |
download_latency | 0.492844 |
depth | 0 |
status | 200 |
links_url | https://practicaldatascience.co.uk/@@https://practicaldatascience.co.uk/data-science@@https://pr... |
links_text | \n \n Practical Data Science\n @@Data Science@@Machine Learning@@Data Engineering@... |
links_nofollow | False@@False@@False@@False@@False@@False@@False@@False@@False@@False@@False@@False@@True@@True@@... |
nav_links_url | https://practicaldatascience.co.uk/@@https://practicaldatascience.co.uk/data-science@@https://pr... |
nav_links_text | \n \n Practical Data Science\n @@Data Science@@Machine Learning@@Data Engineering@... |
nav_links_nofollow | False@@False@@False@@False@@False@@False |
footer_links_url | https://practicaldatascience.co.uk/data-science@@https://practicaldatascience.co.uk/machine-lear... |
footer_links_text | Data Science@@Machine Learning@@Data Engineering@@Data Science Courses@@Sitemap@@About@@LinkedIn... |
footer_links_nofollow | False@@False@@False@@False@@False@@False@@True@@True@@False |
img_src | data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAMAAAACCAQAAAA3fa6RAAAADklEQVR42mNkAANGCAUAACMAA2... |
img_alt | How to create a Python virtual environment for Jupyter@@Test anything in a Jupyter venv without ... |
ip_address | 104.198.14.52 |
crawl_time | 2022-07-15 06:48:35 |
resp_headers_age | 0 |
resp_headers_cache-control | public, max-age=0, must-revalidate |
resp_headers_content-type | text/html; charset=UTF-8 |
resp_headers_date | Fri, 15 Jul 2022 06:48:35 GMT |
resp_headers_etag | "6e837a095eb5cdabf6f09565f0bc00f3-ssl-df" |
resp_headers_server | Netlify |
resp_headers_strict-transport-security | max-age=31536000 |
resp_headers_vary | Accept-Encoding |
resp_headers_x-nf-request-id | 01G809VGJ6N7QE836FRY40269S |
request_headers_accept | text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 |
request_headers_accept-language | en |
request_headers_user-agent | advertools/0.13.1 |
request_headers_accept-encoding | gzip, deflate, br |
resp_headers_content-length | NaN |
h5 | NaN |
jsonld_name | NaN |
jsonld_@id | NaN |
jsonld_nationality | NaN |
jsonld_gender | NaN |
jsonld_Description | NaN |
jsonld_jobTitle | NaN |
jsonld_url | NaN |
jsonld_image | NaN |
jsonld_sameAs | NaN |
jsonld_alumniOf | NaN |
Finally, we can save the output of the Pandas dataframe containing our scraped data to a CSV using the Pandas to_csv()
function with the index=False
argument to prevent Pandas adding an additional index.
df_crawl.to_csv('output.csv', index=False)
In the next example, I’ll show you how you can perform custom extraction using Advertools to find, scrape, and parse custom elements of page content when you’re scraping a website. This is a little more complicated because you need to identify the HTML elements containing the data you want to extract but, on the plus side, Advertools extracts so much data by default that custom extractions are only needed for the odd element.
Matt Clarke, Friday, July 15, 2022