How to use CSS and XPath custom extraction in Advertools

Picture by Skylar Kang, Pexels.

14 minutes to read

Data Science Python SEO Web scraping

The Advertools web scraping package popular in the Python SEO community automatically extracts a wide range of page elements, such as the title, meta description, and various schema.org and OpenGraph elements, without the need for you to write custom code to scrape them.

However, sometimes you’ll need to scrape and parse very specific page elements that aren’t covered by these defaults, which is where custom extraction comes in. Custom extraction is a web scraping technique that lets you scrape specific page elements, such as the number of reviews, the tags on a post, or the price of a product, and can be achieved using either CSS identifiers or XPath path expressions, or both.

The “custom extraction” terminology is most commonly used on web scraping technologies such as Screamingfrog and Advertools that return a range of standard page elements by default, meaning that custom extraction is only required for very specific use cases. For more basic web scraping technologies, such as roll-your-own scrapers built using Requests, Requests-HTML, and BeautifulSoup, everything you do will need a custom extraction, but for the more sophisticated scrapers, like Scrapy, Advertools, and Screamingfrog, it’s the exception.

How does custom extraction work?

Custom extraction in web scraping can be performed in two main ways: via CSS selectors and via XPath. Every web scraping tool out there supports both types and you may need to use both, depending on the complexity of the items you’re trying to extract. CSS selectors and XPath are relatively simple to use in theory, but can be puzzling to those without a deep understanding of HTML or CSS front-end web design and development.

These two methods of custom extraction basically define the unique path in the HTML that points to the specific page element, or page elements, you want to scrape. For example, you might want to find and scrape all the tags from a blog post, or return all the products being recommended in a “Customers also bought” box.

In order to find the CSS identifiers or XPath to use you’d need to carefully examine the HTML of the page, inspect the element you want to scrape, and construct a CSS rule or XPath that uniquely identifies the element. It sounds easy, but it can sometimes be quite a fiddly task requiring lots of trial and error. Also, when the site changes its layout, any content you’ve scraped using CSS or XPath may no longer be extracted…

Import the packages

First, install Advertools by entering pip3 install advertools in your terminal or !pip3 install advertools in your Jupyter notebook, then import the Pandas and Advertools Python packages. You’ll want to use the Pandas set_option() function to increase the maximum number of rows and the maximum column width from the default in order to be able to see all the data in the dataframe.

import pandas as pd
import advertools as adv

pd.set_option('max_colwidth', 100)
pd.set_option('max_rows', 100)

Create a crawl list

Next, we’ll create a list of list of URLs to crawl. Rather than crawling the whole site by scraping the XML sitemap and crawling every page, it’s a good idea to construct a smaller list of representative URLs instead. It’s much quicker for debugging.

url_list = [
    'https://practicaldatascience.co.uk/data-science/how-to-query-the-google-analytics-data-api-for-ga4-with-python',
    'https://practicaldatascience.co.uk/machine-learning/how-to-create-a-naive-bayes-text-classification-model-using-scikit-learn',
    'https://practicaldatascience.co.uk/machine-learning/how-to-classify-customer-support-tickets-using-naive-bayes'
]

Run a basic crawl

To show what happens when you run a basic Advertools crawl without custom extraction we’ll pass in our url_list to the crawl() function and return the output in the JSON lines file output_custom.jl, using the follow_links=False argument to only follow the URLs in the list, not those found within the page.

adv.crawl(url_list, 
          'output_custom.jl', 
          follow_links=False)

After the crawl, we can then use the Pandas read_json() function to load the JSON lines file and display the elements scraped from the page in a Pandas dataframe. As you can see, Advertools returns loads of page elements by default, so you’ll only need to perform custom extraction if the specific HTML elements you want aren’t already in this dataframe.

df_basic_crawl = pd.read_json('output_custom.jl', lines=True)
df_basic_crawl.head(1).T

2022-07-15 09:42:50,505 | INFO | utils.py:129 | _init_num_threads | Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2022-07-15 09:42:50,506 | INFO | utils.py:141 | _init_num_threads | NumExpr defaulting to 8 threads.

	0
url	https://practicaldatascience.co.uk/data-science/how-to-query-the-google-analytics-data-api-for-g...
title	How to query the Google Analytics Data API for GA4 using Python
meta_desc	Learn how to query the Google Analytics Data API for GA4 using Python with GAPandas4 to fetch yo...
viewport	width=device-width, initial-scale=1, shrink-to-fit=no
charset	utf-8
h1	How to query the Google Analytics Data API for GA4 using Python
h2	Learn how to query the Google Analytics Data API for GA4 using Python with GAPandas4 to fetch yo...
h3	Install the package@@Configure your settings@@Create your request@@Full code example@@Dimension,...
h4	\n How to calculate abandonment and completion rates using the Google...
canonical	https://practicaldatascience.co.uk/data-science/how-to-query-the-google-analytics-data-api-for-g...
og:locale	en_GB
og:title	How to query the Google Analytics Data API for GA4 using Python
og:description	Learn how to query the Google Analytics Data API for GA4 using Python with GAPandas4 to fetch yo...
og:image	https://practicaldatascience.co.uk/assets/images/posts/gapandas4.jpg
og:url	https://practicaldatascience.co.uk/data-science/how-to-query-the-google-analytics-data-api-for-g...
og:type	article
twitter:card	summary_large_image
twitter:site	@
twitter:creator	@
twitter:title	How to query the Google Analytics Data API for GA4 using Python
twitter:description	Learn how to query the Google Analytics Data API for GA4 using Python with GAPandas4 to fetch yo...
twitter:image	https://practicaldatascience.co.uk/assets/images/posts/gapandas4.jpg
twitter:url	https://practicaldatascience.co.uk/data-science/how-to-query-the-google-analytics-data-api-for-g...
jsonld_@context	http://schema.org
jsonld_@type	BreadcrumbList
jsonld_itemListElement	[{'@type': 'ListItem', 'position': 1, 'item': {'@id': 'https://practicaldatascience.co.uk/', 'na...
jsonld_1_@context	https://schema.org/
jsonld_1_@type	Person
jsonld_1_name	Matt Clarke
jsonld_1_@id	https://practicaldatascience.co.uk/about
jsonld_1_nationality	British
jsonld_1_gender	Male
jsonld_1_Description	Ecommerce and Marketing data science specialist
jsonld_1_jobTitle	Ecommerce and Marketing Director
jsonld_1_url	https://practicaldatascience.co.uk
jsonld_1_image	https://practicaldatascience.co.uk/assets/images/posts/matt-clarke.jpg
jsonld_1_sameAs	[https://twitter.com/EcommerceMatt, https://www.linkedin.com/in/mattclarke/, https://practicalda...
jsonld_1_alumniOf	[{'@type': 'EducationalOrganization', 'name': 'Imperial College London', 'sameAs': 'https://ic.a...
body_text	\n Data Science \n \n Machine Learning \n \n...
size	116293
download_timeout	180
download_slot	practicaldatascience.co.uk
download_latency	0.168855
depth	0
status	200
links_url	https://practicaldatascience.co.uk/@@https://practicaldatascience.co.uk/data-science@@https://pr...
links_text	\n \n Practical Data Science\n @@Data Science@@Machine Learning@@Data Engineering@...
links_nofollow	False@@False@@False@@False@@False@@False@@False@@False@@False@@False@@False@@False@@False@@True@...
nav_links_url	https://practicaldatascience.co.uk/@@https://practicaldatascience.co.uk/data-science@@https://pr...
nav_links_text	\n \n Practical Data Science\n @@Data Science@@Machine Learning@@Data Engineering@...
nav_links_nofollow	False@@False@@False@@False@@False@@False
footer_links_url	https://practicaldatascience.co.uk/data-science@@https://practicaldatascience.co.uk/machine-lear...
footer_links_text	Data Science@@Machine Learning@@Data Engineering@@Data Science Courses@@Sitemap@@About@@LinkedIn...
footer_links_nofollow	False@@False@@False@@False@@False@@False@@True@@True@@False
img_src	data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAMAAAACCAQAAAA3fa6RAAAADklEQVR42mNkAANGCAUAACMAA2...
img_alt	How to query the Google Analytics Data API for GA4 using Python@@Matt Clarke@@How to calculate a...
ip_address	104.198.14.52
crawl_time	2022-07-15 07:56:30
resp_headers_content-length	10777
resp_headers_age	85
resp_headers_cache-control	public, max-age=0, must-revalidate
resp_headers_content-type	text/html; charset=UTF-8
resp_headers_date	Fri, 15 Jul 2022 07:55:05 GMT
resp_headers_etag	"d1fdd3aa4105279bf0385029c4a5a640-ssl-df"
resp_headers_server	Netlify
resp_headers_strict-transport-security	max-age=31536000
resp_headers_vary	Accept-Encoding
resp_headers_x-nf-request-id	01G80DQVRXY4T7FBRRXCCHYR5C
request_headers_accept	text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8
request_headers_accept-language	en
request_headers_user-agent	advertools/0.13.1
request_headers_accept-encoding	gzip, deflate, br

Create a custom extraction using CSS identifiers

Next, we’ll create a custom extraction in Advertools using CSS identifiers. As I used to be a web developer, I’m most comfortable with using CSS identifiers for custom extraction when web scraping. They’re much more intuitive than XPath expressions, but do lack the versatility and power, so you may encounter some situations when you’ll need to use XPath instead of CSS.

We’ll extract two elements from the page: a list of post tags assigned to each post, and the post date shown at the footer of each article. You can pass in a huge number of CSS selectors to the css_selectors argument as a Python dictionary. The key for the dictionary, i.e. custom_post_tags will become the Pandas column name for the value, while the CSS identifier used for the custom extraction, i.e. .post-tags a::text goes in the value. I prefix mine with custom_ to keep them all together.

To find the CSS identifier you can use the “Inspect element” feature in your web browser, use the view source option, or use a Chrome extension such as Selector Gadget. The ::text part is important. This returns the text from within the element, rather than the element itself, i.e. its raw HTML, which is probably not what you want.

adv.crawl(url_list, 
          'output_custom_css.jl', 
          follow_links=False, 
          css_selectors={
              'custom_post_tags': '.post-tags a::text', 
              'custom_post_date': 'span.date::text'
          })

df_css = pd.read_json('output_custom_css.jl', lines=True)
df_css[['title', 'custom_post_tags', 'custom_post_date']].head()

	title	custom_post_tags	custom_post_date
0	How to query the Google Analytics Data API for GA4 using Python	Data Science@@Google Analytics@@Pandas@@Web analytics	Matt Clarke, Wednesday, June 22, 2022
1	How to create a Naive Bayes text classification model using scikit-learn	Machine Learning@@Natural Language Processing@@scikit-learn	Matt Clarke, Sunday, May 08, 2022
2	How to classify customer support tickets using Naive Bayes	Machine Learning@@Customer experience@@Natural Language Processing@@Technical ecommerce@@scikit-...	Matt Clarke, Friday, August 13, 2021

Create a custom extraction using XPath

XPath uses “path expressions” that select nodes or node sets within an XML document and forms part of the XSLT standard. It takes a bit of effort to get used to the syntax, but it can be quite powerful and can let you do things you can’t easily achieve using CSS identifiers.

For example, XPath predicates allow you to create XPath path expressions that let you select the last element of a type, or items were a scraped value falls within a certain range. They’re really handy when you know how to use them effectively. Here’s how we’d scrape the same elements as above using XPath path expressions.

XPath path expressions are, I think, much harder to come up with. Web browsers, such as Google Chrome and Firefox, do have a copy XPath option, but it’s rarely one that works consistently across pages. Selector Gadget may be of use, but invariably I find myself writing my own custom XPath path expressions.

adv.crawl(url_list, 
          'output_custom_xpath.jl', 
          follow_links=False, 
          xpath_selectors={
              'custom_post_tags': '//div[@class="post-tags"]/a/text()', 
              'custom_post_date': '//p/span/text()'
          })

df_xpath = pd.read_json('output_custom_xpath.jl', lines=True)
df_xpath[['title', 'custom_post_tags', 'custom_post_date']].head()

	title	custom_post_tags	custom_post_date
0	How to query the Google Analytics Data API for GA4 using Python	Data Science@@Google Analytics@@Pandas@@Web analytics	Matt Clarke, Wednesday, June 22, 2022
1	How to classify customer support tickets using Naive Bayes	Machine Learning@@Customer experience@@Natural Language Processing@@Technical ecommerce@@scikit-...	Matt Clarke, Friday, August 13, 2021
2	How to create a Naive Bayes text classification model using scikit-learn	Machine Learning@@Natural Language Processing@@scikit-learn	Matt Clarke, Sunday, May 08, 2022

Matt Clarke, Friday, July 15, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.