How to use CSS and XPath custom extraction in Advertools

Learn how to perform custom extraction when web scraping with Advertools using CSS selectors and XPath to scrape and parse specific HTML elements.

How to use CSS and XPath custom extraction in Advertools
Picture by Skylar Kang, Pexels.
14 minutes to read

The Advertools web scraping package popular in the Python SEO community automatically extracts a wide range of page elements, such as the title, meta description, and various schema.org and OpenGraph elements, without the need for you to write custom code to scrape them.

However, sometimes you’ll need to scrape and parse very specific page elements that aren’t covered by these defaults, which is where custom extraction comes in. Custom extraction is a web scraping technique that lets you scrape specific page elements, such as the number of reviews, the tags on a post, or the price of a product, and can be achieved using either CSS identifiers or XPath path expressions, or both.

The “custom extraction” terminology is most commonly used on web scraping technologies such as Screamingfrog and Advertools that return a range of standard page elements by default, meaning that custom extraction is only required for very specific use cases. For more basic web scraping technologies, such as roll-your-own scrapers built using Requests, Requests-HTML, and BeautifulSoup, everything you do will need a custom extraction, but for the more sophisticated scrapers, like Scrapy, Advertools, and Screamingfrog, it’s the exception.

How does custom extraction work?

Custom extraction in web scraping can be performed in two main ways: via CSS selectors and via XPath. Every web scraping tool out there supports both types and you may need to use both, depending on the complexity of the items you’re trying to extract. CSS selectors and XPath are relatively simple to use in theory, but can be puzzling to those without a deep understanding of HTML or CSS front-end web design and development.

These two methods of custom extraction basically define the unique path in the HTML that points to the specific page element, or page elements, you want to scrape. For example, you might want to find and scrape all the tags from a blog post, or return all the products being recommended in a “Customers also bought” box.

In order to find the CSS identifiers or XPath to use you’d need to carefully examine the HTML of the page, inspect the element you want to scrape, and construct a CSS rule or XPath that uniquely identifies the element. It sounds easy, but it can sometimes be quite a fiddly task requiring lots of trial and error. Also, when the site changes its layout, any content you’ve scraped using CSS or XPath may no longer be extracted…

Import the packages

First, install Advertools by entering pip3 install advertools in your terminal or !pip3 install advertools in your Jupyter notebook, then import the Pandas and Advertools Python packages. You’ll want to use the Pandas set_option() function to increase the maximum number of rows and the maximum column width from the default in order to be able to see all the data in the dataframe.

import pandas as pd
import advertools as adv
pd.set_option('max_colwidth', 100)
pd.set_option('max_rows', 100)

Create a crawl list

Next, we’ll create a list of list of URLs to crawl. Rather than crawling the whole site by scraping the XML sitemap and crawling every page, it’s a good idea to construct a smaller list of representative URLs instead. It’s much quicker for debugging.

url_list = [
    'https://practicaldatascience.co.uk/data-science/how-to-query-the-google-analytics-data-api-for-ga4-with-python',
    'https://practicaldatascience.co.uk/machine-learning/how-to-create-a-naive-bayes-text-classification-model-using-scikit-learn',
    'https://practicaldatascience.co.uk/machine-learning/how-to-classify-customer-support-tickets-using-naive-bayes'
]

Run a basic crawl

To show what happens when you run a basic Advertools crawl without custom extraction we’ll pass in our url_list to the crawl() function and return the output in the JSON lines file output_custom.jl, using the follow_links=False argument to only follow the URLs in the list, not those found within the page.

adv.crawl(url_list, 
          'output_custom.jl', 
          follow_links=False)

After the crawl, we can then use the Pandas read_json() function to load the JSON lines file and display the elements scraped from the page in a Pandas dataframe. As you can see, Advertools returns loads of page elements by default, so you’ll only need to perform custom extraction if the specific HTML elements you want aren’t already in this dataframe.

df_basic_crawl = pd.read_json('output_custom.jl', lines=True)
df_basic_crawl.head(1).T
2022-07-15 09:42:50,505 | INFO | utils.py:129 | _init_num_threads | Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2022-07-15 09:42:50,506 | INFO | utils.py:141 | _init_num_threads | NumExpr defaulting to 8 threads.
0
url https://practicaldatascience.co.uk/data-science/how-to-query-the-google-analytics-data-api-for-g...
title How to query the Google Analytics Data API for GA4 using Python
meta_desc Learn how to query the Google Analytics Data API for GA4 using Python with GAPandas4 to fetch yo...
viewport width=device-width, initial-scale=1, shrink-to-fit=no
charset utf-8
h1 How to query the Google Analytics Data API for GA4 using Python
h2 Learn how to query the Google Analytics Data API for GA4 using Python with GAPandas4 to fetch yo...
h3 Install the package@@Configure your settings@@Create your request@@Full code example@@Dimension,...
h4 \n How to calculate abandonment and completion rates using the Google...
canonical https://practicaldatascience.co.uk/data-science/how-to-query-the-google-analytics-data-api-for-g...
og:locale en_GB
og:title How to query the Google Analytics Data API for GA4 using Python
og:description Learn how to query the Google Analytics Data API for GA4 using Python with GAPandas4 to fetch yo...
og:image https://practicaldatascience.co.uk/assets/images/posts/gapandas4.jpg
og:url https://practicaldatascience.co.uk/data-science/how-to-query-the-google-analytics-data-api-for-g...
og:type article
twitter:card summary_large_image
twitter:site @
twitter:creator @
twitter:title How to query the Google Analytics Data API for GA4 using Python
twitter:description Learn how to query the Google Analytics Data API for GA4 using Python with GAPandas4 to fetch yo...
twitter:image https://practicaldatascience.co.uk/assets/images/posts/gapandas4.jpg
twitter:url https://practicaldatascience.co.uk/data-science/how-to-query-the-google-analytics-data-api-for-g...
jsonld_@context http://schema.org
jsonld_@type BreadcrumbList
jsonld_itemListElement [{'@type': 'ListItem', 'position': 1, 'item': {'@id': 'https://practicaldatascience.co.uk/', 'na...
jsonld_1_@context https://schema.org/
jsonld_1_@type Person
jsonld_1_name Matt Clarke
jsonld_1_@id https://practicaldatascience.co.uk/about
jsonld_1_nationality British
jsonld_1_gender Male
jsonld_1_Description Ecommerce and Marketing data science specialist
jsonld_1_jobTitle Ecommerce and Marketing Director
jsonld_1_url https://practicaldatascience.co.uk
jsonld_1_image https://practicaldatascience.co.uk/assets/images/posts/matt-clarke.jpg
jsonld_1_sameAs [https://twitter.com/EcommerceMatt, https://www.linkedin.com/in/mattclarke/, https://practicalda...
jsonld_1_alumniOf [{'@type': 'EducationalOrganization', 'name': 'Imperial College London', 'sameAs': 'https://ic.a...
body_text \n Data Science \n \n Machine Learning \n \n...
size 116293
download_timeout 180
download_slot practicaldatascience.co.uk
download_latency 0.168855
depth 0
status 200
links_url https://practicaldatascience.co.uk/@@https://practicaldatascience.co.uk/data-science@@https://pr...
links_text \n \n Practical Data Science\n @@Data Science@@Machine Learning@@Data Engineering@...
links_nofollow False@@False@@False@@False@@False@@False@@False@@False@@False@@False@@False@@False@@False@@True@...
nav_links_url https://practicaldatascience.co.uk/@@https://practicaldatascience.co.uk/data-science@@https://pr...
nav_links_text \n \n Practical Data Science\n @@Data Science@@Machine Learning@@Data Engineering@...
nav_links_nofollow False@@False@@False@@False@@False@@False
footer_links_url https://practicaldatascience.co.uk/data-science@@https://practicaldatascience.co.uk/machine-lear...
footer_links_text Data Science@@Machine Learning@@Data Engineering@@Data Science Courses@@Sitemap@@About@@LinkedIn...
footer_links_nofollow False@@False@@False@@False@@False@@False@@True@@True@@False
img_src ...
img_alt How to query the Google Analytics Data API for GA4 using Python@@Matt Clarke@@How to calculate a...
ip_address 104.198.14.52
crawl_time 2022-07-15 07:56:30
resp_headers_content-length 10777
resp_headers_age 85
resp_headers_cache-control public, max-age=0, must-revalidate
resp_headers_content-type text/html; charset=UTF-8
resp_headers_date Fri, 15 Jul 2022 07:55:05 GMT
resp_headers_etag "d1fdd3aa4105279bf0385029c4a5a640-ssl-df"
resp_headers_server Netlify
resp_headers_strict-transport-security max-age=31536000
resp_headers_vary Accept-Encoding
resp_headers_x-nf-request-id 01G80DQVRXY4T7FBRRXCCHYR5C
request_headers_accept text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
request_headers_accept-language en
request_headers_user-agent advertools/0.13.1
request_headers_accept-encoding gzip, deflate, br

Create a custom extraction using CSS identifiers

Next, we’ll create a custom extraction in Advertools using CSS identifiers. As I used to be a web developer, I’m most comfortable with using CSS identifiers for custom extraction when web scraping. They’re much more intuitive than XPath expressions, but do lack the versatility and power, so you may encounter some situations when you’ll need to use XPath instead of CSS.

We’ll extract two elements from the page: a list of post tags assigned to each post, and the post date shown at the footer of each article. You can pass in a huge number of CSS selectors to the css_selectors argument as a Python dictionary. The key for the dictionary, i.e. custom_post_tags will become the Pandas column name for the value, while the CSS identifier used for the custom extraction, i.e. .post-tags a::text goes in the value. I prefix mine with custom_ to keep them all together.

To find the CSS identifier you can use the “Inspect element” feature in your web browser, use the view source option, or use a Chrome extension such as Selector Gadget. The ::text part is important. This returns the text from within the element, rather than the element itself, i.e. its raw HTML, which is probably not what you want.

adv.crawl(url_list, 
          'output_custom_css.jl', 
          follow_links=False, 
          css_selectors={
              'custom_post_tags': '.post-tags a::text', 
              'custom_post_date': 'span.date::text'
          })
df_css = pd.read_json('output_custom_css.jl', lines=True)
df_css[['title', 'custom_post_tags', 'custom_post_date']].head()
title custom_post_tags custom_post_date
0 How to query the Google Analytics Data API for GA4 using Python Data Science@@Google Analytics@@Pandas@@Web analytics Matt Clarke, Wednesday, June 22, 2022
1 How to create a Naive Bayes text classification model using scikit-learn Machine Learning@@Natural Language Processing@@scikit-learn Matt Clarke, Sunday, May 08, 2022
2 How to classify customer support tickets using Naive Bayes Machine Learning@@Customer experience@@Natural Language Processing@@Technical ecommerce@@scikit-... Matt Clarke, Friday, August 13, 2021

Create a custom extraction using XPath

XPath uses “path expressions” that select nodes or node sets within an XML document and forms part of the XSLT standard. It takes a bit of effort to get used to the syntax, but it can be quite powerful and can let you do things you can’t easily achieve using CSS identifiers.

For example, XPath predicates allow you to create XPath path expressions that let you select the last element of a type, or items were a scraped value falls within a certain range. They’re really handy when you know how to use them effectively. Here’s how we’d scrape the same elements as above using XPath path expressions.

XPath path expressions are, I think, much harder to come up with. Web browsers, such as Google Chrome and Firefox, do have a copy XPath option, but it’s rarely one that works consistently across pages. Selector Gadget may be of use, but invariably I find myself writing my own custom XPath path expressions.

adv.crawl(url_list, 
          'output_custom_xpath.jl', 
          follow_links=False, 
          xpath_selectors={
              'custom_post_tags': '//div[@class="post-tags"]/a/text()', 
              'custom_post_date': '//p/span/text()'
          })
df_xpath = pd.read_json('output_custom_xpath.jl', lines=True)
df_xpath[['title', 'custom_post_tags', 'custom_post_date']].head()
title custom_post_tags custom_post_date
0 How to query the Google Analytics Data API for GA4 using Python Data Science@@Google Analytics@@Pandas@@Web analytics Matt Clarke, Wednesday, June 22, 2022
1 How to classify customer support tickets using Naive Bayes Machine Learning@@Customer experience@@Natural Language Processing@@Technical ecommerce@@scikit-... Matt Clarke, Friday, August 13, 2021
2 How to create a Naive Bayes text classification model using scikit-learn Machine Learning@@Natural Language Processing@@scikit-learn Matt Clarke, Sunday, May 08, 2022

Matt Clarke, Friday, July 15, 2022

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.