Web scraping is a programming technique that uses a script or bot to visit one or more websites and extract specific elements or HTML tags from the source code of the page, so the data can be analysed, visualised, or used in models.
Web scrapers have a multitude of uses, especially in SEO, and learning how to build them can give you access to data that can help improve your ecommerce or marketing team’s results.
If you’re using Python for SEO, then you’ll be pleased to know that there are a range of excellent open source tools that make web scraping projects very simple for technically-minded SEOs, or data scientists supporting internal SEO teams.
Python is the most widely using programming language for web scraping projects and the Python community has created some incredible applications that are well-suited to those working in SEO, marketing, or ecommerce, and have some reasonable Python programming skills to apply them to their work.
The web scraping process involves two main steps: web crawling and web scraping. The web crawling step is the action of visiting a website and visiting every URL found, either by using a pre-generated list of URLs to crawl (such as those obtained when you scrape a sitemap.xml file), or by being given the domain as a starting point and then visiting every URL found via a process also known as web spidering.
On each URL found by the web crawler (or web spider), some custom code then runs to “scrape” the desired content from the page’s underlying source code (usually HTML), using rules that identify specific HTML tags in the page, such as the title or meta description. The parsed data are then saved in a CSV file or database.
The web scraping process therefore includes two different elements - one to make an HTTP request to the server to fetch the page, and one to parse the page’s source code to extract the elements of interest, usually using code that utilises regular expressions or Document Object Model (DOM) technologies such as XPath or CSS identifiers.
While these can both be relatively complex to do by hand, Python does make the web scraping process much easier and there are now a wide range of excellent Python packages available to help you scrape websites and extract useful information to use in your analyses, visualisations, or models.
Web scraping packages can be loosely divided into those that crawl websites, those that scrape or parse the content from the crawled pages, and those that combine the two.
Some web scraping packages available are quite basic and easy to use for small projects, but are slower at scale, so you may need a more complex solution depending on the size of the sites you want to scrape. There’s usually no need to pay to access a costly web scraping API for most projects. You can usually build a custom web scraper for free, if you have some intermediate Python skills.
|Requests||Requests is a Python HTTP client package that can be used for making HTTP requests. The Requests library is very simple to use and can allow you to access the source code of a HTML document so it can be parsed using a web parsing package, such as Beautiful Soup. Requests is really good, but is best suited to smaller projects, as it only handles a single connection at a time, making it slower on larger sites.|
|Beautiful Soup||Beautiful Soup is a parser package that can be used to read HTML, XML, RSS, and various other markup languages to allow you to extract specific page elements using either XPath or CSS selectors. The vast majority of web scraping packages make use of this Python library to provide their web parsing functionality, but you can also use it on its own to extract data from within chunks of HTML code. There's no specific requirement to access a whole HTML document when using Beautiful Soup.|
|Request-HTML||Requests-HTML is a Python library that essentially combines the Requests library and Beautiful Soup in a single package, allowing it to both an HTML document and its source code, then parse the source to extract data of interest. It has some really neat features and is a great choice for smaller Python web scraping projects.|
|Screaming Frog||Screaming Frog SEO Spider is a commercial application and runs as a desktop application on Windows, Mac, and Linux machines, and as a command line program. This web scraper is preconfigured to scrape many common page elements and has a stack of features to scrape other content, and integrate with other platforms such as Google Analytics and Google Search Console, so it's immensely powerful. It not an expensive product, so if you're regularly doing web scraping projects it's worth considering, especially if you want less technical staff to assist or handle scraping tasks on their own.|
|Scrapy||Scrapy is probably the most sophisticated package for web scraping with Python. Unlike most others, it supports threading, so can create multiple connections to a web site and scrape several pages at once, making it by far the quickest. It can crawl and scrape pages at a tremendous rate. However, the downside is that it's much more time consuming to set up and typically requires a lot more code than other scrapers. The learning curve is also the steepest of those here, as you usually need to build a custom web scraper for each site.|
|Advertools||Finally, there's Advertools. Advertools is a Python package built on top of Scrapy and is aimed at SEOs who want to perform web scraping projects. It gives you all the benefits of Scrapy but with a simpler structure and without the need to write so much code (plus a shed-load of other handy features unrelated to web scraping). Unlike Scrapy, it fetches a bunch of common page elements by default, to save you the hassle of writing the code to do this yourself. As it supports threading, it's much quicker than the un-threaded scrapers like Requests and Selenium. It's a really good choice.|
If you work in ecommerce, one of the most common web scraping projects you will want to undertake is to build a price scraper. Price scrapers crawl a selected list of your competitors’ websites and extract prices, SKUs, product names, and other useful information to help retailers compare their product prices and check that their goods are competitively priced against those of their rivals.
While scraping product prices from an individual ecommerce website isn’t particularly difficult, it becomes laborious when you need to scrape numerous websites. It usually requires you to develop site-specific scrapers that extract product prices based on each site’s bespoke HTML markup. Since most scrapers break when the underlying source code of the scraped pages is changed, this becomes a time-consuming and expensive process and introduces lots of technical debt.
To work around this problem, my preferred approach is to instead scrape product prices from metadata or microdata embedded within the page whenever possible. This structured data is added to most ecommerce product pages to help search engines extract product data to enrich search engine results pages or allow searchers to compare products via Google Shopping and other price comparison platforms.
By developing a price scraper that extracts microdata or JSON-LD using schema.org markup, a single scraper can extract prices from many sites, avoiding the need to build a bespoke one for every competitor.
The other major complexity with ecommerce price scraping is product matching. Simply scraping the product and price information is easy enough, but the trickiest bit is working out which prices are a like-for-like match for the products you sell. To do this, you’ll need to first build a product matching dataset and then create a machine learning product matching model.
Another insightful web scraping project is to scrape competitor reviews. This can help you benchmark your business performance against theirs, see how or if they respond to negative reviews, understand what customers like and dislike about the service of your rivals, and see what products they’re selling and in what volumes. You can even use it to estimate their sales.
While you could just scrape product reviews directly from their websites, or extract them from the JSON-LD or microdata stored within each product page, the easiest way to access these reviews in bulk is to obtain them from reviews platforms such as Trustpilot and Feefo. The Feefo API also lets you download reviews directly to analyse products or service.
While scraping competitor technology data can be interesting, the data aren’t always that reliable, and the Python Builtwith package seems a bit hit-and-miss. However, your mileage may vary.
Another useful thing you can do with Python web scraping packages is use them to crawl your websites to look for things that cause problems for SEO, such as 404 or “page not found” errors and 301 redirect chains. 404 errors, caused by the inclusion of broken links or images, harm the user experience and can send a signal to search engines that the site is poorly maintained.
Redirect chains impact your “crawl budget”, which can mean that visiting search engine spiders examine fewer pages than they otherwise would, potentially impacting how many new pages are found, and how many updated pages get refreshed in the SERPs.
By scraping the URLs from your site and then using the requests package to retrieve the HTTP status code for each page, you can quickly do a bulk scan of your site and identify pages with a 404 or have a 301 redirect chain so you can fix the problem.
The robots.txt file sits at the document root of every website and is used to tell compliant crawlers what they can crawl and what they can’t. It can also be used to tell crawlers where the sitemap.xml file is located, and throttle or ban aggressive bots that may bring a site to its knees by crawling pages too quickly.
You can use Python to scrape and parse robots.txt files and put that data into a Pandas dataframe so you can analyse it separately, removing the need to visit the site, view the robots.txt file and transfer the content to a file yourself.
Schema.org was founded by the world’s largest search engine providers - Google, Microsoft, Yahoo, and Yandex - to help improve the user experience on search engines by encouraging website owners to create “structured data” that was much easier for them to crawl and parse.
The microdata comes in various forms, but is usually microdata (embedded in the page’s HTML), JSON-LD, or more rarely, RDF-A. There are now a huge range of schema.org schemas, covering everything from products, reviews and promotions, to people, organizations, and recipes.
While it’s often overlooked, it can save you a huge amount of time and effort to scrape and parse microdata instead of scraping page content directly. Schema.org microdata should adhere to the same format, so you can create a single scraper that can work across multiple sites, which massively reduces development and maintenance overheads.
The first step is to identify schema.org metadata usage, so you can see which dialect and schemas are in use on the sites you want to scrape. Then you can use Extruct to scrape schema.org metadata from the page and store it in a Pandas dataframe, or write it to CSV or database.
Open Graph was designed by Facebook to help web pages become rich objects with social graphs. Basically, it’s just another way for site owner’s to help improve the user experience on Facebook and other social media platforms by structuring the data to make it easier for Facebook to scrape and put into widgets and posts on users’ feeds.
Since Open Graph data is embedded directly in the
<head> of the HTML document, you can scrape it and store it just like any other data embedded in the code. Scraping Open Graph data can give you quick access to information such as the page title, description, image, or videos present.
XML sitemaps have been used on websites for well over a decade now. They are structured documents written in a recognised XML format and are designed to help search engines identify the pages present on a website so they can be crawled and indexed for search engine users to find.
When undertaking web scraping projects in Python, scraping XML sitemaps is generally one of the most useful first steps, since it provides your crawler with an initial URL list to crawl and scrape.
You can also use data scraped from XML sitemaps to analyse the site’s information architecture or IA and understand more about what content or products are present, and where the site owner is focusing its efforts. By parsing URL structures in Python you can build up a map of the site and its overall structure. My EcommerceTools package makes scraping the sitemap.xml file a one-line task.
Scraping page titles and descriptions is one of the most useful SEO tasks you can perform in Python. You can perform simple checks, such as ensuring the lengths of the title or description are neither too long or too short, or you can combine the data with other sources and identify a range of other things you can change to improve SEO.
One practical and simple project I’ve been doing for years is to identify the keywords each page is ranking for via the Google Search Console API, selecting the top keyword phrase, and then checking whether the words are present in the page title or meta description - effectively allowing you to identify keyword opportunities for which you already rank.
Many SEO tools will perform this check for you. In Ahrefs, this feature is called “Page and SERP titles do not match”, which is found under the Site Audit > All issues section. However, it’s easy to do in Python (or even PHP).
By identifying the keywords you already rank for, but which are missing from either your page title or meta description, you can add the phrases and get quick and easy improvements in both rankings and click-through rate, because Google will put the phrases in bold, helping them to stand out.
For decades, most SEO tools have scraped the Google search engine result pages (or SERPs) to help SEOs understand how their content is ranking for given keywords. While you can get similar information by querying the Google Search Console API with Python, you can get additional information by scraping the SERPs themselves.
While tools that scrape the SERPs are ubiquitous, Google doesn’t like you doing it, so you’ll find that it’s a fiddly process, and you’ll only be able to scrape a small volume of pages before you are temporarily blocked. You can work around these temporary blocks by using proxies, but your underlying code may also require regular updates, since Google often changes the HTML of the results that can break hard-coded scrapers.
If you want to learn the underlying web scraping techniques, I’d recommend trying to build your own Google SERP scraper with Python. However, for a really quick and easy solution, my EcommerceTools Python package lets you scrape Google search results in just three lines of code.
Besides just scraping the title, description, and URL shown in the search results, you can also extract a whole load of other potentially useful information from Google’s SERPs. This includes featured snippets, People Also Ask questions, related searches, and the words that Google is highlighting in bold (which often reveal useful synonyms you should be using in your pages).
The Google Autocomplete suggestions are also a very useful thing to scrape and analyse. By scraping Google autocomplete suggestions for search terms you can create a simple keyword suggestion tool that shows you a bunch of related search terms.
The People Also Ask widget on the Google search engine results is also a great source of potential keyword ideas for content writers.
Including questions and answers in your content, or clearly defining things that users are searching for, can increase your chances of appearing in these value slots or just help you rank higher. Like the autocomplete suggestions, it’s dead easy to scrape using Python.
By scraping a site’s internal and external links, you can analyse them to see which ones are orphans (with no links pointing to them), and which ones could be good candidates for linking from your other pages.
One really useful technique is to use the scraped links to create a network graph showing how the pages are linked to each other. I use the excellent NetworkX package for this. It creates superb visualisations showing internal linking structures and, when combined with Bokeh, allows you to click on the nodes and edges to reveal further information.
RSS feeds are used on many content-led websites, such as blogs, to provide a structured list of post titles, descriptions, authors, and other data that can be retrieved in RSS feed readers or read aloud by voice assistants.
Therefore, just as with sitemap.xml and schema.org data, reading RSS feeds in Python can yield useful structured data for analysis or use in other projects. It’s particularly useful for constructing Natural Language Processing datasets.
Following on from the SERP scraper mentioned above, one similar application is to create a simple Google rank tracking tool with Python. These take a Python list of target keywords, fetch the Google search engine results, and return the top ranking page for the domain you want to track.
They’re useful for basic monitoring, but you’ll likely find you quickly get blocked temporarily, as Google isn’t a fan of being scraped itself, which is ironic given that it obtains all its own data using the exact same techniques. That said, setup a cron job or Airflow data pipeline and you can collect and report on a small number of keywords quickly and easily.
Finally, if you’re building machine learning models, web scraping is one of the most effective ways to create your own machine learning datasets. Whether you’re scraping a dataset based on text data, such as jobs, or scraping an image dataset to train a machine learning model, web scraping with Python will give you the tools you need to make this a fairly simple task.
Matt Clarke, Wednesday, November 03, 2021