Although I have never really considered myself a technical SEO, I do need to do quite a bit of SEO work as part of my role as an Ecommerce Director. Unsurprisingly, like many others, I’m now using Python for the very vast majority of the work I undertake in this field.
Using Python for SEO makes a lot of sense. Many processes can be automated, saving you loads of time. There are Python tools for almost everything, from web scraping, to machine learning, and it’s easy to integrate data from multiple sources using tools such as Pandas.
Here are a few of the Python SEO projects I’ve undertaken for my sites, and at work, to give you some inspiration on how you can apply it to SEO. If you’re new to Python, I think these show why it’s worth taking the time to learn!
There have been some mind-blowing improvements in the performance of Natural Language Generation models in recent years, largely thanks to the development of so-called “transformer” models. These are pre-trained on massive datasets and can then be fine-tuned to perform other tasks, such as text summarisation.
I recently used this approach to generate surprisingly high-quality short descriptions for ecommerce product pages, but the same technique could easily be applied to the creation of meta descriptions, if you’re faced with a task too large or inefficient for humans to handle. It works well, but you’ll likely need a human editor to make fine adjustments.
While there are loads of excellent keyword generator tools available commercially, they’re also relatively simple to create yourself. Two excellent sources of the data to power these are the suggested keywords from Google Autocomplete and the questions from the People Also Ask section of the Google SERPs.
You can use Python to create simple tools that all you to bulk generate and extract both suggested keywords (ranked by relevance) and give you a list of all the questions people also ask about your topic of choice, giving you a whole plethora of potential keywords to include in your content.
SEO has now moved far beyond just the inclusion and density of keywords on your page, and Google is using sophisticated Natural Language Understanding models, such as BERT, to read and understand your content to answer any questions searchers may have.
I recently used the BERT model for Extractive Question Answering (or EQA) to assess how well my content worked for answering certain questions. My theory is, if I can write the content so my more simplistic BERT model can find the answers adequately, then Google should have no problems and the content should rank higher. Here’s how I did it.
One common issue I encounter in my job running ecommerce websites is that writers often copy and paste content from suppliers without rewriting it. Worse still, they’ll also share the same content over multiple pages on the same site, which results in widespread “near-duplicate” content.
This can be a bit tricky to identify, but I’ve gained good results using an algorithm called Longest Matching Subsequence (LMS), which returns the length of the longest string found in two pieces of content. It’s a great way to identify which content needs to be rewritten to avoid duplicate content harming rankings.
Another really common issue, especially on larger sites, is that content has often been optimised for or ranks for the same keywords. Given that Google (usually) limits the number of times a site can appear for the same phrase within the results, it pays to mix things up a bit.
You can use Python, Pandas, and the Google Search Console API to extract data on the keywords each URL is ranking for, and then identify the amount of keyword cannibalisation across pages to help reduce this by making adjustments to the content. Here’s how it’s done.
Google Search Console data is a rich source of information on potential site issues, and checking it can help identify various ways you can improve your search engine rankings.
One useful analysis to undertake is to identify your non-ranking pages, so you can go back and improve internal linking, add them to the index if they’re missing, or try to determine why they may have been excluded. Similarly, index bloat (too many pages in the index) is also a bad thing in some cases and can also be analysed easily using Python.
If you’re looking for those keywords which are going to be the next big thing in your market, then Google Trends is worth using. There’s no official Google API for Google Trends, but it’s possible to extract the data using Python and analyse it in Pandas (or Excel if that’s your bag - I won’t judge).
Core Web Vitals are a bunch of site performance metrics that examine how quickly your site loads and renders on various devices, and how well it’s set up from an SEO perspective. These are soon to become a ranking factor for Google (albeit probably a fairly minor one).
You can examine Core Web Vitals in Chrome using the Lighthouse tool that’s built-in. However, it’s worth using Python and the Core Web Vitals API to perform these checks in bulk, allowing you to simultaneously check multiple pages and multiple sites in just a few seconds.
Internal linking remains an important factor in SEO, and also helps reduce bounce rate and improve the user experience. Python makes it relatively straightforward to create web scraping tools that let you examine where internal and external links are present, so you can improve internal linking.
If you’re an SEO, you’ll likely spend a lot of your time using Google Analytics and Google Search Console data. You’ll be pleased to know that you can access both data sources computationally in Python using the official APIs.
These are a bit fiddly to use and require lots of code, however, I’ve written a couple of handy Python packages - GAPandas and EcommerceTools - that make the process much easier and require very little code. You can even blend data from the two sources together and do sophisticated SEO testing in just a few lines of code. They both integrate with Pandas too.
While there are loads of off-the-shelf commercial SEO tools that do the same thing, it’s fairly easy to create a Python script to scan your sites for 404 errors and 301 redirect chains, both of which will harm your rankings and the user experience. Here’s how to do it.
Python has some superb anomaly detection modeling packages available. These can be applied to pretty much any kind of time series data and are great for automating the process or poring over data to look for potential issues.
I’ve previously covered how to create anomaly detection models that can be used on both Google Analytics and Google Search Console data. These work well, but do require some prior knowledge of machine learning, so are the more sophisticated end of Python SEO.
If reporting is a big part of your job, then you’ll likely benefit from automating, or semi-automating, some of this work to free up your time to focus on more interesting tasks. I’ve create a couple of Python packages to do this.
GAPandas can be used to automate reports from Google Analytics. EcommerceTools lets you do the same with Google Search Console data, while Gilfoyle turns the Pandas dataframes of data into attractive PDF reports. They can all be set up to run automatically, so you can put your feet up.
If you run multilingual sites, or want to test what would happen if you did, then machine translation is worth considering. While arguably not as good as a human, the results are often surprisingly good. Python makes it easy to do this in bulk, and for free.
Most SEOs who use Python use it for web scraping in some form. There are some absolutely amazing web scraping packages available for Python (check out Scrapy, Requests-HTML, and Advertools). These vary in complexity, and you’ll benefit from some HTML and CSS knowledge, but you can use them for pretty much anything. Here are some examples.
Rather than just scraping HTML, it’s also worth scraping metadata. Site owners often add schema.org or OpenGraph metadata to their sites to help search engines find structured content and this can usually be extracted using more sophisticated web scraping tools, such as Extruct.
XML sitemaps have many uses for SEOs, and in web scraping projects. They can be scraped to give you the initial list of pages to scrape, and they can be analysed to identify the spread of keywords or other factors on your site, or those of your competitors. Here’s how you can access them using Python.
Similarly, the robots.txt found on the root of pretty much every website can tell you a lot about the site structure, and reveal the location of any sitemaps. These can be scraped using Python and parsed in Pandas allowing you to see how a site is configured.
While it’s not strictly permitted, pretty much every SEO likely uses content that’s been scraped from Google in some shape or form. Since Google doesn’t really permit this, it can be a cat and mouse game since the obfuscated code in the page requires constant updating to ensure scrapers continue to work. Here are a few ways you can utilise this powerful Python SEO technique. If you want to get fancy you can even try things like search intent classification.
Matt Clarke, Saturday, May 22, 2021