How to identify SEO keyword opportunities with Python

Learn how to use Python to scrape and parse an XML sitemap, crawl and scrape a site, connect the results to your Google Search Console data and identify ways to rank higher on Google.

How to identify SEO keyword opportunities with Python
Picture by Negative Space, Pexels.
26 minutes to read

One of the most useful Python SEO projects you can undertake is to identify the top keywords for which each of your site’s pages are ranking for. Sometimes, these keywords might not appear within the title of the page or its meta description, so adding them there can help increase visibility in search enging results pages (SERPs), increase impressions, click-through rate (CTR) and often increase the average ranking position for the query.

In this project, I’ll show you how to use a combination of web scraping techniques and the Google Search Console API to create a dataset of site pages, keywords, and organic search metrics to let you identify the pages your SEO team needs to optimise first to help improve their search performance. We’ll be using my EcommerceTools package which makes Python SEO tasks like this much more straightforward than writing all the code by hand.

Install the packages

For this project we’ll need three Python packages: Pandas for manipulating dataframes, Numpy for some data munging, and my EcommerceTools for doing some Python SEO tasks on the site. You’ll probably already have Pandas and Numpy installed, but you may need to install EcommerceTools. You can do this via PyPi by entering the command !pip3 install --upgrade ecommercetools into a cell in your Jupyter notebook. Once installed, load Pandas and Numpy using the usual pd and np aliases, and load the seo module from ecommercetools.

!pip3 install --upgrade ecommercetools
import pandas as pd
import numpy as np
from ecommercetools import seo

Scrape and parse the sitemap

Next we need to obtain a list of URLs to scrape. While you could scrape and parse the XML sitemap manually using Python, the easiest way to get this is to use the get_sitemap() function from EcommerceTools. Simply define the URL of the XML sitemap and pass it to the get_sitemap() function, and it will return a Pandas dataframe containing the site URLs found. The XML sitemap is usually located at /sitemap.xml, but this varies from site to site. The robots.txt file on every server usually reveals its location.

sitemap_url = 'https://practicaldatascience.co.uk/sitemap.xml'
df_sitemap = seo.get_sitemap(sitemap_url)
df_sitemap.head()
loc changefreq priority domain sitemap_name
0 https://practicaldatascience.co.uk/data-scienc... practicaldatascience.co.uk https://practicaldatascience.co.uk/sitemap.xml
1 https://practicaldatascience.co.uk/data-scienc... practicaldatascience.co.uk https://practicaldatascience.co.uk/sitemap.xml
2 https://practicaldatascience.co.uk/machine-lea... practicaldatascience.co.uk https://practicaldatascience.co.uk/sitemap.xml
3 https://practicaldatascience.co.uk/machine-lea... practicaldatascience.co.uk https://practicaldatascience.co.uk/sitemap.xml
4 https://practicaldatascience.co.uk/machine-lea... practicaldatascience.co.uk https://practicaldatascience.co.uk/sitemap.xml

Scrape the site

Now we have our list of URLs, we need to use web scraping to obtain the page title and meta description, since we’ll be checking whether the page’s top-ranking keyword is present in one, both, or neither of these. There are a wide range of excellent web scraping tools for Python that make it easy to scrape and parse the content of a website.

If you have a larger site, I’d advise you to use a proper crawler or scraper tool, such as Screamingfrog, Scrapy, or Advertools, or write your own web scraper from scratch. However, if you only have a small site, you can use the scrape_site() function from EcommerceTools.

This will crawl each page and scrape the title and description, but it’s not multi-threaded, so expect it to take about 1 second per page. If you’re happy to leave it running, or just have a small site, then this will probably be just fine. If not, then check out Scrapy or Advertools if you want to do it in Python, or Screamingfrog if you’d prefer to do this via a desktop application.

To scrape the site, we pass in the df_sitemap dataframe above and define the name of the column which contains the URL, which is called loc in our case. If you want to watch the scraper’s progress, you can set verbose to True to watch it visiting each of the URLs in your sitemap. Once it’s done, the function will return a dataframe containing the URL, title, and meta description we need for in subsequent steps.

df_pages = seo.scrape_site(df_sitemap, 'loc', verbose=False)
df_pages.head()
url title description
0 https://practicaldatascience.co.uk/data-scienc... How to create a Python virtual environment for... How to create a Python virtual environment for...
1 https://practicaldatascience.co.uk/data-scienc... How to engineer date features using Pandas How to engineer date features using Pandas
2 https://practicaldatascience.co.uk/machine-lea... How to impute missing numeric values in your d... How to impute missing numeric values in your d...
3 https://practicaldatascience.co.uk/machine-lea... How to interpret the confusion matrix How to interpret the confusion matrix
4 https://practicaldatascience.co.uk/machine-lea... How to use mean encoding in your machine learn... How to use mean encoding in your machine learn...

Query the Google Search Console API

The next step is to fetch our site performance data from Google Search Console, which we can do with the query_google_search_console() function. Before you can use this, you will need to create a JSON client secrets keyfile using the Google Console API to allow you to authenticate against Google Search Console and query your data. There’s a detailed guide to doing that in my article on accessing the Google Search Console API with Python.

Once you have your client secrets JSON keyfile, save it to your machine and define some variables to hold the key path, the site_url, and the start_date and end_date for your GSC API query. Then, create a simple payload dictionary to pass to the API along with your key and site_url to obtain your GSC data in a Pandas dataframe.

Since my site is set up at the domain level in Google Search Console, rather than entering the site URL, I need to prefix this with sc-domain: otherwise GSC will throw a 403 permissions error. You would not believe the amount of time I spent trying to figure this one out, as it seems to be very poorly documented.

key = "pds-client-secrets.json"
site_url = "sc-domain:practicaldatascience.co.uk"
start_date = '2021-01-01'
end_date = '2021-06-31'
payload = {
    'startDate': start_date, 
    'endDate': end_date,
    'dimensions': ['page', 'query'],  
    'rowLimit': 10000,
    'startRow': 0
}

df = seo.query_google_search_console(key, site_url, payload)

If you use sort_values() to order the Pandas dataframe based on the page column, you’ll notice that we have a number of queries appearing against each page URL. We’re only interested, as this stage, in examining the queries that generate the highest volume of impressions, so we need to separate these data and just keep the queries that generate the bulk of impressions on each page.

df.sort_values(by='page').head()
page query clicks impressions ctr position
446 https://practicaldatascience.co.uk/ practical data science with python 0 5 0.0 55.4
451 https://practicaldatascience.co.uk/ vast data 0 1 0.0 9.0
450 https://practicaldatascience.co.uk/ site:practicaldatascience.co.uk 0 10 0.0 1.0
449 https://practicaldatascience.co.uk/ seo and data science 0 1 0.0 53.0
448 https://practicaldatascience.co.uk/ recommender systems 0 1 0.0 29.0

To identify the keyword that generates the most impressions for each page, we can use sort_values() with both the page and impressions column names and set ascending to False. Once we’ve done that, we can then use the drop_duplicates() function to examine the subset for the page column and then use keep='first' to only keep the query that generates the most impressions. That gives us a dataframe containing every URL, and its top performing keyword, based on impressions.

df = df.sort_values(by=['page', 'impressions'], ascending=False)
df = df.drop_duplicates(subset='page', keep='first')
df.sort_values(by='page').head()
page query clicks impressions ctr position
179 https://practicaldatascience.co.uk/ practical data science 1 97 1.03 65.52
455 https://practicaldatascience.co.uk/about site:practicaldatascience.co.uk 0 10 0.00 6.00
464 https://practicaldatascience.co.uk/data-engine... site:practicaldatascience.co.uk 0 6 0.00 13.00
476 https://practicaldatascience.co.uk/data-engine... airflow data pipeline 0 156 0.00 21.04
596 https://practicaldatascience.co.uk/data-engine... bigquery autocomplete 0 15 0.00 57.73

Merge the site crawl data to the GSC data

Next, we’ll merge the dataframe containing the page URLs, page titles, and page meta descriptions with our Google Search Console metrics. Since some pages on a site may not be generating any traffic at all, we’ll split the data into two dataframes, one for pages that generate traffic and one for pages that don’t.

We’ll do this using the Pandas merge() function, setting the left_on to the page column from our df_pages dataframe from the site crawl and the right_on column to the url column from the Google Search Console dataframe.

df_all = df_pages.merge(df, how='left', left_on='url', right_on='page')

Most sites, especially new ones, often have pages that generate no traffic or impressions, so one useful tactic is to separate these and examine what else can be done to get them to rank better. Usually internal linking can help massively.

To identify these pages, we can filter the df_all dataframe by looking for any pages for which the query column is null or None, using df_all[df_all['query'].isnull()]. We can then fill those null values with more meaningful zero values by using fillna(0). We’ll assign them to df_no_traffic and save the data to a CSV so it can be passed over to your SEO or content team.

df_no_traffic = df_all[df_all['query'].isnull()].fillna(0)
df_no_traffic.to_csv('no_traffic.csv', index=False)
df_no_traffic.head()
url title description page query clicks impressions ctr position
8 https://practicaldatascience.co.uk/data-scienc... How to use GAPandas to view your Google Analyt... How to use GAPandas to view your Google Analyt... 0 0 0.0 0.0 0.0 0.0
16 https://practicaldatascience.co.uk/data-scienc... How to scrape JSON-LD competitor reviews using... How to scrape JSON-LD competitor reviews using... 0 0 0.0 0.0 0.0 0.0
25 https://practicaldatascience.co.uk/machine-lea... A quick guide to machine learning A quick guide to machine learning 0 0 0.0 0.0 0.0 0.0
35 https://practicaldatascience.co.uk/machine-lea... How to use NLP to identify what drives custome... How to use NLP to identify what drives custome... 0 0 0.0 0.0 0.0 0.0
47 https://practicaldatascience.co.uk/machine-lea... How to speed up the NLP text annotation process How to speed up the NLP text annotation process 0 0 0.0 0.0 0.0 0.0

Next we’ll perform a similar process but for the pages which are generating traffic. Instead of using isnull() to look for a query value set to None, we’ll use notnull() to find those rows where the query column is populated with a keyword.

We’ll save the results to a dataframe called df_traffic and save that out to CSV so it can be passed on to the right people in our team. Since we now have a duplicate column, we can also use del to delete the extra page column.

df_traffic = df_all[df_all['query'].notnull()]
del df_traffic['page']
df_traffic.sort_values(by='impressions', ascending=False).head()
url title description query clicks impressions ctr position
28 https://practicaldatascience.co.uk/data-scienc... Dell Precision 7750 mobile data science workst... Dell Precision 7750 mobile data science workst... dell precision 7750 15.0 3489.0 0.43 10.48
116 https://practicaldatascience.co.uk/data-scienc... How to resize and compress images with TinyPNG How to resize and compress images with TinyPNG tinypng 0.0 3344.0 0.00 50.11
147 https://practicaldatascience.co.uk/data-scienc... How to scrape Google search results using Python How to scrape Google search results using Python scrape google search results 40.0 1884.0 2.12 15.23
177 https://practicaldatascience.co.uk/machine-lea... How to create a basic Marketing Mix Model in s... How to create a basic Marketing Mix Model in s... marketing mix modeling 0.0 1394.0 0.00 54.79
128 https://practicaldatascience.co.uk/data-scienc... How to access the Google Knowledge Graph Searc... How to access the Google Knowledge Graph Searc... google knowledge graph 0.0 1351.0 0.00 59.38

Identify whether the title and description contain the top query

Our next step is to take our df_traffic dataframe that contains the URL of each page and it’s top performing query based on impressions, and identify whether the keywords appear in the title and description columns.

There are a few ways to do this, I’ve gone with a separate function that first casts the query column to a string (if it isn’t already) and then checks to see whether the lowercase version of this is present in the title and description.

def in_title(row):
    if str(row['query']) in row['title'].lower():
            return 1
    else:
        return 0
def in_description(row):
    if str(row['query']) in row['description'].lower():
        return 1 
    else:
        return 0

To run the functions I’ve used the assign() function to avoid the Pandas SettingWithCopyWarning error, and have then used apply() to run each function on each row in the dataframe. This is not the most efficient way to do this, but it works plenty fast enough for my needs.

Once we have the binary value for the in_title and in_description columns, we can then use a bit of Numpy code to calculate whether the query is present in both or not, with the outcome assigned to a column called in_both.

df_traffic = df_traffic.assign(in_title=df_traffic.apply(in_title, axis=1))
df_traffic = df_traffic.assign(in_description=df_traffic.apply(in_description, axis=1))
df_traffic['in_both'] = np.where(df_traffic['in_title'] + df_traffic['in_description'] == 2, 1, 0)

Running the code reveals a neat dataframe containing the URL, title, description, query, impressions, ctr, and position, plus our metrics showing whether the highest ranking keyword for each page is present within the title, description, neither, or both.

Now we have this more useful data appended, we’ll save the data to CSV so it can be passed over to the SEO team to handle. They can work through the list, identify any pages where the highest ranking keywords could be added to the title or description and make their changes. After they’ve completed their work, you can re-run in a week or two to check the results.

df_traffic.to_csv('traffic.csv', index=False)
df_traffic.head()
url title description query clicks impressions ctr position in_title in_description in_both
0 https://practicaldatascience.co.uk/data-scienc... How to create a Python virtual environment for... How to create a Python virtual environment for... python virtual environment 0.0 2.0 0.0 99.00 1 1 1
1 https://practicaldatascience.co.uk/data-scienc... How to engineer date features using Pandas How to engineer date features using Pandas pandas business days 0.0 40.0 0.0 7.97 0 0 0
2 https://practicaldatascience.co.uk/machine-lea... How to impute missing numeric values in your d... How to impute missing numeric values in your d... scikit learn imputer 0.0 6.0 0.0 53.33 0 0 0
3 https://practicaldatascience.co.uk/machine-lea... How to interpret the confusion matrix How to interpret the confusion matrix confusion matrix 0.0 73.0 0.0 82.99 1 1 1
4 https://practicaldatascience.co.uk/machine-lea... How to use mean encoding in your machine learn... How to use mean encoding in your machine learn... mean encoding 0.0 9.0 0.0 21.33 1 1 1

Create a prioritised list of pages to optimise

To create your prioritised list of pages to optimise, all you need to do is sort the data by the number of impressions being generated. Simply check whether the top-ranking query is present in the title and description, edit the content, and you should hopefully see that your SEO metrics improve.

You might also want to look at striking distance keywords - those which rank within striking distance of the first page of results - as these can also aid prioritisation, especially on larger sites.

df_traffic.sort_values(by='impressions', ascending=False).head(20)
url title description query clicks impressions ctr position in_title in_description in_both
28 https://practicaldatascience.co.uk/data-scienc... Dell Precision 7750 mobile data science workst... Dell Precision 7750 mobile data science workst... dell precision 7750 15.0 3489.0 0.43 10.48 1 1 1
116 https://practicaldatascience.co.uk/data-scienc... How to resize and compress images with TinyPNG How to resize and compress images with TinyPNG tinypng 0.0 3344.0 0.00 50.11 1 1 1
147 https://practicaldatascience.co.uk/data-scienc... How to scrape Google search results using Python How to scrape Google search results using Python scrape google search results 40.0 1884.0 2.12 15.23 1 1 1
177 https://practicaldatascience.co.uk/machine-lea... How to create a basic Marketing Mix Model in s... How to create a basic Marketing Mix Model in s... marketing mix modeling 0.0 1394.0 0.00 54.79 0 0 0
128 https://practicaldatascience.co.uk/data-scienc... How to access the Google Knowledge Graph Searc... How to access the Google Knowledge Graph Searc... google knowledge graph 0.0 1351.0 0.00 59.38 1 1 1
38 https://practicaldatascience.co.uk/machine-lea... How to annotate training data for NLP models u... How to annotate training data for NLP models u... doccano 2.0 1217.0 0.16 14.08 1 1 1
160 https://practicaldatascience.co.uk/data-scienc... How to create content recommendations using TF... How to create content recommendations using TF... tf idf 0.0 1212.0 0.00 59.97 1 1 1
175 https://practicaldatascience.co.uk/machine-lea... How to make time series forecasts with Neural ... How to make time series forecasts with Neural ... neural prophet 30.0 1017.0 2.95 8.51 1 1 1
13 https://practicaldatascience.co.uk/machine-lea... How to separate audio source data using Spleeter How to separate audio source data using Spleeter spleeter 0.0 977.0 0.00 29.26 1 1 1
100 https://practicaldatascience.co.uk/data-scienc... How to use Screaming Frog from the command line How to use Screaming Frog from the command line screaming frog 0.0 806.0 0.00 62.40 1 1 1
167 https://practicaldatascience.co.uk/machine-lea... How to use k means clustering for customer seg... How to use k means clustering for customer seg... customer segmentation machine learning 1.0 737.0 0.14 44.20 0 0 0
168 https://practicaldatascience.co.uk/machine-lea... How to use the Isolation Forest model for outl... How to use the Isolation Forest model for outl... isolation forest sklearn 5.0 671.0 0.75 10.46 0 0 0
164 https://practicaldatascience.co.uk/data-scienc... How to machine translate product descriptions How to machine translate product descriptions google_trans_new 3.0 620.0 0.48 8.98 0 0 0
165 https://practicaldatascience.co.uk/data-scienc... How to perform a customer cohort analysis in P... How to perform a customer cohort analysis in P... customer cohort analysis 0.0 600.0 0.00 71.80 1 1 1
61 https://practicaldatascience.co.uk/machine-lea... How to use transform categorical variables usi... How to use transform categorical variables usi... category encoders 6.0 582.0 1.03 8.31 0 0 0
62 https://practicaldatascience.co.uk/machine-lea... How to use model selection and hyperparameter ... How to use model selection and hyperparameter ... xgbregressor hyperparameter tuning 12.0 409.0 2.93 8.44 0 0 0
52 https://practicaldatascience.co.uk/machine-lea... How to create a linear regression model using ... How to create a linear regression model using ... sklearn linear regression 1.0 407.0 0.25 38.59 0 0 0
49 https://practicaldatascience.co.uk/data-scienc... How to create an ABC XYZ inventory classificat... How to create an ABC XYZ inventory classificat... abc inventory 0.0 393.0 0.00 76.77 0 0 0
231 https://practicaldatascience.co.uk/tag/technic... Articles and tutorials on technical SEO with P... Articles and tutorials on technical SEO with P... seo data science 0.0 379.0 0.00 72.87 0 0 0
151 https://practicaldatascience.co.uk/data-scienc... How to use Spintax to create content and ad co... How to use Spintax to create content and ad co... spintax 0.0 373.0 0.00 59.41 1 1 1

Matt Clarke, Thursday, August 12, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.