One of the most useful Python SEO projects you can undertake is to identify the top keywords for which each of your site’s pages are ranking for. Sometimes, these keywords might not appear within the title of the page or its meta description, so adding them there can help increase visibility in search enging results pages (SERPs), increase impressions, click-through rate (CTR) and often increase the average ranking position for the query.
In this project, I’ll show you how to use a combination of web scraping techniques and the Google Search Console API to create a dataset of site pages, keywords, and organic search metrics to let you identify the pages your SEO team needs to optimise first to help improve their search performance. We’ll be using my EcommerceTools package which makes Python SEO tasks like this much more straightforward than writing all the code by hand.
For this project we’ll need three Python packages: Pandas for manipulating dataframes, Numpy for some data munging, and my EcommerceTools for doing some Python SEO tasks on the site. You’ll probably already have Pandas and Numpy installed, but you may need to install EcommerceTools. You can do this via PyPi by entering the command !pip3 install --upgrade ecommercetools
into a cell in your Jupyter notebook. Once installed, load Pandas and Numpy using the usual pd
and np
aliases, and load the seo
module from ecommercetools
.
!pip3 install --upgrade ecommercetools
import pandas as pd
import numpy as np
from ecommercetools import seo
Next we need to obtain a list of URLs to scrape. While you could scrape and parse the XML sitemap manually using Python, the easiest way to get this is to use the get_sitemap()
function from EcommerceTools. Simply define the URL of the XML sitemap and pass it to the get_sitemap()
function, and it will return a Pandas dataframe containing the site URLs found. The XML sitemap is usually located at /sitemap.xml
, but this varies from site to site. The robots.txt file on every server usually reveals its location.
sitemap_url = 'https://practicaldatascience.co.uk/sitemap.xml'
df_sitemap = seo.get_sitemap(sitemap_url)
df_sitemap.head()
loc | changefreq | priority | domain | sitemap_name | |
---|---|---|---|---|---|
0 | https://practicaldatascience.co.uk/data-scienc... | practicaldatascience.co.uk | https://practicaldatascience.co.uk/sitemap.xml | ||
1 | https://practicaldatascience.co.uk/data-scienc... | practicaldatascience.co.uk | https://practicaldatascience.co.uk/sitemap.xml | ||
2 | https://practicaldatascience.co.uk/machine-lea... | practicaldatascience.co.uk | https://practicaldatascience.co.uk/sitemap.xml | ||
3 | https://practicaldatascience.co.uk/machine-lea... | practicaldatascience.co.uk | https://practicaldatascience.co.uk/sitemap.xml | ||
4 | https://practicaldatascience.co.uk/machine-lea... | practicaldatascience.co.uk | https://practicaldatascience.co.uk/sitemap.xml |
Now we have our list of URLs, we need to use web scraping to obtain the page title and meta description, since we’ll be checking whether the page’s top-ranking keyword is present in one, both, or neither of these. There are a wide range of excellent web scraping tools for Python that make it easy to scrape and parse the content of a website.
If you have a larger site, I’d advise you to use a proper crawler or scraper tool, such as Screamingfrog, Scrapy, or Advertools, or write your own web scraper from scratch. However, if you only have a small site, you can use the scrape_site()
function from EcommerceTools.
This will crawl each page and scrape the title and description, but it’s not multi-threaded, so expect it to take about 1 second per page. If you’re happy to leave it running, or just have a small site, then this will probably be just fine. If not, then check out Scrapy or Advertools if you want to do it in Python, or Screamingfrog if you’d prefer to do this via a desktop application.
To scrape the site, we pass in the df_sitemap
dataframe above and define the name of the column which contains the URL, which is called loc
in our case. If you want to watch the scraper’s progress, you can set verbose
to True
to watch it visiting each of the URLs in your sitemap. Once it’s done, the function will return a dataframe containing the URL, title, and meta description we need for in subsequent steps.
df_pages = seo.scrape_site(df_sitemap, 'loc', verbose=False)
df_pages.head()
url | title | description | |
---|---|---|---|
0 | https://practicaldatascience.co.uk/data-scienc... | How to create a Python virtual environment for... | How to create a Python virtual environment for... |
1 | https://practicaldatascience.co.uk/data-scienc... | How to engineer date features using Pandas | How to engineer date features using Pandas |
2 | https://practicaldatascience.co.uk/machine-lea... | How to impute missing numeric values in your d... | How to impute missing numeric values in your d... |
3 | https://practicaldatascience.co.uk/machine-lea... | How to interpret the confusion matrix | How to interpret the confusion matrix |
4 | https://practicaldatascience.co.uk/machine-lea... | How to use mean encoding in your machine learn... | How to use mean encoding in your machine learn... |
The next step is to fetch our site performance data from Google Search Console, which we can do with the query_google_search_console()
function. Before you can use this, you will need to create a JSON client secrets keyfile using the Google Console API to allow you to authenticate against Google Search Console and query your data. There’s a detailed guide to doing that in my article on accessing the Google Search Console API with Python.
Once you have your client secrets JSON keyfile, save it to your machine and define some variables to hold the key
path, the site_url
, and the start_date
and end_date
for your GSC API query. Then, create a simple payload
dictionary to pass to the API along with your key
and site_url
to obtain your GSC data in a Pandas dataframe.
Since my site is set up at the domain level in Google Search Console, rather than entering the site URL, I need to prefix this with sc-domain:
otherwise GSC will throw a 403 permissions error. You would not believe the amount of time I spent trying to figure this one out, as it seems to be very poorly documented.
key = "pds-client-secrets.json"
site_url = "sc-domain:practicaldatascience.co.uk"
start_date = '2021-01-01'
end_date = '2021-06-31'
payload = {
'startDate': start_date,
'endDate': end_date,
'dimensions': ['page', 'query'],
'rowLimit': 10000,
'startRow': 0
}
df = seo.query_google_search_console(key, site_url, payload)
If you use sort_values()
to order the Pandas dataframe based on the page
column, you’ll notice that we have a number of queries appearing against each page
URL. We’re only interested, as this stage, in examining the queries that generate the highest volume of impressions, so we need to separate these data and just keep the queries that generate the bulk of impressions on each page.
df.sort_values(by='page').head()
page | query | clicks | impressions | ctr | position | |
---|---|---|---|---|---|---|
446 | https://practicaldatascience.co.uk/ | practical data science with python | 0 | 5 | 0.0 | 55.4 |
451 | https://practicaldatascience.co.uk/ | vast data | 0 | 1 | 0.0 | 9.0 |
450 | https://practicaldatascience.co.uk/ | site:practicaldatascience.co.uk | 0 | 10 | 0.0 | 1.0 |
449 | https://practicaldatascience.co.uk/ | seo and data science | 0 | 1 | 0.0 | 53.0 |
448 | https://practicaldatascience.co.uk/ | recommender systems | 0 | 1 | 0.0 | 29.0 |
To identify the keyword that generates the most impressions for each page, we can use sort_values()
with both the page
and impressions
column names and set ascending
to False
. Once we’ve done that, we can then use the drop_duplicates()
function to examine the subset
for the page
column and then use keep='first'
to only keep the query
that generates the most impressions. That gives us a dataframe containing every URL, and its top performing keyword, based on impressions
.
df = df.sort_values(by=['page', 'impressions'], ascending=False)
df = df.drop_duplicates(subset='page', keep='first')
df.sort_values(by='page').head()
page | query | clicks | impressions | ctr | position | |
---|---|---|---|---|---|---|
179 | https://practicaldatascience.co.uk/ | practical data science | 1 | 97 | 1.03 | 65.52 |
455 | https://practicaldatascience.co.uk/about | site:practicaldatascience.co.uk | 0 | 10 | 0.00 | 6.00 |
464 | https://practicaldatascience.co.uk/data-engine... | site:practicaldatascience.co.uk | 0 | 6 | 0.00 | 13.00 |
476 | https://practicaldatascience.co.uk/data-engine... | airflow data pipeline | 0 | 156 | 0.00 | 21.04 |
596 | https://practicaldatascience.co.uk/data-engine... | bigquery autocomplete | 0 | 15 | 0.00 | 57.73 |
Next, we’ll merge the dataframe containing the page URLs, page titles, and page meta descriptions with our Google Search Console metrics. Since some pages on a site may not be generating any traffic at all, we’ll split the data into two dataframes, one for pages that generate traffic and one for pages that don’t.
We’ll do this using the Pandas merge()
function, setting the left_on
to the page
column from our df_pages
dataframe from the site crawl and the right_on
column to the url
column from the Google Search Console dataframe.
df_all = df_pages.merge(df, how='left', left_on='url', right_on='page')
Most sites, especially new ones, often have pages that generate no traffic or impressions, so one useful tactic is to separate these and examine what else can be done to get them to rank better. Usually internal linking can help massively.
To identify these pages, we can filter the df_all
dataframe by looking for any pages for which the query
column is null or None
, using df_all[df_all['query'].isnull()]
. We can then fill those null values with more meaningful zero values by using fillna(0)
. We’ll assign them to df_no_traffic
and save the data to a CSV so it can be passed over to your SEO or content team.
df_no_traffic = df_all[df_all['query'].isnull()].fillna(0)
df_no_traffic.to_csv('no_traffic.csv', index=False)
df_no_traffic.head()
url | title | description | page | query | clicks | impressions | ctr | position | |
---|---|---|---|---|---|---|---|---|---|
8 | https://practicaldatascience.co.uk/data-scienc... | How to use GAPandas to view your Google Analyt... | How to use GAPandas to view your Google Analyt... | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 |
16 | https://practicaldatascience.co.uk/data-scienc... | How to scrape JSON-LD competitor reviews using... | How to scrape JSON-LD competitor reviews using... | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 |
25 | https://practicaldatascience.co.uk/machine-lea... | A quick guide to machine learning | A quick guide to machine learning | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 |
35 | https://practicaldatascience.co.uk/machine-lea... | How to use NLP to identify what drives custome... | How to use NLP to identify what drives custome... | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 |
47 | https://practicaldatascience.co.uk/machine-lea... | How to speed up the NLP text annotation process | How to speed up the NLP text annotation process | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 |
Next we’ll perform a similar process but for the pages which are generating traffic. Instead of using isnull()
to look for a query
value set to None
, we’ll use notnull()
to find those rows where the query
column is populated with a keyword.
We’ll save the results to a dataframe called df_traffic
and save that out to CSV so it can be passed on to the right people in our team. Since we now have a duplicate column, we can also use del
to delete the extra page
column.
df_traffic = df_all[df_all['query'].notnull()]
del df_traffic['page']
df_traffic.sort_values(by='impressions', ascending=False).head()
url | title | description | query | clicks | impressions | ctr | position | |
---|---|---|---|---|---|---|---|---|
28 | https://practicaldatascience.co.uk/data-scienc... | Dell Precision 7750 mobile data science workst... | Dell Precision 7750 mobile data science workst... | dell precision 7750 | 15.0 | 3489.0 | 0.43 | 10.48 |
116 | https://practicaldatascience.co.uk/data-scienc... | How to resize and compress images with TinyPNG | How to resize and compress images with TinyPNG | tinypng | 0.0 | 3344.0 | 0.00 | 50.11 |
147 | https://practicaldatascience.co.uk/data-scienc... | How to scrape Google search results using Python | How to scrape Google search results using Python | scrape google search results | 40.0 | 1884.0 | 2.12 | 15.23 |
177 | https://practicaldatascience.co.uk/machine-lea... | How to create a basic Marketing Mix Model in s... | How to create a basic Marketing Mix Model in s... | marketing mix modeling | 0.0 | 1394.0 | 0.00 | 54.79 |
128 | https://practicaldatascience.co.uk/data-scienc... | How to access the Google Knowledge Graph Searc... | How to access the Google Knowledge Graph Searc... | google knowledge graph | 0.0 | 1351.0 | 0.00 | 59.38 |
Our next step is to take our df_traffic
dataframe that contains the URL of each page and it’s top performing query
based on impressions, and identify whether the keywords appear in the title
and description
columns.
There are a few ways to do this, I’ve gone with a separate function that first casts the query
column to a string (if it isn’t already) and then checks to see whether the lowercase version of this is present in the title and description.
def in_title(row):
if str(row['query']) in row['title'].lower():
return 1
else:
return 0
def in_description(row):
if str(row['query']) in row['description'].lower():
return 1
else:
return 0
To run the functions I’ve used the assign()
function to avoid the Pandas SettingWithCopyWarning
error, and have then used apply()
to run each function on each row in the dataframe. This is not the most efficient way to do this, but it works plenty fast enough for my needs.
Once we have the binary value for the in_title
and in_description
columns, we can then use a bit of Numpy code to calculate whether the query
is present in both or not, with the outcome assigned to a column called in_both
.
df_traffic = df_traffic.assign(in_title=df_traffic.apply(in_title, axis=1))
df_traffic = df_traffic.assign(in_description=df_traffic.apply(in_description, axis=1))
df_traffic['in_both'] = np.where(df_traffic['in_title'] + df_traffic['in_description'] == 2, 1, 0)
Running the code reveals a neat dataframe containing the URL, title, description, query, impressions, ctr, and position, plus our metrics showing whether the highest ranking keyword for each page is present within the title, description, neither, or both.
Now we have this more useful data appended, we’ll save the data to CSV so it can be passed over to the SEO team to handle. They can work through the list, identify any pages where the highest ranking keywords could be added to the title or description and make their changes. After they’ve completed their work, you can re-run in a week or two to check the results.
df_traffic.to_csv('traffic.csv', index=False)
df_traffic.head()
url | title | description | query | clicks | impressions | ctr | position | in_title | in_description | in_both | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | https://practicaldatascience.co.uk/data-scienc... | How to create a Python virtual environment for... | How to create a Python virtual environment for... | python virtual environment | 0.0 | 2.0 | 0.0 | 99.00 | 1 | 1 | 1 |
1 | https://practicaldatascience.co.uk/data-scienc... | How to engineer date features using Pandas | How to engineer date features using Pandas | pandas business days | 0.0 | 40.0 | 0.0 | 7.97 | 0 | 0 | 0 |
2 | https://practicaldatascience.co.uk/machine-lea... | How to impute missing numeric values in your d... | How to impute missing numeric values in your d... | scikit learn imputer | 0.0 | 6.0 | 0.0 | 53.33 | 0 | 0 | 0 |
3 | https://practicaldatascience.co.uk/machine-lea... | How to interpret the confusion matrix | How to interpret the confusion matrix | confusion matrix | 0.0 | 73.0 | 0.0 | 82.99 | 1 | 1 | 1 |
4 | https://practicaldatascience.co.uk/machine-lea... | How to use mean encoding in your machine learn... | How to use mean encoding in your machine learn... | mean encoding | 0.0 | 9.0 | 0.0 | 21.33 | 1 | 1 | 1 |
To create your prioritised list of pages to optimise, all you need to do is sort the data by the number of impressions being generated. Simply check whether the top-ranking query is present in the title and description, edit the content, and you should hopefully see that your SEO metrics improve.
You might also want to look at striking distance keywords - those which rank within striking distance of the first page of results - as these can also aid prioritisation, especially on larger sites.
df_traffic.sort_values(by='impressions', ascending=False).head(20)
url | title | description | query | clicks | impressions | ctr | position | in_title | in_description | in_both | |
---|---|---|---|---|---|---|---|---|---|---|---|
28 | https://practicaldatascience.co.uk/data-scienc... | Dell Precision 7750 mobile data science workst... | Dell Precision 7750 mobile data science workst... | dell precision 7750 | 15.0 | 3489.0 | 0.43 | 10.48 | 1 | 1 | 1 |
116 | https://practicaldatascience.co.uk/data-scienc... | How to resize and compress images with TinyPNG | How to resize and compress images with TinyPNG | tinypng | 0.0 | 3344.0 | 0.00 | 50.11 | 1 | 1 | 1 |
147 | https://practicaldatascience.co.uk/data-scienc... | How to scrape Google search results using Python | How to scrape Google search results using Python | scrape google search results | 40.0 | 1884.0 | 2.12 | 15.23 | 1 | 1 | 1 |
177 | https://practicaldatascience.co.uk/machine-lea... | How to create a basic Marketing Mix Model in s... | How to create a basic Marketing Mix Model in s... | marketing mix modeling | 0.0 | 1394.0 | 0.00 | 54.79 | 0 | 0 | 0 |
128 | https://practicaldatascience.co.uk/data-scienc... | How to access the Google Knowledge Graph Searc... | How to access the Google Knowledge Graph Searc... | google knowledge graph | 0.0 | 1351.0 | 0.00 | 59.38 | 1 | 1 | 1 |
38 | https://practicaldatascience.co.uk/machine-lea... | How to annotate training data for NLP models u... | How to annotate training data for NLP models u... | doccano | 2.0 | 1217.0 | 0.16 | 14.08 | 1 | 1 | 1 |
160 | https://practicaldatascience.co.uk/data-scienc... | How to create content recommendations using TF... | How to create content recommendations using TF... | tf idf | 0.0 | 1212.0 | 0.00 | 59.97 | 1 | 1 | 1 |
175 | https://practicaldatascience.co.uk/machine-lea... | How to make time series forecasts with Neural ... | How to make time series forecasts with Neural ... | neural prophet | 30.0 | 1017.0 | 2.95 | 8.51 | 1 | 1 | 1 |
13 | https://practicaldatascience.co.uk/machine-lea... | How to separate audio source data using Spleeter | How to separate audio source data using Spleeter | spleeter | 0.0 | 977.0 | 0.00 | 29.26 | 1 | 1 | 1 |
100 | https://practicaldatascience.co.uk/data-scienc... | How to use Screaming Frog from the command line | How to use Screaming Frog from the command line | screaming frog | 0.0 | 806.0 | 0.00 | 62.40 | 1 | 1 | 1 |
167 | https://practicaldatascience.co.uk/machine-lea... | How to use k means clustering for customer seg... | How to use k means clustering for customer seg... | customer segmentation machine learning | 1.0 | 737.0 | 0.14 | 44.20 | 0 | 0 | 0 |
168 | https://practicaldatascience.co.uk/machine-lea... | How to use the Isolation Forest model for outl... | How to use the Isolation Forest model for outl... | isolation forest sklearn | 5.0 | 671.0 | 0.75 | 10.46 | 0 | 0 | 0 |
164 | https://practicaldatascience.co.uk/data-scienc... | How to machine translate product descriptions | How to machine translate product descriptions | google_trans_new | 3.0 | 620.0 | 0.48 | 8.98 | 0 | 0 | 0 |
165 | https://practicaldatascience.co.uk/data-scienc... | How to perform a customer cohort analysis in P... | How to perform a customer cohort analysis in P... | customer cohort analysis | 0.0 | 600.0 | 0.00 | 71.80 | 1 | 1 | 1 |
61 | https://practicaldatascience.co.uk/machine-lea... | How to use transform categorical variables usi... | How to use transform categorical variables usi... | category encoders | 6.0 | 582.0 | 1.03 | 8.31 | 0 | 0 | 0 |
62 | https://practicaldatascience.co.uk/machine-lea... | How to use model selection and hyperparameter ... | How to use model selection and hyperparameter ... | xgbregressor hyperparameter tuning | 12.0 | 409.0 | 2.93 | 8.44 | 0 | 0 | 0 |
52 | https://practicaldatascience.co.uk/machine-lea... | How to create a linear regression model using ... | How to create a linear regression model using ... | sklearn linear regression | 1.0 | 407.0 | 0.25 | 38.59 | 0 | 0 | 0 |
49 | https://practicaldatascience.co.uk/data-scienc... | How to create an ABC XYZ inventory classificat... | How to create an ABC XYZ inventory classificat... | abc inventory | 0.0 | 393.0 | 0.00 | 76.77 | 0 | 0 | 0 |
231 | https://practicaldatascience.co.uk/tag/technic... | Articles and tutorials on technical SEO with P... | Articles and tutorials on technical SEO with P... | seo data science | 0.0 | 379.0 | 0.00 | 72.87 | 0 | 0 | 0 |
151 | https://practicaldatascience.co.uk/data-scienc... | How to use Spintax to create content and ad co... | How to use Spintax to create content and ad co... | spintax | 0.0 | 373.0 | 0.00 | 59.41 | 1 | 1 | 1 |
Matt Clarke, Thursday, August 12, 2021