How to identify keyword cannibalisation using Python

Learn how to use Python to identify keyword cannibalisation which occurs when multiple pages compete for the same keywords.

How to identify keyword cannibalisation using Python
Picture by Eric McNew, Unsplash.
19 minutes to read

Keyword cannibalisation occurs when you have several pages ranking for the same phrase, effectively putting them in competition with each other for search engine rankings. Since Google generally only shows one or two results from each site, keyword cannibalisation means you can reduce your chances for ranking highly for the shared phrase.

Keyword cannibalisation happens for perfectly innocuous reasons, especially on ecommerce websites. Ecommerce copywriters and merchandisers rightly want the product they’re launching to rank on search engines, and appear in the results of the on-site search engine, so they’ll try to optimise the content accordingly by including the target keywords in all the usual places.

However, what can happen is that you end up with several pages all optimised for similar phrases, and all competing for one of the limited number of available slots on the search engine results pages. This ends up diluting your clicks, CTR, and even backlinks.

Identifying keyword cannibalisation

To identify which queries are suffering from keyword cannibalisation you can use a number of technical SEO tools, including Google Search Console. The idea is that you extract the data on the queries your site ranks for, and then identify the terms which have several pages ranking for the same phrase.

I’ll leave it for the technical SEO people to guide you on the best approach for dealing with the issue, but as I understand it, the solution is to re-work your content so that it’s optimised for different phrases, or merge the content together. Obviously, this is a technique better suited to content sites than ecommerce ones, but there’s usually a solution somewhere.

In this tutorial I’ll show you how you can create a Python script using the Google Search Engine Console API to extract your search traffic data and identify which pages are suffering from keyword cannibalisation. You’ll be able to generate a list of problem phrases and pages to give to your content team to see if they can improve the situation. Here’s how it’s done.

Load the packages

First, open a Jupyter notebook or Python script and import the google.ouath2, googleapiclient, requests, json, and pandas packages. Any packages you don’t have can be installed by entering pip3 install package-name in your terminal.

from google.oauth2 import service_account
from googleapiclient.discovery import build
import requests
import json
import pandas as pd
pd.set_option('max_colwidth', 100)

Create a connection

Next, we’ll create a function to connect to the Google Search Console API and return a service object. Check out my guide to accessing the Google Search Console API using Python for more details on how to set this up, and the specifics on obtaining the required JSON key.

key = 'google-search-console.json'
def connect(key):
    """Create a connection to the Google Search Console API and return service object.
        key (string): Google Search Console JSON client secrets path.
        service (object): Google Search Console service object.
    scope = ['']
    credentials = service_account.Credentials.from_service_account_file(key, 
    service = build(
    return service
service = connect(key)

Convert the API response to a dataframe

Now we have a connection to the API, we’ll create another function to pass our query to the API. This takes a dictionary payload containing the API request details, the site_url defining the property you want to access, and the service object returned by connect().

After executing the API query, the function extracts the data from the response and neatly formats a new dictionary containing the data requested. It re-formats each row of data and appends it to a list of results which is then passed to the Pandas to_dict() function to return the API results in a dataframe.

def query(service, site_url, payload):
    """Run a query on the Google Search Console API and return a dataframe of results.
        service (object): Service object from connect()
        site_url (string): URL of Google Search Console property
        payload (dict): API query payload dictionary
        df (dataframe): Pandas dataframe containing requested data. 
    response = service.searchanalytics().query(siteUrl=site_url, body=payload).execute()
    results = []
    for row in response['rows']:    
        data = {}
        for i in range(len(payload['dimensions'])):
            data[payload['dimensions'][i]] = row['keys'][i]

        data['clicks'] = row['clicks']
        data['impressions'] = row['impressions']
        data['ctr'] = round(row['ctr'] * 100, 2)
        data['position'] = round(row['position'], 2)        
    return pd.DataFrame.from_dict(results)

Fetch data by query and page

To put the function to work we’ll create a simple query which requests the query and page data for the month of January 2021 from my fly fishing website, and then returns the output in a dataframe, which is saved to a file called all.csv.

payload = {
    'startDate': "2020-01-01",
    'endDate': "2020-01-31",
    'dimensions': ["query", "page"],  
    'rowLimit': 10000,
    'startRow': 0

site_url = ""

df = query(service, site_url, payload)
df.to_csv('all.csv', index=False)
query page clicks impressions ctr position
0 fly and lure 75 260 28.85 1.53
1 zonker fly 42 529 7.94 3.38
2 simms freestone waders review 32 146 21.92 3.14
3 klinkhammer fly pattern 24 224 10.71 3.03
4 emerger fly patterns 20 344 5.81 2.46

Examine the overall data

To get an overview of the data on the site for the period examined, we can calculate a bunch of summary statistics on specific columns within the dataframe. We’ll stick these in a dictionary and use the from_dict() function to churn out a dataframe of results.

data = {
    'Total pages': [int(df['page'].nunique())],
    'Total queries': [int(df['query'].nunique())],
    'Total clicks': [int(df['clicks'].sum())],
    'Total impressions': [int(df['impressions'].sum())],
    'Average CTR': [round(df['ctr'].mean(),2)],
    'Average position': [round(df['position'].mean(),2)],
    'Average queries per page': [round(int(df['query'].nunique()) / int(df['page'].nunique()),2)],

df_stats = pd.DataFrame.from_dict(data)
Total pages 1240.00
Total queries 8648.00
Total clicks 3033.00
Total impressions 131322.00
Average CTR 1.72
Average position 36.69
Average queries per page 6.97

Calculate summary statistics on queries

To get an overview on the summary statistics for each query we can use groupby() to group the data on the query column and then use the agg() function to calculate statistics based on the grouped data.

The average click through rates on my site’s top terms look OK, but as shown in the unique pages column, several pages are often ranking for the same phrase, so cannibalisation is definitely an issue.

df_summary = df.groupby('query').agg(
    unique_pages=('page', 'nunique'),
    total_clicks=('clicks', 'sum'),
    total_impressions=('impressions', 'sum'),
    avg_ctr=('ctr', 'mean'),
    avg_position=('position', 'mean'),    
).sort_values(by='total_clicks', ascending=False)

unique_pages total_clicks total_impressions avg_ctr avg_position
fly and lure 12 102 1604 3.505833 1.605833
zonker fly 1 42 529 7.940000 3.380000
simms freestone waders review 1 32 146 21.920000 3.140000
klinkhammer fly pattern 1 24 224 10.710000 3.030000
fly fishing near me 46 23 216 7.148696 4.649348
swinsty reservoir fishing 2 20 67 29.410000 1.515000
emerger fly patterns 1 20 344 5.810000 2.460000
how to fish a blob fly 2 19 85 22.410000 2.350000
blob fly 2 19 208 4.635000 4.290000
airflo superflo review 1 18 41 43.900000 1.340000

Identify the cannibalised queries

To examine the queries which are suffering from keyword cannibalisation we can use the same groupby() and agg() approach, but then filter the dataframe so it includes pages where the unique_pages value is greater than one. We can then use to_csv() to save the output .

df_cannibalised = df.groupby('query').agg(
    unique_pages=('page', 'nunique'),
    total_clicks=('clicks', 'sum'),
    total_impressions=('impressions', 'sum'),
    avg_ctr=('ctr', 'mean'),
    avg_position=('position', 'mean'),    
).sort_values(by='unique_pages', ascending=False)

df_cannibalised = df_cannibalised[df_cannibalised['unique_pages'] > 1]
df_cannibalised.to_csv('cannibalised.csv', index=False)
unique_pages total_clicks total_impressions avg_ctr avg_position
fly fishing near me 46 23 216 7.148696 4.649348
trout fishing near me 34 3 101 4.264706 4.034706
fly fishing clubs near me 19 11 84 12.243158 3.392632
trout fisheries near me 18 2 27 6.944444 4.402778
fly fishing 15 0 157 0.000000 17.474667

Examine some specific queries

To get a better idea of the specific issues, and identify whether they’re fixable, we can filter the original dataframe to include the specific cannibalised query. For example, the phrase “loop q rod review” appears on two pages - both of which rank fairly well.

This one might be unavoidable, since the “Loop Q Fly Fishing Kit” page uses the name of the product and the “Loop Q fly rod” is a component within the kit. I suspect this one’s not fixable for me…

df[df['query']=='loop q rod review'].sort_values(by='impressions', ascending=False).head()
query page clicks impressions ctr position
10 loop q rod review 17 57 29.82 3.67
6734 loop q rod review 0 10 0.00 3.50

The “blob fly” query also returns two pages, both for “how to” guides on fishing this specific trout fly.

df[df['query']=='blob fly'].sort_values(by='impressions', ascending=False).head()
query page clicks impressions ctr position
6 blob fly 19 205 9.27 3.25
2145 blob fly 0 3 0.00 5.33

The “fly reel review” phrase suffers from a similar issue. The words appear in each of my reviews of fly reels, as you’d expect, but it means that my page showing all “fly reel reviews” doesn’t rank at all for this phrase.

df[df['query']=='fly reel reviews'].sort_values(by='impressions', ascending=False).head()
query page clicks impressions ctr position
5005 fly reel reviews 0 107 0.0 1.81
5004 fly reel reviews 0 77 0.0 65.05
5006 fly reel reviews 0 4 0.0 1.75
5007 fly reel reviews 0 3 0.0 66.00
5002 fly reel reviews 0 1 0.0 68.00

Worst of all (can you tell I am not an SEO?) is the phrase “fly fishing”, which arguably should be my site’s top phrase overall. The pages I’d expect to rank best actually performing worst of all, probably because the phrase is being cannibalised by dozens of other pages.

It looks like some of these have minimal content, so that may be associated and may be something I could address. Alternatively, I could try toning down the use of the term “fly fishing” on all the other pages. (Please let me know if you have any tips!)

df[df['query']=='fly fishing'].sort_values(by='impressions', ascending=False).head(20)
query page clicks impressions ctr position
4108 fly fishing 0 70 0.0 67.34
4107 fly fishing 0 47 0.0 63.28
4116 fly fishing 0 10 0.0 12.60
4118 fly fishing 0 5 0.0 15.40
4109 fly fishing 0 4 0.0 15.50
4117 fly fishing 0 4 0.0 15.50
4119 fly fishing 0 4 0.0 14.50
4121 fly fishing 0 4 0.0 7.50
4112 fly fishing 0 2 0.0 6.00
4115 fly fishing 0 2 0.0 14.50
4110 fly fishing 0 1 0.0 2.00
4111 fly fishing 0 1 0.0 6.00
4113 fly fishing 0 1 0.0 10.00
4114 fly fishing 0 1 0.0 6.00
4120 fly fishing 0 1 0.0 6.00

There are a wide range of similar Python SEO techniques you can apply in order to use data science techniques to help improve search rankings, from identifying striking distance keywords to identifying keyword opportunities in your title and description data.

Matt Clarke, Friday, March 12, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.