Keyword cannibalisation occurs when you have several pages ranking for the same phrase, effectively putting them in competition with each other for search engine rankings. Since Google generally only shows one or two results from each site, keyword cannibalisation means you can reduce your chances for ranking highly for the shared phrase.
Keyword cannibalisation happens for perfectly innocuous reasons, especially on ecommerce websites. Ecommerce copywriters and merchandisers rightly want the product they’re launching to rank on search engines, and appear in the results of the on-site search engine, so they’ll try to optimise the content accordingly by including the target keywords in all the usual places.
However, what can happen is that you end up with several pages all optimised for similar phrases, and all competing for one of the limited number of available slots on the search engine results pages. This ends up diluting your clicks, CTR, and even backlinks.
To identify which queries are suffering from keyword cannibalisation you can use a number of technical SEO tools, including Google Search Console. The idea is that you extract the data on the queries your site ranks for, and then identify the terms which have several pages ranking for the same phrase.
I’ll leave it for the technical SEO people to guide you on the best approach for dealing with the issue, but as I understand it, the solution is to re-work your content so that it’s optimised for different phrases, or merge the content together. Obviously, this is a technique better suited to content sites than ecommerce ones, but there’s usually a solution somewhere.
In this tutorial I’ll show you how you can create a Python script using the Google Search Engine Console API to extract your search traffic data and identify which pages are suffering from keyword cannibalisation. You’ll be able to generate a list of problem phrases and pages to give to your content team to see if they can improve the situation. Here’s how it’s done.
First, open a Jupyter notebook or Python script and import the google.ouath2
, googleapiclient
, requests
, json
, and pandas
packages. Any packages you don’t have can be installed by entering pip3 install package-name
in your terminal.
from google.oauth2 import service_account
from googleapiclient.discovery import build
import requests
import json
import pandas as pd
pd.set_option('max_colwidth', 100)
Next, we’ll create a function to connect to the Google Search Console API and return a service object. Check out my guide to accessing the Google Search Console API using Python for more details on how to set this up, and the specifics on obtaining the required JSON key.
key = 'google-search-console.json'
def connect(key):
"""Create a connection to the Google Search Console API and return service object.
Args:
key (string): Google Search Console JSON client secrets path.
Returns:
service (object): Google Search Console service object.
"""
scope = ['https://www.googleapis.com/auth/webmasters']
credentials = service_account.Credentials.from_service_account_file(key,
scopes=scope)
service = build(
'webmasters',
'v3',
credentials=credentials
)
return service
service = connect(key)
Now we have a connection to the API, we’ll create another function to pass our query to the API. This takes a dictionary payload
containing the API request details, the site_url
defining the property you want to access, and the service
object returned by connect()
.
After executing the API query, the function extracts the data from the response
and neatly formats a new dictionary containing the data requested. It re-formats each row of data and appends it to a list of results
which is then passed to the Pandas to_dict()
function to return the API results in a dataframe.
def query(service, site_url, payload):
"""Run a query on the Google Search Console API and return a dataframe of results.
Args:
service (object): Service object from connect()
site_url (string): URL of Google Search Console property
payload (dict): API query payload dictionary
Return:
df (dataframe): Pandas dataframe containing requested data.
"""
response = service.searchanalytics().query(siteUrl=site_url, body=payload).execute()
results = []
for row in response['rows']:
data = {}
for i in range(len(payload['dimensions'])):
data[payload['dimensions'][i]] = row['keys'][i]
data['clicks'] = row['clicks']
data['impressions'] = row['impressions']
data['ctr'] = round(row['ctr'] * 100, 2)
data['position'] = round(row['position'], 2)
results.append(data)
return pd.DataFrame.from_dict(results)
To put the function to work we’ll create a simple query which requests the query
and page
data for the month of January 2021 from my fly fishing website, and then returns the output in a dataframe, which is saved to a file called all.csv
.
payload = {
'startDate': "2020-01-01",
'endDate': "2020-01-31",
'dimensions': ["query", "page"],
'rowLimit': 10000,
'startRow': 0
}
site_url = "http://flyandlure.org"
df = query(service, site_url, payload)
df.to_csv('all.csv', index=False)
df.head()
query | page | clicks | impressions | ctr | position | |
---|---|---|---|---|---|---|
0 | fly and lure | http://flyandlure.org/ | 75 | 260 | 28.85 | 1.53 |
1 | zonker fly | http://flyandlure.org/articles/fly_tying/9_zonker_fly_patterns_to_tie | 42 | 529 | 7.94 | 3.38 |
2 | simms freestone waders review | http://flyandlure.org/articles/fly_fishing_gear_reviews/simms_freestone_waders_review | 32 | 146 | 21.92 | 3.14 |
3 | klinkhammer fly pattern | http://flyandlure.org/articles/fly_tying/tying_the_klinkhammer_fly_pattern | 24 | 224 | 10.71 | 3.03 |
4 | emerger fly patterns | http://flyandlure.org/articles/fly_tying/5_emerger_fly_patterns | 20 | 344 | 5.81 | 2.46 |
To get an overview of the data on the site for the period examined, we can calculate a bunch of summary statistics on specific columns within the dataframe. We’ll stick these in a dictionary and use the from_dict()
function to churn out a dataframe of results.
data = {
'Total pages': [int(df['page'].nunique())],
'Total queries': [int(df['query'].nunique())],
'Total clicks': [int(df['clicks'].sum())],
'Total impressions': [int(df['impressions'].sum())],
'Average CTR': [round(df['ctr'].mean(),2)],
'Average position': [round(df['position'].mean(),2)],
'Average queries per page': [round(int(df['query'].nunique()) / int(df['page'].nunique()),2)],
}
df_stats = pd.DataFrame.from_dict(data)
df_stats.T.head(10)
0 | |
---|---|
Total pages | 1240.00 |
Total queries | 8648.00 |
Total clicks | 3033.00 |
Total impressions | 131322.00 |
Average CTR | 1.72 |
Average position | 36.69 |
Average queries per page | 6.97 |
To get an overview on the summary statistics for each query we can use groupby()
to group the data on the query
column and then use the agg()
function to calculate statistics based on the grouped data.
The average click through rates on my site’s top terms look OK, but as shown in the unique pages column, several pages are often ranking for the same phrase, so cannibalisation is definitely an issue.
df_summary = df.groupby('query').agg(
unique_pages=('page', 'nunique'),
total_clicks=('clicks', 'sum'),
total_impressions=('impressions', 'sum'),
avg_ctr=('ctr', 'mean'),
avg_position=('position', 'mean'),
).sort_values(by='total_clicks', ascending=False)
df_summary.head(10)
unique_pages | total_clicks | total_impressions | avg_ctr | avg_position | |
---|---|---|---|---|---|
query | |||||
fly and lure | 12 | 102 | 1604 | 3.505833 | 1.605833 |
zonker fly | 1 | 42 | 529 | 7.940000 | 3.380000 |
simms freestone waders review | 1 | 32 | 146 | 21.920000 | 3.140000 |
klinkhammer fly pattern | 1 | 24 | 224 | 10.710000 | 3.030000 |
fly fishing near me | 46 | 23 | 216 | 7.148696 | 4.649348 |
swinsty reservoir fishing | 2 | 20 | 67 | 29.410000 | 1.515000 |
emerger fly patterns | 1 | 20 | 344 | 5.810000 | 2.460000 |
how to fish a blob fly | 2 | 19 | 85 | 22.410000 | 2.350000 |
blob fly | 2 | 19 | 208 | 4.635000 | 4.290000 |
airflo superflo review | 1 | 18 | 41 | 43.900000 | 1.340000 |
To examine the queries which are suffering from keyword cannibalisation we can use the same groupby()
and agg()
approach, but then filter the dataframe so it includes pages where the unique_pages
value is greater than one. We can then use to_csv()
to save the output .
df_cannibalised = df.groupby('query').agg(
unique_pages=('page', 'nunique'),
total_clicks=('clicks', 'sum'),
total_impressions=('impressions', 'sum'),
avg_ctr=('ctr', 'mean'),
avg_position=('position', 'mean'),
).sort_values(by='unique_pages', ascending=False)
df_cannibalised = df_cannibalised[df_cannibalised['unique_pages'] > 1]
df_cannibalised.to_csv('cannibalised.csv', index=False)
df_cannibalised.head()
unique_pages | total_clicks | total_impressions | avg_ctr | avg_position | |
---|---|---|---|---|---|
query | |||||
fly fishing near me | 46 | 23 | 216 | 7.148696 | 4.649348 |
trout fishing near me | 34 | 3 | 101 | 4.264706 | 4.034706 |
fly fishing clubs near me | 19 | 11 | 84 | 12.243158 | 3.392632 |
trout fisheries near me | 18 | 2 | 27 | 6.944444 | 4.402778 |
fly fishing | 15 | 0 | 157 | 0.000000 | 17.474667 |
To get a better idea of the specific issues, and identify whether they’re fixable, we can filter the original dataframe to include the specific cannibalised query. For example, the phrase “loop q rod review” appears on two pages - both of which rank fairly well.
This one might be unavoidable, since the “Loop Q Fly Fishing Kit” page uses the name of the product and the “Loop Q fly rod” is a component within the kit. I suspect this one’s not fixable for me…
df[df['query']=='loop q rod review'].sort_values(by='impressions', ascending=False).head()
query | page | clicks | impressions | ctr | position | |
---|---|---|---|---|---|---|
10 | loop q rod review | http://flyandlure.org/articles/fly_fishing_gear_reviews/loop_q_series_fly_rod_review | 17 | 57 | 29.82 | 3.67 |
6734 | loop q rod review | http://flyandlure.org/articles/fly_fishing_gear_reviews/loop_q_fly_fishing_kit_review | 0 | 10 | 0.00 | 3.50 |
The “blob fly” query also returns two pages, both for “how to” guides on fishing this specific trout fly.
df[df['query']=='blob fly'].sort_values(by='impressions', ascending=False).head()
query | page | clicks | impressions | ctr | position | |
---|---|---|---|---|---|---|
6 | blob fly | http://flyandlure.org/articles/fly_fishing/how_to_fish_the_blob_fly | 19 | 205 | 9.27 | 3.25 |
2145 | blob fly | http://flyandlure.org/articles/fly_fishing/how_to_fish_the_fab_or_foam_arsed_blob_fly | 0 | 3 | 0.00 | 5.33 |
The “fly reel review” phrase suffers from a similar issue. The words appear in each of my reviews of fly reels, as you’d expect, but it means that my page showing all “fly reel reviews” doesn’t rank at all for this phrase.
df[df['query']=='fly reel reviews'].sort_values(by='impressions', ascending=False).head()
query | page | clicks | impressions | ctr | position | |
---|---|---|---|---|---|---|
5005 | fly reel reviews | http://flyandlure.org/articles/fly_fishing_gear_reviews/orvis_battenkill_fly_reel_review | 0 | 107 | 0.0 | 1.81 |
5004 | fly reel reviews | http://flyandlure.org/articles/fly_fishing_gear_reviews/loop_q_fly_reel_review | 0 | 77 | 0.0 | 65.05 |
5006 | fly reel reviews | http://flyandlure.org/articles/fly_fishing_gear_reviews/sage_2200_series_fly_reel_review | 0 | 4 | 0.0 | 1.75 |
5007 | fly reel reviews | http://flyandlure.org/articles/fly_fishing_gear_reviews/wychwood_flow_fly_reel_review | 0 | 3 | 0.0 | 66.00 |
5002 | fly reel reviews | http://flyandlure.org/articles/fly_fishing_gear_reviews/guideline_favo_fly_reel_review | 0 | 1 | 0.0 | 68.00 |
Worst of all (can you tell I am not an SEO?) is the phrase “fly fishing”, which arguably should be my site’s top phrase overall. The pages I’d expect to rank best actually performing worst of all, probably because the phrase is being cannibalised by dozens of other pages.
It looks like some of these have minimal content, so that may be associated and may be something I could address. Alternatively, I could try toning down the use of the term “fly fishing” on all the other pages. (Please let me know if you have any tips!)
df[df['query']=='fly fishing'].sort_values(by='impressions', ascending=False).head(20)
query | page | clicks | impressions | ctr | position | |
---|---|---|---|---|---|---|
4108 | fly fishing | http://flyandlure.org/articles/fly_fishing | 0 | 70 | 0.0 | 67.34 |
4107 | fly fishing | http://flyandlure.org/ | 0 | 47 | 0.0 | 63.28 |
4116 | fly fishing | http://flyandlure.org/listings/places_to_fly_fish/england/nottinghamshire | 0 | 10 | 0.0 | 12.60 |
4118 | fly fishing | http://flyandlure.org/listings/places_to_fly_fish/england/south_yorkshire | 0 | 5 | 0.0 | 15.40 |
4109 | fly fishing | http://flyandlure.org/articles/fly_fishing_destinations/how_to_fish_westlow_mere | 0 | 4 | 0.0 | 15.50 |
4117 | fly fishing | http://flyandlure.org/listings/places_to_fly_fish/england/somerset | 0 | 4 | 0.0 | 15.50 |
4119 | fly fishing | http://flyandlure.org/listings/places_to_fly_fish/england/staffordshire | 0 | 4 | 0.0 | 14.50 |
4121 | fly fishing | http://flyandlure.org/listings/places_to_fly_fish/wales/neath_port_talbot | 0 | 4 | 0.0 | 7.50 |
4112 | fly fishing | http://flyandlure.org/listings/fly_fishing_instructors/wales/neath_port_talbot/lee_watts_fly_fis... | 0 | 2 | 0.0 | 6.00 |
4115 | fly fishing | http://flyandlure.org/listings/places_to_fly_fish/england/lincolnshire | 0 | 2 | 0.0 | 14.50 |
4110 | fly fishing | http://flyandlure.org/listings/fly_fishing_clubs/wales/neath_port_talbot | 0 | 1 | 0.0 | 2.00 |
4111 | fly fishing | http://flyandlure.org/listings/fly_fishing_instructors/england/west_yorkshire/fishing_with_style | 0 | 1 | 0.0 | 6.00 |
4113 | fly fishing | http://flyandlure.org/listings/places_to_fly_fish/england/lancashire | 0 | 1 | 0.0 | 10.00 |
4114 | fly fishing | http://flyandlure.org/listings/places_to_fly_fish/england/lancashire/cowpe_reservoir | 0 | 1 | 0.0 | 6.00 |
4120 | fly fishing | http://flyandlure.org/listings/places_to_fly_fish/scotland/south_lanarkshire/newmill_trout_fishery | 0 | 1 | 0.0 | 6.00 |
There are a wide range of similar Python SEO techniques you can apply in order to use data science techniques to help improve search rankings, from identifying striking distance keywords to identifying keyword opportunities in your title and description data.
Matt Clarke, Friday, March 12, 2021