There’s often a lot of faffing around required to get marketing and ecommerce data from various systems into Pandas so you can analyse it, or use it within more complex models. I built the EcommerceTools Python package to take the hassle out of this process and make it quick and easy to fetch and analyse data.
At the moment, it only does the basics, but it’s still useful for my daily ecommerce and marketing work. The aim is to eventually create a package that includes all the tools I need to analyse ecommerce, marketing, and SEO data and create models. In this article I’ll explain how you can use it to analyse technical SEO data.
To get started, open a Jupyter notebook and install the EcommerceTools Python package from PyPi by entering !pip3 install ecommercetools
in a code cell and then executing it. Then import pandas
and the seo
module from ecommercetools
.
from ecommercetools import seo
import pandas as pd
First, we’ll take a look at the XML sitemaps features. The get_sitemaps()
function takes the location of a robots.
txt
file (always stored at the root of a domain), and returns the URLs of any XML sitemaps listed within. This
returns a Python list containing the URL of each sitemap.
from ecommercetools import seo
sitemaps = seo.get_sitemaps("http://www.bbc.co.uk/robots.txt")
sitemaps
['http://www.bbc.co.uk/sitemaps/index-uk-archive.xml',
'http://www.bbc.co.uk/sitemaps/index-uk-news.xml',
'http://www.bbc.co.uk/video_sitemap.xml',
'http://www.bbc.co.uk/sitemap.xml',
'https://www.bbc.co.uk/food/sitemap.xml',
'http://www.bbc.co.uk/sitemap.xml',
'http://www.bbc.co.uk/mobile_sitemap.xml',
'http://www.bbc.co.uk/sitemap.xml',
'https://www.bbc.co.uk/ideas/sitemap.xml']
The get_dataframe()
function allows you to download the URLs in an XML sitemap to a Pandas dataframe. If the
sitemap contains child sitemaps, each of these will be retrieved. You can save the Pandas dataframe to CSV in the usual way.
from ecommercetools import seo
df = seo.get_sitemap("http://flyandlure.org/sitemap.xml")
print(df.head())
loc | changefreq | priority | domain | sitemap_name | |
---|---|---|---|---|---|
0 | http://flyandlure.org/ | hourly | 1.0 | flyandlure.org | http://www.flyandlure.org/sitemap.xml |
1 | http://flyandlure.org/about | monthly | 1.0 | flyandlure.org | http://www.flyandlure.org/sitemap.xml |
2 | http://flyandlure.org/terms | monthly | 1.0 | flyandlure.org | http://www.flyandlure.org/sitemap.xml |
3 | http://flyandlure.org/privacy | monthly | 1.0 | flyandlure.org | http://www.flyandlure.org/sitemap.xml |
4 | http://flyandlure.org/copyright | monthly | 1.0 | flyandlure.org | http://www.flyandlure.org/sitemap.xml |
You can also obtain site performance data. The get_core_web_vitals()
function retrieves the Core Web Vitals metrics for a list of sites from the Google PageSpeed Insights API and returns results in a Pandas dataframe. The function requires a Google PageSpeed Insights API key.
from ecommercetools import seo
pagespeed_insights_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
urls = ['https://www.bbc.co.uk', 'https://www.bbc.co.uk/iplayer']
df = seo.get_core_web_vitals(pagespeed_insights_key, urls)
print(df.head())
final_url | fetch_time | form_factor | overall_score | speed_index | first_meaningful_paint | first_contentful_paint | time_to_interactive | total_blocking_time | cumulative_layout_shift | |
---|---|---|---|---|---|---|---|---|---|---|
0 | https://practicaldatascience.co.uk/ | 2021-03-27T10:56:26.497Z | mobile | 74.0 | 79.0 | 57.0 | 61.0 | 90.0 | 100 | 100 |
3 | https://practicaldatascience.co.uk/ | 2021-03-27T10:57:03.226Z | desktop | 95.0 | 97.0 | 86.0 | 87.0 | 100.0 | 100 | 100 |
1 | https://practicaldatascience.co.uk/about | 2021-03-27T10:56:37.058Z | mobile | 69.0 | 85.0 | 61.0 | 61.0 | 82.0 | 100 | 62 |
4 | https://practicaldatascience.co.uk/about | 2021-03-27T10:57:16.035Z | desktop | 94.0 | 96.0 | 86.0 | 87.0 | 100.0 | 100 | 67 |
2 | https://practicaldatascience.co.uk/machine-lea... | 2021-03-27T10:56:48.098Z | mobile | 33.0 | 52.0 | 57.0 | 61.0 | 19.0 | 32 | 82 |
The get_knowledge_graph()
function returns the Google Knowledge Graph data for a given search term. This requires
the use of a Google Knowledge Graph API key. By default, the function returns output in a Pandas dataframe, but you can pass the output="json"
argument if you wish to receive the JSON data back.
from ecommercetools import seo
knowledge_graph_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
knowledge_graph = seo.get_knowledge_graph(knowledge_graph_key, "tesla", output="dataframe")
print(knowledge_graph)
resultScore | @type | result.name | result.@id | result.detailedDescription.articleBody | result.detailedDescription.url | result.detailedDescription.license | result.description | result.@type | |
---|---|---|---|---|---|---|---|---|---|
0 | 15315.625977 | EntitySearchResult | Python | kg:/m/05z1_ | Python is an interpreted, high-level and gener... | https://en.wikipedia.org/wiki/Python_(programm... | https://en.wikipedia.org/wiki/Wikipedia:Text_o... | High-level programming language | [Thing, Brand] |
1 | 1671.793579 | EntitySearchResult | Python family | kg:/m/05tb5 | The Pythonidae, commonly known as pythons, are... | https://en.wikipedia.org/wiki/Pythonidae | https://en.wikipedia.org/wiki/Wikipedia:Text_o... | Snake | [Thing] |
2 | 1301.166748 | EntitySearchResult | Pythons | kg:/m/0cv6_m | Python is a genus of constricting snakes in th... | https://en.wikipedia.org/wiki/Python_(genus) | https://en.wikipedia.org/wiki/Wikipedia:Text_o... | Snake | [Thing] |
3 | 497.687103 | EntitySearchResult | CPython | kg:/m/06bxxb | CPython is the reference implementation of the... | https://en.wikipedia.org/wiki/CPython | https://en.wikipedia.org/wiki/Wikipedia:Text_o... | NaN | [Thing, SoftwareApplication] |
4 | 378.672913 | EntitySearchResult | Python | kg:/m/0l8ry | In Greek mythology, Python was the serpent, so... | https://en.wikipedia.org/wiki/Python_(mythology) | https://en.wikipedia.org/wiki/Wikipedia:Text_o... | NaN | [Thing] |
5 | 312.430939 | EntitySearchResult | Reticulated python | kg:/m/0m5qz | The reticulated python is a python species nat... | https://en.wikipedia.org/wiki/Reticulated_python | https://en.wikipedia.org/wiki/Wikipedia:Text_o... | Snake | [Thing] |
6 | 283.799957 | EntitySearchResult | Python | kg:/m/02rg562 | Python is a double-loop corkscrew roller coast... | https://en.wikipedia.org/wiki/Python_(Efteling) | https://en.wikipedia.org/wiki/Wikipedia:Text_o... | Roller coaster in Kaatsheuvel, Netherlands | [Thing, TouristAttraction] |
7 | 203.535995 | EntitySearchResult | Requests | kg:/m/012hn1l3 | Requests is a Python HTTP library, released un... | https://en.wikipedia.org/wiki/Requests_(software) | https://en.wikipedia.org/wiki/Wikipedia:Text_o... | NaN | [Thing, SoftwareApplication] |
8 | 171.786148 | EntitySearchResult | Python | kg:/m/01v25c | The Rafael Python is a family of air-to-air mi... | https://en.wikipedia.org/wiki/Python_(missile) | https://en.wikipedia.org/wiki/Wikipedia:Text_o... | NaN | [Thing] |
9 | 160.946594 | EntitySearchResult | Python Imaging Library | kg:/m/06rx86 | Python Imaging Library is a free and open-sour... | https://en.wikipedia.org/wiki/Python_Imaging_L... | https://en.wikipedia.org/wiki/Wikipedia:Text_o... | NaN | [Thing, SoftwareApplication] |
The query_google_search_console()
function runs a search query on the Google Search Console API and returns data
in a Pandas dataframe. This function requires a JSON client secrets key with access to the Google Search Console API.
from ecommercetools import seo
key = "google-search-console.json"
site_url = "http://flyandlure.org"
payload = {
'startDate': "2019-01-01",
'endDate': "2019-12-31",
'dimensions': ["page", "device", "query"],
'rowLimit': 100,
'startRow': 0
}
df = seo.query_google_search_console(key, site_url, payload)
print(df.head())
page | device | query | clicks | impressions | ctr | position | |
---|---|---|---|---|---|---|---|
0 | http://flyandlure.org/articles/fly_fishing_gea... | MOBILE | simms freestone waders review | 56 | 217 | 25.81 | 3.12 |
1 | http://flyandlure.org/ | MOBILE | fly and lure | 37 | 159 | 23.27 | 3.81 |
2 | http://flyandlure.org/articles/fly_fishing_gea... | DESKTOP | orvis encounter waders review | 35 | 134 | 26.12 | 4.04 |
3 | http://flyandlure.org/articles/fly_fishing_gea... | DESKTOP | simms freestone waders review | 35 | 200 | 17.50 | 3.50 |
4 | http://flyandlure.org/ | DESKTOP | fly and lure | 32 | 170 | 18.82 | 3.09 |
The get_indexed_pages()
function uses the “site:” prefix to search Google for the number of pages “indexed”. This
is very approximate and may not be a perfect representation, but it’s usually a good guide of site “size” in the absence of other data.
from ecommercetools import seo
urls = ['https://www.bbc.co.uk', 'https://www.bbc.co.uk/iplayer', 'http://flyandlure.org']
df = seo.get_indexed_pages(urls)
print(df.head())
url | indexed_pages | |
---|---|---|
2 | http://flyandlure.org | 2090 |
1 | https://www.bbc.co.uk/iplayer | 215000 |
0 | https://www.bbc.co.uk | 12700000 |
The google_autocomplete()
function returns a set of keyword suggestions from Google Autocomplete. The
include_expanded=True
argument allows you to expand the number of suggestions shown by appending prefixes and suffixes to the search terms.
from ecommercetools import seo
suggestions = seo.google_autocomplete("data science", include_expanded=False)
print(suggestions)
suggestions = seo.google_autocomplete("data science", include_expanded=True)
print(suggestions)
term | relevance | |
---|---|---|
0 | data science jobs | 650 |
1 | data science jobs chester | 601 |
2 | data science course | 600 |
3 | data science masters | 554 |
4 | data science salary | 553 |
5 | data science internship | 552 |
6 | data science jobs london | 551 |
7 | data science graduate scheme | 550 |
The get_robots()
function returns the contents of a robots.txt file in a Pandas dataframe so it can be parsed and
analysed.
from ecommercetools import seo
robots = seo.get_robots("http://www.flyandlure.org/robots.txt")
print(robots)
directive | parameter | |
---|---|---|
0 | User-agent | * |
1 | Disallow | /signin |
2 | Disallow | /signup |
3 | Disallow | /users |
4 | Disallow | /contact |
5 | Disallow | /activate |
6 | Disallow | /*/page |
7 | Disallow | /articles/search |
8 | Disallow | /search.php |
9 | Disallow | *q=* |
10 | Disallow | *category_slug=* |
11 | Disallow | *country_slug=* |
12 | Disallow | *county_slug=* |
13 | Disallow | *features=* |
The get_serps()
function is one of the quickest and easiest ways to scrape Google search results using Python.
This simple function takes a keyword phrase and returns a Pandas dataframe containing the Google search engine results for a given search term.
This is only designed for infrequent use and doesn’t include any features to prevent it from being blocked. If you want to perform large-scale web scraping of Google SERPs then you’ll need a much more sophisticated solution.
from ecommercetools import seo
serps = seo.get_serps("data science blog")
print(serps)
title | link | text | |
---|---|---|---|
0 | 10 of the best data science blogs to follow - ... | https://www.tableau.com/learn/articles/data-sc... | 10 of the best data science blogs to follow. T... |
1 | Best Data Science Blogs to Follow in 2020 | by... | https://towardsdatascience.com/best-data-scien... | 14 Jul 2020 — 1. Towards Data Science · Joined... |
2 | Top 20 Data Science Blogs And Websites For Dat... | https://medium.com/@exastax/top-20-data-scienc... | Top 20 Data Science Blogs And Websites For Dat... |
3 | Data Science Blog – Dataquest | https://www.dataquest.io/blog/ | Browse our data science blog to get helpful ti... |
4 | 51 Awesome Data Science Blogs You Need To Chec... | https://365datascience.com/trending/51-data-sc... | Blog name: DataKind · datakind data science bl... |
5 | Blogs on AI, Analytics, Data Science, Machine ... | https://www.kdnuggets.com/websites/blogs.html | Individual/small group blogs · Ai4 blog, featu... |
6 | Data Science Blog – Applied Data Science | https://data-science-blog.com/ | ... an Bedeutung – DevOps for Data Science. De... |
7 | Top 10 Data Science and AI Blogs in 2020 - Liv... | https://livecodestream.dev/post/top-data-scien... | Some of the best data science and AI blogs for... |
8 | Data Science Blogs: 17 Must-Read Blogs for Dat... | https://www.thinkful.com/blog/data-science-blogs/ | Data scientists could be considered the magici... |
9 | rushter/data-science-blogs: A curated list of ... | https://github.com/rushter/data-science-blogs | A curated list of data science blogs. Contribu... |
Matt Clarke, Saturday, March 20, 2021