How to use EcommerceTools for technical SEO

The EcommerceTools package lets you check SERPs, examine robots.txt files, analyse Core Web Vitals, and much more. Here's how to use it.

How to use EcommerceTools for technical SEO
Picture by Stephen Phillips, Unsplash.
12 minutes to read

There’s often a lot of faffing around required to get marketing and ecommerce data from various systems into Pandas so you can analyse it, or use it within more complex models. I built the EcommerceTools Python package to take the hassle out of this process and make it quick and easy to fetch and analyse data.

At the moment, it only does the basics, but it’s still useful for my daily ecommerce and marketing work. The aim is to eventually create a package that includes all the tools I need to analyse ecommerce, marketing, and SEO data and create models. In this article I’ll explain how you can use it to analyse technical SEO data.

Install the packages

To get started, open a Jupyter notebook and install the EcommerceTools Python package from PyPi by entering !pip3 install ecommercetools in a code cell and then executing it. Then import pandas and the seo module from ecommercetools.

from ecommercetools import seo
import pandas as pd

1. Discover XML sitemap locations

First, we’ll take a look at the XML sitemaps features. The get_sitemaps() function takes the location of a robots. txt file (always stored at the root of a domain), and returns the URLs of any XML sitemaps listed within. This returns a Python list containing the URL of each sitemap.

from ecommercetools import seo

sitemaps = seo.get_sitemaps("http://www.bbc.co.uk/robots.txt")
sitemaps
['http://www.bbc.co.uk/sitemaps/index-uk-archive.xml',
 'http://www.bbc.co.uk/sitemaps/index-uk-news.xml',
 'http://www.bbc.co.uk/video_sitemap.xml',
 'http://www.bbc.co.uk/sitemap.xml',
 'https://www.bbc.co.uk/food/sitemap.xml',
 'http://www.bbc.co.uk/sitemap.xml',
 'http://www.bbc.co.uk/mobile_sitemap.xml',
 'http://www.bbc.co.uk/sitemap.xml',
 'https://www.bbc.co.uk/ideas/sitemap.xml']

2. Read an XML sitemap into Pandas

The get_dataframe() function allows you to download the URLs in an XML sitemap to a Pandas dataframe. If the sitemap contains child sitemaps, each of these will be retrieved. You can save the Pandas dataframe to CSV in the usual way.

from ecommercetools import seo

df = seo.get_sitemap("http://flyandlure.org/sitemap.xml")
print(df.head())
loc changefreq priority domain sitemap_name
0 http://flyandlure.org/ hourly 1.0 flyandlure.org http://www.flyandlure.org/sitemap.xml
1 http://flyandlure.org/about monthly 1.0 flyandlure.org http://www.flyandlure.org/sitemap.xml
2 http://flyandlure.org/terms monthly 1.0 flyandlure.org http://www.flyandlure.org/sitemap.xml
3 http://flyandlure.org/privacy monthly 1.0 flyandlure.org http://www.flyandlure.org/sitemap.xml
4 http://flyandlure.org/copyright monthly 1.0 flyandlure.org http://www.flyandlure.org/sitemap.xml

3. Get Core Web Vitals from PageSpeed Insights

You can also obtain site performance data. The get_core_web_vitals() function retrieves the Core Web Vitals metrics for a list of sites from the Google PageSpeed Insights API and returns results in a Pandas dataframe. The function requires a Google PageSpeed Insights API key.

from ecommercetools import seo

pagespeed_insights_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
urls = ['https://www.bbc.co.uk', 'https://www.bbc.co.uk/iplayer']
df = seo.get_core_web_vitals(pagespeed_insights_key, urls)
print(df.head())
final_url fetch_time form_factor overall_score speed_index first_meaningful_paint first_contentful_paint time_to_interactive total_blocking_time cumulative_layout_shift
0 https://practicaldatascience.co.uk/ 2021-03-27T10:56:26.497Z mobile 74.0 79.0 57.0 61.0 90.0 100 100
3 https://practicaldatascience.co.uk/ 2021-03-27T10:57:03.226Z desktop 95.0 97.0 86.0 87.0 100.0 100 100
1 https://practicaldatascience.co.uk/about 2021-03-27T10:56:37.058Z mobile 69.0 85.0 61.0 61.0 82.0 100 62
4 https://practicaldatascience.co.uk/about 2021-03-27T10:57:16.035Z desktop 94.0 96.0 86.0 87.0 100.0 100 67
2 https://practicaldatascience.co.uk/machine-lea... 2021-03-27T10:56:48.098Z mobile 33.0 52.0 57.0 61.0 19.0 32 82

4. Get Google Knowledge Graph data

The get_knowledge_graph() function returns the Google Knowledge Graph data for a given search term. This requires the use of a Google Knowledge Graph API key. By default, the function returns output in a Pandas dataframe, but you can pass the output="json" argument if you wish to receive the JSON data back.

from ecommercetools import seo

knowledge_graph_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
knowledge_graph = seo.get_knowledge_graph(knowledge_graph_key, "tesla", output="dataframe")
print(knowledge_graph)
resultScore @type result.name result.@id result.detailedDescription.articleBody result.detailedDescription.url result.detailedDescription.license result.description result.@type
0 15315.625977 EntitySearchResult Python kg:/m/05z1_ Python is an interpreted, high-level and gener... https://en.wikipedia.org/wiki/Python_(programm... https://en.wikipedia.org/wiki/Wikipedia:Text_o... High-level programming language [Thing, Brand]
1 1671.793579 EntitySearchResult Python family kg:/m/05tb5 The Pythonidae, commonly known as pythons, are... https://en.wikipedia.org/wiki/Pythonidae https://en.wikipedia.org/wiki/Wikipedia:Text_o... Snake [Thing]
2 1301.166748 EntitySearchResult Pythons kg:/m/0cv6_m Python is a genus of constricting snakes in th... https://en.wikipedia.org/wiki/Python_(genus) https://en.wikipedia.org/wiki/Wikipedia:Text_o... Snake [Thing]
3 497.687103 EntitySearchResult CPython kg:/m/06bxxb CPython is the reference implementation of the... https://en.wikipedia.org/wiki/CPython https://en.wikipedia.org/wiki/Wikipedia:Text_o... NaN [Thing, SoftwareApplication]
4 378.672913 EntitySearchResult Python kg:/m/0l8ry In Greek mythology, Python was the serpent, so... https://en.wikipedia.org/wiki/Python_(mythology) https://en.wikipedia.org/wiki/Wikipedia:Text_o... NaN [Thing]
5 312.430939 EntitySearchResult Reticulated python kg:/m/0m5qz The reticulated python is a python species nat... https://en.wikipedia.org/wiki/Reticulated_python https://en.wikipedia.org/wiki/Wikipedia:Text_o... Snake [Thing]
6 283.799957 EntitySearchResult Python kg:/m/02rg562 Python is a double-loop corkscrew roller coast... https://en.wikipedia.org/wiki/Python_(Efteling) https://en.wikipedia.org/wiki/Wikipedia:Text_o... Roller coaster in Kaatsheuvel, Netherlands [Thing, TouristAttraction]
7 203.535995 EntitySearchResult Requests kg:/m/012hn1l3 Requests is a Python HTTP library, released un... https://en.wikipedia.org/wiki/Requests_(software) https://en.wikipedia.org/wiki/Wikipedia:Text_o... NaN [Thing, SoftwareApplication]
8 171.786148 EntitySearchResult Python kg:/m/01v25c The Rafael Python is a family of air-to-air mi... https://en.wikipedia.org/wiki/Python_(missile) https://en.wikipedia.org/wiki/Wikipedia:Text_o... NaN [Thing]
9 160.946594 EntitySearchResult Python Imaging Library kg:/m/06rx86 Python Imaging Library is a free and open-sour... https://en.wikipedia.org/wiki/Python_Imaging_L... https://en.wikipedia.org/wiki/Wikipedia:Text_o... NaN [Thing, SoftwareApplication]

5. Get Google Search Console API data

The query_google_search_console() function runs a search query on the Google Search Console API and returns data in a Pandas dataframe. This function requires a JSON client secrets key with access to the Google Search Console API.

from ecommercetools import seo

key = "google-search-console.json"
site_url = "http://flyandlure.org"
payload = {
    'startDate': "2019-01-01",
    'endDate': "2019-12-31",
    'dimensions': ["page", "device", "query"],
    'rowLimit': 100,
    'startRow': 0
}

df = seo.query_google_search_console(key, site_url, payload)
print(df.head())

page device query clicks impressions ctr position
0 http://flyandlure.org/articles/fly_fishing_gea... MOBILE simms freestone waders review 56 217 25.81 3.12
1 http://flyandlure.org/ MOBILE fly and lure 37 159 23.27 3.81
2 http://flyandlure.org/articles/fly_fishing_gea... DESKTOP orvis encounter waders review 35 134 26.12 4.04
3 http://flyandlure.org/articles/fly_fishing_gea... DESKTOP simms freestone waders review 35 200 17.50 3.50
4 http://flyandlure.org/ DESKTOP fly and lure 32 170 18.82 3.09

6. Get the number of “indexed” pages

The get_indexed_pages() function uses the “site:” prefix to search Google for the number of pages “indexed”. This is very approximate and may not be a perfect representation, but it’s usually a good guide of site “size” in the absence of other data.

from ecommercetools import seo

urls = ['https://www.bbc.co.uk', 'https://www.bbc.co.uk/iplayer', 'http://flyandlure.org']
df = seo.get_indexed_pages(urls)
print(df.head())
url indexed_pages
2 http://flyandlure.org 2090
1 https://www.bbc.co.uk/iplayer 215000
0 https://www.bbc.co.uk 12700000

7. Scrape keyword suggestions from Google Autocomplete

The google_autocomplete() function returns a set of keyword suggestions from Google Autocomplete. The include_expanded=True argument allows you to expand the number of suggestions shown by appending prefixes and suffixes to the search terms.

from ecommercetools import seo

suggestions = seo.google_autocomplete("data science", include_expanded=False)
print(suggestions)

suggestions = seo.google_autocomplete("data science", include_expanded=True)
print(suggestions)
term relevance
0 data science jobs 650
1 data science jobs chester 601
2 data science course 600
3 data science masters 554
4 data science salary 553
5 data science internship 552
6 data science jobs london 551
7 data science graduate scheme 550

8. Retrieve robots.txt content

The get_robots() function returns the contents of a robots.txt file in a Pandas dataframe so it can be parsed and analysed.

from ecommercetools import seo

robots = seo.get_robots("http://www.flyandlure.org/robots.txt")
print(robots)
directive parameter
0 User-agent *
1 Disallow /signin
2 Disallow /signup
3 Disallow /users
4 Disallow /contact
5 Disallow /activate
6 Disallow /*/page
7 Disallow /articles/search
8 Disallow /search.php
9 Disallow *q=*
10 Disallow *category_slug=*
11 Disallow *country_slug=*
12 Disallow *county_slug=*
13 Disallow *features=*

9. Scrape Google search engine results

The get_serps() function is one of the quickest and easiest ways to scrape Google search results using Python. This simple function takes a keyword phrase and returns a Pandas dataframe containing the Google search engine results for a given search term.

This is only designed for infrequent use and doesn’t include any features to prevent it from being blocked. If you want to perform large-scale web scraping of Google SERPs then you’ll need a much more sophisticated solution.

from ecommercetools import seo

serps = seo.get_serps("data science blog")
print(serps)
title link text
0 10 of the best data science blogs to follow - ... https://www.tableau.com/learn/articles/data-sc... 10 of the best data science blogs to follow. T...
1 Best Data Science Blogs to Follow in 2020 | by... https://towardsdatascience.com/best-data-scien... 14 Jul 2020 — 1. Towards Data Science · Joined...
2 Top 20 Data Science Blogs And Websites For Dat... https://medium.com/@exastax/top-20-data-scienc... Top 20 Data Science Blogs And Websites For Dat...
3 Data Science Blog – Dataquest https://www.dataquest.io/blog/ Browse our data science blog to get helpful ti...
4 51 Awesome Data Science Blogs You Need To Chec... https://365datascience.com/trending/51-data-sc... Blog name: DataKind · datakind data science bl...
5 Blogs on AI, Analytics, Data Science, Machine ... https://www.kdnuggets.com/websites/blogs.html Individual/small group blogs · Ai4 blog, featu...
6 Data Science Blog – Applied Data Science https://data-science-blog.com/ ... an Bedeutung – DevOps for Data Science. De...
7 Top 10 Data Science and AI Blogs in 2020 - Liv... https://livecodestream.dev/post/top-data-scien... Some of the best data science and AI blogs for...
8 Data Science Blogs: 17 Must-Read Blogs for Dat... https://www.thinkful.com/blog/data-science-blogs/ Data scientists could be considered the magici...
9 rushter/data-science-blogs: A curated list of ... https://github.com/rushter/data-science-blogs A curated list of data science blogs. Contribu...

Matt Clarke, Saturday, March 20, 2021

Matt Clarke Matt is a Digital Director who uses data science to help in his work. He has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.

Web Scraping in Python

Learn to retrieve and parse information from the internet using the Python library scrapy.

Start course for FREE

Comments