The Screaming Frog SEO Spider Tool is widely used in digital marketing and ecommerce. It provides a user-friendly interface to a powerful site crawler and scraper that can be used to analyse technical SEO and content issues on sites of all sizes.
While Screaming Frog is most commonly used via its graphical user interface, you can also access the spider via the command line, which can allow you to automate crawls, scrape or fetch specific data, and export the spider’s output to CSV, so it can be used in other applications.
In this web scraping project I’ll show you how to set up Screaming Frog to run from the command line in Ubuntu, so you can crawl your site, look for issues, store the data in CSV files, and analyse the results using Pandas.
Screaming Frog is available free for sites with fewer than 500 URLs, but you’ll need to buy a licence to use it on larger sites. It costs £149 per year, but I think this is well worth it if you have a monetised website to maintain. On Ubuntu, cd
into the hidden .ScreamingFrogSEOSpider
in your home directory and view the contents.
cd ~/.ScreamingFrogSEOSpider/
ls
If you’ve used the application already, and have added your licence key, you should find that the licence.txt
is populated already. If it’s empty, open the file and add your credentials. I used gedit licence.txt
to check.
matt@SonOfAnton:~/.ScreamingFrogSEOSpider$ ls
analytics log4j2.xml safety.lck
chromium machine-id.txt spellCheckIgnoreWords.txt
crash.txt pid_info spider.config
jxbrowser-crash-reports prefs temp
jxbrowser.log ProjectInstanceData trace.txt
jxbrowser.log.1 renderinit.lck
licence.txt running.lck
Add your username to the first line of the file and your licence key to the second line, then save the file. Next, open up the spider.config
file using gedit spider.config
. Check that the line saying eula.accepted=10
or eula.accepted=9
is present. If not, fill it in.
yourusername
XXXXXXXX-XXXXXXXX-XXXXXXX
While you’re inside the spider.config
file, add a line to the bottom to set the storage.mode
property. By default, Screamingfrog stores data in memory, but it is recommended to store data in the built-in database by adding this parameter.
storage.mode=DB
If you have a powerful desktop machine (or server) you can assign extra memory too. Adding the below line to the spider.config
will assign 8 GB of memory to the spider, and can make things a bit quicker.
-Xmx8g
Screamingfrog is very powerful and lets you connect to various APIs, such as Google Analytics, Google PageSpeed, Google Search Console, Moz, and Ahrefs, to augment your crawl data and discover new issues. You can configure the command line crawler to use these APIs, but we’ll skip this for now and just get started with a basic crawl.
To access the Screaming Frog command line, open an Ubuntu terminal and enter screamingfrogseospider --help
. This will return a page of information containing all the available crawl options.
screamingfrogseospider –help
usage: ScreamingFrogSEOSpider [crawl-file|options]
Positional arguments:
crawl-file
Specify a crawl to load. This argument will be ignored if there
are any other options specified
Options:
--crawl <url>
Start crawling the supplied URL
--crawl-list <list file>
Start crawling the specified URLs in list mode
--config <config>
Supply a config file for the spider to use
--task-name <task name>
Option to name this invocation of the SEO Spider. Will be used
as the crawl name when in DB storage mode
--project-name <project name>
Db Storage Mode option sets project name of crawl. This argument
will be ignored if in Memory storage mode
--use-majestic
Use Majestic API during crawl
--use-mozscape
Use Mozscape API during crawl
--use-ahrefs
Use Ahrefs API during crawl
--use-pagespeed
Use PageSpeed API during crawl
--use-google-analytics <google account> <account> <property> <view> <segment>
Use Google Analytics API during crawl
--use-google-search-console <google account> <website>
Use Google Search Console API during crawl
--headless
Run in silent mode without a user interface
--output-folder <output>
Where to store saved files. Default: current working directory
--google-drive-account <google account>
Google Drive Account for export
--export-format <csv|xls|xlsx|gsheet>
Supply a format to be used for all exports
--overwrite
Overwrite files in output directory
--timestamped-output
Create a timestamped folder in the output directory, and store
all output there
--save-crawl
Save the completed crawl
--export-tabs <tab:filter,...>
Supply a comma separated list of tabs to export. You need to
specify the tab name and the filter name separated by a colon
--bulk-export <[submenu:]export,...>
Supply a comma separated list of bulk exports to perform. The
export names are the same as in the Bulk Export menu in the UI.
To access exports in a submenu, use <submenu-name:export-name>
--save-report <[submenu:]report,...>
Supply a comma separated list of reports to save. The report
names are the same as in the Report menu in the UI. To access
reports in a submenu, use <submenu-name:report-name>
--create-sitemap
Creates a sitemap from the completed crawl
--create-images-sitemap
Creates an images sitemap from the completed crawl
-h, --help
Print this message and exit
To run a basic crawl we first call the screamingfrogseospider
application and pass in the arguments --crawl
, the domain of the site we wish to scrape, and the --headless
argument, which ensures the crawl runs on the command line without opening the desktop application.
The --save-crawl --output-folder /home/matt/Development/crawls/pds
argument saves the crawl data in CSV format to a folder in my home directory (which needs to be present), while the export-tabs
part defines which of the additional tabs from the interface we want to save. Passing the --timestamped-output
argument creates a new directory on each crawl with a timestamp in the filename to prevent previous data being overwritten.
screamingfrogseospider --crawl https://www.practicaldatascience.co.uk --headless \
--save-crawl --output-folder /home/matt/Development/crawls/pds \
--export-tabs "Internal:All" \
--timestamped-output
After running the crawl, if you look inside the output-folder
you defined you’ll find two files: the crawl.seospider
file and the internal_all.csv
. Clicking the crawl.seospider
file should open it in the desktop app, so you can examine the data there. Your crawl results are stored in internal_all.csv
so you can view and manipulate them in Pandas if you wish.
import pandas as pd
df = pd.read_csv('internal_all.csv')
df.head()
Address | Content Type | Status Code | Status | Indexability | Indexability Status | Title 1 | Title 1 Length | Title 1 Pixel Width | Title 2 | ... | Spelling Errors | Grammar Errors | Hash | Response Time | Last Modified | Redirect URL | Redirect Type | Cookies | HTTP Version | URL Encoded Address | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | https://www.practicaldatascience.co.uk/ | text/plain | 301 | Moved Permanently | Non-Indexable | Redirected | NaN | 0 | 0 | NaN | ... | NaN | NaN | NaN | 0.233 | NaN | https://practicaldatascience.co.uk/ | HTTP Redirect | NaN | HTTP/1.1 | https://www.practicaldatascience.co.uk/ |
1 | https://practicaldatascience.co.uk/ | text/html; charset=UTF-8 | 200 | OK | Indexable | NaN | Home | Practical Data Science | 29 | 268 | Home | Practical Data Science | ... | NaN | NaN | 95216013995b2777477420f8d83e373c | 0.140 | NaN | NaN | NaN | NaN | HTTP/1.1 | https://practicaldatascience.co.uk/ |
2 | https://practicaldatascience.co.uk/tag/web-ana... | text/html; charset=UTF-8 | 301 | Moved Permanently | Non-Indexable | Redirected | NaN | 0 | 0 | NaN | ... | NaN | NaN | NaN | 0.143 | NaN | https://practicaldatascience.co.uk/tag/web-ana... | HTTP Redirect | NaN | HTTP/1.1 | https://practicaldatascience.co.uk/tag/web-ana... |
3 | https://practicaldatascience.co.uk/assets/css/... | text/css; charset=UTF-8 | 200 | OK | Indexable | NaN | NaN | 0 | 0 | NaN | ... | NaN | NaN | NaN | 0.440 | NaN | NaN | NaN | NaN | HTTP/1.1 | https://practicaldatascience.co.uk/assets/css/... |
4 | https://practicaldatascience.co.uk/tag/keras | text/html; charset=UTF-8 | 301 | Moved Permanently | Non-Indexable | Redirected | NaN | 0 | 0 | NaN | ... | NaN | NaN | NaN | 0.459 | NaN | https://practicaldatascience.co.uk/tag/keras/ | HTTP Redirect | NaN | HTTP/1.1 | https://practicaldatascience.co.uk/tag/keras |
5 rows × 58 columns
In additional to the "Internal:All"
data we exported above, we can also export the data from each of the horizontal tabs on the main Screaming Frog interface to a CSV file, so we can access them in Pandas. These use the format of Tab Name:Filter
, so URL:All
gives you all the data in the URL tab. Unfortunately, you can export the useful metrics in the right hand column, so you’ll need to manually recalculate them… Here’s all of them in one command.
screamingfrogseospider --crawl https://www.practicaldatascience.co.uk --headless \
--save-crawl --output-folder /home/matt/Development/crawls/pds \
--timestamped-output \
--export-tabs "Internal:All,\
External:All,\
Security:All,\
Response Codes:All,\
URL:All,\
Page Titles:All,\
Meta Description:All,\
Meta Keywords:All,\
H1:All,\
H2:All,\
Content:All,\
Images:All,\
Canonicals:All,\
Pagination:All,\
Directives:All,\
Hreflang:All,\
AJAX:All,\
AMP:All,\
Structured Data:All,\
Sitemaps:All,\
PageSpeed:All,\
Custom Search:All,\
Custom Extraction:All,\
Analytics:All,\
Search Console:All,\
Link Metrics:All"
Matt Clarke, Thursday, March 11, 2021