How to use Screaming Frog from the command line

The Screaming Frog SEO Spider is widely used in digital marketing and ecommerce and has a powerful command line interface to complement its GUI.

How to use Screaming Frog from the command line
Aaaaaaaggghhhhhhhh! Picture by Geoffrey Baumbach, Unsplash.
14 minutes to read

The Screaming Frog SEO Spider Tool is widely used in digital marketing and ecommerce. It provides a user-friendly interface to a powerful site crawler and scraper that can be used to analyse technical SEO and content issues on sites of all sizes.

While Screaming Frog is most commonly used via its graphical user interface, you can also access the spider via the command line, which can allow you to automate crawls, scrape or fetch specific data, and export the spider’s output to CSV, so it can be used in other applications.

In this project I’ll show you how to set up Screaming Frog to run from the command line in Ubuntu, so you can crawl your site, look for issues, store the data in CSV files, and analyse the results using Pandas.

Configure your licence key

Screaming Frog is available free for sites with fewer than 500 URLs, but you’ll need to buy a licence to use it on larger sites. It costs £149 per year, but I think this is well worth it if you have a monetised website to maintain. On Ubuntu, cd into the hidden .ScreamingFrogSEOSpider in your home directory and view the contents.

cd ~/.ScreamingFrogSEOSpider/
ls

If you’ve used the application already, and have added your licence key, you should find that the licence.txt is populated already. If it’s empty, open the file and add your credentials. I used gedit licence.txt to check.

matt@SonOfAnton:~/.ScreamingFrogSEOSpider$ ls
analytics                log4j2.xml           safety.lck
chromium                 machine-id.txt       spellCheckIgnoreWords.txt
crash.txt                pid_info             spider.config
jxbrowser-crash-reports  prefs                temp
jxbrowser.log            ProjectInstanceData  trace.txt
jxbrowser.log.1          renderinit.lck
licence.txt              running.lck

Add your username to the first line of the file and your licence key to the second line, then save the file. Next, open up the spider.config file using gedit spider.config. Check that the line saying eula.accepted=10 or eula.accepted=9 is present. If not, fill it in.

yourusername
XXXXXXXX-XXXXXXXX-XXXXXXX

Configure storage and memory

While you’re inside the spider.config file, add a line to the bottom to set the storage.mode property. By default, Screamingfrog stores data in memory, but it is recommended to store data in the built-in database by adding this parameter. storage.mode=DB If you have a powerful desktop machine (or server) you can assign extra memory too. Adding the below line to the spider.config will assign 8 GB of memory to the spider, and can make things a bit quicker. -Xmx8g

Finding command line crawl options

Screamingfrog is very powerful and lets you connect to various APIs, such as Google Analytics, Google PageSpeed, Google Search Console, Moz, and Ahrefs, to augment your crawl data and discover new issues. You can configure the command line crawler to use these APIs, but we’ll skip this for now and just get started with a basic crawl.

To access the Screaming Frog command line, open an Ubuntu terminal and enter screamingfrogseospider --help. This will return a page of information containing all the available crawl options. screamingfrogseospider –help

usage: ScreamingFrogSEOSpider [crawl-file|options]

Positional arguments:
    crawl-file
                         Specify a crawl to load. This argument will be ignored if there
                         are any other options specified

Options:
    --crawl <url>
                         Start crawling the supplied URL

    --crawl-list <list file>
                         Start crawling the specified URLs in list mode

    --config <config>
                         Supply a config file for the spider to use

    --task-name <task name>
                         Option to name this invocation of the SEO Spider. Will be used
                         as the crawl name when in DB storage mode

    --project-name <project name>
                         Db Storage Mode option sets project name of crawl. This argument
                         will be ignored if in Memory storage mode

    --use-majestic
                         Use Majestic API during crawl

    --use-mozscape
                         Use Mozscape API during crawl

    --use-ahrefs
                         Use Ahrefs API during crawl

    --use-pagespeed
                         Use PageSpeed API during crawl

    --use-google-analytics <google account> <account> <property> <view> <segment>
                         Use Google Analytics API during crawl

    --use-google-search-console <google account> <website>
                         Use Google Search Console API during crawl

    --headless
                         Run in silent mode without a user interface

    --output-folder <output>
                         Where to store saved files. Default: current working directory

    --google-drive-account <google account>
                         Google Drive Account for export

    --export-format <csv|xls|xlsx|gsheet>
                         Supply a format to be used for all exports

    --overwrite
                         Overwrite files in output directory

    --timestamped-output
                         Create a timestamped folder in the output directory, and store
                         all output there

    --save-crawl
                         Save the completed crawl

    --export-tabs <tab:filter,...>
                         Supply a comma separated list of tabs to export. You need to
                         specify the tab name and the filter name separated by a colon

    --bulk-export <[submenu:]export,...>
                         Supply a comma separated list of bulk exports to perform. The
                         export names are the same as in the Bulk Export menu in the UI.
                         To access exports in a submenu, use <submenu-name:export-name>

    --save-report <[submenu:]report,...>
                         Supply a comma separated list of reports to save. The report
                         names are the same as in the Report menu in the UI. To access
                         reports in a submenu, use <submenu-name:report-name>

    --create-sitemap
                         Creates a sitemap from the completed crawl

    --create-images-sitemap
                         Creates an images sitemap from the completed crawl

 -h, --help
                         Print this message and exit

Running a basic crawl

To run a basic crawl we first call the screamingfrogseospider application and pass in the arguments --crawl, the domain of the site we wish to scrape, and the --headless argument, which ensures the crawl runs on the command line without opening the desktop application.

The --save-crawl --output-folder /home/matt/Development/crawls/pds argument saves the crawl data in CSV format to a folder in my home directory (which needs to be present), while the export-tabs part defines which of the additional tabs from the interface we want to save. Passing the --timestamped-output argument creates a new directory on each crawl with a timestamp in the filename to prevent previous data being overwritten.

screamingfrogseospider --crawl https://www.practicaldatascience.co.uk --headless \
--save-crawl --output-folder /home/matt/Development/crawls/pds \
--export-tabs "Internal:All" \
--timestamped-output

Examining command line crawl data

After running the crawl, if you look inside the output-folder you defined you’ll find two files: the crawl.seospider file and the internal_all.csv. Clicking the crawl.seospider file should open it in the desktop app, so you can examine the data there. Your crawl results are stored in internal_all.csv so you can view and manipulate them in Pandas if you wish.

import pandas as pd
df = pd.read_csv('internal_all.csv')
df.head()
Address Content Type Status Code Status Indexability Indexability Status Title 1 Title 1 Length Title 1 Pixel Width Title 2 ... Spelling Errors Grammar Errors Hash Response Time Last Modified Redirect URL Redirect Type Cookies HTTP Version URL Encoded Address
0 https://www.practicaldatascience.co.uk/ text/plain 301 Moved Permanently Non-Indexable Redirected NaN 0 0 NaN ... NaN NaN NaN 0.233 NaN https://practicaldatascience.co.uk/ HTTP Redirect NaN HTTP/1.1 https://www.practicaldatascience.co.uk/
1 https://practicaldatascience.co.uk/ text/html; charset=UTF-8 200 OK Indexable NaN Home | Practical Data Science 29 268 Home | Practical Data Science ... NaN NaN 95216013995b2777477420f8d83e373c 0.140 NaN NaN NaN NaN HTTP/1.1 https://practicaldatascience.co.uk/
2 https://practicaldatascience.co.uk/tag/web-ana... text/html; charset=UTF-8 301 Moved Permanently Non-Indexable Redirected NaN 0 0 NaN ... NaN NaN NaN 0.143 NaN https://practicaldatascience.co.uk/tag/web-ana... HTTP Redirect NaN HTTP/1.1 https://practicaldatascience.co.uk/tag/web-ana...
3 https://practicaldatascience.co.uk/assets/css/... text/css; charset=UTF-8 200 OK Indexable NaN NaN 0 0 NaN ... NaN NaN NaN 0.440 NaN NaN NaN NaN HTTP/1.1 https://practicaldatascience.co.uk/assets/css/...
4 https://practicaldatascience.co.uk/tag/keras text/html; charset=UTF-8 301 Moved Permanently Non-Indexable Redirected NaN 0 0 NaN ... NaN NaN NaN 0.459 NaN https://practicaldatascience.co.uk/tag/keras/ HTTP Redirect NaN HTTP/1.1 https://practicaldatascience.co.uk/tag/keras

5 rows × 58 columns

Exporting other report tabs

In additional to the "Internal:All" data we exported above, we can also export the data from each of the horizontal tabs on the main Screaming Frog interface to a CSV file, so we can access them in Pandas. These use the format of Tab Name:Filter, so URL:All gives you all the data in the URL tab. Unfortunately, you can export the useful metrics in the right hand column, so you’ll need to manually recalculate them… Here’s all of them in one command.

screamingfrogseospider --crawl https://www.practicaldatascience.co.uk --headless \
--save-crawl --output-folder /home/matt/Development/crawls/pds \
--timestamped-output \
--export-tabs "Internal:All,\
External:All,\
Security:All,\
Response Codes:All,\
URL:All,\
Page Titles:All,\
Meta Description:All,\
Meta Keywords:All,\
H1:All,\
H2:All,\
Content:All,\
Images:All,\
Canonicals:All,\
Pagination:All,\
Directives:All,\
Hreflang:All,\
AJAX:All,\
AMP:All,\
Structured Data:All,\
Sitemaps:All,\
PageSpeed:All,\
Custom Search:All,\
Custom Extraction:All,\
Analytics:All,\
Search Console:All,\
Link Metrics:All"

Matt Clarke, Thursday, March 11, 2021

Matt Clarke Matt is a Digital Director who uses data science to help in his work. He has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.

Web Scraping in Python

Learn to retrieve and parse information from the internet using the Python library scrapy.

Start course for FREE

Comments