Internal linking helps improve the user experience by recommending related content to users, which both reduces bounce rate, and helps search engine optimisation efforts. While there are no hard and fast rules on how many internal links each post ought to include, it’s sensible to include them when they’re relevant.
It pays to go over your internal links periodically, because newly added content creates new opportunities to connect related content together. In this project, we’ll use web scraping to crawl this website and identify all the internal and external links in each post to help see where the gaps are.
For this project we’ll be using some standard Python libraries, plus a couple of more specialist ones. For the web scraping component, we’re using Requests-HTML , which is built on top of the powerful BeautifulSoup package, while for the sitemap crawling we’re using my EcommerceTools package. Open a Jupyter notebook and enter the below commands to install the packages.
!pip3 install requests_html
!pip3 install ecommercetools
To load the packages, enter the below import statements into a cell in your Jupyter notebook. You may wish to pass in a couple of additional Pandas set_option()
commands so Jupyter doesn’t truncate the columns, allowing you to scan long URLs more easily.
import requests
import urllib
from urllib.parse import urlparse
import pandas as pd
from requests_html import HTML
from requests_html import HTMLSession
from ecommercetools import seo
pd.set_option('max_rows', 1000)
pd.set_option('max_colwidth', 1000)
Before we can run our crawler, we first need to create a list of site URLs. The easiest way to do this is via EcommerceTools, using the get_sitemap()
function. This takes the URL of the XML sitemap and returns a Pandas dataframe of URLs. Since we only need the URL, stored in the loc
column, we can filter out the rest.
df = seo.get_sitemap("https://www.practicaldatascience.co.uk/sitemap.xml")
df = df[['loc']].head(100)
df.head()
loc | |
---|---|
0 | https://practicaldatascience.co.uk/data-science/how-to-access-google-analytics-data-in-pandas-using-gapandas |
1 | https://practicaldatascience.co.uk/data-science/how-to-create-a-python-virtual-environment-for-jupyter |
2 | https://practicaldatascience.co.uk/data-science/how-to-engineer-date-features-using-pandas |
3 | https://practicaldatascience.co.uk/machine-learning/how-to-impute-missing-numeric-values-in-your-dataset |
4 | https://practicaldatascience.co.uk/machine-learning/how-to-interpret-the-confusion-matrix |
In the next steps we’ll create a bunch of helper functions that we can use in a final function to scrape each of the URLs in the sitemap dataframe generated above. The first function, get_source()
, uses a try except block/data-science/how-to-try-except-for-python-exception-handling) and takes the URL of a page from the sitemap and fetches the page source code. We can pass this source code to our other functions.
def get_source(url):
"""Return the source code for the provided URL.
Args:
url (string): URL of the page to scrape.
Returns:
response (object): HTTP response object from requests_html.
"""
try:
session = HTMLSession()
response = session.get(url)
return response
except requests.exceptions.RequestException as e:
print(e)
It will be useful to see the page title, so we’ll create a get_title()
function to parse the title out of the page. This uses the html.find()
function from Requests-HTML, locates the first
it finds (there should only be one), and then extracts the text
from within the <title>
tags.
def get_title(response):
return response.html.find('title', first=True).text
Typical web pages have links in the header, footer, sidebar, and body of the article. We’re only interested in the internal links within the body of the article in this project, so we’ll create a function to extract these. This uses the Requests-HTML html.find()
function to look for the class article-post
which contains the article body, and then return all the anchor a
tags present in a list. We’ll loop through these in a later step.
def get_post_links(response):
return response.html.find('.article-post a')
The other thing we’ll want to know is whether a link is internal or external. Internal links could come in two formats - relative URLs, which lack a protocol (i.e. /about), and absolute URLs, which use the site’s domain (i.e. https://practicaldatascience.co.uk). The below function returns True or False depending on whether a link matches these criteria or not.
def is_internal(url, domain):
if not bool(urlparse(url).netloc):
return True
elif url.startswith(domain):
return True
else:
return False
Finally, we can put these helper functions together. The scrape_links()
function takes the dataframe of URLs from our sitemap, the name of the column containing the URL (i.e. loc
), and the domain name of the site. It creates a Pandas dataframe called df_pages
, loops through each URL, scrapes the page, and parses the content. If it finds any links it will add the information on the page linked to, the text used for the link, and whether the link was internal or external.
def scrape_links(df, url_column, domain):
df_pages = pd.DataFrame(columns = ['url', 'title', 'post_link_href', 'post_link_text', 'internal'])
for index, row in df.iterrows():
response = get_source(row[url_column])
data = {}
data['url'] = row[url_column]
data['title'] = get_title(response)
post_links = get_post_links(response)
if post_links:
for link in post_links:
data['post_link_href'] = link.attrs['href']
data['post_link_text'] = link.text
data['internal'] = is_internal(link.attrs['href'], domain)
df_pages = df_pages.append(data, ignore_index=True)
else:
df_pages = df_pages.append(data, ignore_index=True)
return df_pages
Running the function takes a few minutes, depending on the size of the site, and returns a Pandas dataframe with all the data we need to analyse the site’s internal and external linking on a per page basis. As you can see, some pages have multiple external links, some have multiple internal links, and some have no links whatsoever.
df_pages = scrape_links(df, 'loc', 'https://practicaldatascience.co.uk')
df_pages.head(10)
url | title | post_link_href | post_link_text | internal | |
---|---|---|---|---|---|
0 | https://practicaldatascience.co.uk/data-science/how-to-access-google-analytics-data-in-pandas-using-gapandas | How to use GAPandas to view your Google Analytics data | https://ga-dev-tools.appspot.com/query-explorer/ | Query Explorer | False |
1 | https://practicaldatascience.co.uk/data-science/how-to-access-google-analytics-data-in-pandas-using-gapandas | How to use GAPandas to view your Google Analytics data | https://pypi.org/project/gapandas/ | GAPandas | False |
2 | https://practicaldatascience.co.uk/data-science/how-to-access-google-analytics-data-in-pandas-using-gapandas | How to use GAPandas to view your Google Analytics data | https://developers.google.com/analytics/devguides/config/mgmt/v3/quickstart/installed-py | how it’s done | False |
3 | https://practicaldatascience.co.uk/data-science/how-to-access-google-analytics-data-in-pandas-using-gapandas | How to use GAPandas to view your Google Analytics data | https://analytics.google.com/analytics/web/ | analytics.google.com | False |
4 | https://practicaldatascience.co.uk/data-science/how-to-create-a-python-virtual-environment-for-jupyter | How to create a Python virtual environment for Jupyter | NaN | NaN | NaN |
5 | https://practicaldatascience.co.uk/data-science/how-to-engineer-date-features-using-pandas | How to engineer date features using Pandas | https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects | Pandas documentation | False |
6 | https://practicaldatascience.co.uk/machine-learning/how-to-impute-missing-numeric-values-in-your-dataset | How to impute missing numeric values in your dataset | NaN | NaN | NaN |
7 | https://practicaldatascience.co.uk/machine-learning/how-to-interpret-the-confusion-matrix | How to interpret the confusion matrix | NaN | NaN | NaN |
8 | https://practicaldatascience.co.uk/machine-learning/how-to-use-mean-encoding-in-your-machine-learning-models | How to use mean encoding in your machine learning models | NaN | NaN | NaN |
9 | https://practicaldatascience.co.uk/data-science/how-to-use-python-regular-expressions-to-extract-information | How to use Python regular expressions to extract information | NaN | NaN | NaN |
To examine the pages with internal links we can simply filter the dataframe on whether the internal
column is set to True
. Running the count()
and nunique()
functions on this column shows us we have 45 internal links pointing to 15 unique pages. Given that there are currently 175 articles on the site, I’m clearly not linking to many of them often enough.
df_internal = df_pages[df_pages['internal']==True]
df_internal.head()
url | title | post_link_href | post_link_text | internal | |
---|---|---|---|---|---|
31 | https://practicaldatascience.co.uk/data-science/how-to-use-the-apriori-algorithm-for-market-basket-analysis | How to use the Apriori algorithm for Market Basket Analysis | https://practicaldatascience.co.uk/python/accessing-google-analytics-data-in-pandas | here | True |
45 | https://practicaldatascience.co.uk/machine-learning/a-quick-guide-to-product-attribute-extraction-models | A quick guide to Product Attribute Extraction models | /data-science/how-to-scrape-json-ld-competitor-reviews-using-extruct | check out my tutorial | True |
46 | https://practicaldatascience.co.uk/data-science/dell-precision-7750-mobile-data-science-workstation-review | Dell Precision 7750 mobile data science workstation review | /data-science/how-to-install-the-nvidia-data-science-stack-on-ubuntu-20-04 | NVIDIA Data Science Stack | True |
59 | https://practicaldatascience.co.uk/machine-learning/how-to-use-nlp-to-identify-what-drives-customer-satisfaction | How to use NLP to identify what drives customer satisfaction | /machine-learning/how-to-a-create-a-neural-network-for-sentiment-analysis | Long Short-Term Memory recurrent neural network | True |
61 | https://practicaldatascience.co.uk/machine-learning/ecommerce-and-marketing-data-sets-for-machine-learning-projects | Ecommerce and marketing data sets for machine learning | /data-science/how-to-group-and-aggregate-transactional-data-using-pandas | used to construct secondary datasets | True |
df_internal['url'].count()
45
df_internal['url'].nunique()
15
We can use the same approach to find external links, but instead filter internal
on those where the column value is False
. That gives us 123 links to 36 different URLs.
df_external = df_pages[df_pages['internal']==False]
df_external.head()
url | title | post_link_href | post_link_text | internal | |
---|---|---|---|---|---|
0 | https://practicaldatascience.co.uk/data-science/how-to-access-google-analytics-data-in-pandas-using-gapandas | How to use GAPandas to view your Google Analytics data | https://ga-dev-tools.appspot.com/query-explorer/ | Query Explorer | False |
1 | https://practicaldatascience.co.uk/data-science/how-to-access-google-analytics-data-in-pandas-using-gapandas | How to use GAPandas to view your Google Analytics data | https://pypi.org/project/gapandas/ | GAPandas | False |
2 | https://practicaldatascience.co.uk/data-science/how-to-access-google-analytics-data-in-pandas-using-gapandas | How to use GAPandas to view your Google Analytics data | https://developers.google.com/analytics/devguides/config/mgmt/v3/quickstart/installed-py | how it’s done | False |
3 | https://practicaldatascience.co.uk/data-science/how-to-access-google-analytics-data-in-pandas-using-gapandas | How to use GAPandas to view your Google Analytics data | https://analytics.google.com/analytics/web/ | analytics.google.com | False |
5 | https://practicaldatascience.co.uk/data-science/how-to-engineer-date-features-using-pandas | How to engineer date features using Pandas | https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects | Pandas documentation | False |
df_external['url'].count()
123
df_external['url'].nunique()
36
From the data above, it’s obvious that some pages don’t have any internal links at all. By using nunique()
again, we can see that 91 pages are affected. I’d be able to reduce my bounce rate, and improve the user experience, by going over these pages and adding some helpful links to related content from within each post.
df_no_internal_links = df_pages[df_pages['internal']!=True]
df_no_internal_links.head()
url | title | post_link_href | post_link_text | internal | |
---|---|---|---|---|---|
0 | https://practicaldatascience.co.uk/data-science/how-to-access-google-analytics-data-in-pandas-using-gapandas | How to use GAPandas to view your Google Analytics data | https://ga-dev-tools.appspot.com/query-explorer/ | Query Explorer | False |
1 | https://practicaldatascience.co.uk/data-science/how-to-access-google-analytics-data-in-pandas-using-gapandas | How to use GAPandas to view your Google Analytics data | https://pypi.org/project/gapandas/ | GAPandas | False |
2 | https://practicaldatascience.co.uk/data-science/how-to-access-google-analytics-data-in-pandas-using-gapandas | How to use GAPandas to view your Google Analytics data | https://developers.google.com/analytics/devguides/config/mgmt/v3/quickstart/installed-py | how it’s done | False |
3 | https://practicaldatascience.co.uk/data-science/how-to-access-google-analytics-data-in-pandas-using-gapandas | How to use GAPandas to view your Google Analytics data | https://analytics.google.com/analytics/web/ | analytics.google.com | False |
4 | https://practicaldatascience.co.uk/data-science/how-to-create-a-python-virtual-environment-for-jupyter | How to create a Python virtual environment for Jupyter | NaN | NaN | NaN |
df_no_internal_links['url'].nunique()
91
Finally, we can do the same thing but seek out only those pages that have no links - internal or external - by using isnull()
to show the pages where internal
contains NaN
value. There are 55 pages affected here.
df_no_links = df_pages[df_pages['internal'].isnull()]
df_no_links.head()
url | title | post_link_href | post_link_text | internal | |
---|---|---|---|---|---|
4 | https://practicaldatascience.co.uk/data-science/how-to-create-a-python-virtual-environment-for-jupyter | How to create a Python virtual environment for Jupyter | NaN | NaN | NaN |
6 | https://practicaldatascience.co.uk/machine-learning/how-to-impute-missing-numeric-values-in-your-dataset | How to impute missing numeric values in your dataset | NaN | NaN | NaN |
7 | https://practicaldatascience.co.uk/machine-learning/how-to-interpret-the-confusion-matrix | How to interpret the confusion matrix | NaN | NaN | NaN |
8 | https://practicaldatascience.co.uk/machine-learning/how-to-use-mean-encoding-in-your-machine-learning-models | How to use mean encoding in your machine learning models | NaN | NaN | NaN |
9 | https://practicaldatascience.co.uk/data-science/how-to-use-python-regular-expressions-to-extract-information | How to use Python regular expressions to extract information | NaN | NaN | NaN |
df_no_links['url'].nunique()
55
On its own these data are quite useful, because they allow you to identify which pages have links and which do not, so you can add some and try and reduce your bounce rate.
You could manually go through the posts to identify potential internal linking opportunities, however, in the next part of this article I’ll show you how you can identify internal linking opportunities using Python to speed up the process.
Matt Clarke, Sunday, May 09, 2021