How to identify internal and external links using Python

Learn how to identify internal and external links through web scraping in Python and help identify internal linking opportunities and improve SEO.

How to identify internal and external links using Python
Picture by Pixabay, Pexels.
17 minutes to read

Internal linking helps improve the user experience by recommending related content to users, which both reduces bounce rate, and helps search engine optimisation efforts. While there are no hard and fast rules on how many internal links each post ought to include, it’s sensible to include them when they’re relevant.

It pays to go over your internal links periodically, because newly added content creates new opportunities to connect related content together. In this project, we’ll use web scraping to crawl this website and identify all the internal and external links in each post to help see where the gaps are.

Install the packages

For this project we’ll be using some standard Python libraries, plus a couple of more specialist ones. For the web scraping component, we’re using Requests-HTML , which is built on top of the powerful BeautifulSoup package, while for the sitemap crawling we’re using my EcommerceTools package. Open a Jupyter notebook and enter the below commands to install the packages.

!pip3 install requests_html
!pip3 install ecommercetools

Load the packages

To load the packages, enter the below import statements into a cell in your Jupyter notebook. You may wish to pass in a couple of additional Pandas set_option() commands so Jupyter doesn’t truncate the columns, allowing you to scan long URLs more easily.

import requests
import urllib
from urllib.parse import urlparse
import pandas as pd
from requests_html import HTML
from requests_html import HTMLSession
from ecommercetools import seo
pd.set_option('max_rows', 1000)
pd.set_option('max_colwidth', 1000)

Scrape the URLs from the sitemap

Before we can run our crawler, we first need to create a list of site URLs. The easiest way to do this is via EcommerceTools, using the get_sitemap() function. This takes the URL of the XML sitemap and returns a Pandas dataframe of URLs. Since we only need the URL, stored in the loc column, we can filter out the rest.

df = seo.get_sitemap("https://www.practicaldatascience.co.uk/sitemap.xml")
df = df[['loc']].head(100)
df.head()
loc
0 https://practicaldatascience.co.uk/data-science/how-to-access-google-analytics-data-in-pandas-using-gapandas
1 https://practicaldatascience.co.uk/data-science/how-to-create-a-python-virtual-environment-for-jupyter
2 https://practicaldatascience.co.uk/data-science/how-to-engineer-date-features-using-pandas
3 https://practicaldatascience.co.uk/machine-learning/how-to-impute-missing-numeric-values-in-your-dataset
4 https://practicaldatascience.co.uk/machine-learning/how-to-interpret-the-confusion-matrix

Scrape the source of each page

In the next steps we’ll create a bunch of helper functions that we can use in a final function to scrape each of the URLs in the sitemap dataframe generated above. The first function, get_source(), takes the URL of a page from the sitemap and fetches the page source code. We can pass this source code to our other functions.

def get_source(url):
    """Return the source code for the provided URL. 

    Args: 
        url (string): URL of the page to scrape.

    Returns:
        response (object): HTTP response object from requests_html. 
    """

    try:
        session = HTMLSession()
        response = session.get(url)
        return response

    except requests.exceptions.RequestException as e:
        print(e)

Scrape the page title

It will be useful to see the page title, so we’ll create a get_title() function to parse the title out of the page. This uses the html.find() function from Requests-HTML, locates the first it finds (there should only be one), and then extracts the text from within the <title> tags.

def get_title(response):
    return response.html.find('title', first=True).text

Typical web pages have links in the header, footer, sidebar, and body of the article. We’re only interested in the internal links within the body of the article in this project, so we’ll create a function to extract these. This uses the Requests-HTML html.find() function to look for the class article-post which contains the article body, and then return all the anchor a tags present in a list. We’ll loop through these in a later step.

def get_post_links(response):
    return response.html.find('.article-post a')

The other thing we’ll want to know is whether a link is internal or external. Internal links could come in two formats - relative URLs, which lack a protocol (i.e. /about), and absolute URLs, which use the site’s domain (i.e. https://practicaldatascience.co.uk). The below function returns True or False depending on whether a link matches these criteria or not.

def is_internal(url, domain):
    if not bool(urlparse(url).netloc):
        return True
    elif url.startswith(domain):
        return True
    else:
        return False

Scrape the pages

Finally, we can put these helper functions together. The scrape_links() function takes the dataframe of URLs from our sitemap, the name of the column containing the URL (i.e. loc), and the domain name of the site. It creates a Pandas dataframe called df_pages, loops through each URL, scrapes the page, and parses the content. If it finds any links it will add the information on the page linked to, the text used for the link, and whether the link was internal or external.

def scrape_links(df, url_column, domain):
    
    df_pages = pd.DataFrame(columns = ['url', 'title', 'post_link_href', 'post_link_text', 'internal'])
    
    for index, row in df.iterrows():
        response = get_source(row[url_column])
        data = {}
        data['url'] = row[url_column]
        data['title'] = get_title(response)
        
        post_links = get_post_links(response)
        if post_links:
            for link in post_links:                
                data['post_link_href'] = link.attrs['href']
                data['post_link_text'] = link.text
                data['internal'] = is_internal(link.attrs['href'], domain)
                df_pages = df_pages.append(data, ignore_index=True)
        else:
            df_pages = df_pages.append(data, ignore_index=True)   
    
    return df_pages

Running the function takes a few minutes, depending on the size of the site, and returns a Pandas dataframe with all the data we need to analyse the site’s internal and external linking on a per page basis. As you can see, some pages have multiple external links, some have multiple internal links, and some have no links whatsoever.

df_pages = scrape_links(df, 'loc', 'https://practicaldatascience.co.uk')
df_pages.head(10)
url title post_link_href post_link_text internal
0 https://practicaldatascience.co.uk/data-science/how-to-access-google-analytics-data-in-pandas-using-gapandas How to use GAPandas to view your Google Analytics data https://ga-dev-tools.appspot.com/query-explorer/ Query Explorer False
1 https://practicaldatascience.co.uk/data-science/how-to-access-google-analytics-data-in-pandas-using-gapandas How to use GAPandas to view your Google Analytics data https://pypi.org/project/gapandas/ GAPandas False
2 https://practicaldatascience.co.uk/data-science/how-to-access-google-analytics-data-in-pandas-using-gapandas How to use GAPandas to view your Google Analytics data https://developers.google.com/analytics/devguides/config/mgmt/v3/quickstart/installed-py how it’s done False
3 https://practicaldatascience.co.uk/data-science/how-to-access-google-analytics-data-in-pandas-using-gapandas How to use GAPandas to view your Google Analytics data https://analytics.google.com/analytics/web/ analytics.google.com False
4 https://practicaldatascience.co.uk/data-science/how-to-create-a-python-virtual-environment-for-jupyter How to create a Python virtual environment for Jupyter NaN NaN NaN
5 https://practicaldatascience.co.uk/data-science/how-to-engineer-date-features-using-pandas How to engineer date features using Pandas https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects Pandas documentation False
6 https://practicaldatascience.co.uk/machine-learning/how-to-impute-missing-numeric-values-in-your-dataset How to impute missing numeric values in your dataset NaN NaN NaN
7 https://practicaldatascience.co.uk/machine-learning/how-to-interpret-the-confusion-matrix How to interpret the confusion matrix NaN NaN NaN
8 https://practicaldatascience.co.uk/machine-learning/how-to-use-mean-encoding-in-your-machine-learning-models How to use mean encoding in your machine learning models NaN NaN NaN
9 https://practicaldatascience.co.uk/data-science/how-to-use-python-regular-expressions-to-extract-information How to use Python regular expressions to extract information NaN NaN NaN

To examine the pages with internal links we can simply filter the dataframe on whether the internal column is set to True. Running the count() and nunique() functions on this column shows us we have 45 internal links pointing to 15 unique pages. Given that there are currently 175 articles on the site, I’m clearly not linking to many of them often enough.

df_internal = df_pages[df_pages['internal']==True]
df_internal.head()
url title post_link_href post_link_text internal
31 https://practicaldatascience.co.uk/data-science/how-to-use-the-apriori-algorithm-for-market-basket-analysis How to use the Apriori algorithm for Market Basket Analysis https://practicaldatascience.co.uk/python/accessing-google-analytics-data-in-pandas here True
45 https://practicaldatascience.co.uk/machine-learning/a-quick-guide-to-product-attribute-extraction-models A quick guide to Product Attribute Extraction models /data-science/how-to-scrape-json-ld-competitor-reviews-using-extruct check out my tutorial True
46 https://practicaldatascience.co.uk/data-science/dell-precision-7750-mobile-data-science-workstation-review Dell Precision 7750 mobile data science workstation review /data-science/how-to-install-the-nvidia-data-science-stack-on-ubuntu-20-04 NVIDIA Data Science Stack True
59 https://practicaldatascience.co.uk/machine-learning/how-to-use-nlp-to-identify-what-drives-customer-satisfaction How to use NLP to identify what drives customer satisfaction /machine-learning/how-to-a-create-a-neural-network-for-sentiment-analysis Long Short-Term Memory recurrent neural network True
61 https://practicaldatascience.co.uk/machine-learning/ecommerce-and-marketing-data-sets-for-machine-learning-projects Ecommerce and marketing data sets for machine learning /data-science/how-to-group-and-aggregate-transactional-data-using-pandas used to construct secondary datasets True
df_internal['url'].count()
45
df_internal['url'].nunique()
15

We can use the same approach to find external links, but instead filter internal on those where the column value is False. That gives us 123 links to 36 different URLs.

df_external = df_pages[df_pages['internal']==False]
df_external.head()
url title post_link_href post_link_text internal
0 https://practicaldatascience.co.uk/data-science/how-to-access-google-analytics-data-in-pandas-using-gapandas How to use GAPandas to view your Google Analytics data https://ga-dev-tools.appspot.com/query-explorer/ Query Explorer False
1 https://practicaldatascience.co.uk/data-science/how-to-access-google-analytics-data-in-pandas-using-gapandas How to use GAPandas to view your Google Analytics data https://pypi.org/project/gapandas/ GAPandas False
2 https://practicaldatascience.co.uk/data-science/how-to-access-google-analytics-data-in-pandas-using-gapandas How to use GAPandas to view your Google Analytics data https://developers.google.com/analytics/devguides/config/mgmt/v3/quickstart/installed-py how it’s done False
3 https://practicaldatascience.co.uk/data-science/how-to-access-google-analytics-data-in-pandas-using-gapandas How to use GAPandas to view your Google Analytics data https://analytics.google.com/analytics/web/ analytics.google.com False
5 https://practicaldatascience.co.uk/data-science/how-to-engineer-date-features-using-pandas How to engineer date features using Pandas https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects Pandas documentation False
df_external['url'].count()
123
df_external['url'].nunique()
36

From the data above, it’s obvious that some pages don’t have any internal links at all. By using nunique() again, we can see that 91 pages are affected. I’d be able to reduce my bounce rate, and improve the user experience, by going over these pages and adding some helpful links to related content from within each post.

df_no_internal_links = df_pages[df_pages['internal']!=True]
df_no_internal_links.head()
url title post_link_href post_link_text internal
0 https://practicaldatascience.co.uk/data-science/how-to-access-google-analytics-data-in-pandas-using-gapandas How to use GAPandas to view your Google Analytics data https://ga-dev-tools.appspot.com/query-explorer/ Query Explorer False
1 https://practicaldatascience.co.uk/data-science/how-to-access-google-analytics-data-in-pandas-using-gapandas How to use GAPandas to view your Google Analytics data https://pypi.org/project/gapandas/ GAPandas False
2 https://practicaldatascience.co.uk/data-science/how-to-access-google-analytics-data-in-pandas-using-gapandas How to use GAPandas to view your Google Analytics data https://developers.google.com/analytics/devguides/config/mgmt/v3/quickstart/installed-py how it’s done False
3 https://practicaldatascience.co.uk/data-science/how-to-access-google-analytics-data-in-pandas-using-gapandas How to use GAPandas to view your Google Analytics data https://analytics.google.com/analytics/web/ analytics.google.com False
4 https://practicaldatascience.co.uk/data-science/how-to-create-a-python-virtual-environment-for-jupyter How to create a Python virtual environment for Jupyter NaN NaN NaN
df_no_internal_links['url'].nunique()
91

Finally, we can do the same thing but seek out only those pages that have no links - internal or external - by using isnull() to show the pages where internal contains NaN value. There are 55 pages affected here.

df_no_links = df_pages[df_pages['internal'].isnull()]
df_no_links.head()
url title post_link_href post_link_text internal
4 https://practicaldatascience.co.uk/data-science/how-to-create-a-python-virtual-environment-for-jupyter How to create a Python virtual environment for Jupyter NaN NaN NaN
6 https://practicaldatascience.co.uk/machine-learning/how-to-impute-missing-numeric-values-in-your-dataset How to impute missing numeric values in your dataset NaN NaN NaN
7 https://practicaldatascience.co.uk/machine-learning/how-to-interpret-the-confusion-matrix How to interpret the confusion matrix NaN NaN NaN
8 https://practicaldatascience.co.uk/machine-learning/how-to-use-mean-encoding-in-your-machine-learning-models How to use mean encoding in your machine learning models NaN NaN NaN
9 https://practicaldatascience.co.uk/data-science/how-to-use-python-regular-expressions-to-extract-information How to use Python regular expressions to extract information NaN NaN NaN
df_no_links['url'].nunique()
55

Identifying internal linking opportunities

On its own these data are quite useful, because they allow you to identify which pages have links and which do not, so you can add some and try and reduce your bounce rate.

You could manually go through the posts to identify potential internal linking opportunities, however, in the next part of this article I’ll show you how you can identify internal linking opportunities using Python to speed up the process.

Matt Clarke, Sunday, May 09, 2021

Matt Clarke Matt is a Digital Director who uses data science to help in his work. He has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.

Web Scraping in Python

Learn to retrieve and parse information from the internet using the Python library scrapy.

Start course for FREE

Comments