How to parse URL structures using Python

When analysing web data, it’s common to need to parse URLs and extract the domain, directories, query string and other components. Here’s how it’s done.

How to parse URL structures using Python
Picture by Marvin Meyer, Unsplash.
8 minutes to read

URLs often contain useful information that can be used to analyse a website, a user’s search, or the breakdown of content present in each section. While they often look pretty complicated, Python includes a number of useful packages to allow you to parse URLs and extract components from within.

In this project, we’ll be using urllib to parse a bunch of URLs from a sitemap, and extract various elements from them, including the scheme, domain (or netloc), the path, the fragment, the anchor, and the directory structure. Here’s how it’s done.

Load the packages

First, open up a Jupyter notebook, or a Python script, and import pandas and the urlparse module from urllib.parse. You’ll likely have both of these pre-installed already.

import pandas as pd
from urllib.parse import urlparse

Extract the main parts from a URL

Next we’ll use the urlparse() function to return the component parts of a regular URL. The ParseResult() returned contains the scheme or HTTP protocol (i.e. http or https); the domain or netloc (i.e. flyandlure.org), the path containing the directory structure or file path, any params or anchor elements present, and the query string if there is one.

url = "http://flyandlure.org/articles/fly_fishing/fly_fishing_diary_july_2020?q=word&b=something#anchor"
parts = urlparse(url)
parts
ParseResult(scheme='http', 
netloc='flyandlure.org', 
path='/articles/fly_fishing/fly_fishing_diary_july_2020', 
params='', 
query='q=word&b=something', 
fragment='anchor')

Split up the path and return the directories

The urlparse() function gives us most of the stuff we need, but the path is somewhat inconvenient to work with, as it’s provided as a continuous string. It would be better if this were broken up into the directories or files. We can do that by using strip() and split(), which takes our long string and turns it into a list of component directories.

directories = parts.path.strip('/').split('/')
directories
['articles', 'fly_fishing', 'fly_fishing_diary_july_2020']

Return all URL components

To combine the two approaches, we can create a little function called url_parser(). Feeding the url to this returns the component parts, plus the directories and queries broken into chunks. The outputs can then be merged together in a neat Python dictionary that we can manipulate easily.

def url_parser(url):
    
    parts = urlparse(url)
    directories = parts.path.strip('/').split('/')
    queries = parts.query.strip('&').split('&')
    
    elements = {
        'scheme': parts.scheme,
        'netloc': parts.netloc,
        'path': parts.path,
        'params': parts.params,
        'query': parts.query,
        'fragment': parts.fragment,
        'directories': directories,
        'queries': queries,
    }
    
    return elements
elements = url_parser(url)
elements
{'scheme': 'http',
 'netloc': 'flyandlure.org',
 'path': '/articles/fly_fishing/fly_fishing_diary_july_2020',
 'params': '',
 'query': 'q=word&b=something',
 'fragment': 'anchor',
 'directories': ['articles', 'fly_fishing', 'fly_fishing_diary_july_2020'],
 'queries': ['q=word', 'b=something']}

Test a selection of URLs

To test it out we can loop through a bunch of URLs and print the output for each one. These are a mixture of Google search queries and ecommerce website URLs, so include a mixture of directories and queries to parse.

urls = [
    'https://www.google.com/search?q=how+to+dispose+of+a+corpse&oq=how+to+dispose+of+a+corpse&aqs=chrome..69i57j69i64.4925j1j7&sourceid=chrome&ie=UTF-8',
    'https://tales.as/101-things-to-do-with-a-dead-body_9780997711639?utm_source=google-shopping&utm_medium=cpc&utm_campaign=',
    'https://www.worldofbooks.com/en-gb/books/hugh-whitemore/disposing-of-the-body/9781872868271#NGR9781872868271',
    'https://www.google.com/search?q=where+to+buy+hydrofluoric+acid&oq=where+to+buy+hydrofl&aqs=chrome.1.69i57j0l2j0i10j0i10i395j0i395i457j0i10i395l2.8058j1j7&sourceid=chrome&ie=UTF-8'
]
for url in urls:
    print(url_parser(url))
    print('-------------------------------\n')
{'scheme': 'https', 'netloc': 'www.google.com', 'path': '/search', 'params': '', 'query': 'q=how+to+dispose+of+a+corpse&oq=how+to+dispose+of+a+corpse&aqs=chrome..69i57j69i64.4925j1j7&sourceid=chrome&ie=UTF-8', 'fragment': '', 'directories': ['search'], 'queries': ['q=how+to+dispose+of+a+corpse', 'oq=how+to+dispose+of+a+corpse', 'aqs=chrome..69i57j69i64.4925j1j7', 'sourceid=chrome', 'ie=UTF-8']}
-------------------------------

{'scheme': 'https', 'netloc': 'tales.as', 'path': '/101-things-to-do-with-a-dead-body_9780997711639', 'params': '', 'query': 'utm_source=google-shopping&utm_medium=cpc&utm_campaign=', 'fragment': '', 'directories': ['101-things-to-do-with-a-dead-body_9780997711639'], 'queries': ['utm_source=google-shopping', 'utm_medium=cpc', 'utm_campaign=']}
-------------------------------

{'scheme': 'https', 'netloc': 'www.worldofbooks.com', 'path': '/en-gb/books/hugh-whitemore/disposing-of-the-body/9781872868271', 'params': '', 'query': '', 'fragment': 'NGR9781872868271', 'directories': ['en-gb', 'books', 'hugh-whitemore', 'disposing-of-the-body', '9781872868271'], 'queries': ['']}
-------------------------------

{'scheme': 'https', 'netloc': 'www.google.com', 'path': '/search', 'params': '', 'query': 'q=where+to+buy+hydrofluoric+acid&oq=where+to+buy+hydrofl&aqs=chrome.1.69i57j0l2j0i10j0i10i395j0i395i457j0i10i395l2.8058j1j7&sourceid=chrome&ie=UTF-8', 'fragment': '', 'directories': ['search'], 'queries': ['q=where+to+buy+hydrofluoric+acid', 'oq=where+to+buy+hydrofl', 'aqs=chrome.1.69i57j0l2j0i10j0i10i395j0i395i457j0i10i395l2.8058j1j7', 'sourceid=chrome', 'ie=UTF-8']}
-------------------------------

Create a dataframe of URL structure components

Finally, since it’s generally easier to work with such data in Pandas, we can create a function to iterate over the URLs in a Pandas sitemap dataframe and return a new dataframe containing the URL components and parameters.

def url_components_to_df(df, url='url'):
    """Parses a dataframe of URLs and returns a dataframe of URL components.
    
    Args:
        df (object): Pandas dataframe containing URLs.
        url (string, optional): Optional name of column containing URL, if not 'url'.
        
    Return:
        df (object): Pandas dataframe containing URL components. 
    """
    
    df_output = pd.DataFrame(columns = ['scheme', 'netloc', 'path', 
                                        'params', 'query', 'fragment',
                                        'directories', 'queries'])
    
    for index, row in df.iterrows(): 

        elements = url_parser(row['url'])

        page = {
            'scheme': elements['scheme'],
            'netloc': elements['netloc'],
            'path': elements['path'], 
            'params': elements['params'],
            'query': elements['query'],
            'fragment': elements['fragment'], 
            'directories': elements['directories'],
            'queries': elements['queries'],            
        }

        df_output = df_output.append(page, ignore_index=True)
    
    return df_output

If we load up a dataframe containing our URLs, which are stored in a column called url, we can pass that to our function, and it will return a new dataframe containing all the directory structure components identified. We can then analyse the data from within Pandas!

df_output = url_components_to_df(df_sitemap)
df_output.tail()
scheme netloc path params query fragment directories queries
1669 http flyandlure.org /listings/places_to_fly_fish/wales/wrexham/lla... [listings, places_to_fly_fish, wales, wrexham,... []
1670 http flyandlure.org /listings/places_to_fly_fish/wales/wrexham/pen... [listings, places_to_fly_fish, wales, wrexham,... []
1671 http flyandlure.org /listings/places_to_fly_fish/wales/wrexham/pen... [listings, places_to_fly_fish, wales, wrexham,... []
1672 http flyandlure.org /listings/places_to_fly_fish/wales/wrexham/tre... [listings, places_to_fly_fish, wales, wrexham,... []
1673 http flyandlure.org /listings/places_to_fly_fish/wales/wrexham/ty_... [listings, places_to_fly_fish, wales, wrexham,... []

Matt Clarke, Friday, March 12, 2021

Matt Clarke Matt is a Digital Director who uses data science to help in his work. He has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.

Joining Data with pandas

Learn to combine data from multiple tables by joining data together using pandas.

Start course for FREE

Comments