URLs often contain useful information that can be used to analyse a website, a user’s search, or the breakdown of content present in each section. While they often look pretty complicated, Python includes a number of useful packages to allow you to parse URLs and extract components from within.
In this web scraping project, we’ll be using urllib
to parse a bunch of URLs from a sitemap, and extract various elements from
them, including the scheme, domain (or netloc), the path, the fragment, the anchor, and the directory structure. Here’s how it’s done.
First, open up a Jupyter notebook, or a Python script, and import pandas
and the urlparse
module from urllib.parse
. You’ll likely have both of these pre-installed already.
import pandas as pd
from urllib.parse import urlparse
Next we’ll use the urlparse()
function to return the component parts of a regular URL. The ParseResult()
returned contains the scheme
or HTTP protocol (i.e. http or https); the domain or netloc
(i.e. flyandlure.org), the path
containing the directory structure or file path, any params
or anchor
elements present, and the query
string if there is one.
url = "http://flyandlure.org/articles/fly_fishing/fly_fishing_diary_july_2020?q=word&b=something#anchor"
parts = urlparse(url)
parts
ParseResult(scheme='http',
netloc='flyandlure.org',
path='/articles/fly_fishing/fly_fishing_diary_july_2020',
params='',
query='q=word&b=something',
fragment='anchor')
The urlparse()
function gives us most of the stuff we need, but the path
is somewhat inconvenient to work with, as it’s provided as a continuous string. It would be better if this were broken up into the directories or files. We can do that by using strip()
and split()
, which takes our long string and turns it into a list of component directories.
directories = parts.path.strip('/').split('/')
directories
['articles', 'fly_fishing', 'fly_fishing_diary_july_2020']
To combine the two approaches, we can create a little function called url_parser()
. Feeding the url
to this returns the component parts, plus the directories
and queries
broken into chunks. The outputs can then be merged together in a neat Python dictionary that we can manipulate easily.
def url_parser(url):
parts = urlparse(url)
directories = parts.path.strip('/').split('/')
queries = parts.query.strip('&').split('&')
elements = {
'scheme': parts.scheme,
'netloc': parts.netloc,
'path': parts.path,
'params': parts.params,
'query': parts.query,
'fragment': parts.fragment,
'directories': directories,
'queries': queries,
}
return elements
elements = url_parser(url)
elements
{'scheme': 'http',
'netloc': 'flyandlure.org',
'path': '/articles/fly_fishing/fly_fishing_diary_july_2020',
'params': '',
'query': 'q=word&b=something',
'fragment': 'anchor',
'directories': ['articles', 'fly_fishing', 'fly_fishing_diary_july_2020'],
'queries': ['q=word', 'b=something']}
To test it out we can loop through a bunch of URLs and print the output for each one. These are a mixture of Google search queries and ecommerce website URLs, so include a mixture of directories
and queries
to parse.
urls = [
'https://www.google.com/search?q=how+to+dispose+of+a+corpse&oq=how+to+dispose+of+a+corpse&aqs=chrome..69i57j69i64.4925j1j7&sourceid=chrome&ie=UTF-8',
'https://tales.as/101-things-to-do-with-a-dead-body_9780997711639?utm_source=google-shopping&utm_medium=cpc&utm_campaign=',
'https://www.worldofbooks.com/en-gb/books/hugh-whitemore/disposing-of-the-body/9781872868271#NGR9781872868271',
'https://www.google.com/search?q=where+to+buy+hydrofluoric+acid&oq=where+to+buy+hydrofl&aqs=chrome.1.69i57j0l2j0i10j0i10i395j0i395i457j0i10i395l2.8058j1j7&sourceid=chrome&ie=UTF-8'
]
for url in urls:
print(url_parser(url))
print('-------------------------------\n')
{'scheme': 'https', 'netloc': 'www.google.com', 'path': '/search', 'params': '', 'query': 'q=how+to+dispose+of+a+corpse&oq=how+to+dispose+of+a+corpse&aqs=chrome..69i57j69i64.4925j1j7&sourceid=chrome&ie=UTF-8', 'fragment': '', 'directories': ['search'], 'queries': ['q=how+to+dispose+of+a+corpse', 'oq=how+to+dispose+of+a+corpse', 'aqs=chrome..69i57j69i64.4925j1j7', 'sourceid=chrome', 'ie=UTF-8']}
-------------------------------
{'scheme': 'https', 'netloc': 'tales.as', 'path': '/101-things-to-do-with-a-dead-body_9780997711639', 'params': '', 'query': 'utm_source=google-shopping&utm_medium=cpc&utm_campaign=', 'fragment': '', 'directories': ['101-things-to-do-with-a-dead-body_9780997711639'], 'queries': ['utm_source=google-shopping', 'utm_medium=cpc', 'utm_campaign=']}
-------------------------------
{'scheme': 'https', 'netloc': 'www.worldofbooks.com', 'path': '/en-gb/books/hugh-whitemore/disposing-of-the-body/9781872868271', 'params': '', 'query': '', 'fragment': 'NGR9781872868271', 'directories': ['en-gb', 'books', 'hugh-whitemore', 'disposing-of-the-body', '9781872868271'], 'queries': ['']}
-------------------------------
{'scheme': 'https', 'netloc': 'www.google.com', 'path': '/search', 'params': '', 'query': 'q=where+to+buy+hydrofluoric+acid&oq=where+to+buy+hydrofl&aqs=chrome.1.69i57j0l2j0i10j0i10i395j0i395i457j0i10i395l2.8058j1j7&sourceid=chrome&ie=UTF-8', 'fragment': '', 'directories': ['search'], 'queries': ['q=where+to+buy+hydrofluoric+acid', 'oq=where+to+buy+hydrofl', 'aqs=chrome.1.69i57j0l2j0i10j0i10i395j0i395i457j0i10i395l2.8058j1j7', 'sourceid=chrome', 'ie=UTF-8']}
-------------------------------
Finally, since it’s generally easier to work with such data in Pandas, we can create a function to iterate over the URLs in a Pandas sitemap dataframe and return a new dataframe containing the URL components and parameters.
def url_components_to_df(df, url='url'):
"""Parses a dataframe of URLs and returns a dataframe of URL components.
Args:
df (object): Pandas dataframe containing URLs.
url (string, optional): Optional name of column containing URL, if not 'url'.
Return:
df (object): Pandas dataframe containing URL components.
"""
df_output = pd.DataFrame(columns = ['scheme', 'netloc', 'path',
'params', 'query', 'fragment',
'directories', 'queries'])
for index, row in df.iterrows():
elements = url_parser(row['url'])
page = {
'scheme': elements['scheme'],
'netloc': elements['netloc'],
'path': elements['path'],
'params': elements['params'],
'query': elements['query'],
'fragment': elements['fragment'],
'directories': elements['directories'],
'queries': elements['queries'],
}
df_output = df_output.append(page, ignore_index=True)
return df_output
If we load up a dataframe containing our URLs, which are stored in a column called url
, we can pass that to our function, and it will return a new dataframe containing all the directory structure components identified. We can then analyse the data from within Pandas!
df_output = url_components_to_df(df_sitemap)
df_output.tail()
scheme | netloc | path | params | query | fragment | directories | queries | |
---|---|---|---|---|---|---|---|---|
1669 | http | flyandlure.org | /listings/places_to_fly_fish/wales/wrexham/lla... | [listings, places_to_fly_fish, wales, wrexham,... | [] | |||
1670 | http | flyandlure.org | /listings/places_to_fly_fish/wales/wrexham/pen... | [listings, places_to_fly_fish, wales, wrexham,... | [] | |||
1671 | http | flyandlure.org | /listings/places_to_fly_fish/wales/wrexham/pen... | [listings, places_to_fly_fish, wales, wrexham,... | [] | |||
1672 | http | flyandlure.org | /listings/places_to_fly_fish/wales/wrexham/tre... | [listings, places_to_fly_fish, wales, wrexham,... | [] | |||
1673 | http | flyandlure.org | /listings/places_to_fly_fish/wales/wrexham/ty_... | [listings, places_to_fly_fish, wales, wrexham,... | [] |
Matt Clarke, Friday, March 12, 2021