How to analyse Google Analytics demographics and interests with GAPandas

Pictures by Kampus, Pexels.

36 minutes to read

The demographics and interests data provided in Google Analytics can be a useful way to understand who is visiting your site or purchasing your products, without the need to perform a complex and expensive demographic segmentation project that might not deliver an ROI.

While there are some limitations to the demographic data provided in Google Analytics, they are still quite useful for developing a simple overview of your customer base and can work alongside market segmentation or customer segmentation and reveal patterns of which you may not have been aware.

Having worked in businesses which have segmented their customers using expensive customer segmentation techniques such as Mosaic, that use demographic segmentation to report on the overall customer base, I’m not left convinced that these are worth spending money on, compared to what you can get for free using Google Analytics data.

If you only want an overview of customer demographics, such as the age brackets, gender, geographic location, and market interests of your customers, then the demographic and interests data in Google Analytics should be enough to give you such a high level view, for absolutely no cost.

They’re a great way to help your business owners understand who interacts with your website, how the demographics of your customer base may be changing over time, and can help you identify what products to launch, what markets to target, and what marketing activity might work best.

In this project, we’ll be using my GAPandas package to query the Google Analytics API and retrieve the demographics and interests data for the visitors to one of my (non-ecommerce) websites. This will cover the basic techniques you need to apply to understand the demographic segments of your site visitors at a high level, just as Mosaic segmentation does.

Understanding demographics and interests data in Google Analytics

Google Analytics demographics and interests data are collected using Google’s advertising cookies. Therefore, in order to see any demographic and interests data in GA, you will first need to enable a couple of optional settings.

Go to Google Analytics > Admin
Select the Google Analytics property for which you want to enable data collection
Set Remarketing to On
Set Advertising Reporting Features to On
Save

Once you’ve saved these settings, Google Analytics will use existing Google cookies for the DoubleClick , Android, and iOS advertising networks to capture some additional data on your site visitors and add it to your Google Analytics account in an anonymised form. Since the data is collected daily, you’ll need to leave the settings turned on for a period of time before running your analysis.

Data thresholds and anonymisation

One potential application of these data, you may be thinking, could be to capture demographic data on each of your customers to augment the customer profile you hold upon each of them. This would be a great way to let you target customers based on their demographics, and could work brilliantly alongside the aggregate demographics and interests data.

However, while this would be fascinating from a data perspective, Google doesn’t want you to be able to fetch data on the specific demographics of individual users for privacy reasons. For example, you’re unable to write an API query to fetch the userAgeBracket or userGender of each customer when you select a transactionId dimension.

In addition, if you query any data and there are few people in the dataset, which could inadvertently leak information to you about the people, Google will prevent you from seeing the data by setting a threshold on the minimum volume to anonymise the data. At the higher level queries we’re examining below, this rarely kicks in, but it might if you create more granular API queries.

Ecommerce

Install the packages

First, open a Jupyter notebook and install the Pandas and GAPandas packages. You’ll probably have Pandas already installed, but you can install GAPandas by entering the command pip3 install gapandas in your terminal or !pip3 install gapandas in a cell in your Jupyter notebook. Any other packages we need can be installed later in our notebook.

!pip3 install gapandas

import gapandas as gp
import pandas as pd

Connect to the Google Analytics API

Next, we’ll use GAPandas to connect to the Google Analytics API. To authenticate you will need to download a client secrets JSON keyfile from the Google API Console. This gives you access to a Google Cloud Service Account, via which we can extract our Google Analytics data. There’s a step-by-step guide to creating a client secrets JSON keyfile in my other guide to using GAPandas.

service = gp.get_service('client-secrets.json')

To avoid entering the same values throughout our code, we’ll create some variables to hold the view ID for the Google Analytics property we want to query, as well as the start and end dates for the date period we want to extract via the GA API.

view = '1234567890'
start_date = '2021-07-01'
end_date = '2021-07-31'

Age bracket

The first piece of demographic data we’ll extract is the age bracket for our users, which is stored in an API dimension called userAgeBracket. To obtain this demographic data, we simply assemble a GAPandas payload dictionary containing the ga:userAgeBracket dimension and our choice of metrics, and then run the query using the run_query() function, remembering to pass in the service object and the view ID for our GA property.

As the data below show, the site has the fewest visitors in the 18-24 bracket (there is no lower age bracket in the data), and older visitors tend to dominate on this site. Although there’s no e-ecommerce data in my site’s data, I’ve included the metrics in the payload so you can apply this to analysing an e-commerce site. These will often show differences in average order value and conversion rate for each age bracket, which can reveal how well you cater for customers of different ages, or how easy or difficult they find your site to use.

payload = {
    'start_date': start_date, 
    'end_date': end_date, 
    'metrics': 'ga:sessions, ga:transactions, ga:transactionRevenue, 
    ga:transactionsPerSession, ga:revenuePerTransaction',
    'dimensions': 'ga:userAgeBracket',
    'sort': '-ga:sessions'
}

df = gp.run_query(service, view, payload)
df.head(10)

	userAgeBracket	sessions
0	65+	1301
1	55-64	1291
2	45-54	1255
3	25-34	1236
4	35-44	1055
5	18-24	574

Gender

We can repeat the process for the ga:userGender demographic data simply by adjusting the dimension we define in the payload. This reveals that the site has a strong male bias in the visitor profile. Again, there’s no e-commerce data for this site, but this also often shows big differences in customer behaviour according to gender.

payload = {
    'start_date': start_date, 
    'end_date': end_date, 
    'metrics': 'ga:sessions, ga:transactions, ga:transactionRevenue, 
    ga:transactionsPerSession, ga:revenuePerTransaction',
    'dimensions': 'ga:userGender',
    'sort': '-ga:sessions'
}

df = gp.run_query(service, view, payload)
df.head(10)

	userGender	sessions	transactions	transactionRevenue	transactionsPerSession	revenuePerTransaction
0	male	5406	0	0.0	0.0	0.0
1	female	1427	0	0.0	0.0	0.0

Age and gender

To get more granular data on age and gender, you can define multiple dimensions. In the example below, I’ve included both the ga:userGender and ga:userAgeBracket dimensions, to get a more detailed breakdown of customer ages and genders. To analyse these more easily, you may wish to convert the raw metrics to percentages to better show the proportional split within each age bracket.

payload = {
    'start_date': start_date, 
    'end_date': end_date, 
    'metrics': 'ga:sessions, ga:transactions, ga:transactionRevenue, 
    ga:transactionsPerSession, ga:revenuePerTransaction',
    'dimensions': 'ga:userGender, ga:userAgeBracket',
    'sort': '-ga:sessions'
}

df = gp.run_query(service, view, payload)

df.sort_values(by='userAgeBracket').head(10)

	userGender	userAgeBracket	sessions
5	male	18-24	400
11	female	18-24	174
3	male	25-34	984
8	female	25-34	252
4	male	35-44	846
10	female	35-44	209
2	male	45-54	987
6	female	45-54	268
0	male	55-64	1051
9	female	55-64	240

Continent

You can also perform geographical segmentation using the same technique. There are various levels of dimension provided that allow you to drilldown through your data to understand where site visitors are located. From an e-commerce perspective, this can be a useful way to judge whether it may be worth considering offering delivery services to specific countries, or even offering local language content.

Do be aware, though, that all geographic data in Google Analytics should be taken with a pinch of salt. While Google has the ability to geolocate users, it would appear that these data may often be based upon the location of the network through which you are routed, so VPNs and other network systems can potentially falsify some data. However, it’s generally sufficient to give you a rough idea. Here’s the data on the ga:continent dimension, which shows most of my visitors are in Europe.

payload = {
    'start_date': start_date, 
    'end_date': end_date, 
    'metrics': 'ga:sessions, ga:transactions, ga:transactionRevenue, 
    ga:transactionsPerSession, ga:revenuePerTransaction',
    'dimensions': 'ga:continent',
    'sort': '-ga:sessions'
}

df = gp.run_query(service, view, payload)
df.head(10)

	continent	sessions
0	Europe	15924
1	Americas	4808
2	Oceania	382
3	Africa	202
4	Asia	171
5	(not set)	9

Country

To drill down through our geographic segmentation data we can use different dimensions. By running an API query with the ga:country dimension in the payload we can see that the UK dominates the site’s traffic, followed by the US and Canada.

payload = {
    'start_date': start_date, 
    'end_date': end_date, 
    'metrics': 'ga:sessions, ga:transactions, ga:transactionRevenue, 
    ga:transactionsPerSession, ga:revenuePerTransaction',
    'dimensions': 'ga:country',
    'sort': '-ga:sessions'
}

df = gp.run_query(service, view, payload)
df.head(10)

	country	sessions
0	United Kingdom	13580
1	United States	3807
2	Canada	910
3	Ireland	414
4	Sweden	405
5	Norway	343
6	Australia	250
7	South Africa	173
8	Finland	170
9	New Zealand	131

Region

You can also drill down to more granular levels. For example, by adding in the ga:region dimension we also get back the region (i.e. England, Wales, Scotland) for each country. To look at the regions for a specific country, you can perform a simple filter on the country column and set it to the country name.

payload = {
    'start_date': start_date, 
    'end_date': end_date, 
    'metrics': 'ga:sessions, ga:transactions, ga:transactionRevenue, 
    ga:transactionsPerSession, ga:revenuePerTransaction',
    'dimensions': 'ga:country, ga:region',
    'sort': '-ga:sessions'
}

df = gp.run_query(service, view, payload)

df[df['country']=='United Kingdom'].head(10)

	country	region	sessions
0	United Kingdom	England	9034
1	United Kingdom	Scotland	3005
2	United Kingdom	Wales	995
3	United Kingdom	Northern Ireland	510
89	United Kingdom	(not set)	19
95	United Kingdom	Isle of Man	17

City

To get even more detail, you can also add in the ga:city dimension and use the same approach. As you’ll notice, there’s a big chunk of (not set) data which could be anonymised to avoid revealing too much information. Looks like London and Glasgow are the main hotspots for my site.

payload = {
    'start_date': start_date, 
    'end_date': end_date, 
    'metrics': 'ga:sessions, ga:transactions, ga:transactionRevenue, 
    ga:transactionsPerSession, ga:revenuePerTransaction',
    'dimensions': 'ga:country, ga:region, ga:city',
    'sort': '-ga:sessions'
}

df = gp.run_query(service, view, payload)

df[df['country']=='United Kingdom'].head(10)

	country	region	city	sessions
0	United Kingdom	England	London	1655
1	United Kingdom	England	(not set)	1104
2	United Kingdom	Scotland	Glasgow	806
3	United Kingdom	Scotland	(not set)	477
4	United Kingdom	Scotland	Edinburgh	371
5	United Kingdom	England	Birmingham	283
6	United Kingdom	Wales	(not set)	276
7	United Kingdom	England	Manchester	261
8	United Kingdom	England	Newcastle upon Tyne	236
10	United Kingdom	England	Bristol	188

Designated Market Area (DMA) or Metro

You can also examine the Designated Market Area (DMA) or Metro. These are known as “media markets” or television market areas and define the TV or radio stations that a population receives. If you wanted to run TV or radio advertising, or advertise in other channels alongside your TV and radio ads, these DMAs could be very useful in helping you find the best places to advertise. On e-commerce sites, there’s sometimes a striking difference in customer behaviour between DMAs.

payload = {
    'start_date': start_date, 
    'end_date': end_date, 
    'metrics': 'ga:sessions, ga:transactions, ga:transactionRevenue, 
    ga:transactionsPerSession, ga:revenuePerTransaction',
    'dimensions': 'ga:country, ga:region, ga:metro',
    'sort': '-ga:sessions'
}

df = gp.run_query(service, view, payload)

df[df['country']=='United Kingdom'].sort_values(by='region', ascending=False).head(50)

	country	region	metro	sessions
9	United Kingdom	Wales	HTV Wales	490
7	United Kingdom	Wales	(not set)	505
8	United Kingdom	Scotland	(not set)	495
52	United Kingdom	Scotland	Border	44
11	United Kingdom	Scotland	North Scotland	462
1	United Kingdom	Scotland	Central Scotland	2004
10	United Kingdom	Northern Ireland	Ulster	463
48	United Kingdom	Northern Ireland	(not set)	47
120	United Kingdom	Isle of Man	(not set)	17
6	United Kingdom	England	North East	641
5	United Kingdom	England	Yorkshire	837
4	United Kingdom	England	North West	1208
12	United Kingdom	England	Meridian (exc. Channel Islands)	430
13	United Kingdom	England	HTV West	412
14	United Kingdom	England	East Of England	389
18	United Kingdom	England	South West	224
22	United Kingdom	England	Border	135
3	United Kingdom	England	Midlands	1355
2	United Kingdom	England	(not set)	1394
0	United Kingdom	England	London	2009
105	United Kingdom	(not set)	(not set)	19

Latitude and longitude

You can also get the approximate anonymised latitude and longitude coordinates for each visitor using the ga:latitude and ga:longitude dimensions. Judging from comparisons made to e-commerce data, where delivery postcodes are known, these aren’t 100% accurate as mentioned above, but they will give you a rough idea of where your visitors are physically located.

payload = {
    'start_date': start_date, 
    'end_date': end_date, 
    'metrics': 'ga:sessions',
    'dimensions': 'ga:latitude, ga:longitude',
    'sort': '-ga:sessions'
}

df = gp.run_query(service, view, payload)

df.head()

	latitude	longitude	sessions
0	0.0000	0.0000	2553
1	51.5074	-0.1278	1655
2	55.8642	-4.2518	806
3	55.9533	-3.1883	371
4	52.4862	-1.8904	283

Distance from store

Once you have the visitor’s coordinates, you can use maths to calculate their distance from a given location using the excellent GeoPy extension. This can be a useful way to identify roughly (possibly very roughly) how many visitors you have within the radius of a particular store or event.

!pip3 install geopy

from geopy.distance import geodesic

start_latitude = 52.98
start_longitude = -3.36

df['distance'] = df.apply(lambda x: 
                          geodesic((start_latitude, start_longitude),\
                                   (x.latitude, x.longitude)).miles, axis=1)

df[df['distance']< 10].head()

	latitude	longitude	sessions	distance
115	53.1149	-3.3103	26	9.555561

Interest category

Finally, we have the interest categories, which are handled via the ga:interestOtherCategory, ga:interestAffinityCategory, and ga:interestInMarketCategory dimensions. These can tell you what interests your website visitors have, and let you drill down to get increasing levels of detail. Most of my visitors are into the outdoors, sporting goods, cars, and travel, which totally makes sense given the subject matter.

payload = {
    'start_date': start_date, 
    'end_date': end_date, 
    'metrics': 'ga:sessions, ga:transactions, ga:transactionRevenue, 
    ga:transactionsPerSession, ga:revenuePerTransaction',
    'dimensions': 'ga:interestOtherCategory',
    'sort': '-ga:sessions'
}

df = gp.run_query(service, view, payload)
df.head(10)

	interestOtherCategory	sessions
0	Hobbies & Leisure/Outdoors/Fishing	2831
1	Sports/Team Sports/Soccer	1588
2	News/Weather	1179
3	Arts & Entertainment/Celebrities & Entertainme...	1138
4	News/Sports News	1055
5	Arts & Entertainment/TV & Video/Online Video	787
6	Food & Drink/Cooking & Recipes	602
7	Travel & Transportation/Hotels & Accommodations	565
8	Home & Garden/Patio, Lawn & Garden/Gardening	494
9	Real Estate/Real Estate Listings/Residential S...	416

Interest affinity category

The ga:interestAffinityCategory gives broader information. For my site, we can see that “Lifestyles & Hobbies/Outdoor Enthusiasts” is the top one, which again fits perfectly with the subject matter.

payload = {
    'start_date': start_date, 
    'end_date': end_date, 
    'metrics': 'ga:sessions, ga:transactions, ga:transactionRevenue, 
    ga:transactionsPerSession, ga:revenuePerTransaction',
    'dimensions': 'ga:interestAffinityCategory',
    'sort': '-ga:sessions'
}

df = gp.run_query(service, view, payload)
df.head(10)

	interestAffinityCategory	sessions
0	Lifestyles & Hobbies/Outdoor Enthusiasts	4628
1	Home & Garden/Do-It-Yourselfers	3845
2	Food & Dining/Cooking Enthusiasts/30 Minute Chefs	3815
3	Sports & Fitness/Sports Fans	3769
4	Lifestyles & Hobbies/Green Living Enthusiasts	3294
5	News & Politics/Avid News Readers	2791
6	Banking & Finance/Avid Investors	2644
7	Sports & Fitness/Health & Fitness Buffs	2605
8	Shoppers/Value Shoppers	2601
9	Lifestyles & Hobbies/Business Professionals	2439

In market category

Finally, there’s in-market category which is similar to the others above…

payload = {
    'start_date': start_date, 
    'end_date': end_date, 
    'metrics': 'ga:sessions, ga:transactions, ga:transactionRevenue, 
    ga:transactionsPerSession, ga:revenuePerTransaction',
    'dimensions': 'ga:interestInMarketCategory',
    'sort': '-ga:sessions'
}

df = gp.run_query(service, view, payload)
df.head(10)

	interestInMarketCategory	sessions
0	Sports & Fitness/Outdoor Recreational Equipmen...	2784
1	Sports & Fitness/Sporting Goods	1946
2	Autos & Vehicles/Motor Vehicles/Motor Vehicles...	1415
3	Travel/Hotels & Accommodations	1031
4	Real Estate/Residential Properties/Residential...	939
5	Real Estate/Residential Properties/Residential...	840
6	Home & Garden/Home & Garden Services/Landscape...	831
7	Travel/Trips by Destination/Trips to Europe/Tr...	818
8	Apparel & Accessories/Women's Apparel	620
9	Home & Garden/Home Improvement	590

Interest category by gender

The other thing you can do is analyse different combinations of dimensions. For example, in the code below I’m querying ga:interestOtherCategory with ga:userGender to reveal what gender differences exist within the interest categories. The topic of my site is male dominated, so the bulk of visitors are male in each category.

payload = {
    'start_date': start_date, 
    'end_date': end_date, 
    'metrics': 'ga:sessions, ga:transactions, ga:transactionRevenue, 
    ga:transactionsPerSession, ga:revenuePerTransaction',
    'dimensions': 'ga:interestOtherCategory, ga:userGender',
    'sort': '-ga:sessions'
}

df = gp.run_query(service, view, payload)
df.head(20)

	interestOtherCategory	userGender	sessions
0	Hobbies & Leisure/Outdoors/Fishing	male	2461
1	Sports/Team Sports/Soccer	male	1319
2	News/Weather	male	934
3	News/Sports News	male	918
4	Arts & Entertainment/Celebrities & Entertainme...	male	846
5	Arts & Entertainment/TV & Video/Online Video	male	627
6	Food & Drink/Cooking & Recipes	male	436
7	Travel & Transportation/Hotels & Accommodations	male	374
8	Home & Garden/Patio, Lawn & Garden/Gardening	male	352
9	Autos & Vehicles/Boats & Watercraft	male	342
10	Real Estate/Real Estate Listings/Residential S...	male	312
11	News/Politics	male	302
12	Hobbies & Leisure/Outdoors/Fishing	female	270
13	Arts & Entertainment/Celebrities & Entertainme...	female	253
14	News/Weather	female	216
15	Internet & Telecom/Email & Messaging/Email	male	211
16	Autos & Vehicles/Vehicle Shopping/Used Vehicles	male	210
17	Pets & Animals/Pets/Dogs	male	207
18	Sports/Team Sports/Soccer	female	206
19	Internet & Telecom/Search Engines	male	203

Interest category, gender, and age bracket

You can also add in the ga:userAgeBracket to get the age brackets added in. Collectively, the various queries above give you a really good general overview of the market segments this site serves, which could be useful when communicating to key stakeholders, can help you target marketing or advertising, and can help you spot potential opportunities with products or technologies. For free, the data are really quite useful.

payload = {
    'start_date': start_date, 
    'end_date': end_date, 
    'metrics': 'ga:sessions, ga:transactions, ga:transactionRevenue,
    ga:transactionsPerSession, ga:revenuePerTransaction',
    'dimensions': 'ga:interestOtherCategory, 
    ga:userGender, ga:userAgeBracket',
    'sort': '-ga:sessions'
}

df = gp.run_query(service, view, payload)
df.head(20)

	interestOtherCategory	userGender	userAgeBracket	sessions
0	Hobbies & Leisure/Outdoors/Fishing	male	65+	606
1	Hobbies & Leisure/Outdoors/Fishing	male	55-64	569
2	Hobbies & Leisure/Outdoors/Fishing	male	45-54	433
3	Hobbies & Leisure/Outdoors/Fishing	male	35-44	364
4	Hobbies & Leisure/Outdoors/Fishing	male	25-34	351
5	Sports/Team Sports/Soccer	male	55-64	284
6	Sports/Team Sports/Soccer	male	45-54	259
7	Sports/Team Sports/Soccer	male	25-34	237
8	Sports/Team Sports/Soccer	male	35-44	235
9	News/Weather	male	65+	220
10	News/Sports News	male	55-64	213
11	Sports/Team Sports/Soccer	male	65+	213
12	News/Weather	male	55-64	208
13	Arts & Entertainment/Celebrities & Entertainme...	male	55-64	207
14	Arts & Entertainment/Celebrities & Entertainme...	male	65+	186
15	News/Sports News	male	45-54	176
16	News/Sports News	male	65+	168
17	News/Weather	male	45-54	165
18	News/Sports News	male	35-44	160
19	Arts & Entertainment/Celebrities & Entertainme...	male	45-54	154

Matt Clarke, Tuesday, August 10, 2021

Matt Clarke Matt is an Ecommerce and Marketing Director who uses data science to help in his work. Matt has a Master's degree in Internet Retailing (plus two other Master's degrees in different fields) and specialises in the technical side of ecommerce and marketing.