Google recently announced that it will be sunsetting Universal Analytics and replacing it with Google Analytics 4. The news sent shock waves through the ecommerce and marketing world, as it meant we’ll lose all our previous data, and we’ll have to migrate to GA4 and start using it.
For data scientists like me, who’ve spent weeks building reporting systems and models that pull data from the Google Analytics Reporting API, this is also somewhat irksome. I wrote the GAPandas Python package to make it quicker and easier for me to query the GA API and use it heavily in my work.
By June 2023, GAPandas and other code that queries the Google Analytics Reporting API used by Universal Analytics will cease to work or have access to fresh data, as Universal Analytics is being killed off or sunsetted.
Seeing as I’ve now got lots of reports to rewrite, I figured it wise to create a new version of GAPandas that supports the Google Analytics Data API used by GA4. At the moment it’s very alpha, but does work and is a quick and easy way to access GA4 data in Python. Here’s how to use it.
GAPandas4 will work in Jupyter, Google Colab, or a Python script. First, you’ll need to install the Python package from my GitHub repository. You can do that in Jupyter or Colab using the command below. This will install the right version of Pandas, plus the Google Analytics Data package that GAPandas4 forms a helpful wrapper around. Once installed, import the package and give it an alias.
!pip3 install git+https://github.com/practical-data-science/gapandas4.git
import gapandas4 as gp
Next, create a couple of variables in which to store your settings. The service_account
variable needs to hold the filename and path to your Google Service Account client secrets JSON keyfile. The Service Account needs to be granted access to the Google Analytics Data API and you’ll need to add the email of the service account user as a user on your Google Analytics 4 property. You’ll also need the number of the Google Analytics 4 property you wish to query.
service_account = 'client_secrets.json'
property_id = '123456789'
Unlike the previous version, the GA4 API uses Protocol Buffers or Protobufs to accept requests. These are a bit clunkier to create and have a very specific syntax. You’ll need to pass in the property_id
with every request, plus the dimensions
and metrics
you want to return and the date_ranges
.
Once you’ve constructed your request
we’ll pass this to query()
with the service_account
and GAPandas4 will run the query and return the data in a Pandas dataframe, or a list of Pandas dataframes if you passed a batch report request.
request = gp.RunReportRequest(
property=f"properties/{property_id}",
dimensions=[
gp.Dimension(name="country"),
gp.Dimension(name="city")
],
metrics=[
gp.Metric(name="activeUsers")
],
date_ranges=[gp.DateRange(start_date="2022-06-01", end_date="2022-06-01")],
)
df = gp.query(service_account, request)
df.head()
country | city | activeUsers | |
---|---|---|---|
0 | United Kingdom | London | 2191 |
1 | United Kingdom | (not set) | 805 |
2 | United Kingdom | Birmingham | 202 |
3 | United Kingdom | Manchester | 105 |
4 | United Kingdom | Leeds | 95 |
Here’s the full code example required to construct a simple GA4 API query and return the output in a Pandas dataframe. This should be all you need to get started with regular reports. You can also use GAPandas4 to run batch reports, pivot reports, batch pivot reports, and realtime reports.
import gapandas4 as gp
service_account = 'client_secrets.json'
property_id = '123456789'
request = gp.RunReportRequest(
property=f"properties/{property_id}",
dimensions=[
gp.Dimension(name="country"),
gp.Dimension(name="city")
],
metrics=[
gp.Metric(name="activeUsers")
],
date_ranges=[gp.DateRange(start_date="2022-06-01", end_date="2022-06-01")],
)
df = gp.query(service_account, request)
df.head()
country | city | activeUsers | |
---|---|---|---|
0 | United Kingdom | London | 2191 |
1 | United Kingdom | (not set) | 805 |
2 | United Kingdom | Birmingham | 202 |
3 | United Kingdom | Manchester | 105 |
4 | United Kingdom | Leeds | 95 |
request = gp.RunReportRequest(
property=f"properties/{property_id}",
dimensions=[
gp.Dimension(name="country"),
gp.Dimension(name="city")
],
metrics=[
gp.Metric(name="activeUsers")
],
date_ranges=[gp.DateRange(start_date="2022-06-01", end_date="2022-06-01")],
)
df = gp.query(service_account, request)
print(df.head())
request = gp.RunReportRequest(
property=f"properties/{property_id}",
dimensions=[
gp.Dimension(name="country"),
gp.Dimension(name="city")
],
metrics=[
gp.Metric(name="activeUsers")
],
date_ranges=[gp.DateRange(start_date="2022-06-01", end_date="2022-06-01")],
order_bys=[gp.OrderBy(metric=gp.OrderBy.MetricOrderBy(metric_name='activeUsers'), desc=True)]
)
df = gp.query(service_account, request)
print(df.head())
request = gp.RunReportRequest(
property=f"properties/{property_id}",
dimensions=[
gp.Dimension(name="country"),
gp.Dimension(name="city")
],
metrics=[
gp.Metric(name="activeUsers")
],
date_ranges=[gp.DateRange(start_date="2022-06-01", end_date="2022-06-01")],
order_bys=[gp.OrderBy(dimension=gp.OrderBy.DimensionOrderBy(dimension_name='country'), desc=False)]
)
df = gp.query(service_account, request)
print(df.head())
request = gp.RunReportRequest(
property=f"properties/{property_id}",
dimensions=[
gp.Dimension(name="country"),
gp.Dimension(name="city")
],
metrics=[
gp.Metric(name="activeUsers")
],
date_ranges=[gp.DateRange(start_date="2022-06-01", end_date="2022-06-01")],
dimension_filter=gp.FilterExpression(
filter=gp.Filter(
field_name="country",
string_filter=gp.Filter.StringFilter(value="Australia"),
)
),
)
df = gp.query(service_account, request)
print(df.head())
request = gp.RunReportRequest(
property=f"properties/{property_id}",
dimensions=[
gp.Dimension(name="country"),
gp.Dimension(name="city")
],
metrics=[
gp.Metric(name="activeUsers")
],
date_ranges=[gp.DateRange(start_date="2022-06-01", end_date="2022-06-01")],
dimension_filter=gp.FilterExpression(
not_expression=gp.FilterExpression(
filter=gp.Filter(
field_name="country",
string_filter=gp.Filter.StringFilter(value="United Kingdom"),
)
)
),
)
df = gp.query(service_account, request)
print(df.head())
request = gp.RunReportRequest(
property=f"properties/{property_id}",
dimensions=[
gp.Dimension(name="country"),
gp.Dimension(name="city")
],
metrics=[
gp.Metric(name="activeUsers")
],
date_ranges=[gp.DateRange(start_date="2022-06-01", end_date="2022-06-01")],
dimension_filter=gp.FilterExpression(
filter=gp.Filter(
field_name="country",
in_list_filter=gp.Filter.InListFilter(
values=[
"United Kingdom",
"Ireland",
]
),
)
),
)
df = gp.query(service_account, request)
print(df.head())
request = gp.RunReportRequest(
property=f"properties/{property_id}",
dimensions=[
gp.Dimension(name="country"),
gp.Dimension(name="city")
],
metrics=[
gp.Metric(name="activeUsers")
],
date_ranges=[gp.DateRange(start_date="2022-06-01", end_date="2022-06-01")],
dimension_filter=gp.FilterExpression(
not_expression=gp.FilterExpression(
filter=gp.Filter(
field_name="country",
in_list_filter=gp.Filter.InListFilter(
values=[
"United Kingdom",
"Ireland",
]
),
)
)
),
)
df = gp.query(service_account, request)
print(df.head())
The Google Analytics Data API supports making batch requests where you can pass up to five API requests at once. GAPandas4 will process each one and return the Pandas dataframes in a Python list so you can access them.
import gapandas4 as gp
service_account = 'client_secrets.json'
property_id = 'xxxxxxxxx'
batch_report_request = gp.BatchRunReportsRequest(
property=f"properties/{property_id}",
requests=[
gp.RunReportRequest(
dimensions=[
gp.Dimension(name="country"),
gp.Dimension(name="city")
],
metrics=[
gp.Metric(name="activeUsers")
],
date_ranges=[gp.DateRange(start_date="2022-06-01", end_date="2022-06-01")]
),
gp.RunReportRequest(
dimensions=[
gp.Dimension(name="country"),
gp.Dimension(name="city")
],
metrics=[
gp.Metric(name="activeUsers")
],
date_ranges=[gp.DateRange(start_date="2022-06-02", end_date="2022-06-02")]
)
]
)
df = gp.query(service_account, batch_report_request, report_type="batch_report")
print(df[0].head())
print(df[1].head())
Constructing a report using RunPivotReportRequest()
will return pivoted data in a single Pandas dataframe.
import gapandas4 as gp
service_account = 'client_secrets.json'
property_id = 'xxxxxxxxx'
pivot_request = gp.RunPivotReportRequest(
property=f"properties/{property_id}",
dimensions=[gp.Dimension(name="country"),
gp.Dimension(name="browser")],
metrics=[gp.Metric(name="sessions")],
date_ranges=[gp.DateRange(start_date="2022-05-30", end_date="today")],
pivots=[
gp.Pivot(
field_names=["country"],
limit=5,
order_bys=[
gp.OrderBy(
dimension=gp.OrderBy.DimensionOrderBy(dimension_name="country")
)
],
),
gp.Pivot(
field_names=["browser"],
offset=0,
limit=5,
order_bys=[
gp.OrderBy(
metric=gp.OrderBy.MetricOrderBy(metric_name="sessions"), desc=True
)
],
),
],
)
df = gp.query(service_account, pivot_request, report_type="pivot")
print(df.head())
Constructing a payload using BatchRunPivotReportsRequest()
will allow you to run up to five pivot reports. These are returned as a list of Pandas dataframes.
import gapandas4 as gp
service_account = 'client_secrets.json'
property_id = 'xxxxxxxxx'
batch_pivot_request = gp.BatchRunPivotReportsRequest(
property=f"properties/{property_id}",
requests=[
gp.RunPivotReportRequest(
dimensions=[gp.Dimension(name="country"),
gp.Dimension(name="browser")],
metrics=[gp.Metric(name="sessions")],
date_ranges=[gp.DateRange(start_date="2022-05-30", end_date="today")],
pivots=[
gp.Pivot(
field_names=["country"],
limit=5,
order_bys=[
gp.OrderBy(
dimension=gp.OrderBy.DimensionOrderBy(dimension_name="country")
)
],
),
gp.Pivot(
field_names=["browser"],
offset=0,
limit=5,
order_bys=[
gp.OrderBy(
metric=gp.OrderBy.MetricOrderBy(metric_name="sessions"), desc=True
)
],
),
],
),
gp.RunPivotReportRequest(
dimensions=[gp.Dimension(name="country"),
gp.Dimension(name="browser")],
metrics=[gp.Metric(name="sessions")],
date_ranges=[gp.DateRange(start_date="2022-05-30", end_date="today")],
pivots=[
gp.Pivot(
field_names=["country"],
limit=5,
order_bys=[
gp.OrderBy(
dimension=gp.OrderBy.DimensionOrderBy(dimension_name="country")
)
],
),
gp.Pivot(
field_names=["browser"],
offset=0,
limit=5,
order_bys=[
gp.OrderBy(
metric=gp.OrderBy.MetricOrderBy(metric_name="sessions"), desc=True
)
],
),
],
)
]
)
df = gp.query(service_account, batch_pivot_request, report_type="batch_pivot")
print(df[0].head())
print(df[1].head())
Matt Clarke, Wednesday, June 22, 2022