; ;
Google Analytics 4 (GA4) is an industry-shaking update to Universal Analytics. In addition to GA4’s extensive list of new features, exporting GA4 data to BigQuery is free. However, analysts tasked with doing anything with the GA4 BigQuery export are in for a heartbreaking surprise: all the event data is nested (*big sad*).
But have no fear! I have gone through and done most of the hard work for you. So now, all you have to do is download a few files, follow along, and stick around for the ride.
You can find the full query at the end of this article. I recommend opening the file in another window or screen next to your current one and referencing the file while reading the article.
Caveat: This article assumes you have intermediate familiarity with SQL in BigQuery. I will not be teaching the basics of BigQuery. Now that we have selected for all users WHERE level >= “intermediate,” let’s get into it!
Before you start writing any SQL, you should have a plan for the variables you want to include in your analysis. Since jumping directly into BigQuery without a plan will most likely leave you even more flustered and confused than when you started, I suggest creating a Variable Mapping Reference. The columns should include items such as “Desired Data,” which are the data points you need, and “Source Column Name,” which maps the actual columns in the raw GA4 BigQuery export to your desired data points.
Here is what I came up with:
Alright! Now that you have your Variable Mapping Reference completed, you are one step closer to that beautiful flattened data you’ve been dreaming of.
Let’s write some SQL.
Let’s start by understanding the raw GA4 data export structure in GCP. Below is a simplified illustration of the structure of the export if you were to run a query like the following:
As you can see above, the event_params are a nested array. As mentioned before, this nested structure isn’t ideal if you want to do any analysis.
First, let’s write some pseudo-code to demonstrate the structure of our query. This query flattens the nested data structure of a GA4 data export using a combination of the UNNEST(), COALESCE(), and MAX() functions and a Common Table Expression (CTE) with two parts:
Now that we’ve planned our structure, let’s jump into some SQL.
In the following section, I will specifically walk through the trickier parts of the query.
The first part of the query, alpha, uses UNNEST() to unnest the event_params array column of your GA4 data export table.
This action creates a new row for each element in the event_params array, with columns for your chosen dimensions like event_date, user_pseudo_id, and event_name as shown in the illustration below:
Next, we want to consolidate our event_params values so that every event parameter key only has one value. You can use the COALESCE() function to compress the four event value columns into a single event value column. For any given event_params key value, there will only be one non-null event value in the string_value, int_value, double_value, or float_value columns. The COALESCE() function allows us to extract the non-null values from the value columns, and we concurrently CAST all values to strings such that all values in our resulting column event_value will be the same data type.
The second part of our query, beta, flattens the data by grouping the alpha table by the common columns and pivoting the event key and event_value columns into new columns using the MAX() function. This action creates a single row for each unique combination of your columns (in this instance, event_date, user_pseudo_id, event_name, ga_session_id, campaign_source, campaign_id, and campaign_content).
Finally, the main query selects the desired columns (event_date, user_pseudo_id, event_name, ga_session_id, source, campaign, content, session_id, event_number) from the flattened beta table. The CONCAT() function concatenates ga_session_id and user_pseudo_id to create a unique_session_id, and ROW_NUMBER() is used to generate an event number within each ga_session_id partitioned and ordered by event_timestamp.
After your query produces a flattened output, ensure it pulls the data as intended. For this result, I suggest creating high-level reports in the GA4 UI and comparing them to the summarized data in your query. Here is the quality assurance format that I came up with:
Input your summarized data into a Google Sheet or Excel Spreadsheet and compare. If the variance looks too large, double-check your query to ensure no issues in your code. However, some data points will likely have a higher variance than others (like “Sessions,” which is nicely explained by Tanelytics in the article here).
If you’ve made it this far, you (hopefully) now have some flattened data! Now you’re well on your way to the fun part: analysis.
The query I’ve walked through should enable you to recreate reports from the GA4 UI and empower you to create new, meaningful reports that would be far more complex for the UI alone.
So get creative! Push the boundaries! Happy analyzing!
My Template Variable Mapping Reference, QA Doc, and Query
Google Support | [GA4] BigQuery Export schema
Google Support | [GA4] Default channel group
Tanelytics | Defining and calculating sessions
GPT4SQL.com
Full Sample Query: