This dbt package connects to an exported GA4 dataset and provides useful transformations as well as report-ready dimensional models that can be used to build reports.
Features include:
- Flattened models to access common events and event parameters such as
page_view
,session_start
, andpurchase
- Conversion of sharded event tables into a single partitioned table
- Incremental loading of GA4 data into your staging tables
- Page, session and user dimensional models with conversion counts
- Last non-direct session attribution
- Simple methods for accessing query parameters (like UTM params) or filtering query parameters (like click IDs)
- Support for custom event parameters & user properties
- Mapping from source/medium to default channel grouping
model | description |
---|---|
stg_ga4__events | Contains cleaned event data that is enhanced with useful event and session keys. |
stg_ga4__event_* | 1 model per event (ex: page_view, purchase) which flattens event parameters specific to that event |
stg_ga4__event_items | Contains item data associated with e-commerce events (Purchase, add to cart, etc) |
stg_ga4__event_to_query_string_params | Mapping between each event and any query parameters & values that were contained in the event's page_location field |
stg_ga4__user_properties | Finds the most recent occurance of specified user_properties for each user |
stg_ga4__derived_user_properties | Finds the most recent occurance of specific event_params value and assigns them to a client_key. Derived user properties are specified as variables (see documentation below) |
stg_ga4__derived_session_properties | Finds the most recent occurance of specific event_params or user_properties value and assigns them to a session's session_key. Derived session properties are specified as variables (see documentation below) |
stg_ga4__session_conversions_daily | Produces daily counts of conversions per session. The list of conversion events to include is configurable (see documentation below) |
stg_ga4__sessions_traffic_sources | Finds the first source, medium, campaign, content, paid search term (from UTM tracking), and default channel grouping for each session. |
stg_ga4__sessions_traffic_sources_daily | Same data as stg_ga4__sessions_traffic_sources, but partitioned by day to allow for efficient loading and querying of data. |
stg_ga4__sessions_traffic_sources_last_non_direct_daily | Finds the last non-direct source attributed to each session within a 30-day lookback window. Assumes each session is contained within a day. |
dim_ga4__client_keys | Dimension table for user devices as indicated by client_keys. Contains attributes such as first and last page viewed. |
dim_ga4__sessions | Dimension table for sessions which contains useful attributes such as geography, device information, and acquisition data. Can be expensive to run on large installs (see dim_ga4__sessions_daily ) |
dim_ga4__sessions_daily | Query-optimized session dimension table that is incremental and partitioned on date. Assumes that each partition is contained within a single day |
fct_ga4__pages | Fact table for pages which aggregates common page metrics by page_location, date, and hour. |
fct_ga4__sessions_daily | Fact table for session metrics, partitioned by date. A single session may span multiple rows given that sessions can span multiple days. |
fct_ga4__sessions | Fact table that aggregates session metrics across days. This table is not partitioned, so be mindful of performance/cost when querying. |
seed file | description |
---|---|
ga4_source_categories.csv | Google's mapping between source and source_category . Downloaded from https://support.google.com/analytics/answer/9756891?hl=en |
Be sure to run dbt seed
before you run dbt run
.
To pull the latest stable release along with minor updates, add the following to your packages.yml
file:
packages:
- package: Velir/ga4
version: [">=4.0.0", "<4.1.0"]
To install the latest code (may be unstable), add the following to your packages.yml
file:
packages:
- git: "https://github.com/Velir/dbt-ga4.git"
- Clone this repository to a folder in the same parent directory as your DBT project
- Update your project's
packages.yml
to include a reference to this package:
packages:
- local: ../dbt-ga4
This package assumes that you have an existing DBT project with a BigQuery profile and a BigQuery GCP instance available with GA4 event data loaded. Source data is defined using the following variables which must be set in dbt_project.yml
.
vars:
ga4:
project: "your_gcp_project"
dataset: "your_ga4_dataset"
start_date: "YYYYMMDD" # Earliest date to load
frequency: "daily" # daily|streaming|daily+streaming. See 'Export Frequency' below.
Setting query_parameter_exclusions
will remove query string parameters from the page_location
and page_referrer
fields for all downstream processing. Original parameters are captured in the original_page_location
and original_page_referrer
fields. Ex:
vars:
ga4:
query_parameter_exclusions: ["gclid","fbclid","_ga"]
Within GA4, you can add custom parameters to any event. These custom parameters will be picked up by this package if they are defined as variables within your dbt_project.yml
file using the following syntax:
[event name]_custom_parameters
- name: "[name of custom parameter]"
value_type: "[string_value|int_value|float_value|double_value]"
For example:
vars:
ga4:
page_view_custom_parameters:
- name: "clean_event"
value_type: "string_value"
- name: "country_code"
value_type: "int_value"
You can optionally rename the output column:
vars:
ga4:
page_view_custom_parameters:
- name: "country_code"
value_type: "int_value"
rename_to: "country"
If there are custom parameters you need on all events, you can define defaults using default_custom_parameters
, for example:
vars:
ga4:
default_custom_parameters:
- name: "country_code"
value_type: "int_value"
User properties are provided by GA4 in the user_properties
repeated field. The most recent user property for each user will be extracted and included in the dim_ga4__users
model by configuring the user_properties
variable in your project as follows:
vars:
ga4:
user_properties:
- user_property_name: "membership_level"
value_type: "int_value"
- user_property_name: "account_status"
value_type: "string_value"
Derived user properties are different from "User Properties" in that they are derived from event parameters. This provides additional flexibility in allowing users to turn any event parameter into a user property.
Derived User Properties are included in the dim_ga4__users
model and contain the latest event parameter value per user.
derived_user_properties:
- event_parameter: "[your event parameter]"
user_property_name: "[a unique name for the derived user property]"
value_type: "[string_value|int_value|float_value|double_value]"
For example:
vars:
ga4:
derived_user_properties:
- event_parameter: "page_location"
user_property_name: "most_recent_page_location"
value_type: "string_value"
- event_parameter: "another_event_param"
user_property_name: "most_recent_param"
value_type: "string_value"
Derived session properties are similar to derived user properties, but on a per-session basis, for properties that change slowly over time. This provides additional flexibility in allowing users to turn any event parameter into a session property.
Derived Session Properties are included in the dim_ga4__sessions
and dim_ga4__sessions_daily
models and contain the latest event parameter or user property value per session.
derived_session_properties:
- event_parameter: "[your event parameter]"
session_property_name: "[a unique name for the derived session property]"
value_type: "[string_value|int_value|float_value|double_value]"
- user_property: "[your user property key]"
session_property_name: "[a unique name for the derived session property]"
value_type: "[string_value|int_value|float_value|double_value]"
For example:
vars:
ga4:
derived_session_properties:
- event_parameter: "page_location"
session_property_name: "most_recent_page_location"
value_type: "string_value"
- event_parameter: "another_event_param"
session_property_name: "most_recent_param"
value_type: "string_value"
- user_property: "first_open_time"
session_property_name: "first_open_time"
value_type: "int_value"
See the README file at /dbt_packages/models/staging/recommended_events for instructions on enabling Google's recommended events.
Specific event names can be specified as conversions by setting the conversion_events
variable in your dbt_project.yml
file. These events will be counted against each session and included in the fct_sessions.sql
dimensional model. Ex:
vars:
ga4:
conversion_events:['purchase','download']
The stg_ga4__sessions_traffic_sources_last_non_direct_daily
model provides last non-direct session attribution within a configurable lookback window. The default is 30 days, but this can be overridden with the session_attribution_lookback_window_days
variable.
vars:
ga4:
session_attribution_lookback_window_days: 90
Custom events can be generated in your project using the create_custom_event
macro. Simply create a new model in your project and enter the following:
{{ ga4.create_custom_event('my_custom_event') }}
Note, however, that any event-specific custom parameters or default custom parameters must be defined in the global variable space as shown below:
vars:
default_custom_parameters:
- name: "some_parameter"
value_type: "string_value"
my_custom_event_custom_parameters:
- name: "some_other_parameter"
value_type: "string_value"
By default, GA4 exports data into sharded event tables that use the event date as the table suffix in the format of events_YYYYMMDD
or events_intraday_YYYYMMDD
. This package incrementally loads data from these tables into base_ga4__events
which is partitioned on date. There are two incremental loading strategies available:
- Dynamic incremental partitions (Default) - This strategy queries the destination table to find the latest date available. Data beyond that date range is loaded in incrementally on each run.
- Static incremental partitions - This strategy is enabled when the
static_incremental_days
variable is set to an integer. It incrementally loads in the last X days worth of data regardless of what data is availabe. Google will update the daily event tables within the last 72 hours to handle late-arriving hits so you should use this strategy if late-arriving hits is a concern. The 'dynamic incremental' strategy will not re-process past date tables. Ex: Astatic_incremental_days
setting of3
would load data fromcurrent_date - 1
current_date - 2
andcurrent_date - 3
. Note thatcurrent_date
uses UTC as the timezone.
The value of the frequency
variable should match the "Frequency" setting on GA4's BigQuery Linking Admin page.
GA4 | dbt_project.yml |
---|---|
Daily | "daily" |
Streaming | "streaming" |
both Daily and Streaming | "daily+streaming" |
The daily option (default) is for sites that use just the daily, batch export. It can also be used as a substitute for the "daily+streaming" option where you don't care about including today's data so it doesn't strictly need to match the GA4 "Frequency" setting. The streaming option is for sites that only use the streaming export. The streaming export is not constrained by Google's one million event daily limit and so is the best option for sites that may exceed that limit. Selecting both "Daily" and "Streaming" in GA4 causes the streaming, intraday BigQuery tables to be deleted when the daily, batch tables are updated. The "daily+streaming" option uses the daily batch export and unions the streaming intraday tables. It is intended to append today's data from the streaming intraday to the batch tables.
Example:
vars:
ga4:
frequency: "daily+streaming"
This package assumes that BigQuery is the source of your GA4 data. Full instructions for connecting DBT to BigQuery are here: https://docs.getdbt.com/reference/warehouse-profiles/bigquery-profile
The easiest option is using OAuth with your Google Account. Summarized instructions are as follows:
- Download and initialize gcloud SDK with your Google Account (https://cloud.google.com/sdk/docs/install)
- Run the following command to provide default application OAuth access to BigQuery:
gcloud auth application-default login --scopes=https://www.googleapis.com/auth/bigquery,https://www.googleapis.com/auth/iam.test
This package uses pytest
as a method of unit testing individual models. More details can be found in the unit_tests/README.md folder.
By default, this package maps traffic sources to channel groupings using the macros/default_channel_grouping.sql
macro. This macro closely adheres to Google's recommended channel groupings documented here: https://support.google.com/analytics/answer/9756891?hl=en .
Package users can override this macro and implement their own channel groupings by following these steps:
- Create a macro in your project named
default__default_channel_grouping
that accepts the same 3 arguments: source, medium, source_category - Implement your custom logic within that macro. It may be easiest to first copy the code from the package macro and modify from there.
Overriding the package's default channel mapping makes use of dbt's dispatch override capability documented here: https://docs.getdbt.com/reference/dbt-jinja-functions/dispatch#overriding-package-macros
Multiple GA4 properties are supported by listing out the project IDs in the property_ids
variable. In this scenario, the static_incremental_days
variable is required and the dataset
variable will define the target dataset where source data will be copied.
vars:
ga4:
property_ids: [11111111, 22222222, 33333333]
static_incremental_days: 3
dataset: "my_combined_dataset"
With these variables set, the combine_property_data
macro will run as a pre-hook to base_ga4_events
and clone shards to the target dataset. The number of days' worth of data to clone during incremental runs will be based on the static_incremental_days
variable.
When the frequency variable is set to daily
or daily+streaming
, the events_*
tables will be copied and intraday tables will be ignored. When the frequency is set to streaming
, only the events_intraday_*
tables will be copied.
Jobs that run a large number of clone operations are prone to timing out. As a result, it is recommended that you increase the query timeout if you need to backfill or full-refresh the table, when first setting up or when the base model gets modified. Otherwise, it is best to prevent the base model from rebuilding on full refreshes unless needed to minimize timeouts.
models:
ga4:
staging:
base:
base_ga4__events:
+full_refresh: false
This package attempts to adhere to the Brooklyn Data style guide found here. This work is in-progress.