Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to represent mart data? Kimball dimensional modeling or other? #178

Open
adamribaudo-velir opened this issue Apr 10, 2023 · 6 comments

Comments

@adamribaudo-velir
Copy link
Collaborator

adamribaudo-velir commented Apr 10, 2023

As pointed out by @willbryant , the current dim and fct mart models don't provide much value in being separated as they each are unique on the same key.

Ex: dim_ga4__sessions and fct_ga4__sessions are unique on session_key and could be joined together without issue.

It may make sense to have a single ga4__sessions model, but I know this is a debate with a lot of facets so it seems worth researching first.

My opinion:

I like the idea of: dim models where there is 1 record for each entity and fct models that contain low-grain events that can be aggregated against the entity. I think our mistake was in pre-aggregating the fct models so now they are entity-grain rather than the grain of the facts themselves.

@dgitis
Copy link
Collaborator

dgitis commented Apr 11, 2023

I agree with this. In fact, this is what I did with the fork that I have deployed on my client with high-volume sites. Separating sessions into fct and dim models adds to the effort needed to create reports and it makes the data less accessible.

I also think that we should include fct_ga4__event_page_view and possibly a fct_ga4__event_scroll model by default because these are the two events that are used to build other models. This would provide package users an example of how we expect them to use the package. It would also make building session and user models more efficient (I know I have been asked to demonstrate this in issue #153 and haven't yet done so, but it does make building those models more efficient and I will demonstrate it when I have the time).

I also think we should discuss the fct_ga4__pages model. With the high-volume client, who is starting to ramp up the use of the production tables, I switched out the fct_ga4__pages model for a agg_ga4__page_location. I've seen others use agg as a prefix and I've been using it to indicate models that aren't intended to be joined. I have aggregate tables for a number of different dimensions like page_location as well as an agg_ga4__daily table (which is another model that I think we should consider making default) and naming the model after the dimension that it segments on makes it consistent with other single-dimension aggregate models like author (which is obviously custom).

Having agg models isn't strictly Kimballian (?), but storage was expensive when Kimball developed his modeling guidelines. We might want to ask the various communities to hear what others are doing in this respect. Before deploying agg models, reports for the high volume client, who gets 1-3 million events a day, were taking about 40 seconds to load. Now, no one remarks on the report load speed.

@willbryant
Copy link
Contributor

Yeah. The other aspect is that BigQuery doesn't index like a Kimball-era database would, so having normalized models doesn't pay off in the same way.

@dgitis
Copy link
Collaborator

dgitis commented Apr 14, 2023

Here's something like what I envision the sessions and users models looking like.

dbt-ga4

Green represents partitioned tables while yellow represents views.

The idea here is to minimize lookups that hit the base_ga4__events table in favor of fct_ga4__events_page_view and a sessions table that I called here int_ga4__sessions.

I'm not sure if dim_ga4__sessions needs to be a partitioned or if a view will suffice. If it's a view, then it would probably be best to partition all of its parents.

I'd also welcome a discussion on whether we need both the first_last_events and first_last_pageviews models. I've noticed some problems with session attribution in sessions with just sessions_start events because we filter them in stg_ga4__sessions_traffic_sources. These sorts of issues will get worse if we decide to do away with first_last_events and just use first_last_pageviews built later and more efficiently. Maybe that tradeoff is worth it.

For last_non_direct attribution, I'm planning on introducing a lookback_window variable to let people choose how many partitions to check before defaulting to Direct attribution.

If we were to introduce a lookback_window then we could also use it in the users tables. One pattern I've seen in Google Ad Manager exports, for example, is that GAM populates daily partitions with all of the relevant data for that day so, for example, the companies table includes all of the companies that bid on that day and repeating that data on the previous day. If we were to do that with the users table and a lookback window, then the user entry for that day would calculate metrics only using data in the lookback window. The cookie-driven traffic_source would remain the same but it would be the only data point outside the lookback window.

The lookback window would also be good for user stitching between device_key and user_id.

I personally like the idea of having a lookback window setting as it makes me a lot more comfortable with looking up past data which gets expensive quickly on large sites and most of the relevant data is usually in the last few days.

We also might want to leave the lifetime user tables in the package as well and let the people deploying the package decide whether they care more about the efficiency of a lookback window versus having perfect user data.

@Stephen986
Copy link
Contributor

Just discovered this package, it does a lot of things very well, but 100% the separation of dim_ga4__sessions and fct_ga4__sessions makes no sense to me.

It's also worth pointing out that most companies have been around much longer than they've been using GA 4. So the company's fact_sessions table is probably fed into by a table full of GA 4 sessions AND a table full of Universal Analytics sessions, and maybe they have Mixpanel or Snowplow or Amplitude sessions as well. Only the final sessions table is a real fact table. So I don't think that a sessions table that only has GA 4 sessions needs a fct prefix at all... your GA 4 sessions table is a building block towards a fact table.

@Stephen986
Copy link
Contributor

Fivetran's packages are probably the most used outside of dbt_utils, and they don't call their shopify orders table fct_orders, but just shopify__orders. A company might have the shopify_orders table join with a stripe_orders table to create their fct_orders. It makes sense to me to use the same approach here.

@dgitis
Copy link
Collaborator

dgitis commented May 5, 2023

There's definitely an argument for avoiding fct_ and dim_ prefixes entirely since those have a specific meaning that we are not following.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants