-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Improve incremental materialized views refresh time #668
Comments
Thanks for the submission @cmcnicoll! I think the two things you're comparing might not be exactly the same. When a
In order for dbt to know that all that's needed is a refresh statement, it needs to do this investigative work first. When you ran the refresh statement in DBeaver, you already subconsciously observed that your project structure has not changed, nor did the model definition or configuration change since the last run. Hence executing a refresh statement is the only required step. In other words, for I agree that the 7x-8x jump is larger than I would have expected. I'll talk through my thought process here and see where we wind up. When working with test sets, oftentimes the bottleneck is not the size of the data, the local application nor the server, it's the network time in getting your command to the server. I'm going to assume that the network lag is somewhere around the 4s that you're seeing for the refresh command to return since all the server did was determine that nothing needed to happen. That means a query costs a minimum of 4s. Walking through the logic flow above, we need to run the following queries (at least):
Let's assume I'm not forgetting anything (not a safe assumption). And I'll point out that these four queries run in series. That puts us at 17.5s (4.371s x 4) just in network lag time, leaving 15s (32.5s - 17.5s) for query execution of the first three queries (assume refresh == 0s). That feels a little higher than expected, but not alarming, especially considering that we have not accounted for any dbt work yet. At the moment, I'm not inclined to regard this as a bug. Hopefully this context helps. I'll leave this open in case you disagree though. |
Thanks @mikealfare for the context! I appreciate you taking the time to write that out. I wasn't sure if this should be a bug, feature or discussion. The main drivers for creating this issue are:
Based on the context you provided, should this issue be changed to a feature request? For example, speeding up the investigative queries and/or being able to bypass them when trying to refresh every 5 minutes. Or should this be moved to a discussion? I can expand on my use case, if that helps. |
In your post you have
This is interesting. I'd like to look into why that's the case as it should run in parallel if there are no dependencies between the materialized views.
Could you elaborate? Is this due to the time it takes to refresh, or something else?
If 2 is true, I would argue that's a bug. The intention is to run as many models in parallel as we can. I could see speeding up the investigative queries as being a performance enhancement down the road. Ideally that happens at the same time as building the cache. However, since only materialized views support If your intention is to bypass the investigative queries, I strongly suggest looking into turning on |
Thanks again @mikealfare for the detailed response! Here is a quick overview of my use case.
And here are some timings from
Sorry, this is not true. When I check
I was hoping that chaining incremental materialized views together would handle slowly changing dimensions and would be faster than incremental models. It seems like the best use case for incremental materialized views is for |
Ah ok, the context helps a lot. Agreed that there are some significant limitations on materialized views, and they vary by platform. Based on your context and use cases, would it be fair to reclassify this into two enhancements? Namely:
Cc: @Fleid |
Yes, that sounds good to me! Thanks! |
We are currently running hourly refreshes, but we are still interested in using incremental materialized views to do near real time refreshes. Thanks! |
Is this a new bug in dbt-redshift?
Current Behavior
Refreshing materialized views manually is significantly faster than refreshing via dbt.
Expected Behavior
Refreshing via dbt is faster, especially when the underlying data has not changed and the model configs have not changed.
Steps To Reproduce
dbt run --threads 8
to create incremental materialized views.dbt run --threads 8
to do initial refresh.dbt run --threads 8
to get timing for refreshing via dbt.Relevant log output
No response
Environment
Additional Context
I'm evaluating if dbt can be used for near real time reporting using incremental materialized views in staging and core models.
The text was updated successfully, but these errors were encountered: