-
Notifications
You must be signed in to change notification settings - Fork 984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental Models - unique_key
possibly has different behavior than stated
#4355
Comments
@mirnawong1 Is this the right place for me to open this issue? Is there a different repo that would get this more traction? Trying to see if my organization can use this feature or if this is undefined behavior |
hey @fathom-parth this is def the right place. I'm going to consult with our developer experience team to see how to best address this for you. thank you for flagging this and apologies for the delay here! |
Take a look at the
|
@dbeatty10 I did look into the insert_overwrite strategy but that wouldn’t work for us since that overwrites entire partitions. We currently set partitioning on the ingest timestamp column (a datetime); however, the partitioning is set to the granularity of a month in accordance to BQ’s best practices so we don’t run into the maximum number of allowed partitions and we follow BQ's guidelines around the most optimal size per partition. Overwriting the entire month of data would be expensive and unideal. Ideally we’d only overwrite rows from the same exact timestamp Lmk if I’ve grossly misunderstood insert overwrite or if I’m missing something obvious!! Edited at 2023-12-13 11:13am ET for clarity (wrote the above half-asleep) |
@dbeatty10 responding to your question from Slack: Using the Snowflake adapter
|
Interesting, I guess the BQ specific documentation doesn't mention the unique_key needing to be unique: |
Ah though looking through the code on how the |
merge
strategy with unique_key
possibly has different behavior than statedunique_key
possibly has different behavior than stated
There are two different use cases in which
For (1), the upstream source would benefit from adding For (2), there should not be any uniqueness check, because we'd expect many rows to have the same value for It would be nice to formally distinguish between each of those use-cases, and dbt-labs/dbt-core#9490 might be a good way to accomplish that. |
I appears this may have been resolved via other methods. If any updates to the docs are required, please feel free to re-open |
Contributions
Link to the page on docs.getdbt.com requiring updates
https://docs.getdbt.com/docs/build/incremental-models#defining-a-unique-key-optional
What part(s) of the page would you like to see updated?
Context
I'd like to use incremental models to replace all records with a specific timestamp (an ingest timestamp) with the new rows coming in.
When reading over the dbt incremental docs, it seems the
merge
strategy would be the best for this; however, this strategy requires aunique_key
.The Problem
In this case, there's no
unique_key
that I want to define per record and I'd rather have the incremental strat remove every row related to a specific column value (in my case, an ingest timestamp).I asked this question on slack here:
https://getdbt.slack.com/archives/CBSQTAPLG/p1698417729305499
And some people mentioned that they do this by defining a
unique_key
that's not actually unique per-record and that this works fine for them.One of the users on slack also mention:
The documentation seems to conflict with the above suggestion however by stating:
and
This conflict is confusing to me and I'd like to know if the docs need to be updated or if the users' experience is undefined.
Additional information
For reference, I'm using the bigquery adapter.
The text was updated successfully, but these errors were encountered: