-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Multi step materialization/ supporting Update and Delete #4313
Comments
related to #184 |
@ismailsimsek Thanks for opening, and sorry for the delay getting back to you. The liquibase-inspired syntax you've outlined is pretty neat! My sense is, from quickly reading through liquibase's docs, it primarily serves a different use case, where change management for database migrations is essential—namely, transactional databases powering live applications. I'm thinking about how it might align with dbt, which encodes a few strong opinions, all quite different from conventional approaches to managing application databases:
That opinionated approach means putting into place certain guardrails and operating assumptions:
So, two things are simultaneously true:
{% materialization create_insert_then_sequence, default %}
-- this is the boilerplate DDL/DML, to be defined once and reused across
-- many many many models
-- everything below is demonstrative, untested pseudo-code
{% set columns = get_columns_in_query(sql) %}
{{ run_hooks(pre_hooks, inside_transaction=False) }}
-- `BEGIN` happens here, if this database supports transactions
{{ run_hooks(pre_hooks, inside_transaction=True) }}
{% set create_table_sql %}
-- what to do if the table already exists?
create table {{ model.name }} (
{% for column in columns %}
{{ column.name }} {{ column.data_type }} {{ "," if not loop.last }}
);
{% endset %}
{% set insert_sql %}
insert into test1 ({% for column in columns %} {{ column.name }} {{ "," if not loop.last }})
values (select * from {{ sql }});
{% endset %}
{% call statement('main') -%}
{{ create_table_sql }};
{{ insert_sql }};
{%- endcall %}
{{ create_sequence(model.name) }};
-- `COMMIT` happens here
{{ adapter.commit() }}
{{ run_hooks(post_hooks, inside_transaction=True) }}
{{ return({'relations': [target_relation]}) }} {{ config(materialized = 'create_insert_then_sequence') }}
-- this is the model's actual transformation logic (SQL)
select 1 as id, 'name 1' as name
union all
select 2 as id, 'name 2' as name
-- this is expected to vary from model to model
-- but all of them can use the same strategy for materializing in the database I'm going to close this issue, since the proposal is both a bit of an anti-pattern in dbt, and possible given the capabilities that exist today. My biggest takeaway from this issue is that we should aspire to make dbt's materialization syntax much more legible and intuitive, inspired by the "changelog"-style syntax of liquibase. In previous attempts, that's looked like wrapping much more of the repeated logic in thoughtfully named macros. Maybe someday we'll even get away from all the |
@jtcohen6 why is this an antipattern? Say you have a model, sales. Say that model has 5 different sources. Rather than build 5 models and union it would be nice to be able to flow those 5 sources in to the same sales model as separate tasks. This way if any of the tasks fail the final table can still be usable, minus the issue partition/task. Pairing this with partition structures can keep it idempotent. This functionality should only exist when paired with technologies that enable partitions otherwise the idempotency is lost. The current hack we have is to use downstream unions. The problem here is while it appears idempotent, the disconnect of the materialization of the table, from the final use case introduces risk that can only be managed logically. Someone could change an upstream model, breaking the downstream model. Now one small change has broken the entire system. If the interim task and final model were more tightly coupled the change would fail and it would not break the final model. |
@walked08860 In the "dbt way," as I see it, if you want the re-modeling of each source to be independent, each should be its own model. Then, it feels appropriate to have a subsequent model handle the unification. You could materialize that downstream model as a table (if any of the upstream models fails, the table remains but gets stale), or as a view (if one of the upstream models fails, the view remains, the other models succeed, so fresh data flows into the view from all except the problematic source). There's absolutely a trade-off to make there, between independence and consistency, but dbt gives you the ability to make that choice.
It sounds like you need to impose column standardization earlier in your model process. We're thinking a lot about constraints and "model contracts" right now (e.g. #6079), and that might be a mechanism to address this, more clearly and explicitly than by fuzzing the boundary between modeling ("logical") vs. materialization ("physical") code. |
Is there an existing feature request for this?
Describe the Feature
wanted to leave here a idea of supporting multi step materialization. similar to liquibae sql change logs, thought this solution could fit to dbt.
this should enable users to run insert followed by update or followed by any other logic they like. for some cases data load needs to be followed by update statements and currently this is not possible with materialization.
idea is using similar approach to liquibase sql change-logs does. link 2
following example is defining 3 change-set with rollback statements for each one. when executed all 3 sql script is executed sequential with the configs defined in sql comment(changeset)
Describe alternatives you've considered
No response
Who will this benefit?
all dbt users needs to run complex transformation(multiple steps of transformations) on a table.
Are you interested in contributing this feature?
No response
Anything else?
No response
The text was updated successfully, but these errors were encountered: