Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Multi step materialization/ supporting Update and Delete #4313

Closed
1 task done
ismailsimsek opened this issue Nov 20, 2021 · 4 comments
Closed
1 task done
Labels
enhancement New feature or request

Comments

@ismailsimsek
Copy link

ismailsimsek commented Nov 20, 2021

Is there an existing feature request for this?

  • I have searched the existing issues

Describe the Feature

wanted to leave here a idea of supporting multi step materialization. similar to liquibae sql change logs, thought this solution could fit to dbt.

this should enable users to run insert followed by update or followed by any other logic they like. for some cases data load needs to be followed by update statements and currently this is not possible with materialization.

idea is using similar approach to liquibase sql change-logs does. link 2

following example is defining 3 change-set with rollback statements for each one. when executed all 3 sql script is executed sequential with the configs defined in sql comment(changeset)

--liquibase formatted sql
				
--changeset nvoxland:1
create table test1 (  
    id int primary key,
    name varchar(255)  
);  
--rollback drop table test1; 

--changeset nvoxland:2 
insert into test1 (id, name) values (1, ‘name 1′);
insert into test1 (id,  name) values (2, ‘name 2′);  

--changeset nvoxland:3 dbms:oracle
create sequence seq_test;

Describe alternatives you've considered

No response

Who will this benefit?

all dbt users needs to run complex transformation(multiple steps of transformations) on a table.

Are you interested in contributing this feature?

No response

Anything else?

No response

@ismailsimsek
Copy link
Author

related to #184

@ismailsimsek ismailsimsek changed the title [Feature] Multi step materialization/ supporting Updates [Feature] Multi step materialization/ supporting Update and Delete Jan 12, 2022
@jtcohen6
Copy link
Contributor

jtcohen6 commented Jan 25, 2022

@ismailsimsek Thanks for opening, and sorry for the delay getting back to you.

The liquibase-inspired syntax you've outlined is pretty neat! My sense is, from quickly reading through liquibase's docs, it primarily serves a different use case, where change management for database migrations is essential—namely, transactional databases powering live applications.

I'm thinking about how it might align with dbt, which encodes a few strong opinions, all quite different from conventional approaches to managing application databases:

  • Transformation logic (SQL) should be separate from materialization logic (DDL/DML)
  • Transformations (opinionated, mutable) should live independently from raw data (unopinionated, immutable)
  • Transformations should be idempotent. That means transformed models can be recreated at will, without fear of losing any data (create or replace table). Granted, your willingness to drop and replace specific large tables may vary, based on performance considerations—but it's a powerful assumption, and as an operating principle it gets us a lot of mileage.

That opinionated approach means putting into place certain guardrails and operating assumptions:

  • We want users to define materialization types/strategies once (including all boilerplate DDL/DML), and reuse them across multiple models. (Most often, this looks like users leaning on the built-in types/strategies that ship with dbt's global project.)
  • Different database adapters implement transactional logic in different ways. However it's implemented, if any of the steps in a model's materialization fails, including any SQL statement that raises an error, dbt aborts that model's materialization and returns the error. That may look like rolling back the open transaction on Postgres/Redshift.

So, two things are simultaneously true:

  1. The idempotent nature of dbt transformations makes the kind of "changelog"-based, migration-style approach to object creation less critical—we can atomically swap + drop tables without fear, and simply back out of the transaction if anything goes wrong. That's the viewpoint underpinning dbt's built-in view, table, and incremental materializations—and we think you should use them!—rather than a migration-style update + delete workflow.
  2. At the same time, it is possible to create your own custom materializations, for use in your own dbt project. That materialization logic can include whatever you'd like. It might not look quite as pretty as the liquibase syntax, and there's no way to manually catch/handle errors via your own rollback mechanism—dbt will do that automatically as soon as it encounters a failing SQL statement—but you have full control to bring in your own functional, reusable components (via macros):
{% materialization create_insert_then_sequence, default %}

    -- this is the boilerplate DDL/DML, to be defined once and reused across
    -- many many many models
    -- everything below is demonstrative, untested pseudo-code

    {% set columns = get_columns_in_query(sql) %}
    
    {{ run_hooks(pre_hooks, inside_transaction=False) }}

    -- `BEGIN` happens here, if this database supports transactions
    {{ run_hooks(pre_hooks, inside_transaction=True) }}
    
    {% set create_table_sql %}
      -- what to do if the table already exists?
      create table {{ model.name }} (  
        {% for column in columns %}
          {{ column.name }} {{ column.data_type }} {{ "," if not loop.last }}
      );  
    {% endset %}
    
    {% set insert_sql %}
    insert into test1 ({% for column in columns %} {{ column.name }} {{ "," if not loop.last }})
      values (select * from {{ sql }});
    {% endset %}
    
    {% call statement('main') -%}
      {{ create_table_sql }};
      {{ insert_sql }};
    {%- endcall %}

    {{ create_sequence(model.name) }};
    
        
    -- `COMMIT` happens here
    {{ adapter.commit() }}
    
    {{ run_hooks(post_hooks, inside_transaction=True) }}

    {{ return({'relations': [target_relation]}) }}
{{ config(materialized = 'create_insert_then_sequence') }}

-- this is the model's actual transformation logic (SQL)
select 1 as id, 'name 1' as name
union all
select 2 as id, 'name 2' as name

-- this is expected to vary from model to model
-- but all of them can use the same strategy for materializing in the database

I'm going to close this issue, since the proposal is both a bit of an anti-pattern in dbt, and possible given the capabilities that exist today. My biggest takeaway from this issue is that we should aspire to make dbt's materialization syntax much more legible and intuitive, inspired by the "changelog"-style syntax of liquibase. In previous attempts, that's looked like wrapping much more of the repeated logic in thoughtfully named macros. Maybe someday we'll even get away from all the {% %} clutter...?

@jtcohen6 jtcohen6 removed the triage label Jan 25, 2022
@walked08860
Copy link

@jtcohen6 why is this an antipattern? Say you have a model, sales. Say that model has 5 different sources. Rather than build 5 models and union it would be nice to be able to flow those 5 sources in to the same sales model as separate tasks. This way if any of the tasks fail the final table can still be usable, minus the issue partition/task. Pairing this with partition structures can keep it idempotent. This functionality should only exist when paired with technologies that enable partitions otherwise the idempotency is lost.

The current hack we have is to use downstream unions. The problem here is while it appears idempotent, the disconnect of the materialization of the table, from the final use case introduces risk that can only be managed logically. Someone could change an upstream model, breaking the downstream model. Now one small change has broken the entire system. If the interim task and final model were more tightly coupled the change would fail and it would not break the final model.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Dec 5, 2022

@walked08860 In the "dbt way," as I see it, if you want the re-modeling of each source to be independent, each should be its own model. Then, it feels appropriate to have a subsequent model handle the unification. You could materialize that downstream model as a table (if any of the upstream models fails, the table remains but gets stale), or as a view (if one of the upstream models fails, the view remains, the other models succeed, so fresh data flows into the view from all except the problematic source). There's absolutely a trade-off to make there, between independence and consistency, but dbt gives you the ability to make that choice.

If the interim task and final model were more tightly coupled the change would fail and it would not break the final model.

It sounds like you need to impose column standardization earlier in your model process. We're thinking a lot about constraints and "model contracts" right now (e.g. #6079), and that might be a mechanism to address this, more clearly and explicitly than by fuzzing the boundary between modeling ("logical") vs. materialization ("physical") code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants