Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-3493] [Bug] unique_key list incremental model has performance issues on the delete phase #150

Open
2 tasks done
nfarinha opened this issue Dec 13, 2023 · 7 comments · May be fixed by ataft/dbt-core#1 or #151
Open
2 tasks done
Labels
feature:incremental Issues related to incremental materializations feature:performance Issues related to improving dbt's performance type:bug Something isn't working as documented

Comments

@nfarinha
Copy link

nfarinha commented Dec 13, 2023

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

The scenario is as follows:
• A model that is loading 20k rows
• Is setup as incremental , delete+insert
• The unique_key is an array of 2 columns
• The 2 columns (unique_key) define a single “partition” with all columns (200k)
• The tmp table creation executes fast as expected
• The delete takes hours until we cancelled it
• The execution plan shows that the delete statement is combining 20k * 20k = 200 million rows

The configuration is

{{
    config(
        materialized='incremental',
        incremental_strategy='delete+insert',
        unique_key=['version_code','bu_agg']
    )
}}

The delete statement is assuming that the key is really unique. That probably is not the case if you are using "delete+insert".

delete from big_table
            using big_table__dbt_tmp
            where (
                    big_table__dbt_tmp.version_code = big_table.version_code
                    and big_table__dbt_tmp.bu_agg = big_table.bu_agg

image

Expected Behavior

The delete statement should not assume the unique key.
My suggestion is to use :

delete from big_table
            where (version_code ,bu_agg) in (
                  select version_code ,bu_agg from big_table__dbt_tmp
              )

Steps To Reproduce

• Setup the model as incremental , delete+insert
• The unique_key is an array of 2 columns
• Build the model
• Check the execution plan

Relevant log output

No response

Environment

- dbt:Cloud 1.7

Which database adapter are you using with dbt?

No response

Additional Context

No response

@nfarinha nfarinha added type:bug Something isn't working as documented triage:product In Product's queue labels Dec 13, 2023
@github-actions github-actions bot changed the title [Bug] unique_key list incremental model has performance issues on the delete phase [CT-3493] [Bug] unique_key list incremental model has performance issues on the delete phase Dec 13, 2023
@dbeatty10 dbeatty10 self-assigned this Dec 13, 2023
@dbeatty10
Copy link
Contributor

Thanks for reporting this @nfarinha !

Based on your screenshots of the query plan, it looks like you are using dbt-snowflake. Let's dive into the details.

Non-unique unique_key

The delete statement is assuming that the key is really unique. That probably is not the case if you are using "delete+insert".

The folks that have contributed to dbt-labs/docs.getdbt.com#4355 are saying something similar about use-cases for the delete+insert incremental strategy.

Our documentation currently states that unique_key is expected to be truly unique here, but then it also gives guidance on using delete+insert within dbt-snowflake when it isn't truly unique.

For the sake of this discussion, let's assume that we want to support the use case of non-unique keys with delete+insert specifically, but expecting truly unique keys with the other incremental strategies.

The delete+insert strategy in dbt-snowflake

The delete+insert strategy in dbt-snowflake defers to the implementation of default__get_delete_insert_merge_sql within dbt-core).

The delete+insert strategy in dbt-core

It only does a delete portion if unique_key is defined, otherwise, it only does an insert.

When it does a delete, it has two different code paths introduced in dbt-labs/dbt-core#4858 depending on if unique_key is a list or not.

  1. If unique_key is a list:

https://github.com/dbt-labs/dbt-core/blob/c2bc2f009bbeeb46b3c69d082ab4d485597898af/core/dbt/include/global_project/macros/materializations/models/incremental/merge.sql#L65-L77

  1. If unique_key is a single column:

https://github.com/dbt-labs/dbt-core/blob/c2bc2f009bbeeb46b3c69d082ab4d485597898af/core/dbt/include/global_project/macros/materializations/models/incremental/merge.sql#L79-L89

Your suggestion

Your suggestion is to use:

delete from big_table
            where (version_code ,bu_agg) in (
                  select version_code ,bu_agg from big_table__dbt_tmp
              )

Your suggestion looks really close to that 2nd code path!

So it would essentially involve eliminating the 1st code path in favor of the 2nd (with some small tweaks, of course).

Summary

To the best I can tell, the current logic is expected to give correct results on tiny data sets but it quickly runs into severe performance issues with normal-sized data sets.

We'd need to try out your suggestion with a similar-sized data set (20K pre-existing x 20K updates) to see if it executes in a reasonable amount of time.

@dbeatty10 dbeatty10 removed the triage:product In Product's queue label Dec 13, 2023
@dbeatty10 dbeatty10 removed their assignment Dec 13, 2023
@nfarinha
Copy link
Author

nfarinha commented Dec 13, 2023 via email

@ataft
Copy link

ataft commented Jan 25, 2024

I can absolutely confirm that the current multi-column insert+delete query strategy is beyond inefficient. My tests never finish, even with small amounts of data (~100K rows). The proposed strategy appears like the best approach:

delete from table1
where (col1, col2) in (
    select distinct col1, col2 from table1_tmp
)

Some initial/quick testing appears to be faster with distinct in there.

Any ideas on when this fix might be applied?

@ataft
Copy link

ataft commented Jan 25, 2024

I had to rewrite this macro for an urgent use-case, so here's the improved code (faster and cleaner):

{% macro default__get_delete_insert_merge_sql(target, source, unique_key, dest_columns, incremental_predicates) -%}

    {%- set dest_cols_csv = get_quoted_csv(dest_columns | map(attribute="name")) -%}

    {% if unique_key %}
        {% if unique_key is string %}
        {% set unique_key = [unique_key] %}
        {% endif %}

        {%- set unique_key_str = unique_key|join(', ') -%}

        delete from {{ target }}
        where ({{ unique_key_str }}) in (
            select distinct {{ unique_key_str }}
            from {{ source }}
        )
        {%- if incremental_predicates %}
            {% for predicate in incremental_predicates %}
                and {{ predicate }}
            {% endfor %}
        {%- endif -%};

    {% endif %}

    insert into {{ target }} ({{ dest_cols_csv }})
    (
        select {{ dest_cols_csv }}
        from {{ source }}
    )

{%- endmacro %}

@dbeatty10 dbeatty10 added the feature:performance Issues related to improving dbt's performance label Apr 8, 2024
@dataders dataders transferred this issue from dbt-labs/dbt-core Apr 10, 2024
ataft added a commit to ataft/dbt-adapters that referenced this issue Apr 10, 2024
resolves dbt-labs#150

Problem
The delete query for the 'delete+insert' incremental_strategy with 2+ unique_key columns is VERY inefficient. In many cases, it will hang and never return for deleting small amounts of data (<100K rows).

Solution
Improve the query by switching to a much more efficient delete strategy:

```
delete from table1
where (col1, col2) in (
    select distinct col1, col2 from table1_tmp
)
```
@ataft ataft linked a pull request Apr 10, 2024 that will close this issue
4 tasks
@Fleid Fleid added triage:product In Product's queue tracking_pr and removed triage:product In Product's queue labels Apr 11, 2024
@dbeatty10 dbeatty10 added the feature:incremental Issues related to incremental materializations label May 30, 2024
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale Mark an issue or PR as stale, to be closed label Nov 27, 2024
@b-per b-per removed the Stale Mark an issue or PR as stale, to be closed label Nov 27, 2024
@b-per
Copy link
Contributor

b-per commented Nov 27, 2024

Removing the stale label as this is still relevant

@tsrp25
Copy link

tsrp25 commented Nov 28, 2024

The same issue presists in DBT cloud, but DBT support says that they are not planning to change delete + insert approach anytime in future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature:incremental Issues related to incremental materializations feature:performance Issues related to improving dbt's performance type:bug Something isn't working as documented
Projects
None yet
7 participants