-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automating Non Regression Test : How to get a ref() to the deferred version of the selected model ? #2740
Comments
Interesting idea, @fabrice-etanchaud! Based on how we've defined That said, I agree that this would be really compelling. The foundational work we've done around artifact/state comparison—the surface of which we've just barely scratched—could totally make something like this possible in the future. Maybe we add a "defer" argument to |
Hi @jtcohen6 , currently I am using the macro below to check for non regression. Best, {% macro non_regression_test(unmanaged_model, managed_model_database, managed_model_schema, managed_model_identifier, pk) %}
-- depends_on: {{ ref(unmanaged_model) }}
{% set unmanaged_model = ref(unmanaged_model) %}
{% set managed_model = api.Relation.create(database=managed_model_database, schema=managed_model_schema, identifier=managed_model_identifier) %}
{% set managed_cols = adapter.get_columns_in_relation(managed_model) %}
{% set managed_colnames = managed_cols | map(attribute='name') | list %}
{% set unmanaged_cols = adapter.get_columns_in_relation(unmanaged_model) %}
{% set unmanaged_colnames = unmanaged_cols | map(attribute='name') | list %}
{% set pk_cols = managed_cols | selectattr('name', 'in', pk) | list %}
{% set pk_colnames = pk_cols | map(attribute='name') | list %}
{% set nrt_cols = unmanaged_cols | rejectattr('name', 'in', pk) | selectattr('name', 'in', managed_colnames) | list %}
{% set nrt_colnames = nrt_cols | map(attribute='name') | list %}
select
md5(concat({{ columns_csv('managed', pk_cols, alias=false) }})) as tnr_id
,{{ columns_csv('managed', pk_cols) }}
,'difference' AS change_type
,{{ columns_csv('unmanaged', nrt_cols, prefix='unmanaged') }}
,{{ columns_csv('managed', nrt_cols, prefix='managed') }}
from {{ managed_model }} as managed
join {{ unmanaged_model }} as unmanaged
on ({{ join_on_columns('managed', 'unmanaged', pk_cols) }})
where {{ columns_inequality('managed', 'unmanaged', nrt_cols) }}
union all
select
md5(concat({{ columns_csv('managed', pk_cols, alias=false) }}))
,{{ columns_csv('managed', pk_cols, alias=false) }}
,'missing'
,{{ columns_csv('unmanaged', nrt_cols, prefix='unmanaged', nullify=true, alias=false) }}
,{{ columns_csv('managed', nrt_cols, prefix='managed', alias=false) }}
from
{{ managed_model }} as managed
where not exists (
select 1
from {{ unmanaged_model }} as unmanaged
where ({{ join_on_columns('unmanaged', 'managed', pk_cols) }})
)
union all
select
md5(concat({{ columns_csv('unmanaged', pk_cols, alias=false) }}))
,{{ columns_csv('unmanaged', pk_cols, alias=false) }}
,'unexpected'
,{{ columns_csv('unmanaged', nrt_cols, prefix='unmanaged', alias=false) }}
,{{ columns_csv('managed', nrt_cols, prefix='managed', nullify=true, alias=false) }}
from
{{ unmanaged_model }} as unmanaged
where not exists (
select 1
from {{ managed_model }} as managed
where ({{ join_on_columns('managed', 'unmanaged', pk_cols) }})
)
{% endmacro %}
{%- macro columns_csv(model, columns, alias=true, prefix=none, nullify=false) -%}
{%- for column in columns -%}
cast(
{%- if not nullify -%}
{{ model }}.{{ column.name }}
{%- else -%}
null
{%- endif %} as {{ column.data_type }}){% if alias %} as
{%- if prefix is not none %} {{ prefix }}_{%- else %} {% endif -%}{{ column.name }}{% endif %}
{%- if not loop.last %}, {% endif -%}
{% endfor -%}
{%- endmacro -%}
{%- macro join_on_columns(model1, model2, columns) -%}
{%- for column in columns -%}
{{ model1 }}.{{ column.name }} = {{ model2 }}.{{ column.name }}
{%- if not loop.last %} and {% endif -%}
{%- endfor -%}
{%- endmacro -%}
{%- macro columns_inequality(model1, model2, columns) -%}
{%- for column in columns -%}
{{ model1 }}.{{ column.name }} is distinct from {{ model2 }}.{{ column.name }}
{% if not loop.last %} or {% endif -%}
{%- endfor -%}
{%- endmacro -%}
|
Okay, here's what I'm thinking about: A builtin It would be a close cousin of
What do you think @fabrice-etanchaud? |
Hello @jtcohen6 , this is why I really love dbt : it's not only efficient, but also elegant ! Thinking out loud, couldn't this help implement model schema awareness ?
|
I'm also interested in non-regression tests (new term to me, thanks @fabrice-etanchaud!) and followed the rabbit-hole to here. :-) During a The strength of the Going a little off-topic from
And the data test would look like:
The downside of this being that |
@boxysean I really like the connections coming out of this issue—I'm finding this problem space very generative, even though I'm not yet sold on any one solution in particular (including my hypothetical I too was thinking about mock data today, having spent some time this morning writing out #2857 and thinking about improved UX for Imagine that, for each {% macro ref(model_name) %}
{% if var('mock', false) %}
{% set rel = builtins.ref('mock__' ~ model_name) %}
{% else %}
{% set rel = builtins.ref(model_name) %}
{% endif %}
{% do return(rel) %}
{% endmacro %} Then Even more interesting would be overriding the {% macro source(source_name, table_name) %}
{% if var('mock', false) %}
{% set src = builtins.ref('mock__' ~ source_name ~ '__' ~ table_name) %}
{% elif var('mock_update', false) %}
{% set src = builtins.ref('mock_update__' ~ source_name ~ '__' ~ table_name) %}
{% else %}
{% set src = builtins.source(source_name, table_name) %}
{% endif %}
{% do return(src) %}
{% endmacro %} You could then easily run an entire DAG of transformations and tests on top of fixture data, including In my personal view, I think that's more trouble than it's worth at most organizations—automatically generating realistic and up-to-date fake data is a massive undertaking—but I know that a rigorous and extensible unit testing framework is on the minds of several community members. Over in #2593, @bashyroger kindly linked me to a quite thoughtful post about what's missing from dbt testing today, and I've been mulling it over for the past month. In particular:
I think that piece would be possible with the custom ref/source override above. The requirement would be having mock data CSVs that are stable and stored in the Speaking of interesting connections, @fabrice-etanchaud, I'd be curious to hear more about how you see this tying into schema awareness. Is it the ability to compare |
@jtcohen6 , you are right, testing that schema is not evolving between prod and dev is not an model "invariant" test, but an ad hoc test. There are discussions elsewhere on how to build a schema repository, that's for sure the way to go. |
I have been playing around with a way to automate this and I have a working concept here: https://github.com/jmriego/dbt-bdd The tests are run with
The main trick is that the phrase " Then, it will replace all Before spending too much time on this, what do you all think of this approach? Does it make sense to continue in this direction? |
@jmriego This looks cool! We played around with using Out of curiosity, why the need for the scenario ID prefix? We do something very similar for dbt integration tests, so as to reuse the same set of models while avoiding database namespace collisions across scenarios or testing invocations. If that's your reason as well, could you simply pass Taking a step back, it would be great to move this line of conversation over to #2354, or into discourse. I think there's a lot of folks who are interested in the broader concept of unit testing / fixed data testing with dbt, and currently we've got various thoughts spread across long threads, tucked under niche feature requests. |
thanks @jtcohen6 ! I'll continue the conversation in those two places. It would be great to have dbt have this kind of capability |
@jmriego I think a custom {% macro generate_alias_name(custom_alias_name=none, node=none) -%}
{% set aliased = custom_alias_name | trim if custom_alias_name else node.name %}
{% set prefix = var('scenario') ~ '_' if var('scenario') else '' %}
{% set identifier = prefix ~ aliased %}
{{ return(identifier) }}
{%- endmacro %} |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest; add a comment to notify the maintainers. |
Describe the feature
Based on the new --defer feature, one would like to automate non regression test (NRT), comparing local model data with the same deferred model data. How one could get the ref() of the deferred model to be given to a NRT macro comparing the common columns on the common primary key values ?
Describe alternatives you've considered
No alternative so far
Additional context
Who will this benefit?
All dbt users who want to automate non regression testing.
Are you interested in contributing this feature?
Yes !
The text was updated successfully, but these errors were encountered: