Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-3121] [Implementation] get schema of inputs prior to executing unit test command #8649

Closed
1 task done
Tracked by #8283
graciegoheen opened this issue Sep 14, 2023 · 8 comments
Closed
1 task done
Tracked by #8283
Labels
enhancement New feature or request user docs [docs.getdbt.com] Needs better documentation

Comments

@graciegoheen
Copy link
Contributor

graciegoheen commented Sep 14, 2023

Housekeeping

  • I am a maintainer of dbt-core

Short description

Instead of requiring folks to run dbt run --select +my_model --exclude my_model before executing unit tests on my_model or requiring the use of --deferral (which would cause issues in a CI context when you're deferring to a production environment and may have changes to your inputs as part of the PR)...

we should be able to run dbt unit-test --select my_model in a new environment without having previously run dbt run --select +my_model --exclude my_model or dbt docs generate

Acceptance criteria

  • I can run dbt unit-test --select my_model in a new environment without having previously run dbt run --select +my_model --exclude my_model or dbt docs generate

Impact to Other Teams

None

Will backports be required?

No

Context

This is how the fixture sql is currently being generated, using the adapter.get_columns_in_relation to dynamically query the existing input table (or view) during input rendering: https://github.com/dbt-labs/dbt-core/blob/unit_testing_feature_branch/core/dbt/include/global_project/macros/unit_test_sql/get_fixture_sql.sql#L6-L9

@graciegoheen graciegoheen added the user docs [docs.getdbt.com] Needs better documentation label Sep 14, 2023
@github-actions github-actions bot changed the title [unit testing] get schema of inputs prior to executing unit test command [CT-3121] [unit testing] get schema of inputs prior to executing unit test command Sep 14, 2023
@graciegoheen graciegoheen added the enhancement New feature or request label Sep 14, 2023
@graciegoheen graciegoheen changed the title [CT-3121] [unit testing] get schema of inputs prior to executing unit test command [CT-3121] [Implementation] get schema of inputs prior to executing unit test command Sep 14, 2023
@graciegoheen
Copy link
Contributor Author

graciegoheen commented Sep 18, 2023

Idea 1

If the datatype is explicitly provided, let's use that:

  • data contracts
  • allow data type definition (similar to seeds) for fixtures & inputs (in yml, only relevant columns)

If there's no indication, we "guess" (similar to what we do for seeds - agate library; or whatever the data warehouse would guess for the inputed data).

But!! If you don't provide us anything (columns that you don't care about), we can't guess.

What if we said:

  • use contracts if you don't want to have to supply data for all of your columns
  • other wise supply mock data for all of your columns (or use NULL strings)

Future aspirations:

  • column level lineage propagated from sources (not currently possible)????

Idea 2

Can we use SQL parsing to get the types? With the schemas for sources from source freshness?

Idea 3

  • Cast everything as string by default, raise a warning
  • If the test fails, we show in the diffs the CAST-ing
  • Only way to get accurate data types is via explicit contracts for direct parents

Idea 4

What if we could somehow "clean" the SQL to only select inputted columns?

Idea 5

Can we run the catalog queries for the direct parents of the model selected for the unit test command and make those available in the unit testing jinja context for rendering inputs and expected outputs SQL? We could make use of the applied state work to filter catalog queries to only run the catalog queries for the relevant models #8648

But catalog queries will fail if you haven't previously built the relevant models...

@graciegoheen
Copy link
Contributor Author

graciegoheen commented Sep 19, 2023

Idea 6

cc: @jtcohen6

dbt ci command that in DAG order:

  • runs unit tests
  • builds models
  • runs data tests

Provided we:

  • include deferral
  • only selects the state:modified+ models
  • have retry for CI
  • with an optional --sample flag to limit your data
  • could exclude a piece of this ("don't run my data tests")

This is just dbt build + unit tests, is it better to just do dbt build --unit-tests

However - thing that you wouldn't be able to do:
modify a bunch of models in dev (specifically - adding/removing/retyping columns)
run a single unit test in the middle of several of them
without materializing any of them (not even sampled)

@graciegoheen
Copy link
Contributor Author

Idea 7

Can we

  • use deferral for data types of most of the models
  • and build a temporary --sample 0 of all +modified models to get the data types of the relevant changed models (and then drop them when we're done)

@graciegoheen
Copy link
Contributor Author

graciegoheen commented Sep 28, 2023

Our requirement -> "We need to know the data types of my current logical state (for direct parents of our unit tested models)"
This is because we want:

  • to be able to run unit testing in an empty warehouse (as the first step of a CI job, or in an empty development schema to save warehouse $$)
  • to only have to provide mock input data for the columns we care about (rather than all of the columns of our inputs)

Possible solutions:

  1. you give us mock data for all of your columns, and we let the warehouse "guess" their data types
  2. we use deferral + create temp tables for all of your +modified models to get the full picture of data types for your relevant models (maybe we "cache" to save time)
  3. we only support running unit tests in the context of running your dag (so we require that upstream models have been materialized before we run your unit test)
  4. we use column level lineage, SQL parsing from sources to get the data types of your inputs
  5. deferring to a catalog with your data types, instead of a manifest -> and require that you define the data types for all of your columns in yml (this would be a significant amount of yml that you would have to add to every single model in your project that's a direct parent of a model you want to add a unit test to)

@graciegoheen
Copy link
Contributor Author

What's the simplest thing we could do here:

  • we will use the data_types defined in the direct parent's yml config
  • if no data_type is provided...
    • we will have the warehouse "presume" the data type based on the mock data provided (users could put cast statements in their fixture)
    • if no mock data is provided for a column, we will guess that the data type is a string
    • raise a warning during unit test execution if an input fixture specifies columns that don’t have user-provided data_type values.

We will need to confirm when we should add '' around the mocked data.

Let's see how far this gets us, and if this is "good enough" or if we need to take one of the more complicated approaches listed above.

@graciegoheen
Copy link
Contributor Author

graciegoheen commented Oct 3, 2023

A few follow up questions:

  • Would we still have to know what all of the columns are? So would we be requiring that they also list all of the column names in the direct parent's yml config? That seems cumbersome.

    • response: we should be able to get this information from adapter.get_column_schema_from_query without requiring users to manually input
  • Does Snowflake require a type-d NULL?

  • What if someone puts the wrong data_type of column name, but the contract isn't enforced? The unit test could pass, but the build could fail...

  • Should we offer multiple approaches for those that don't want us to build "temp" tables in their warehouse?

  • Let's test running a large dbt project but replacing ref with (select * from ref limit 0) and see how long it takes

@graciegoheen
Copy link
Contributor Author

From a discussion with @dbeatty10 and @jtcohen6:

What if we supply folks 3 main options:

Option 1: supply mock data for all of your referenced columns
Option 2: supply mock data for only your relevant columns, build necessary upstream tables using "sample" mode / dry run to limit resluts
Option 3: dbt build dag order, run unit test of model before model is materialized

@graciegoheen
Copy link
Contributor Author

graciegoheen commented Nov 1, 2023

@graciegoheen graciegoheen closed this as not planned Won't fix, can't repro, duplicate, stale Nov 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request user docs [docs.getdbt.com] Needs better documentation
Projects
None yet
Development

No branches or pull requests

1 participant