Doc (and potentially, Test) Inheritance #6527

balmasi · 2021-01-07T21:43:28Z

balmasi
Jan 7, 2021

Describe the feature

I'm a big believer in the power of documentation, and I love dbt's doc generation and testing, however, there is still a massive amount of manual work that goes into maintaining docs and tests, especially if you have many layers, where many of the column descriptions aren't actually changing.

I'm finding I'm consistently spending 70% of my development time just wiring up documentation through doc blocks. Same for tests, if you want to test every layer. At least upstream tests can catch some data issues, but you can't really do the same with docs without reading the source code SQL.

There are 3 cases that have to be accounted for with documentation (and tests, in fact):

The names of the columns have not changed from upstream models
There has been a renaming of a column compared to an upstream model
There has been a meaningful transformation or additional column, compared to an upstream model

One can either address each of those concerns incrementally or come up with a solution that solves all 3.

Ideal Solution

In an ideal world, issues 1 and 2 could be solved if dbt could parse the SQL in all models using something like sqlparse to get true field-level lineage.

I know your first thought is probably "What about all the different dialects?" but I would say to that, we don't need to parse everything. We only need the SELECT [fields] and AS in the bottom-most expression. In fact, it might even be possible with regex, and is most likely the same across all dialects.

This could

Inherit any docs in Case 1 or Case 2.
Optionally Scaffold the description yml keys for Case 3, to be filled in by the user (can be broken out to a separate feature)

I know this is a huge ask and involves major work, which is why I'm gonna put forward an incremental solution as well

Incremental Solution

An easier-to-implement, but not as valuable solution would be to make this config-driven on a field by field basis. You still have to write quite a bit of boilerplate, but at least you're not copy-pasting (which leads to inconsistencies), or spinning up hundreds of model.md files to make doc block references (still very error-prone to get the doc block names right)

This might be something like:

# schema.yml
version: 2

models:
  - name: model_2
    description: "Unique description for model 2"
    columns:
      - name: col2
        description:
           inherit: upstream1 # Case 1 - No change in column name, so we just give the upstream model name
        tests:  # Put this here for demonstration, cause it was juicy :) 
           inherit: upstream1

      - name: col1
        description:
           inherit:  src1.column_one  # Case 2 - Col name was renamed

      - name: col3
        description: "This is brand new documentation" # Case 3 - Meaningful transformation or addition in this model

Alternative syntax might be:

# schema.yml
version: 2

models:
  - name: model_2
    description: "Unique description for model 2"
    columns:
      - name: col2
        description: '{{ doc_inherit(ref("upstream1") }}' # Case 1 - No change in column name, so we exclude the second argument

      - name: col1
        description: '{{ doc_inherit(source("src1"), "column_one"  }}'  # Case 2 - Col name was renamed

      - name: col3
        description: "This is brand new documentation" # Case 3 - Meaningful transformation or addition in this model

Describe alternatives you've considered

Currently in order to solve this, I'm defining a new .md file for every model and using a doc reference in my schema.yml files. In order to keep things DRY and keep me sane, instead of referring to the immediate parent in every model, I refer to the ancestors (grandparents) of the model if possible.

For example, if source1 -> model1 -> model2 all share a field name , model2 refers to the doc of source1 instead of model1. This way, I'm only changing descriptions in 1 place if the need arises.

Who will this benefit?

I think implementing the Ideal Solutoin would have a major impact not only for speeding up documentation and keeping it up-to-date, but also allows you to apply the same strategy to tests, which is huge.

In the Incremental Solution, I'd say this would save me about 30-40% of my time for developing each feature. This is because I don't have to make additional doc.md files and don't have to worry about getting the doc() references correct, which can be a nightmare in a complex project.

I think if documentation (and maybe tests) are your thing, this solution would pave the way for saving tens of thousands of dollars over the life of a project in time saved.

Are you interested in contributing this feature?

Maybe if I had a better idea of how dbt works under the hood based on a guide or something.

balmasi · 2021-01-07T21:54:02Z

balmasi
Jan 7, 2021
Author

Just realized the incremental solution is very close to #1158

0 replies

jtcohen6 · 2021-01-11T12:32:19Z

jtcohen6
Jan 11, 2021
Maintainer

@balmasi Thanks for this detailed writeup—I agree with a lot of what you've got to say here!

Ideal Solution

I think this is what we're talking about when we're talking about column-level lineage, which is a sizzling topic these days. It would enable:

dbt to know a column's properties (description, tests, meta, tags, etc) based on its upstream antecedents
You to ask and answer questions about how one column moves through the DAG

I think you're right on in outlining the prerequisite here:

if dbt could parse the SQL in all models... We only need the SELECT [fields] and AS in the bottom-most expression

I think column-level lineage is the real payoff of investing in SQL linting / auto-formatting (#2356). It's a big part of the excitement around projects like sqlfluff. The obvious (though hardly trivial!) short-term benefit is better-formatted code. The long game is that, by establishing and validating a grammar of SQL dialects, we could simply examine the compiled SQL of all the models in a project to know exactly how a given column moves through it. This won't be without significant complexity; the distinctions between parsed vs. compiled vs. executed code would need to be a lot clearer.

Incremental Solution

I'm glad you came across #1158, which I've long thought of as "YAML anchors across files." Today, it's possible to do this at a basic level, using standard YAML anchors, within one .yml file:

version: 2

models:
  - name: model_1
    description: "Unique description for model 1"
    columns:
      - &col1
        name: col1
        description: "Unique description for column 1"
      - &col2
        name: col2
        description: "Unique description for column 2"
        tests:
          - unique
          - not_null

  - name: model_2
    description: "Unique description for model 2"
    columns:
      - <<: *col1
        name: column_one  # renamed
      - *col2  # inherits name, description, tests from model_1
      - name: col3
        description: "This is brand new documentation"

I see your doc_inherit() and the extends: proposal in the same vein here: a YAML anchor that can reference/alias/extend config defined in a different .yml file, elsewhere in the project.

0 replies

balmasi · 2021-01-13T01:31:44Z

balmasi
Jan 13, 2021
Author

Wow thanks for the anchor tip. Didn't know I could do that.

I realised I was ridiculously naive about the difficulty of parsing sql from scratch 😅 takesy backsies

0 replies

jtcohen6 · 2021-02-01T11:04:49Z

jtcohen6
Feb 1, 2021
Maintainer

Regarding this idea:

a YAML anchor that can reference/alias/extend config defined in a different .yml file, elsewhere in the project.

See also the discussion in #1790, which proposes that we use a PyYAML (though not pure-YAML) construct to enable external file linking via an !include constructor.

0 replies

pcasteran · 2021-04-05T21:15:47Z

pcasteran
Apr 5, 2021

I like the alternative syntax proposed by @balmasi. Until we have full support for column-level lineage, it allows to explicitly specify the column dependencies.

As I was looking for a way not to repeat the columns' documentation between all my models layers, I came to an almost identical solution (ref and source functions are not available in the templating context used for the documentation).

The idea is to add the inherited_doc method to the DocsRuntimeContext class (dbt/context/docs.py):

    @contextmember
    def inherited_doc(self, **kwargs: Dict[str, Any]) -> str:
        """The `inherited_doc` function is used to reference another column's
        description. The column's description resolution is transitive,
        i.e. the referenced column's descritpion can also use the
        `inherited_doc` function.
        """

        # Retrieve the referenced node (model or source).
        node = None
        if "model" in kwargs:
            package_name = kwargs.get("package")
            model_name = kwargs["model"]
            node = self.manifest.resolve_ref(
                model_name,
                package_name,
                self._project_name,
                self.node.package_name,
            )
            if not node:
                raise_compiler_error(
                    "inherited_doc() : unable to find the referenced model : package=[{}], model=[{}]"
                        .format(package_name, model_name),
                    self.node
                )
        elif "source" in kwargs and "table" in kwargs:
            source_name = kwargs["source"]
            table_name = kwargs["table"]
            node = self.manifest.resolve_source(
                source_name,
                table_name,
                self._project_name,
                self.node.package_name,
            )
            if not node:
                raise_compiler_error(
                    "inherited_doc() : unable to find the referenced source : source=[{}], table=[{}]"
                        .format(source_name, table_name),
                    self.node
                )
        else:
            raise_compiler_error("Invalid use of the inherited_doc() function.", self.node)

        # Check if the column exists.
        if not "column" in kwargs:
            raise_compiler_error("Invalid use of the inherited_doc() function.", self.node)

        column_name = kwargs["column"]
        if not column_name in node.columns:
            raise_compiler_error("inherited_doc() : column '{}' doesn't not exist in the referenced model/source.".format(column_name), self.node)

        # Perform the transitive resolution of the referenced column's description.
        column = node.columns[column_name]
        try:
            description = get_rendered(column.description, self._ctx)
        except RecursionError:
            raise_compiler_error("inherited_doc() : circular dependencies encountered while resolving the description of column '{}'.".format(column_name),
                                 self.node)

        return description

It can then be used in the YAML files, for example:

version: 2
models:
  - name: my_model
    description: Hello there
    columns:
      - name: id
        # Inheriting the description from another model.
        description: "{{ inherited_doc(model='my_other_model', column='id') }}"
      - name: name
        # Inheriting the description from a source.
        description: "{{ inherited_doc(source='my_source', table='my_table', column='name') }}"

It works fine for my project, but it has not been extensively tested (for example, models from another package).

Also, it correctly handles the transitive dependencies but without much intelligence about circular one : it just catches the RecursionError and raise a detailed dbt error.

What do you think about it @jtcohen6, is that something worth contributing ? I guess, it could easily be adapted to inherit column's tests.

0 replies

jtcohen6 · 2021-04-06T15:53:15Z

jtcohen6
Apr 6, 2021
Maintainer

@pcasteran Whoa, very cool! Nice job getting this to work in your project.

It does feel like we're at a fork in the road, given the options described above (and one bonus):

Should we super-charge the doc Jinja construct, as you've done with inherited_doc? This would require us to move toward a fuller dependency graph of description field inheritance. I'm not sure if/how this could work for inheriting multiple properties, i.e. docs and also tests.
Should we enable cross-file YAML anchors? Today, anchors make it possible to inherit and "extend" the full properties of one model/source/etc into another, including descriptions, tests, tags, and everything else. This would simply remove the limitation of using them cross-file.
There's a big effort we want to take a swing at before v1.0: reconciling node configs and resource properties (Set configs in schema.yml files #2401). If we succeed, it's conceivable that you could hierarchically define config-properties in dbt_project.yml, to apply to all models in a subfolder. That's a quite different approach from property inheritance/extension, and it likely makes more sense for some properties (meta) than for others (descriptions, tests, etc), but it's another compelling way to avoid repeating yourself.

At present, I'm leaning toward 2 + 3:

inheritance/extension via YAML anchor (model_a → model_b), with cross-file capabilities (new! we'll need to figure out the right approach here, a la Be able to !include external yaml files #1790)
hierarchical property configurations by path (dbt_project.yml → models/subfolder)

Those wouldn't give us the ability to build a graph of doc block inheritance—if that's something we want, we could open a separate issue for it (similar to #2860).

0 replies

pcasteran · 2021-04-07T21:00:59Z

pcasteran
Apr 7, 2021

Thanks for the answer @jtcohen6.
I understand not preferring to integrate the code I proposed that, albeit working for inheriting the documentation, is a partial solution regarding everything we could need to reuse between models.

Maybe a less intrusive and even more generic way to solve this problem would be to allow using macros in the model documentation. Currently only the doc function is put in the Jinja context used for the doc generation but if we could use any macro we could probably achieve description inheritance and many other great things.
I think it would be useful to allow using macros in the documentation, even if you decide to implement options 2 & 3 as you proposed.

I'm thinking of a simple macro like this one, that could be distributed in the dbt_utils package:

{% macro inherited_doc_macro(package, model, column) %}
  {{- log("inherited_doc_macro: package=[" ~ package ~ "], model=[" ~ model ~ "], column=[" ~ column ~ "]", True) -}}

  {%- if not package -%}
    {%- set package = project_name -%}
  {%- endif -%}
  {%- set model_id = "model." ~ package ~ "." ~ model -%}

  {#- Impossible to directly access the dict as per the documentation : using `graph.nodes[model_id]` directly produces an error. -#}
  {#- We have to iterate on the dict keys, maybe a recent regression ? :( -#}
  {%- for node in graph.nodes  -%}
    {%- if node == model_id -%}
      {{- graph.nodes[model_id].columns[column]['description'] -}}
    {%- endif -%}
  {%- endfor -%}

{%- endmacro -%}

Do you know if it is acceptable and feasible to add the project macros to the Jinja context used for the doc generation?
If yes, can you provide some pointers on how to do that so I can try if it works ?

0 replies

saraleon1 · 2022-02-17T10:24:37Z

saraleon1
Feb 17, 2022
Collaborator

+1 from a dbt cloud customer via dbt support

I am working on documenting my organization's code
So I was thinking of a way to do it at it's source and then it will flow to all my other tables
So I want to reduce the manual amount of work that needs to be done

I suggested they try using docs blocks to have one single source of truth to then reference elsewhere, but a more automated way of doing this would still be preferable for them

0 replies

ghost · 2022-02-18T16:54:22Z

ghost
Feb 18, 2022

+1 would love to see an elegant inheritance construct especially if we're moving toward column level lineage.

0 replies

fernandobrito · 2022-02-18T19:37:38Z

fernandobrito
Feb 18, 2022

Sharing our use-case here in case it might spark some more ideas from the community around this feature.

At my organization, I just developed a Python package to try to solve some of the issues described by the original poster (avoid duplicating documentation on a multi-layered data warehouse architecture). My approach was to develop an external Python package that parses the manifest and catalog and propagates the column documentation in case a column is not documented in a model but an upstream model has a column with the same name that is documented. The new inherited column descriptions are then written in a new manifest that we use in our "dbt docs release pipeline".

Some of the features we implemented and the rationale behind:

We decided to start with an opt-out approach. By default, any column that is not documented will have its description inherited if an upstream column with the same name is documented.
To reduce confusion on false positives (misleading descriptions that were inherited), we append a text like "[documentation propagated from XXXXX]" with a link to the parent model. This helps our dbt docs users, but also our analysts that actually write the documentation. Since we recursively propagate the descriptions, it might be that the original description was written in a model 4-levels up in the dependency tree. With this text, they know exactly where to find the original description if they want to update it
As we are doing it automatically in an opt-out fashion, we have a list of "blocklist" column names that we never propagate because they might be too generic (id, created_at, etc).
As the package reads both from the catalog and from the manifest, even columns that exist in the database but have not been described at all in schema.yml files (and thus, are not in the dbt manifest) are covered
We implement column renaming/aliasing (as suggested by the original poster) by using dbt meta tags on the columns. A column named is_active can have a meta tag stating "also inherit from upstream columns named active".
In case a column has 2 or more upstream models where the documentation can be inherited, all the unique descriptions will be concatenated in the description. As we are adding a "[propagated from ...]", we think this won't confuse our users too much
Part of the output of the script's execution (besides overwriting the manifest.json) is a list of which columns are currently not documented but would be propagated to most places if they were. This helps our analysts to add documentation to places where it might bring more value. Also, it shows how many columns in the entire project are documented, how many are not documented, and how many were propagated.

Here is a simple example, taken from a sample dbt project used as part of the test suite of the package:

We will actually release this into production next week. If there's interest from the others, I can try to come back here after a few weeks with our learnings. Also if there's enough interest (maybe by reacting to this comment?), I can try to convince management to let me spend time writing a blog post and open-sourcing this project.

4 replies

thekushpatel Jan 19, 2023

Very curious on how this worked out for you 👀 been thinking of how we can implement exactly this. If you have the bandwidth for a follow up to this post, would love to hear more about the approach you took.

alittlesliceoftom Jun 8, 2023

Seconded, would love to use this as a package.

omer-ginosar Jun 11, 2023

A package would be awesome!

fernandobrito Oct 4, 2023

Hi! It took me a long time, but I'm happy to announce that I was able to open-source the project above 🎉. We named it "dbt documentation propagation" and it is available in our Python package called "dbt-toolkit", together with other small utilities.

You can find more information about it here: https://github.com/voi-oss/dbt-toolkit/blob/main/src/dbttoolkit/documentation/README.md

ghost · 2022-02-18T20:34:34Z

ghost
Feb 18, 2022

@fernandobrito That would be fantastic if you're able to package it up and share. I was just starting to brainstorm how to write something similar.

0 replies

ducchetrongminh · 2022-07-08T04:18:09Z

ducchetrongminh
Jul 8, 2022

Hi guys. I created a quick python script to generate docs block from yml files. This script is independent with dbt manifest file, since dbt compile won't run if the docs block is not found.
https://gist.github.com/ducchetrongminh/c494d867feec925a5b59515714778279

Copy this file to your repo. Your workflow will change a little bit, but it saves time than creating md or copy your docs
python -m path.to.dbt_docsblock_autogenerator && dbt docs generate
(or dbt compile)

Then you just need to reference the docs block

Or even add an extra doc. Cool?

Hope this can save your time. Happy documenting :D

0 replies

2023-01-05T01:59:53Z

github-actions[bot]
bot Jan 5, 2023

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

0 replies

jtcohen6 · 2023-01-05T13:43:01Z

jtcohen6
Jan 5, 2023
Maintainer

Going to convert this to a discussion (I thought it already was one!)

0 replies

saraleon1 · 2023-01-24T10:49:30Z

saraleon1
Jan 24, 2023
Collaborator

+1 dbt Cloud enterprise customer - 'We have 4-5 layers of models with the same columns throughout and these fields maintain the same business meaning; Without an automated way of column description inheritance docs are hard to maintain. As a workaround docs blocks is still complex to maintain. Ideally we want a sort of ref function to upstream model column descriptions and the ability to confirm it’s the same field.'

0 replies

alison985 · 2023-07-07T20:28:11Z

alison985
Jul 7, 2023

I want to cross-reference two other topics to this thread.

a feature request to map a single dbt entity to multiple databases [CT-2149] [Feature] Add the possibility to map a single dbt entity to multiple databases #7021
a feature request to support identifier at a column level for sources [CT-2075] [Feature] Add column aliasing (identifiers) for source tables #6929

All of these are about re-using YML to enable more DRYness.

0 replies

SeppBerkner · 2024-02-29T11:04:37Z

SeppBerkner
Feb 29, 2024

As I understood in recent discussions this feature is implemented, but only for DBT Cloud users? If so, is there any plan to make it available to DBT Core?

3 replies

larsrinn Apr 17, 2024

Where did you find information on this being implemented for DBT cloud users?

alison985 Apr 24, 2024

@SeppBerkner I'd also like to know. Thanks in advance!

pranasziaukas Jul 2, 2024

Maybe referring to https://www.getdbt.com/blog/proactively-improve-your-dbt-projects-with-new-dbt-explorer-features#column-level-lineage

ismailsimsek · 2024-11-28T21:41:08Z

ismailsimsek
Nov 28, 2024

Hi all, want to leave link to opendbt project here. Where you can easily add this kind of features to dbt-core in an elegant way. Feel free to check it out and add the features you like to see in dbt-core

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doc (and potentially, Test) Inheritance #6527

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 18 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Doc (and potentially, Test) Inheritance #6527

Describe the feature

Ideal Solution

Incremental Solution

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Replies: 18 comments · 7 replies

balmasi Jan 7, 2021 Author

jtcohen6 Jan 11, 2021 Maintainer

Ideal Solution

Incremental Solution

balmasi Jan 13, 2021 Author

jtcohen6 Feb 1, 2021 Maintainer

jtcohen6 Apr 6, 2021 Maintainer

saraleon1 Feb 17, 2022 Collaborator

github-actions[bot] bot Jan 5, 2023

jtcohen6 Jan 5, 2023 Maintainer

saraleon1 Jan 24, 2023 Collaborator

Replies: 18 comments 7 replies

balmasi
Jan 7, 2021
Author

jtcohen6
Jan 11, 2021
Maintainer

balmasi
Jan 13, 2021
Author

jtcohen6
Feb 1, 2021
Maintainer

jtcohen6
Apr 6, 2021
Maintainer

saraleon1
Feb 17, 2022
Collaborator

github-actions[bot]
bot Jan 5, 2023

jtcohen6
Jan 5, 2023
Maintainer

saraleon1
Jan 24, 2023
Collaborator