Replies: 18 comments 7 replies
-
Just realized the incremental solution is very close to #1158 |
Beta Was this translation helpful? Give feedback.
-
@balmasi Thanks for this detailed writeup—I agree with a lot of what you've got to say here! Ideal SolutionI think this is what we're talking about when we're talking about column-level lineage, which is a sizzling topic these days. It would enable:
I think you're right on in outlining the prerequisite here:
I think column-level lineage is the real payoff of investing in SQL linting / auto-formatting (#2356). It's a big part of the excitement around projects like sqlfluff. The obvious (though hardly trivial!) short-term benefit is better-formatted code. The long game is that, by establishing and validating a grammar of SQL dialects, we could simply examine the compiled SQL of all the models in a project to know exactly how a given column moves through it. This won't be without significant complexity; the distinctions between parsed vs. compiled vs. executed code would need to be a lot clearer. Incremental SolutionI'm glad you came across #1158, which I've long thought of as "YAML anchors across files." Today, it's possible to do this at a basic level, using standard YAML anchors, within one version: 2
models:
- name: model_1
description: "Unique description for model 1"
columns:
- &col1
name: col1
description: "Unique description for column 1"
- &col2
name: col2
description: "Unique description for column 2"
tests:
- unique
- not_null
- name: model_2
description: "Unique description for model 2"
columns:
- <<: *col1
name: column_one # renamed
- *col2 # inherits name, description, tests from model_1
- name: col3
description: "This is brand new documentation" I see your |
Beta Was this translation helpful? Give feedback.
-
Wow thanks for the anchor tip. Didn't know I could do that. I realised I was ridiculously naive about the difficulty of parsing sql from scratch 😅 takesy backsies |
Beta Was this translation helpful? Give feedback.
-
Regarding this idea:
See also the discussion in #1790, which proposes that we use a PyYAML (though not pure-YAML) construct to enable external file linking via an |
Beta Was this translation helpful? Give feedback.
-
I like the alternative syntax proposed by @balmasi. Until we have full support for column-level lineage, it allows to explicitly specify the column dependencies. As I was looking for a way not to repeat the columns' documentation between all my models layers, I came to an almost identical solution ( The idea is to add the @contextmember
def inherited_doc(self, **kwargs: Dict[str, Any]) -> str:
"""The `inherited_doc` function is used to reference another column's
description. The column's description resolution is transitive,
i.e. the referenced column's descritpion can also use the
`inherited_doc` function.
"""
# Retrieve the referenced node (model or source).
node = None
if "model" in kwargs:
package_name = kwargs.get("package")
model_name = kwargs["model"]
node = self.manifest.resolve_ref(
model_name,
package_name,
self._project_name,
self.node.package_name,
)
if not node:
raise_compiler_error(
"inherited_doc() : unable to find the referenced model : package=[{}], model=[{}]"
.format(package_name, model_name),
self.node
)
elif "source" in kwargs and "table" in kwargs:
source_name = kwargs["source"]
table_name = kwargs["table"]
node = self.manifest.resolve_source(
source_name,
table_name,
self._project_name,
self.node.package_name,
)
if not node:
raise_compiler_error(
"inherited_doc() : unable to find the referenced source : source=[{}], table=[{}]"
.format(source_name, table_name),
self.node
)
else:
raise_compiler_error("Invalid use of the inherited_doc() function.", self.node)
# Check if the column exists.
if not "column" in kwargs:
raise_compiler_error("Invalid use of the inherited_doc() function.", self.node)
column_name = kwargs["column"]
if not column_name in node.columns:
raise_compiler_error("inherited_doc() : column '{}' doesn't not exist in the referenced model/source.".format(column_name), self.node)
# Perform the transitive resolution of the referenced column's description.
column = node.columns[column_name]
try:
description = get_rendered(column.description, self._ctx)
except RecursionError:
raise_compiler_error("inherited_doc() : circular dependencies encountered while resolving the description of column '{}'.".format(column_name),
self.node)
return description It can then be used in the YAML files, for example: version: 2
models:
- name: my_model
description: Hello there
columns:
- name: id
# Inheriting the description from another model.
description: "{{ inherited_doc(model='my_other_model', column='id') }}"
- name: name
# Inheriting the description from a source.
description: "{{ inherited_doc(source='my_source', table='my_table', column='name') }}"
It works fine for my project, but it has not been extensively tested (for example, models from another package). Also, it correctly handles the transitive dependencies but without much intelligence about circular one : it just catches the What do you think about it @jtcohen6, is that something worth contributing ? I guess, it could easily be adapted to inherit column's tests. |
Beta Was this translation helpful? Give feedback.
-
@pcasteran Whoa, very cool! Nice job getting this to work in your project. It does feel like we're at a fork in the road, given the options described above (and one bonus):
At present, I'm leaning toward 2 + 3:
Those wouldn't give us the ability to build a graph of |
Beta Was this translation helpful? Give feedback.
-
Thanks for the answer @jtcohen6. Maybe a less intrusive and even more generic way to solve this problem would be to allow using macros in the model documentation. Currently only the I'm thinking of a simple macro like this one, that could be distributed in the
Do you know if it is acceptable and feasible to add the project macros to the Jinja context used for the doc generation? |
Beta Was this translation helpful? Give feedback.
-
+1 from a dbt cloud customer via dbt support
I suggested they try using docs blocks to have one single source of truth to then reference elsewhere, but a more automated way of doing this would still be preferable for them |
Beta Was this translation helpful? Give feedback.
-
+1 would love to see an elegant inheritance construct especially if we're moving toward column level lineage. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
@fernandobrito That would be fantastic if you're able to package it up and share. I was just starting to brainstorm how to write something similar. |
Beta Was this translation helpful? Give feedback.
-
Hi guys. I created a quick python script to generate docs block from yml files. This script is independent with dbt manifest file, since Copy this file to your repo. Your workflow will change a little bit, but it saves time than creating md or copy your docs Then you just need to reference the docs block Or even add an extra doc. Cool? Hope this can save your time. Happy documenting :D |
Beta Was this translation helpful? Give feedback.
-
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days. |
Beta Was this translation helpful? Give feedback.
-
Going to convert this to a discussion (I thought it already was one!) |
Beta Was this translation helpful? Give feedback.
-
+1 dbt Cloud enterprise customer - 'We have 4-5 layers of models with the same columns throughout and these fields maintain the same business meaning; Without an automated way of column description inheritance docs are hard to maintain. As a workaround docs blocks is still complex to maintain. Ideally we want a sort of ref function to upstream model column descriptions and the ability to confirm it’s the same field.' |
Beta Was this translation helpful? Give feedback.
-
I want to cross-reference two other topics to this thread.
All of these are about re-using YML to enable more DRYness. |
Beta Was this translation helpful? Give feedback.
-
As I understood in recent discussions this feature is implemented, but only for DBT Cloud users? If so, is there any plan to make it available to DBT Core? |
Beta Was this translation helpful? Give feedback.
-
Hi all, want to leave link to opendbt project here. Where you can easily add this kind of features to dbt-core in an elegant way. Feel free to check it out and add the features you like to see in dbt-core |
Beta Was this translation helpful? Give feedback.
-
Describe the feature
I'm a big believer in the power of documentation, and I love dbt's doc generation and testing, however, there is still a massive amount of manual work that goes into maintaining docs and tests, especially if you have many layers, where many of the column descriptions aren't actually changing.
I'm finding I'm consistently spending 70% of my development time just wiring up documentation through doc blocks. Same for tests, if you want to test every layer. At least upstream tests can catch some data issues, but you can't really do the same with docs without reading the source code SQL.
There are 3 cases that have to be accounted for with documentation (and tests, in fact):
One can either address each of those concerns incrementally or come up with a solution that solves all 3.
Ideal Solution
In an ideal world, issues 1 and 2 could be solved if dbt could parse the SQL in all models using something like
sqlparse
to get true field-level lineage.I know your first thought is probably "What about all the different dialects?" but I would say to that, we don't need to parse everything. We only need the
SELECT [fields]
andAS
in the bottom-most expression. In fact, it might even be possible with regex, and is most likely the same across all dialects.This could
I know this is a huge ask and involves major work, which is why I'm gonna put forward an incremental solution as well
Incremental Solution
An easier-to-implement, but not as valuable solution would be to make this config-driven on a field by field basis. You still have to write quite a bit of boilerplate, but at least you're not copy-pasting (which leads to inconsistencies), or spinning up hundreds of
model.md
files to make doc block references (still very error-prone to get the doc block names right)This might be something like:
Alternative syntax might be:
Describe alternatives you've considered
Currently in order to solve this, I'm defining a new
.md
file for every model and using adoc
reference in my schema.yml files. In order to keep things DRY and keep me sane, instead of referring to the immediate parent in every model, I refer to the ancestors (grandparents) of the model if possible.For example, if
source1 -> model1 -> model2
all share a field name , model2 refers to the doc ofsource1
instead of model1. This way, I'm only changing descriptions in 1 place if the need arises.Who will this benefit?
I think implementing the Ideal Solutoin would have a major impact not only for speeding up documentation and keeping it up-to-date, but also allows you to apply the same strategy to tests, which is huge.
In the Incremental Solution, I'd say this would save me about 30-40% of my time for developing each feature. This is because I don't have to make additional
doc.md
files and don't have to worry about getting thedoc()
references correct, which can be a nightmare in a complex project.I think if documentation (and maybe tests) are your thing, this solution would pave the way for saving tens of thousands of dollars over the life of a project in time saved.
Are you interested in contributing this feature?
Maybe if I had a better idea of how dbt works under the hood based on a guide or something.
Beta Was this translation helpful? Give feedback.
All reactions