-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3rd Party DBT Operations #2894
Comments
@impguard Really neat ideas. Thanks for the thorough writeup. Obviously, the conventional ELT story is unidirectional: Here's how we've hacked this in the past: define a model (or multiple models) to "stand in" for the external process. That model (e.g.
You'll see lineage in the DAG Viz that is inclusive the external processes being run—in the form of one or more model nodes named or tagged to indicate as much. Of course, depending on your database, you may be able to take this even further:
|
That's actually a fantastic option I haven't considered that would alleviate essentially the bare minimum request. I've seen the listed packages (we operate with BigQuery). The option for |
I'll toss in some more notes that while the provided workaround is great, it still doesn't necessarily resolve for capturing data sinks, where data is used by exported somewhere else (such as pushing data out of our data ecosystem with a spark job for service use, etc.). Seems like DBT could definitely use two concepts:
Honestly this can all be done by expanding the concept of sources and allowing them to be sources, sinks, or just intemediary nodes, with the expectation that all three are functionally identical except that sinks do not need to be tables but also can have no further transforms referencing them. Perhaps this is expanding the scope of DBT, but it really feels close to the vision of a central framework for architecting data pipelines and documenting them, with the biggest expectation that 90% of your pipelines consist of SQL transformations. |
Have you looked at exposures? They were just released in dbt v0.18.1, and still quite new, but this is exactly what we have in mind.
I think we have a few options here:
I share this vision completely. I think the operative questions are:
|
Ah! Yes, this exposure concept is indeed a great start. I personally think it's cleaner to combine the latter two concepts you've offered into one. Namely, all externally managed pieces are the same thing, but some are sources, sinks/exposures, or middle pieces. The main differences I can see from the three pieces are:
To your questions, my personal opinions are:
Many DBT pipelines I feel like will have external bits that aren't simply sources and sinks, simply because it's easier or more expressive to write a python script sometimes or (in our case), a pyspark machine learning job vs. SQL. Note, I can see one hiccup with the central pieces in that EDIT: on quick consideration, making sources be able to depend on DBT constructs (or other sources) sounds like a good middle ground at least from a clarity perspective. |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
Describe the feature
I understand that DBT's mode of operation is essentially self contained. Other than external sources, do everything within DBT. However, it would be nice if it were possible to add some more documentation for "sources" that aren't pure third party sources so that DBT docs + lineage work as intended.
In our case, we build spark jobs that generate tables that might feed back into the DBT ecosystem. These spark jobs run off tables that were generated by the DBT ecosystem. In this case, the entire pipeline might look like:
Raw Data => DBT Views/Tables, etc. => Spark Job => DBT Views/Tables
This is essentially handled by isolated graphs in DBT. The Spark job output is considered another "source" to the DBT project, it just so happens depends on outputs within the DBT project.
The bare minimum feature request would be a simple way to document this in a way that would show up in DBT docs, can be verified by DBT (the source can't depend on a nonexistent DBT table or source).
The awesome additional feature request is a way for DBT to run a "materialization" that's just pure Python. That way we can have DBT trigger spark jobs (or whatever other non SQL oriented task) as a normal DBT operation. This could potentially be handled within DBT by allowing the user to write Python code while providing the expected output table structure. This way, DBT would still generate the table (and expect a table to be written to), while it executes the Python code with the expectation that a success equates to the table being populated with data.
Describe alternatives you've considered
The current alternatives we've considered are:
Option 1: Just use DBT as intended and build our own toolchain on top. These would involve potentially adding meta tags to the DBT sources and parsing those to generate a lineage graph on top of DBT's lineage graph.
Option 2: Just stop using DBT docs completely since it's not suitable for other jobs that generate tables. As a result, we have to create our own lineage and documentation system.
Additional context
I think this would be useful for any team moving to DBT that can't take the full jump into DBT directly. Separately there are many real world examples of needing a more expressive language than SQL or more performant tool like Spark in order to build a table. Say building recommendations or year long machine learning aggregate models that you want to stick back into the warehouse for customers/data engineers to build on. It would be lovely if these outliers fit within the DBT ecosystem nicely.
Who will this benefit?
Many real world pipelining scenarios involving data science + data engineer merging together.
Are you interested in contributing this feature?
We have considered forking DBT internally just to add some sort of custom materialization model option that allows DBT to simply run the python code and potentially test that a table was generated instead of compiling just a DDL to run, but the other solutions seemed lighter weight. So, no, not at the moment.
The text was updated successfully, but these errors were encountered: