Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3rd Party DBT Operations #2894

Closed
impguard opened this issue Nov 17, 2020 · 6 comments
Closed

3rd Party DBT Operations #2894

impguard opened this issue Nov 17, 2020 · 6 comments
Labels
enhancement New feature or request stale Issues that have gone stale

Comments

@impguard
Copy link

impguard commented Nov 17, 2020

Describe the feature

I understand that DBT's mode of operation is essentially self contained. Other than external sources, do everything within DBT. However, it would be nice if it were possible to add some more documentation for "sources" that aren't pure third party sources so that DBT docs + lineage work as intended.

In our case, we build spark jobs that generate tables that might feed back into the DBT ecosystem. These spark jobs run off tables that were generated by the DBT ecosystem. In this case, the entire pipeline might look like:

Raw Data => DBT Views/Tables, etc. => Spark Job => DBT Views/Tables

This is essentially handled by isolated graphs in DBT. The Spark job output is considered another "source" to the DBT project, it just so happens depends on outputs within the DBT project.

The bare minimum feature request would be a simple way to document this in a way that would show up in DBT docs, can be verified by DBT (the source can't depend on a nonexistent DBT table or source).

The awesome additional feature request is a way for DBT to run a "materialization" that's just pure Python. That way we can have DBT trigger spark jobs (or whatever other non SQL oriented task) as a normal DBT operation. This could potentially be handled within DBT by allowing the user to write Python code while providing the expected output table structure. This way, DBT would still generate the table (and expect a table to be written to), while it executes the Python code with the expectation that a success equates to the table being populated with data.

Describe alternatives you've considered

The current alternatives we've considered are:

Option 1: Just use DBT as intended and build our own toolchain on top. These would involve potentially adding meta tags to the DBT sources and parsing those to generate a lineage graph on top of DBT's lineage graph.

Option 2: Just stop using DBT docs completely since it's not suitable for other jobs that generate tables. As a result, we have to create our own lineage and documentation system.

Additional context

I think this would be useful for any team moving to DBT that can't take the full jump into DBT directly. Separately there are many real world examples of needing a more expressive language than SQL or more performant tool like Spark in order to build a table. Say building recommendations or year long machine learning aggregate models that you want to stick back into the warehouse for customers/data engineers to build on. It would be lovely if these outliers fit within the DBT ecosystem nicely.

Who will this benefit?

Many real world pipelining scenarios involving data science + data engineer merging together.

Are you interested in contributing this feature?

We have considered forking DBT internally just to add some sort of custom materialization model option that allows DBT to simply run the python code and potentially test that a table was generated instead of compiling just a DDL to run, but the other solutions seemed lighter weight. So, no, not at the moment.

@impguard impguard added enhancement New feature or request triage labels Nov 17, 2020
@jtcohen6
Copy link
Contributor

@impguard Really neat ideas. Thanks for the thorough writeup.

Obviously, the conventional ELT story is unidirectional: source --> model --> exposure. But it's not always so simple; as you mention, there may be transformation or data science workloads that run on different hardware, with dbt models as both parents and children.

Here's how we've hacked this in the past: define a model (or multiple models) to "stand in" for the external process. That model (e.g. spark_job.sql) would look like:

-- depends on: {{ ref('input_to_spark_job') }}

select * from {{ source('spark', 'job') }}
-- you might also perform actual data cleaning (renaming, type-casting) here

You'll see lineage in the DAG Viz that is inclusive the external processes being run—in the form of one or more model nodes named or tagged to indicate as much.

Of course, depending on your database, you may be able to take this even further:

  • On BigQuery, that dbt model could itself be an ML model, using (e.g.) the dbt_ml package
  • On Snowflake, that dbt model could call an external function that actually triggers the external process to run. (You'd have to figure out the complexity of async or partial deployment.)
  • You could have a dbt model, written in SparkSQL, that actually runs on Spark via the dbt-spark plugin. It would still need to run as part of a separate invocation/command, but the overall lineage graph would be accurate. Currently, it isn't possible to write dbt models in PySpark or SparkR, but that may someday be possible.

@jtcohen6 jtcohen6 removed the triage label Nov 17, 2020
@impguard
Copy link
Author

That's actually a fantastic option I haven't considered that would alleviate essentially the bare minimum request. I've seen the listed packages (we operate with BigQuery). The option for dbt_ml works great if BigQuery's built in ML is suitable, but we're aiming to provide a solution that allows users to choose to build their own flexible spark jobs is necessary. I haven't thought of making a intermediary "injector" model that preserves lineage but it makes sense. Thanks for the suggestion!

@impguard
Copy link
Author

impguard commented Nov 30, 2020

I'll toss in some more notes that while the provided workaround is great, it still doesn't necessarily resolve for capturing data sinks, where data is used by exported somewhere else (such as pushing data out of our data ecosystem with a spark job for service use, etc.).

Seems like DBT could definitely use two concepts:

  • data sinks (opposite of sources, not SQL, purely meant for lineage and documentation)
  • external transforms (original question, not SQL, purely meant for lineage and documentation)

Honestly this can all be done by expanding the concept of sources and allowing them to be sources, sinks, or just intemediary nodes, with the expectation that all three are functionally identical except that sinks do not need to be tables but also can have no further transforms referencing them.

Perhaps this is expanding the scope of DBT, but it really feels close to the vision of a central framework for architecting data pipelines and documenting them, with the biggest expectation that 90% of your pipelines consist of SQL transformations.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Dec 1, 2020

data sinks (opposite of sources, not SQL, purely meant for lineage and documentation)

Have you looked at exposures? They were just released in dbt v0.18.1, and still quite new, but this is exactly what we have in mind.

external transforms (original question, not SQL, purely meant for lineage and documentation)

I think we have a few options here:

  1. "placeholder" models (as recommended, no change needed)
  2. allowing exposures to be ref-able (today they can only be leaf nodes)
  3. allowing sources to ref other sources (today they can only be root nodes)

Perhaps this is expanding the scope of DBT, but it really feels close to the vision of a central framework for architecting data pipelines and documenting them, with the biggest expectation that 90% of your pipelines consist of SQL transformations.

I share this vision completely. I think the operative questions are:

  • If dbt only ever executes SQL, how much should dbt try to know about non-SQL transformations?
  • Is it desirable for dbt to store metadata about pipelines that don't involve its models at all?

@impguard
Copy link
Author

impguard commented Dec 1, 2020

Ah! Yes, this exposure concept is indeed a great start. I personally think it's cleaner to combine the latter two concepts you've offered into one. Namely, all externally managed pieces are the same thing, but some are sources, sinks/exposures, or middle pieces. The main differences I can see from the three pieces are:

  • sources are static - have a static location
  • middle pieces may have a dynamic schema (we're integrating them with the DBT profile workflow so that they run in "dev" similarly to DBT outputs)
  • sinks don't have to be tables anymore, they're essentially data products that go out to wherever (current exposure concept captures this beautifully)

To your questions, my personal opinions are:

  • Assuming you never want DBT to leave the SQL space (I'd make the argument that it'd be cool potentially for it to support one non SQL language as well), it should simply help document the existence of these transformations and let the user orchestrate how these transformations occur.
  • I actually think the documentation feature of DBT is actually extremely powerful and a large value add in and of itself. I don't necessarily see it harmful if someone wants to leverage DBT just for the lineage + documentation with 100% of their graph being external. If anything, it also provides a migration path to slowly move bits into SQL.

Many DBT pipelines I feel like will have external bits that aren't simply sources and sinks, simply because it's easier or more expressive to write a python script sometimes or (in our case), a pyspark machine learning job vs. SQL.

Note, I can see one hiccup with the central pieces in that dbt run would be a little more complex due to the fact that it essentially stops when it hits an external component. I think that's completely reasonable, but it does present some UX complications.

EDIT: on quick consideration, making sources be able to depend on DBT constructs (or other sources) sounds like a good middle ground at least from a clarity perspective.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 1, 2021

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Nov 1, 2021
@github-actions github-actions bot closed this as completed Nov 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests

2 participants