Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1732] [Feature] Add Native Persistent UDF Materialization #451

Closed
3 tasks done
anaghshineh opened this issue Dec 28, 2022 · 6 comments
Closed
3 tasks done

[CT-1732] [Feature] Add Native Persistent UDF Materialization #451

anaghshineh opened this issue Dec 28, 2022 · 6 comments
Assignees
Labels
enhancement New feature or request refinement Product or leadership input needed Stale

Comments

@anaghshineh
Copy link

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt-bigquery functionality, rather than a Big Idea better suited to a discussion

Describe the feature

There should be a native materialization for persistent UDFs in BigQuery. This materialization will enable users to create UDFs via dbt run and to track them appropriately within their DAGs and lineage graphs.

Previously, BigQuery only supported temporary UDFs. The community wrote several issues related to the handling of temporary BigQuery UDFs. For example:

However, BigQuery now supports persistent UDFs. We should have a native materialization for persistent BigQuery UDFs! This issue - dbt-labs/dbt-core#136 - was created a while ago for cross-database UDF handling. The issue at hand, however, specifically concerns persistent BigQuery UDFs.

Describe alternatives you've considered

A few popular alternatives discussed throughout the community:

  • Create UDFs via hooks/run-operations. For example, as documented here
  • Maintain UDF SQL files in the models directory with the CREATE OR REPLACE statement in the SQL. Use dbt to compile the model SQL and reference the UDF models via ref. This way, UDF models show up in the lineage graph. However, UDF creation is handled outside of dbt run

Who will this benefit?

dbt users who leverage persistent BigQuery UDFs and want to track them in their DAGs and lineage graphs. This is particularly useful for those who leverage UDFs to define reusable, core business logic.

Are you interested in contributing this feature?

Yes! I've started working on this. Interested to see what people think.

Anything else?

Some questions:

  • Does it make sense to implement this for BigQuery separately? I know there's interest in supporting a UDF materialization across different databases, but that seems like a much heavier lift to do all at once.
  • Should this touch on temporary UDFs as well? Perhaps a UDF materialization that also supports the notion of an "ephemeral" (i.e., temporary) UDF?
@anaghshineh anaghshineh added enhancement New feature or request triage labels Dec 28, 2022
@github-actions github-actions bot changed the title [Feature] Add Native Persistent UDF Materialization [CT-1732] [Feature] Add Native Persistent UDF Materialization Dec 28, 2022
@jtcohen6 jtcohen6 self-assigned this Jan 2, 2023
@jtcohen6
Copy link
Contributor

jtcohen6 commented Jan 5, 2023

@anaghshineh Thanks for taking the initiative to open the issue, and the accompanying PR!

I am very supporting of adding native support for UDFs within dbt, and doing it in a consistent way (as much as possible!) on all adapters for data warehouses that support persistent UDFs. It seems like you've done your research, and have a good feel for what some of the existing patterns and sentiments are in the wider community.

This feels even more important now that dbt supports Python (dbt-labs/dbt-core#5741). On some data platforms (e.g. Snowpark), UDFs are more than just an ergonomic way to call Python functions from SQL — they're also an important tool for achieving performant, parallelizable Python.

On the flip side, this gets us over the traditional hesitation around UDFs: If it's not SQL (Python, JavaScript), it's probably gross & slow. And if it's just SQL, why not a macro, which dbt will clearly "compile" (template) to its "source code" representation? Nowadays, there are clearly good reasons to define and use Python UDFs, and we can give people the ability to do it in files with a .py extension — text editors rejoice :)

You've asked some specific questions, which are fair. I have an even bigger question down below.

I know there's interest in supporting a UDF materialization across different databases, but that seems like a much heavier lift to do all at once.

If we're going to implement this by building it into "dbt Core" (dbt-core + official adapter plugins), we would want to implement it in such a way that we're providing a consistent experience across all our adapters (or as many as possible), and such that we coordinate the timing of the feature launch. This is a tall order indeed, but it's also a big part of the value of dbt—we are constantly attempting the impossible, a good-enough abstraction across data platforms.

Should this touch on temporary UDFs as well? Perhaps a UDF materialization that also supports the notion of an "ephemeral" (i.e., temporary) UDF?

The way we initially "supported" temporary UDFs on dbt-bigquery was via sql_header — and I think that's probably the best we can manage, since temporary UDFs need to be created in the same connection/session/script/query in which they're being called. At that point, it is probably better to treat them as part of the source code for a given model, DRY'ed up with a macro if needed across multiple models. There's no persistent database object to track metadata, or map back to a dbt project resource.

The big question

Should "UDF" be a type of model materialization, or a new node type in its own right? And is this any different for scalar versus tabular UDFs? (Some prior discussion in this older issue: #132.)

The upside of implementing this as a custom materialization is expediency: You were able to do it just with mostly Jinja + SQL, and just in the adapter plugin, without having to modify dbt-core at all. To @dataders' point in #454 (comment), this could maybe even be an experimental package of custom macros that community members could try out and give feedback on.

I worry, though, that it breaks the mental model (!) of what a dbt model is: a query that returns a dataset, which can be compiled & previewed. (UDTFs are a bit closer to fitting the bill, if you ignore the fact that they can take arguments. They're almost like saved CTEs, or ephemeral models.) The appeal of separating "modeling" logic from "materialization" logic is that you can have the same basic SQL, and then just switch the materialization from table to view to ephemeral, and it all just works — a few more slight tweaks, and incremental just works too. Whereas the source code for defining a UDF (at least a non-tabular one) looks pretty different.

The upside of implementing this as a totally new node type: No risk of confusion between what's a model, and what's a function. Like seeds or snapshots, functions would still be "materialized" in the database, they could be referenced in other models/functions/etc, and all the same patterns about node selection + state:modified + etc would still apply. This is the direction I find myself leaning in. The downside: it would require changes within dbt-core, and it would be a heavier lift. we'd need to know that it's the right direction, before committing to it. We'd want to start by finding some lower-effort ways to experiment with / de-risk it ahead of time, which looks like discovery work. It wouldn't be a technical spike in the sense of, "Is it possible? Can we do it?", but rather a conceptual investigation of:

  • Does the user experience feel right? Does it feel dbt-onic?
  • Can we achieve a consistent UX across multiple adapters?
  • Is this the right direction to move in?

That, in turn, looks like starting by writing some throwaway code, without any expectation or guarantee that it's going to be merged.

Is continuing down this path, and doing some additional discovery work to figure out the answer to this question (model materialization vs. new node type), something you'd be interested in, over the next weeks/months? If so, let's talk more about what an ongoing collaboration might look like! If not, and this is as far as you want to take it for now — I completely understand.

(cc @lostmygithubaccount @ChenyuLInx - remember when we spiked this last August? internal Notion link)

@jtcohen6 jtcohen6 removed their assignment Jan 5, 2023
@jtcohen6 jtcohen6 added refinement Product or leadership input needed and removed triage labels Jan 5, 2023
@dreinon
Copy link

dreinon commented Mar 23, 2023

Hi! I have read through this issue and posted yesterday a question in the dbt Community Forum about this. What's the temporary way you recommend calling persistent functions from a dbt model? Just hardcoding project_id and dataset_id?

Thanks!

@Fleid Fleid self-assigned this Mar 24, 2023
@dreinon
Copy link

dreinon commented Mar 28, 2023

Question solved in the link!

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Sep 25, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Oct 2, 2023

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 2, 2023
@brabster
Copy link

brabster commented Feb 19, 2024

I think this is an independent experimental implementation in a real-world public project https://tempered.works/posts/2024-02-19-udf-dbt-models/ (mentioned on Slack channel #i-made-this)

eg. https://github.com/brabster/pypi_vulnerabilities/blob/main/models/published/udfs/matches_multi_spec.sql

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request refinement Product or leadership input needed Stale
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants