Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-2149] [Feature] Add the possibility to map a single dbt entity to multiple databases #7021

Closed
3 tasks done
jochemvandooren opened this issue Feb 21, 2023 · 8 comments
Closed
3 tasks done
Labels
enhancement New feature or request wontfix Not a bug or out of scope for dbt-core

Comments

@jochemvandooren
Copy link

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Currently it is impossible to map a single dbt entity to multiple databases. Add the possibility to enable a single dbt entity on multiple databases.

Consider a setup where a table can exist in several databases. The tables, spread over the multiple databases, are exactly the same except for the data inside the table.

There is no way to have a single model/source enabled in multiple databases, during a single run. It is however possible to run dbt run multiple times (for each database), but for dbt docs generate it is impossible to get the the expected output since multiple versions of catalog files are generated and only one version can be served.

Describe alternatives you've considered

  • Executing dbt run multiple times; not a solution for dbt docs generate.
  • Specifying a resource multiple times; impossible because dbt expects the name of a resource to be unique. Also for resources it is impossible because the database is specified on the sources level:
sources:
  - name: foo
    database: bar
    tables:
  • Dynamically setting the database in configuration files of dbt entities; does not offer to enable multiple databases during a single run.

Who will this benefit?

Anyone who wants to have a single dbt entity mapped to multiple databases in a single dbt project.

Are you interested in contributing this feature?

I would be interested.

Anything else?

We have a setup where we are deploying a dbt project to multiple databases, as we want these structures to exist there, which are then sourced with different data. This allows us to have segmented databases for different deployments, which is how our business is structured. However, we would like to have a unique catalog to represent the multiple databases, which would mean a single dbt entity mapped to multiple database relationships, one per database.

@jochemvandooren jochemvandooren added enhancement New feature or request triage labels Feb 21, 2023
@github-actions github-actions bot changed the title [Feature] Add the possibility to map a single dbt entity to multiple databases [CT-2149] [Feature] Add the possibility to map a single dbt entity to multiple databases Feb 21, 2023
@dbeatty10 dbeatty10 self-assigned this Feb 21, 2023
@dbeatty10
Copy link
Contributor

Thanks for reaching out @jochemvandooren !

Historically, dbt has held the philosophy of "one model, one output (table)", and we're inclined to stick with it.

Reading about your use-case, it seems like one key thing you're after is being able to see all the different databases in a single catalog.

Let's pretend we could wave a magic wand and be able to merge the output of two different executions of dbt docs generate 🪄 ✨

Would that solve your problem? Or would there be other key design criteria that would still be unaddressed in that scenario?

@jochemvandooren
Copy link
Author

jochemvandooren commented Feb 22, 2023

Hi @dbeatty10, thanks for your quick response!

You are absolutely right that the problem is not being able to see all different databases in a single catalog. This magic wand:magic_wand: sounds great and indeed would fix the issue! The Project tab would have to show the entities that exist in several databases once and the Databases tab would show several databases, with the entity being part of each database (for each execution of dbt docs generate).

This is also something I already looked into myself, but turned out to be not too straightforward at this point in time. With the current structure of manifest.json and catalog.json merging several outputs is a challenge, because a single dbt entity that exists in several databases has the same unique identifier and key in different outputs.

But being able to merge the output of different executions of dbt docs generate would definitely solve it for me!

@matthieucan
Copy link

Hey @dbeatty10, good to see you across multiple repos ;)

I'm working with @jochemvandooren on the same project, and to add to his perspective:

  • It would be awesome to be able to merge those JSON files indeed! Would that preserve all relationships, e.g. between models and associated tests?
  • Would that work with "exceptions", i.e. models that are mapped to a single database?
  • How would the Project and Database tabs look like? I can think of something like
Project
\-- Sources
    \-- source_1 [db1, db2]
    \-- source_2 [db1]
    \-- source_3 [db2]
\-- Models
    \-- model_1 [db1, db2]


Database
\-- db1
    \-- model_1
    \-- source_1
    \-- source_2
\-- db2
    \-- model_1
    \-- source_1
    \-- source_3

Would that make sense?

Thank you!

@dbeatty10
Copy link
Contributor

@matthieucan good to see you again too!

Yep, we would want to preserve all relationships (like models and associated tests, etc). I think we'd also want it to act similar to a full outer join in SQL (rather than an inner join) and for it to include models that are only within a single database.

I don't know what the Project and Database tabs would end up looking like (or how exactly we'd want them to look). But what you laid out looks reasonable.

In terms of trying out a JSON merge, I'm wondering if jq might come in handy here?

Would you be willing to experiment with one or both of the following and see what they do?

(Of course you'll want to use the most simple project possible for any experiments.)

@jochemvandooren
Copy link
Author

jochemvandooren commented Mar 21, 2023

Thanks for the quick response and my apologies for the late reply! I finally found some time to do some experiments with jq.

You are right that it should act like a FULL OUTER JOIN, although not exactly the same. Take the following example of two catalog.json (as a result of two different dbt docs generate runs on different databases) where model1 is enabled in database1 and database2:

output1.json

{
  "nodes": {
    "model1": {
      "database": "database1",
      "fqn": [
        "dwh",
        "foobar",
        "model1"
      ],
      "unique_id": "model.dwh.model1",
      "name": "model1"
    }
  }
}

output2.json

{
  "nodes": {
    "model1": {
      "database": "database2",
      "fqn": [
        "dwh",
        "foobar",
        "model1"
      ],
      "unique_id": "model.dwh.model1",
      "name": "model1"
    }
  }
}

When combining the two outputs, the result should look something like:

merged.json

{
  "nodes": {
    "model1_A": {
      "database": "database2",
      "fqn": [
        "dwh",
        "foobar",
        "model1_A"
      ],
      "unique_id": "model.dwh.model1_A",
      "name": "model1"
    },
    "model1_B": {
      "database": "database2",
      "fqn": [
        "dwh",
        "foobar",
        "model1_B"
      ],
      "unique_id": "model.dwh.model1_B",
      "name": "model1"
    }
  }
}

As you can see, the keys of the models inside nodes is the same and need to be changed when merging the two outputs into one. Next to that, the fqn and unique_id probably need changing as well. For the unique_id I am 100% sure, since apparently this unique_id is used by dbt docs to determine which model is highlighted in the databases tab.

I was not able to figure out if this is possible with jq. It seems to me it is mostly used to merge files based on keys, where the values of the key that occur in multiple files are added together. Not saying it is impossible, but it probably requires more work.

These experiments were done with a very basic dbt project, only one model and no tests associated with them. So reality might be more complex than my examples!

@alison985
Copy link

I think there's overlap here with the inheritance conversation. #6527

There may be, at least, cross-dialect use cases here the request to add column level identifier support #6929 (comment)

@dbeatty10
Copy link
Contributor

Circling back to this...

The crux

As you know, the way dbt docs works is very model-centric:

  • given the models within a dbt project, show the associated database table & columns for each model

In contrast, this issue is proposing a web experience that is database-centric:

  • given the tables within a database, show the related dbt models.

Cardinality

When restricted to a single dbt run, there is a one-to-one relationship between dbt models and database tables, so it really doesn't matter if the web experience is based on one paradigm or the other.

But this issue describes multiple independent executions of dbt run that create one-to-many relationship between dbt models and database tables.

In the current paradigm for dbt docs, this would lead to multiple documentation websites.

In order to generate a single website, we'd need to invert the paradigm completely and turn everything inside-out. The catalog.json artifact might be able to handle this inversion similarly to how @jochemvandooren described here. But the dbt-docs website changes would be much more complicated.

Summary

We don't have appetite for those changes to the docs website at this time, so I'm going to close this issue as wont_fix.

But an alternative dbt-docs experience (like https://github.com/PicnicSupermarket/dbt-docs) could totally take multiple catalog.json files and offer a database-centric web experience.

@dbeatty10 dbeatty10 closed this as not planned Won't fix, can't repro, duplicate, stale Sep 20, 2023
@dbeatty10 dbeatty10 added wontfix Not a bug or out of scope for dbt-core and removed triage labels Sep 20, 2023
@dbeatty10 dbeatty10 removed their assignment Sep 20, 2023
@jochemvandooren
Copy link
Author

Thanks @dbeatty10 for your explanation! I understand that making this work would require a big overhaul, which is probably not worth it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request wontfix Not a bug or out of scope for dbt-core
Projects
None yet
Development

No branches or pull requests

4 participants