-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Datasharing: cross-database support #637
Comments
Q: What data sharing scenarios could feasibly be supported?
Q: The current & future best practice for fetching metadata? Q: The performance impact of using SVV_ tables for all SKUs (ra3, serverless, dc2)?
|
re: metadata over the API I think we're totally on the same page. Can discuss that more down the line.
What concerns me most is not the APIs not having enough info. Instead my issue is maintaining our users' performance expectation. @ryanwaldorf check out dbt-labs/dbt-redshift#741 (and my response). |
with respect to workloads, I totally buy the business case for cross-cluster reads & writes. My concern is the perspective that a dbt end-user. How a dbt user thinks about their dbt project
Mapping dbt perspective to Redshift DatasharingIf Redshift Datasharing experience is to map directly to the expectations communicated by the dbt framework, then the boundary between models in different clusters has to feel as invisible as models that share a database. Based on how much of the below is met, we have the following options:
One example caveat is that we already know that a view of a tables in another db is not supported. This should be clearly documented and communicated (at least until it is supported). If the full set of Datasharing cross-database capabilities results in intra-dbt-project experience that is not "invisible". Perhaps, instead we take that stance that "best practice" is to mirror the database boundary as as dbt project boundary. This aligns well with the concept of dbt Mesh Expectations of full-support cross-database queries in dbtFor example, creating the following dependencies in objects within the same database is easy, but what if they're in distinct databases? Assuming that the administration to connect the clusters has already been done, which of the following are possible today with Datasharing?
Adding on to that:
|
Saw the issue with the driver performance over time. I can't speak to what happened with those previous releases but we should be able to address this together going forward through better communication between our teams. All the key folks for this going forward will be included in our Friday meeting. |
For what's possible
|
I'm surprised to hear that this is supported because I literally got an error when I tried to do this. Will dig up the specific error message again. |
wait sorry I messed up my answer. you did get an error. it needs to be a late-binding view. updated my answer above and bolded the changes |
Any updates on this? AWS Data Sharing Writes have gone GA today https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-redshift-multi-data-warehouse-through-data-sharing/ Would be great and help us with further dbt adoption |
Describe the feature
What
I think this is a direct successor to dbt-labs/dbt-redshift#217. There's many other related issues listed below.
Today, the only supported cross-database scenario in dbt-redshift is to allow
source
tables to be in another database. There are considerable caveats to this narrow use case:ra3
orserverless
profiles.yml
profile that specifies:ra3_node: true
source
table is defined in adatabase
different than what's given in profilethere is a newer feature, still in preview, Datashare, that allows writes to external databases. For example, if I am logged into the
FOO
database on clusterFIZZ
, a configuration exists such that I can create tables within databaseBAR
on clusterBUZZ
(and vise versa).Ideal end-state
A dbt-redshift project could take any model in the DAG and have it materialized in another with a simply as providing
{{ config( database = 'BAR' ) }}
at the top of the model. That is, provided that the databases used have been properly configured as such.How
Challenges
The challenges in supporting this new feature are varied:
redshift__create_schema()
which invokespostgres__create_schema()
which only uses two-part names,schema.relation
, exlcuding a requireddatabase
redshift__get_columns_in_relation()
queriesinformation_schema."columns"
but should perhaps useSVV_ALL_COLUMNS
instead`ra3_node
andserverless
supports Datasharing/crossdb-writes,dc2
clusters do notpg_*
metadata tables andSVV_*
datashare-supporting Redshift system tablesinformation_schema."columns"
takes a few dozen millisecionds to return all column metadata in the current database.SVV_ALL_COLUMNS
can be more than 50X slower than thisWork required
Before the actual coding begins, it is imperative that we first learn the following:
SVV_*
tables for all SKUs (ra3, serverless, dc2)Possible work required
related
related issues
ra3_node: true
and ignores database name dbt-core#5297ra3_node
profile config dbt-redshift#94adapter.get_relation
for relations in other databases [RA3] #652adapter.get_columns_in_relation
does not work cross database #639relevant AWS Redshift docs
Who will this benefit?
dbt-redshift users who want to use Datasharing to federate a dbt project across databases and clusters
The text was updated successfully, but these errors were encountered: