-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New command: dbt clone
#7258
New command: dbt clone
#7258
Conversation
|
||
{% macro snowflake__get_clone_table_sql(this_relation, state_relation) %} | ||
create or replace | ||
{{ "transient" if config.get("transient", true) }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Snowflake requires that table clones match the transience/permanence of the table they're cloning.
We determine if the table is a table based on the cache result from the other/prod schema, but here we're just using the current (dev) configuration for transient
. There's the possibility of a mismatch if a user has updated the transient
config in development.
@@ -1362,6 +1362,20 @@ def this(self) -> Optional[RelationProxy]: | |||
return None | |||
return self.db_wrapper.Relation.create_from(self.config, self.model) | |||
|
|||
@contextproperty | |||
def state_relation(self) -> Optional[RelationProxy]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open to naming suggestions! This will only be available in the context for the clone
command currently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "relation" being used as a byword for table/view. Relation convene, so many things, and I feel like we want an additional term or a better term that can be more specific about what you're trying to achieve. Perhaps "stateful_db_relation". I really don't know the subtleties here though, so you might have picked the best one.
core/dbt/contracts/graph/manifest.py
Outdated
state_relation = RelationalNode( | ||
other_node.database, other_node.schema, other_node.alias | ||
) | ||
self.nodes[unique_id] = current.replace(state_relation=state_relation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're storing information about each node's production-state counterpart, right on the node entry in the manifest. I'm open to discussing whether this is the right approach. It feels better than passing the entire other manifest around into other methods/runners.
core/dbt/contracts/graph/nodes.py
Outdated
@@ -567,6 +571,7 @@ class HookNode(CompiledNode): | |||
class ModelNode(CompiledNode): | |||
resource_type: NodeType = field(metadata={"restrict": [NodeType.Model]}) | |||
access: AccessType = AccessType.Protected | |||
state_relation: Optional[RelationalNode] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the approach of storing stateful information on each node about its prod counterpart, we do need a place on the node object to do that. For now, I'm adding a new attribute (default None
) to models, seeds, and snapshots — the three refable
nodes that are eligible for deferral & cloning.
{% macro can_clone_tables() %} | ||
{{ return(adapter.dispatch('can_clone_tables', 'dbt')()) }} | ||
{% endmacro %} | ||
|
||
|
||
{% macro default__can_clone_tables() %} | ||
{{ return(False) }} | ||
{% endmacro %} | ||
|
||
|
||
{% macro snowflake__can_clone_tables() %} | ||
{{ return(True) }} | ||
{% endmacro %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of True/False conditional behavior might be better as an adapter property/method (Python), since it's really a property of the adapter / data platform, rather than something a specific user wants to reimplement. Comparable to the "boolean macros" we defined for logic around grants
. Open to thoughts!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to state the obvious: any snowflake__
macros would want to move to dbt-snowflake
as part of implementing & testing this on our adapters! I'm just defining them here for now for the sake of comparison & convenience (= laziness)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I concur with both comments
if get_flags().CACHE_SELECTED_ONLY is True: | ||
required_schemas = self.get_model_schemas(adapter, selected_uids) | ||
self.populate_adapter_cache(adapter, required_schemas) | ||
else: | ||
self.populate_adapter_cache(adapter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in RunTask.before_run
(comment above)
# TODO: need an "adapter zone" version of this test that checks to see | ||
# how many of the cloned objects are "pointers" (views) versus "true clones" (tables) | ||
# e.g. on Postgres we expect to see 4 views | ||
# whereas on Snowflake we'd expect to see 3 cloned tables + 1 view |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either this test, or some extension of it, should be in the "adapter zone" so that we can verify this functionality:
- On data platforms that support
create table <dev> clone <prod>
, let's use it- Otherwise, create views that are simple pointers (
create view <dev> as select * from <prod>
)
-- If this is a database that can do zero-copy cloning of tables, and the other relation is a table, then this will be a table | ||
-- Otherwise, this will be a view |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{% set should_revoke = should_revoke(existing_relation, full_refresh_mode=True) %} | ||
{% do apply_grants(target_relation, grant_config, should_revoke=should_revoke) %} | ||
{% do persist_docs(target_relation, model) %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking we should still apply grants & table/column-level comments. I have a suspicion that whether these things are copied over, during cloning, varies by data platform; I should really look to confirm/reject that suspicion. It's also possible that the user has defined conditional logic for these that differs between dev & prod, especially grants
.
if "state_relation" in node: | ||
del node["state_relation"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't feel like state_relation
is a thing we should need/want to include in serialized manifest.json
. It's really just for our internal use.
Holy crap. Jerco and team, how are you shipping soooo much value right now?!?! |
A decent amount of time is spent checking/updating the cache, which locks across all the concurrent threads. This could be even faster (2 min for 1k models) with the change proposed in #6844. I think we'd want to modify the behavior of
|
Candidate for favorite PR of the year 👏🏻 |
@@ -444,7 +445,10 @@ def before_run(self, adapter, selected_uids: AbstractSet[str]): | |||
with adapter.connection_named("master"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably should not be master if we're trying to avoid that language elsewhere
""" | ||
For commands which add information about this node's corresponding | ||
production version (via a --state artifact), access the Relation | ||
object for that stateful other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other feels vague (well, really, just Latinate/French in phrasing :) ). "source node", etc.?
|
||
|
||
{% macro default__get_pointer_sql(to_relation) %} | ||
{% set pointer_sql %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel uncomfortable with this use of pointer. I'm not exactly sure why were drawing that comparison, when perhaps we should use something closer to reference or allude to the fact that there's a shadow copy. In my mind pointer is just a very specific thing and somewhat anachronistic in a SQL context.
737c7c4
to
0b25b9c
Compare
closing in favor of #7881 :) |
resolves #7256
Description
Introduce
clone
command, which also makes use of aclone
materialization.create table <dev> clone <prod>
, let's use itcreate view <dev> as select * from <prod>
)--full-refresh
is passedSome interesting behaviors around:
--state
manifest in a way that's similar to, but not exactly the same as,--defer
TODOs
dbt clone --threads 50 --full-refresh
(~3.5 minutes) versuscreate schema <dev> clone <prod>
(~10 min) using our real Snowflake project with ~1k models in it. That's ~65% faster!Example
I have a seed, a view model, and a table model.
From
logs/dbt.log
:Checklist
changie new
to create a changelog entry