Skip to content

Commit

Permalink
Updated constraints doc following slack (#5995)
Browse files Browse the repository at this point in the history
  • Loading branch information
mirnawong1 authored Oct 4, 2024
2 parents ec8fa54 + 06384c6 commit cda5488
Show file tree
Hide file tree
Showing 2 changed files with 79 additions and 4 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -102,12 +102,14 @@ We’ve focused heavily thus far on the primary area of action in our dbt projec

### Project splitting

One important, growing consideration in the analytics engineering ecosystem is how and when to split a codebase into multiple dbt projects. Our present stance on this for most projects, particularly for teams starting out, is straightforward: you should avoid it unless you have no other option or it saves you from an even more complex workaround. If you do have the need to split up your project, it’s completely possible through the use of private packages, but the added complexity and separation is, for most organizations, a hindrance, not a help, at present. That said, this is very likely subject to change! [We want to create a world where it’s easy to bring lots of dbt projects together into a cohesive lineage](https://github.com/dbt-labs/dbt-core/discussions/5244). In a world where it’s simple to break up monolithic dbt projects into multiple connected projects, perhaps inside of a modern mono repo, the calculus will be different, and the below situations we recommend against may become totally viable. So watch this space!
One important, growing consideration in the analytics engineering ecosystem is how and when to split a codebase into multiple dbt projects. Currently, our advice for most teams, especially those just starting, is fairly simple: in most cases, we recommend doing so with [dbt Mesh](/best-practices/how-we-mesh/mesh-1-intro)! dbt Mesh allows organizations to handle complexity by connecting several dbt projects rather than relying on one big, monolithic project. This approach is designed to speed up development while maintaining governance.

- ❌ **Business groups or departments.** Conceptual separations within the project are not a good reason to split up your project. Splitting up, for instance, marketing and finance modeling into separate projects will not only add unnecessary complexity but destroy the unifying effect of collaborating across your organization on cohesive definitions and business logic.
- ❌ **ML vs Reporting use cases.** Similarly to the point above, splitting a project up based on different use cases, particularly more standard BI versus ML features, is a common idea. We tend to discourage it for the time being. As with the previous point, a foundational goal of implementing dbt is to create a single source of truth in your organization. The features you’re providing to your data science teams should be coming from the same marts and metrics that serve reports on executive dashboards.
As breaking up monolithic dbt projects into smaller, connected projects, potentially within a modern mono repo becomes easier, the scenarios we currently advise against may soon become feasible. So watch this space!

- ✅ **Business groups or departments.** Conceptual separations within the project are the primary reason to split up your project. This allows your business domains to own their own data products and still collaborate using dbt Mesh. For more information about dbt Mesh, please refer to our [dbt Mesh FAQs](/best-practices/how-we-mesh/mesh-5-faqs).
- ✅ **Data governance.** Structural, organizational needs — such as data governance and security — are one of the few worthwhile reasons to split up a project. If, for instance, you work at a healthcare company with only a small team cleared to access raw data with PII in it, you may need to split out your staging models into their own projects to preserve those policies. In that case, you would import your staging project into the project that builds on those staging models as a [private package](https://docs.getdbt.com/docs/build/packages/#private-packages).
- ✅ **Project size.** At a certain point, your project may grow to have simply too many models to present a viable development experience. If you have 1000s of models, it absolutely makes sense to find a way to split up your project.
- ❌ **ML vs Reporting use cases.** Similarly to the point above, splitting a project up based on different use cases, particularly more standard BI versus ML features, is a common idea. We tend to discourage it for the time being. As with the previous point, a foundational goal of implementing dbt is to create a single source of truth in your organization. The features you’re providing to your data science teams should be coming from the same marts and metrics that serve reports on executive dashboards.

## Final considerations

Expand Down
75 changes: 74 additions & 1 deletion website/docs/reference/resource-properties/constraints.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Constraints require the declaration and enforcement of a model [contract](/refer

Constraints may be defined for a single column, or at the model level for one or more columns. As a general rule, we recommend defining single-column constraints directly on those columns.

If you are defining multiple `primary_key` constraints for a single model, those _must_ be defined at the model level. Defining multiple `primary_key` constraints at the column level is not supported.
If you define multiple `primary_key` constraints for a single model, those _must_ be defined at the model level. Defining multiple `primary_key` constraints at the column level is not supported.

The structure of a constraint is:
- `type` (required): one of `not_null`, `unique`, `primary_key`, `foreign_key`, `check`, `custom`
Expand Down Expand Up @@ -572,3 +572,76 @@ alter table schema_name.my_model add constraint 472394792387497234 check (id > 0
</div>

</WHCode>

## Custom constraints

In dbt Cloud and dbt Core, you can use custom constraints on models for the advanced configuration of tables. Different data warehouses support different syntax and capabilities.

Custom constraints allow you to add configuration to specific columns. For example:

- Set [masking policies](https://docs.snowflake.com/en/user-guide/security-column-intro#what-are-masking-policies) in Snowflake when using a Create Table As Select (CTAS).

- Other data warehouses (such as [Databricks](https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html) and [BigQuery](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#column_name_and_column_schema) have their own set of parameters that can be set for columns in their CTAS statements.


You can implement constraints in a couple of different ways:

- [Custom constraints with tags](#custom-constraints-with-tags)
- [Custom constraints without tags](#custom-constraints-without-tags)

<Expandable alt_header="Custom constraints with tags">

Here's an example of how to implement tag-based masking policies with contracts and constraints using the following syntax:

<File name='models/constraints_example.yml'>

```yaml

models:
- name: my_model
config:
contract:
enforced: true
materialized: table
columns:
- name: id
data_type: int
constraints:
- type: custom
expression: "tag (my_tag = 'my_value')" # A custom SQL expression used to enforce a specific constraint on a column.

```

</File>

Using this syntax requires configuring all the columns and their types as it’s the only way to send a create or replace `<cols_info_with_masking> mytable as ...`. It’s not possible to do it with just a partial list of columns. This means making sure the columns and constraints fields are fully defined.

To generate a YAML with all the columns, you can use `generate_model_yaml` from [dbt-codegen](https://github.com/dbt-labs/dbt-codegen/tree/0.12.1/?tab=readme-ov-file#generate_model_yaml-source).
</Expandable>

<Expandable alt_header="Custom constraints without tags">

Alternatively, you can add a masking policy without tags:

<File name='models/constraints_example.yml'>

```yaml

models:
- name: my_model
config:
contract:
enforced: true
materialized: table
columns:
- name: id
data_type: int
constraints:
- type: custom
expression: "masking policy my_policy"

```

</File>
</Expandable>

0 comments on commit cda5488

Please sign in to comment.