Feature Request: Sharded Tables #1637

jthandy · 2019-07-30T00:45:25Z

Feature:

Feature description

Currently in dbt, there is a hard-and-fast rule that one model file results in at most one object (table or view) in the database. This is a core part of the design of the product and has been true since the very first commit. In most situations, this works OK. There is one specific case where it does not, however: sharded tables.

Sharding isn't a term that's often used outside of the Bigquery world, but it's a pattern that in practice is used on Snowflake and Redshift semi-frequently when organizations are dealing with large enough datasets. Essentially, sharding is simply creating a series of physical tables that are "sharded" on a key (the most often I've seen are customer_id and created_date), that, when taken all together represent a complete view of the entire table. These tables are typically named in the format table_name_[shard], i.e. fct_orders_190401.

Bigquery provides a wildcard operator to allow all shards in a logical table to be selected from at the same time, and building tables using this paradigm was well-supported in even early public versions of Bigquery. Redshift and Snowflake do not have quite such native support for this style, but the Redshift docs specifically talk about this strategy, and I've heard from the Snowflake internal analytics team that they use this pattern as well.

I can imagine multiple ways that dbt could theoretically be modified in order to output this type of data structure in a more idiomatic way, but this is far enough from dbt's standard paradigm today that I don't want to be prescriptive here: I legitimately don't know what the ideal answer is from either a dbt user's perspective or from a technology perspective. Instead, I just want to flag this as a real need—one that I have personally felt on recent projects and have spoken to several teams who would get value out of this. Currently, those teams are employing some fascinating hacks to end-around dbt's inability to handle this type of data structure by escaping to Python and Airflow.

Why would anyone want to use this?

Sharded tables create essentially one additional level of abstraction to store very large datasets. While modern optimizers are often pretty good at minimizing costs and runtime of queries on very large tables, manually sharding these tables allows analytics engineers to be more explicit about what data they want to scan in a given query and thus control performance and costs more explicitly.
Sharded tables create more fine-grained levels that can have database permissions applied. For instance, it is possible to shard a large table on a region id and then only grant employees in a particular region to select from the shard associated with that particular region, while employees in HQ can select from all regions.
In practice, many source datasets are already sharded, and it is often optimal to deal with them in this format rather than forcing these datasets to be unioned together prior to being operated on within dbt.

Who will this benefit?

This will benefit dbt users who are using dbt to process large tables, typically 50GB+ but often 1TB+, who want to apply the fairly common data engineering design pattern of sharding data into multiple physical tables.

The text was updated successfully, but these errors were encountered:

PradKumarGC · 2020-07-02T15:49:10Z

+1 for including sharding as a dbt feature. To highlight on another use case, when we hit a max. partitioning limit of 4000 in BigQuery (as of today) for large tables an alternative would be to shard them (by txn year as an example) followed by partitioning on each table on the original intended partition (e.g txn date date) column.

elliottohara · 2020-09-30T15:54:24Z

This would be super useful for us as well, we'd love to physically separate some of the data in our tables based on the customer's id. It would dramatically simplify security.

mitalauskas · 2020-12-02T07:14:09Z

+1

OussGhan · 2021-01-25T16:16:54Z

+1
Would definitely be a game changer for all bigquery dbt users. A frequent usecase for this is GA360 transformations using dbt. I managed to write incremental models using dbt but was very painful and will be obsolete once we reach 4000 partions milestone :/

thibault-lengyel-carrefour · 2021-04-22T21:06:45Z

+1 It will be very useful for us as well. We usually need to shard our orders data (10TB) by country and other fields.

minhnhat992 · 2021-11-19T19:49:46Z

+1

This comment was marked as spam.

Sign in to view

drewbanin mentioned this issue Feb 4, 2020

Support for metadata-based object / ETL generation: ETL automation #2091

Closed

drewbanin mentioned this issue Jun 17, 2020

Template models for dbt #2551

Closed

jtcohen6 mentioned this issue Jul 3, 2021

Make it possible for DBT to generate models from a piece of JINJA code #3469

Closed

jtcohen6 mentioned this issue Oct 6, 2021

Pseudo-partitioning #4005

Closed

jtcohen6 mentioned this issue Nov 17, 2021

[Feature] The insert_by_period materialization should graduate to part of the main project #4174

Closed

1 task

jtcohen6 mentioned this issue Nov 25, 2021

partition_by did not create table with corresponding fields dbt-labs/dbt-bigquery#72

Closed

dbt-labs locked and limited conversation to collaborators Dec 8, 2021

jtcohen6 converted this issue into discussion #4457 Dec 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Feature Request: Sharded Tables #1637

Feature Request: Sharded Tables #1637

jthandy commented Jul 30, 2019

This comment was marked as spam.

PradKumarGC commented Jul 2, 2020 •

edited

Loading

elliottohara commented Sep 30, 2020

mitalauskas commented Dec 2, 2020

OussGhan commented Jan 25, 2021

thibault-lengyel-carrefour commented Apr 22, 2021

minhnhat992 commented Nov 19, 2021

This issue was moved to a discussion.

This issue was moved to a discussion.

Feature Request: Sharded Tables #1637

Feature Request: Sharded Tables #1637

Comments

jthandy commented Jul 30, 2019

Feature:

Feature description

Why would anyone want to use this?

Who will this benefit?

This comment was marked as spam.

PradKumarGC commented Jul 2, 2020 • edited Loading

elliottohara commented Sep 30, 2020

mitalauskas commented Dec 2, 2020

OussGhan commented Jan 25, 2021

thibault-lengyel-carrefour commented Apr 22, 2021

minhnhat992 commented Nov 19, 2021

This issue was moved to a discussion.

PradKumarGC commented Jul 2, 2020 •

edited

Loading