Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] event time config #9490

Closed
3 tasks done
Tracked by #10624
graciegoheen opened this issue Jan 30, 2024 · 4 comments · Fixed by #10594
Closed
3 tasks done
Tracked by #10624

[Feature] event time config #9490

graciegoheen opened this issue Jan 30, 2024 · 4 comments · Fixed by #10594
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@graciegoheen
Copy link
Contributor

graciegoheen commented Jan 30, 2024

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

There are many use cases where it would be helpful to know the event time field of a given model:

  • filtering your project to only run on the last X days of data (sampling)
  • incremental extensions (lookback window, generate the "where" clause for you)
  • intelligently set your clustering key
  • etc.

This is similar to a handful of our current configs:

We believe this is distinct from your partition key, because it's likely you want to filter on something different (event_time) than what you’re partitioning on (processing_time) - event time vs processing time.

We should add a new config to allow folks to specify the event time field for a given model.

Notion doc

Acceptance Criteria

  • Create new config event_time that accepts SQL (name of a column, or SQL like to_date(my_time_col); similar to loaded_at_field)
models:
  - name: my_model
     config:
       event_time: my_time_column
     columns: 
       - name: my_time_column
	 data_type: timestamp
	 ...
  • config can be set in config block or in dbt_project.yml or in schema yml
  • config can be set for models and sources

YML design options

Option 1: field name field

models:
  - name: my_model
     config:
       event_time_field: my_time_field
     columns: 
       - name: my_time_field
	 data_type: timestamp
	 ...
	 granularity: day
Pros Cons
can set in dbt_project.yml duplicate column name in YML
reuse same data_type config block
could use any sql, like loaded_at_field

Option 2: dictionary config

models:
  - name: my_model
     config:
       event_time: 
         field_name: my_time_field
         granularity: day
         ...
     columns: 
       - name: my_time_field
	 data_type: timestamp

Open considerations

  • What should the config be called? event_time, event_time_field, something else?
  • Should we also support this configuration on other resource types (sources, snapshots, etc.)?

Additional context here

Additional requirement

Add some document about how config on node and config in node.config works(code level implementation) while working on the ticket.
related issue #7157
https://docs.getdbt.com/reference/configs-and-properties

@graciegoheen graciegoheen added enhancement New feature or request triage and removed triage labels Jan 30, 2024
@dbeatty10
Copy link
Contributor

I keep a collection of terminology here. I should add those bi-temporal definitions for "event time" and "processing time" from Spark: The Definitive Guide !

I have a hard time sorting out the differences between all these temporal dimensions and their implications, so that's why I created that repo. It's still hard for me to reason about 😅.

No matter which terminology we land on ("event_time", etc), we'll want to do a good job of explaining what it means and how it fits in.

@graciegoheen
Copy link
Contributor Author

dummy example of what a future incremental model could look like where we automatically generate the where clause for you:

{{
    config(
        materialized='incremental',
        event_time = {field_name: 'event_time', granularity: 'day'}
        look_back_window = 3
    )
}}

select
    *,
    my_slow_function(my_column)

from raw_app_data.events

@dbeatty10
Copy link
Contributor

The dummy example of event_time as a dictionary looks similar to the partition_by config for incremental models in dbt-bigquery:

{
  "field": "<field name>",
  "data_type": "<timestamp | date | datetime | int64>",
  "granularity": "<hour | day | month | year>"

  # Only required if data_type is "int64"
  "range": {
    "start": <int>,
    "end": <int>,
    "interval": <int>
  }
}

The range portion looks similar to the look_back_window in the dummy example above.

Within the config, partition_by looks like this:

{{ config(
    materialized='table',
    partition_by={
      "field": "created_at",
      "data_type": "timestamp",
      "granularity": "day"
    }
)}}

@graciegoheen
Copy link
Contributor Author

graciegoheen commented Apr 19, 2024

@jtcohen6 and I discussed and believe Option 1 is the best path forward here.

Notion doc

Acceptance Criteria

  • Create new config event_time that accepts SQL (name of a column, or SQL like to_date(my_time_col); similar to loaded_at_field)
models:
  - name: my_model
     config:
       event_time: my_time_column
     columns: 
       - name: my_time_column
	 data_type: timestamp
	 ...
  • config can be set in config block or in dbt_project.yml or in schema yml
  • config can be set for models and sources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants