Skip to content

Commit

Permalink
Merge branch 'current' into feat/dbt-databend-cloud
Browse files Browse the repository at this point in the history
  • Loading branch information
mirnawong1 authored Jan 16, 2023
2 parents d193c58 + 68a42e9 commit 98eb3af
Show file tree
Hide file tree
Showing 34 changed files with 1,577 additions and 305 deletions.
10 changes: 10 additions & 0 deletions .github/config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Comment to be posted to on PRs from first time contributors in your repository

newPRWelcomeComment: >
Hello!👋 Thanks for contributing to the dbt product documentation and opening this pull request! ✨
We use Markdown and some HTML to write the dbt product documentation. When writing content, you can use our [style guide](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/content-style-guide.md) and [content types](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/content-types.md) to understand our writing standards and how we organize information in the dbt product docs.

We'll review your contribution and respond as soon as we can. 😄


38 changes: 38 additions & 0 deletions .github/workflows/label.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
name: Add/Remove Labels

on:
pull_request_target:
types: [ opened, closed ]

jobs:
add_new_contributor_label:
if: github.event.action == 'opened'
permissions:
contents: read
pull-requests: write
runs-on: ubuntu-latest
steps:
- uses: actions/github-script@v6
with:
script: |
const creator = context.payload.sender.login
const opts = github.rest.issues.listForRepo.endpoint.merge({
...context.issue,
creator,
state: 'all'
})
const issues = await github.paginate(opts)
for (const issue of issues) {
if (issue.number === context.issue.number) {
continue
}
if (issue.pull_request) {
return // creator is already a contributor
}
}
await github.rest.issues.addLabels({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
labels: ['new contributor']
})
1 change: 1 addition & 0 deletions _redirects
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
/dbt-cloud/api-v4 /docs/dbt-cloud-apis/admin-cloud-api 301
/docs/building-a-dbt-project/building-models/python-models /docs/build/python-models 301
/docs/deploy/regions /docs/deploy/regions-ip-addresses 301

Expand Down
59 changes: 59 additions & 0 deletions website/docs/docs/build/incremental-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -303,5 +303,64 @@ select ...

</File>

<VersionBlock firstVersiont="1.4">

### About incremental_predicates

`incremental_predicates` is an advanced use of incremental models, where data volume is large enough to justify additional investments in performance. This config accepts any valid SQL expression, however dbt does not check the syntax.

For example, this is a pattern we might expect to see on Snowflake:

```yml
models:
- name: my_incremental_model
config:
materialized: incremental
unique_key: id
# this will affect how the data is stored on disk, and indexed to limit scans
cluster_by: ['session_start']
incremental_strategy: merge
# this limits the scan of the existing table to the last 7 days of data
incremental_predicates: "DBT_INTERNAL_DEST.session_start > datediff(day, -7, current_date)"
# `DBT_INTERNAL_DEST` and `DBT_INTERNAL_SOURCE` are the standard aliases for the target table and temporary table, respectively, during an incremental run using the merge strategy.
```

This will template (in the `dbt.log` file) a `merge` statement like:
```sql
merge into <existing_table> DBT_INTERNAL_DEST
from <temp_table_with_new_records> DBT_INTERNAL_SOURCE
on
-- unique key
DBT_INTERNAL_DEST.id = DBT_INTERNAL_SOURCE.id
and
-- custom predicate: limits data scan in the "old" data / existing table
DBT_INTERNAL_DEST.session_start > datediff(day, -7, current_date)
when matched then update ...
when not matched then insert ...
```

Limit the data scan of _upstream_ tables within the body of their incremental model SQL, which will limit the amount of "new" data processed/transformed.

```sql
with large_source_table as (

select * from {{ ref('large_source_table') }}
{% if is_incremental() %}
where session_start > dateadd(day, -3, current_date)
{% endif %}

),

...
```

:::info
The syntax depends on how you configure your `incremental_strategy`:
- If using the `merge` strategy, you may need to explicitly alias any columns with either `DBT_INTERNAL_DEST` ("old" data) or `DBT_INTERNAL_SOURCE` ("new" data).
- There's a decent amount of conceptual overlap with the `insert_overwrite` incremental strategy.
:::

</VersionBlock>

<Snippet src="discourse-help-feed-header" />
<DiscourseHelpFeed tags="incremental"/>
212 changes: 207 additions & 5 deletions website/docs/docs/build/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ keywords:

## About Metrics

A metric is a timeseries aggregation over a <Term id="table" /> that supports zero or more dimensions. Some examples of metrics include:
A metric is an aggregation over a <Term id="table" /> that supports zero or more dimensions. Some examples of metrics include:
- active users
- monthly recurring revenue (mrr)

Expand Down Expand Up @@ -69,7 +69,7 @@ metrics:
expression: user_id

timestamp: signup_date
time_grains: [day, week, month, quarter, year, all_time]
time_grains: [day, week, month, quarter, year]

dimensions:
- plan
Expand Down Expand Up @@ -168,8 +168,8 @@ Metrics can have many declared **properties**, which define aspects of your metr
| label | A short for name / label for the metric | New Customers | yes |
| description | Long form, human-readable description for the metric | The number of customers who.... | no |
| calculation_method | The method of calculation (aggregation or derived) that is applied to the expression | count_distinct | yes |
| expression | The expression to aggregate/calculate over | user_id, cast(user_id as int) | yes |
| timestamp | The time-based component of the metric | signup_date | yes |
| expression | The expression to aggregate/calculate over | user_id, cast(user_id as int) | <VersionBlock firstVersion="1.4"> no </VersionBlock> <VersionBlock lastVersion="1.3"> yes </VersionBlock> |
| timestamp | The time-based component of the metric | signup_date | <VersionBlock firstVersion="1.4"> no </VersionBlock> <VersionBlock lastVersion="1.3"> yes </VersionBlock> |
| time_grains | One or more "grains" at which the metric can be evaluated. For more information, see the "Custom Calendar" section. | [day, week, month, quarter, year] | yes |
| dimensions | A list of dimensions to group or filter the metric by | [plan, country] | no |
| window | A dictionary for aggregating over a window of time. Used for rolling metrics such as 14 day rolling average. Acceptable periods are: [`day`,`week`,`month`, `year`, `all_time`] | {count: 14, period: day} | no |
Expand Down Expand Up @@ -216,6 +216,7 @@ The type of calculation (aggregation or expression) that is applied to the sql p
| average | This metric type will apply the `average` aggregation to the specified field |
| min | This metric type will apply the `min` aggregation to the specified field |
| max | This metric type will apply the `max` aggregation to the specified field |
| median | This metric type will apply the `median` aggregation to the specified field, or an alternative `percentile_cont` aggregation if `median` is not available |
|<VersionBlock firstVersion="1.3">derived </VersionBlock> <VersionBlock lastVersion="1.2">expression </VersionBlock> | <VersionBlock firstVersion="1.2"> This metric type is defined as any _non-aggregating_ calculation of 1 or more metrics </VersionBlock> |

<VersionBlock firstVersion="1.3">
Expand Down Expand Up @@ -324,6 +325,91 @@ Note that `value` must be defined as a string in YAML, because it will be compil
value: "'2020-01-01'"
```

### Calendar
The dbt_metrics package contains a [basic calendar table](https://github.com/dbt-labs/dbt_metrics/blob/main/models/dbt_metrics_default_calendar.sql) that is created as part of your `dbt run`. It contains dates between 2010-01-01 and 2029-12-31.

If you want to use a custom calendar, you can replace the default with any table which meets the following requirements:
- Contains a `date_day` column.
- Contains the following columns: `date_week`, `date_month`, `date_quarter`, `date_year`, or equivalents.
- Additional date columns need to be prefixed with `date_`, e.g. `date_4_5_4_month` for a 4-5-4 retail calendar date set. Dimensions can have any name (see following section).

To do this, set the value of the `dbt_metrics_calendar_model` variable in your `dbt_project.yml` file:
```yaml
#dbt_project.yml
config-version: 2
[...]
vars:
dbt_metrics_calendar_model: my_custom_calendar
```

#### Dimensions from calendar tables
You may want to aggregate metrics by a dimension in your custom calendar table, for example is_weekend. You can include this within the list of dimensions in the macro call without it needing to be defined in the metric definition.

To do so, set a list variable at the project level called custom_calendar_dimension_list, as shown in the example below.

```yaml
#dbt_project.yml
vars:
custom_calendar_dimension_list: ["is_weekend"]
```

<VersionBlock firstVersion="1.3">

### Configuration

Metric nodes now accept `config` dictionaries like other dbt resources. Specify Metric configs in the metric yml itself, or for groups of metrics in the `dbt_project.yml` file.

<!--tabs for config and project.yml -->

<Tabs>
<TabItem value="config" label="Metric yml">

<File name="models/metrics.yml">

```yml
version: 2
metrics:
- name: config_metric
label: Example Metric with Config
model: ref(‘my_model’)
calculation_method: count
timestamp: date_field
time_grains: [day, week, month]
config:
enabled: true
```

</File>
</TabItem>

<TabItem value="project" label="dbt_project.yml">

<File name="dbt_project.yml">

```yml
metrics:
your_project_name:
+enabled: true
```

</File>
</TabItem>
</Tabs>

<!--End of tabs for config and project.yml -->


#### Accepted Metric Configurations

The following is the list of currently accepted metric configs:

| Config | Type | Accepted Values | Default Value | Description |
|--------|------|-----------------|---------------|-------------|
| `enabled` | boolean | True/False | True | Enables or disables a metric node. When disabled, dbt will not consider it as part of your project. |
| `treat_null_values_as_zero` | boolean | True/False | True | Controls the `coalesce` behavior for metrics. By default, when there are no observations for a metric, the output of the metric as well as [Period over Period](#secondary-calculations) secondary calculations will include a `coalesce({{ field }}, 0)` to return 0's rather than nulls. Setting this config to False instead returns `NULL` values. |

</VersionBlock>

## Querying Your Metric
You can dynamically query metrics directly in dbt and verify them before running a job in the deployment environment. To query your defined metric, you must have the [dbt_metrics package](https://github.com/dbt-labs/dbt_metrics) installed. Information on how to [install packages can be found here](https://docs.getdbt.com/docs/build/packages#how-do-i-add-a-package-to-my-project).

Expand Down Expand Up @@ -406,11 +492,72 @@ You may find some pieces of functionality, like secondary calculations, complica
| end_date | `'2022-12-31'` | Limits the date range of data used in the metric claculation by not querying data after this date | Optional |
| where | `plan='paying_customer'` | A sql statment, or series of sql statements, that alter the **final** CTE in the generated sql. Most often used to limit the data to specific values of dimensions provided | Optional |

#### Secondary Calculations
### Secondary Calculations
Secondary calculations are window functions you can add to the metric calculation and perform on the primary metric or metrics.

You can use them to compare values to an earlier period, calculate year-to-date sums, and return rolling averages. You can add custom secondary calculations into dbt projects - for more information on this, reference the [package README](https://github.com/dbt-labs/dbt_metrics#secondary-calculations).

The supported Secondary Calculations are:

#### Period over Period:

The period over period secondary calculation performs a calculation against the metric(s) in question by either determining the difference or the ratio between two points in time. The input variable, which looks at the grain selected in the macro, determines the other point.

| Input | Example | Description | Required |
| -------------------------- | ----------- | ----------- | -----------|
| `comparison_strategy` | `ratio` or `difference` | How to calculate the delta between the two periods | Yes |
| `interval` | 1 | Integer - the number of time grains to look back | Yes |
| `alias` | `week_over_week` | The column alias for the resulting calculation | No |
| `metric_list` | `base_sum_metric` | List of metrics that the secondary calculation should be applied to. Default is all metrics selected | No |

#### Period to Date:

The period to date secondary calculation performs an aggregation on a defined period of time that is equal to or higher than the grain selected. For example, when you want to display a month_to_date value alongside your weekly grained metric.

| Input | Example | Description | Required |
| -------------------------- | ----------- | ----------- | -----------|
| `aggregate` | `max`, `average` | The aggregation to use in the window function. Options vary based on the primary aggregation and are enforced in [validate_aggregate_coherence()](https://github.com/dbt-labs/dbt_metrics/blob/main/macros/validation/validate_aggregate_coherence.sql). | Yes |
| `period` | `"day"`, `"week"` | The time grain to aggregate to. One of [`"day"`, `"week"`, `"month"`, `"quarter"`, `"year"`]. Must be at equal or coarser (higher, more aggregated) granularity than the metric's grain (see [Time Grains](#time-grains) below). In example grain of `month`, the acceptable periods would be `month`, `quarter`, or `year`. | Yes |
| `alias` | `month_to_date` | The column alias for the resulting calculation | No |
| `metric_list` | `base_sum_metric` | List of metrics that the secondary calculation should be applied to. Default is all metrics selected | No |

#### Rolling:

<VersionBlock firstVersion="1.3" >
The rolling secondary calculation performs an aggregation on a number of rows in metric dataset. For example, if the user selects the `week` grain and sets a rolling secondary calculation to `4` then the value returned will be a rolling 4 week calculation of whatever aggregation type was selected. If the `interval` input is not provided then the rolling caclulation will be unbounded on all preceding rows.

| Input | Example | Description | Required |
| -------------------------- | ----------- | ----------- | -----------|
| `aggregate` | `max`, `average` | The aggregation to use in the window function. Options vary based on the primary aggregation and are enforced in [validate_aggregate_coherence()](https://github.com/dbt-labs/dbt_metrics/blob/main/macros/validation/validate_aggregate_coherence.sql). | Yes |
| `interval` | 1 | Integer - the number of time grains to look back | No |
| `alias` | `month_to_date` | The column alias for the resulting calculation | No |
| `metric_list` | `base_sum_metric` | List of metrics that the secondary calculation should be applied to. Default is all metrics selected | No |
</VersionBlock>

<VersionBlock lastVersion="1.2" >
The rolling secondary calculation performs an aggregation on a number of rows in the metric dataset. For example, if the user selects the `week` grain and sets a rolling secondary calculation to `4`, then the value returned will be a rolling 4-week calculation of whatever aggregation type was selected.

| Input | Example | Description | Required |
| -------------------------- | ----------- | ----------- | -----------|
| `aggregate` | `max`, `average` | The aggregation to use in the window function. Options vary based on the primary aggregation and are enforced in [validate_aggregate_coherence()](https://github.com/dbt-labs/dbt_metrics/blob/main/macros/validation/validate_aggregate_coherence.sql). | Yes |
| `interval` | 1 | Integer - the number of time grains to look back | Yes |
| `alias` | `month_to_date` | The column alias for the resulting calculation | No |
| `metric_list` | `base_sum_metric` | List of metrics that the secondary calculation should be applied to. Default is all metrics selected | No |
</VersionBlock>

<VersionBlock firstVersion="1.3" >

#### Prior:
The prior secondary calculation returns the value from a specified number of intervals before the row.

| Input | Example | Description | Required |
| -------------------------- | ----------- | ----------- | -----------|
| `interval` | 1 | Integer - the number of time grains to look back | Yes |
| `alias` | `2_weeks_prior` | The column alias for the resulting calculation | No |
| `metric_list` | `base_sum_metric` | List of metrics that the secondary calculation should be applied to. Default is all metrics selected | No |

</VersionBlock>

### Developing metrics with `metrics.develop`

<VersionBlock firstVersion="1.3" >
Expand Down Expand Up @@ -493,6 +640,61 @@ Functionality for `develop` is only supported in v1.2 and higher. Please navigat

</VersionBlock>

#### Multiple/Derived Metrics with `metrics.develop`
If you have a more complicated use case that you are interested in testing, the develop macro also supports this behavior. The only caveat is that you must include the raw tags for any provided metric yml that contains a derived metric. Example below:

```
{% set my_metric_yml -%}
{% raw %}
metrics:
- name: develop_metric
model: ref('fact_orders')
label: Total Discount ($)
timestamp: order_date
time_grains: [day, week, month]
calculation_method: average
expression: discount_total
dimensions:
- had_discount
- order_country
- name: derived_metric
label: Total Discount ($)
timestamp: order_date
time_grains: [day, week, month]
calculation_method: derived
expression: "{{ metric('develop_metric') }} - 1 "
dimensions:
- had_discount
- order_country
- name: some_other_metric_not_using
label: Total Discount ($)
timestamp: order_date
time_grains: [day, week, month]
calculation_method: derived
expression: "{{ metric('derived_metric') }} - 1 "
dimensions:
- had_discount
- order_country
{% endraw %}
{%- endset %}
select *
from {{ metrics.develop(
develop_yml=my_metric_yml,
metric_list=['derived_metric']
grain='month'
)
}}
```

The above example will return a dataset that contains the metric provided in the metric list (`derived_metric`) and the parent metric (`develop_metric`). It will not contain `some_other_metric_not_using` as it is not designated in the metric list or a parent of the metrics included.

**Important caveat** - You _must_ wrap the `expression` property for `derived` metrics in double quotes to render it. For example, `expression: "{{ metric('develop_metric') }} - 1 "`.


<Snippet src="discourse-help-feed-header" />
<DiscourseHelpFeed tags="metrics"/>
Loading

0 comments on commit 98eb3af

Please sign in to comment.