new schema.yml syntax #790

drewbanin · 2018-06-12T02:14:14Z

Related: #375

schema.yml files currently exist solely to specify schema tests for models. The schema.yml syntax should be extended to account for:

model metadata
- markdown descriptions
column definitions
markdown description
type
extras (column encodings for redshift / bq partition keys?)
tests, as before

Bonus:

schemas should be able to "extend" other schemas, eg. snowplow_sessions <- snowplow_sessions_tmp

Proposed syntax:

snowplow_sessions:
    comment: "A table of sessions sourced from snowplow events"

    options:
        strict: True
        extends: snowplow_sessions_tmp

    columns:
        - name: session_id
          comment: "The unique id for the session"
          tests:
              - unique
              - not_null
              - relationships:
                    to: snowplow_page_views
                    field: session_id

options

strict

options: True | False
default: False

If true, the columns specified in the columns section must match the actual columns in the model in the database. If there is a mismatch (either too many, or not enough columns), then an error will be raised. If false, then the check will not occur.

extends:

options: null | model name | list<model name>
default: null

If a model name is provided, then this model will "inherit" the schema from the parent model. This will entail copying over descriptions, column definitions, strictness, etc. This will be exceedingly useful for "chains" of models which share a similar schema, as duplicating the documentation would be both time consuming and error prone.

comments

Comments can either be long-form, unstructured Markdown, or, they can contain a ref to a documentation node. These documentation nodes will live in markdown files inside of markdown blocks, eg:

{% docs model.snowplow_sessions %}

### Lorem ipsum
- dolor sit amet
- consectetur adipiscing elit
- sed do eiusmod

{% enddocs %}

This block will serve a few purposes:

typing markdown inside of yaml is terrible
putting these blocks in .md files will make text editors behave sanely
the docs definitions can be referenced in multiple places, eg. for a column that appears in many models

This is a super natural use case for jinja. I can totally imagine writing macros to render tables, enforce docs guidelines, render links, etc etc etc.

Implementation

Each entry in the schema.yml files should be munged into the same JSON schema used for catalog entries. The two are very similar: they have comments, a list of columns, and those columns have names / types / etc. If we keep the data structures similar, then it should be easy to overlay the schema and catalog data on top of the manifest data for dbt docs purposes.

We should preserve backwards compatibility for schema tests either by 1) adding a version number header or 2) just continuing to parse the constraints section of the old schema.yml files.

The text was updated successfully, but these errors were encountered:

drewbanin · 2018-06-27T13:57:38Z

cc @jthandy @cmcarthur

jthandy · 2018-06-27T14:20:22Z

strict

the mechanism you're proposing to test (count of columns) doesn't feel right--seems like each individual column should be validated for existence if we're going to have this option at all. i also don't feel like this is something that must be prioritized for the initial release.

extends

are we planning on implementing this in the near-term? i love the idea but am just worried that it adds near-term complexity.

comment

i would propose not calling this comment, but rather docs. i'd like to be consistent from the beginning with how we're referring to documentation throughout dbt.

drewbanin · 2018-06-27T15:09:10Z

@jthandy sure, for strict, I meant more that both cases will be checked: documenting a column that does not exist, or failing to document a column that does exist, will result in an error.

I spoke with @cmcarthur about the implementation for extends, and we probably need to use networkx to actually traverse the whole extends graph. I want to make sure the decisions / implementation we choose now makes it feasible to add extends in the future, but I agree, might not be suitable for v1.

I feel the same way about comment. I think docs might not be exactly right though... maybe description? To me, docs are comprised of a description, tests, lineage, sample data, etc.

cmcarthur · 2018-06-27T15:16:27Z

👍 for description

@drewbanin when we discussed this yesterday we came up with a very different looking schema.yml format, can you post the updated structure here (with models and sources?)

jthandy · 2018-06-27T15:28:02Z

👍 for description

drewbanin · 2018-06-28T19:45:09Z

After speaking with @cmcarthur, we're going scope these schema definitions under a models: key, and we're also going to require a version indicator. That will look like:

version: 2

models:
  - name: events
    description: "a description..."

    columns:
        - name: event_time
          description: "def"
          tests:
              - primary_key
              - unique

sources:
  - name: snowplow
    description: "Snowplow dataset"
    tables:
        - name: snowplow_event_2
          description: An immutable log of events collected by Snowplow
          sql_table_name: snowplow.web_page
          columns:
            - name: collector_tstamp
              description: Timestamp for the event recorded by the collector

See #814 for more information on the sources section. This change will allow sources and schemas to conceivably live in the same file. @jthandy and I also discussed potentially renaming tests to properties, as this section could eventually include things that are not strictly tests.

Support new schema.yml syntax (#790)

drewbanin added this to the Betsy Ross (unreleased) milestone Jun 12, 2018

drewbanin added the dbt-docs [dbt feature] documentation site, powered by metadata artifacts label Jun 12, 2018

drewbanin mentioned this issue Jun 26, 2018

Docs blocks #810

Closed

drewbanin mentioned this issue Jun 28, 2018

Define source tables #814

Closed

drewbanin modified the milestones: 0.10.2 - Betsy Ross (unreleased), 0.11.0 - Isaac Asimov (unreleased) Jun 28, 2018

cmcarthur added the estimate: 16 label Jul 18, 2018

beckjake self-assigned this Jul 20, 2018

beckjake added a commit that referenced this issue Jul 31, 2018

Merge pull request #880 from fishtown-analytics/new-schema-yaml-syntax

3b3a486

Support new schema.yml syntax (#790)

beckjake mentioned this issue Jul 31, 2018

New schema yaml syntax (#790) #880

Merged

beckjake closed this as completed Jul 31, 2018

drewbanin mentioned this issue Sep 10, 2018

test coverage #589

Closed

emilieschario mentioned this issue Nov 28, 2018

dbt doc blocks #1158

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new schema.yml syntax #790

new schema.yml syntax #790

drewbanin commented Jun 12, 2018

drewbanin commented Jun 27, 2018

jthandy commented Jun 27, 2018

drewbanin commented Jun 27, 2018

cmcarthur commented Jun 27, 2018

jthandy commented Jun 27, 2018

drewbanin commented Jun 28, 2018