-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #82 from analyst-collective/readthedocs
Readthedocs
- Loading branch information
Showing
14 changed files
with
340 additions
and
135 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,148 +1,27 @@ | ||
### dbt | ||
# dbt | ||
|
||
[![Join the chat at https://gitter.im/analyst-collective/dbt](https://badges.gitter.im/analyst-collective/dbt.svg)](https://gitter.im/analyst-collective/dbt?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge) | ||
dbt [data build tool] helps you write reliable, modular analytic code as an individual or in teams using a workflow that closely mirrors software development. | ||
|
||
A data build tool | ||
--- | ||
|
||
#### installation | ||
- View the [documentation][documentation-url]. | ||
- Project [release notes][release-notes-url]. | ||
- Join the [chat][gittr-url] on Gittr. | ||
|
||
```bash | ||
› pip install dbt | ||
``` | ||
|
||
#### configuration | ||
To create your first dbt project, run: | ||
```bash | ||
› dbt init [project] | ||
``` | ||
This will create a sample dbt_project.yml file in the [project] directory with everything you need to get started. | ||
## Code of Conduct | ||
|
||
Next, create a `profiles.yml` file in the `~/.dbt` directory. If this directory doesn't exist, you should create it. The | ||
`dbt_project.yml` file should be checked in to your models repository, so be sure that it does *not* contain any database | ||
credentials! Make sure that all of your private information is stored in the `~/.dbt/profiles.yml` configuration file. | ||
Everyone interacting in the dbt project's codebases, issue trackers, chat rooms, and mailing lists is expected to follow the [PyPA Code of Conduct]. | ||
|
||
##### example dbt_project.yml | ||
```yml | ||
|
||
# configure dbt file paths (relative to dbt_project.yml) | ||
## Project background | ||
|
||
# the package config is _required_. If other packages import this package, | ||
# the name given below is used to reference this package | ||
package: | ||
name: 'package_name' | ||
version: '1.0' | ||
For more information on the thinking that led to dbt, see [this article](https://medium.com/analyst-collective/building-a-mature-analytics-workflow-the-analyst-collective-viewpoint-7653473ef05b). | ||
|
||
source-paths: ["models"] # paths with source code to compile | ||
target-path: "target" # path for compiled code | ||
clean-targets: ["target"] # directories removed by the clean task | ||
test-paths: ["test"] # where to store test results | ||
|
||
# default paramaters that apply to _all_ models (unless overridden below) | ||
|
||
model-defaults: | ||
enabled: true # enable all models by default | ||
materialized: false # If true, create tables. If false, create views | ||
|
||
# custom configurations for each model. Unspecified models will use the model-defaults information above. | ||
|
||
models: | ||
pardot: # assuming pardot is listed in the models/ directory | ||
enabled: true # enable all pardot models except where overriden (same as default) | ||
pardot_emails: # override the configs for the pardot_emails model | ||
enabled: true # enable this specific model (false to disable) | ||
materialized: true # create a table instead of a view | ||
|
||
# You can choose sort keys, a dist key, or both to improve query efficiency. By default, materialized | ||
# tables are created with no sort or dist keys. | ||
# | ||
sort: ['@timestamp', '@userid'] # optionally set one or more sort keys on the materialized table | ||
dist: '@userid' # optionally set a distribution key on the materialized table | ||
|
||
pardot_visitoractivity: | ||
materialized: false | ||
sort: ['@timestamp'] # this has no effect, as sort and dist keys only apply to materialized tables | ||
|
||
# add dependencies. these will get pulled during the `dbt deps` process. | ||
|
||
repositories: | ||
- "[email protected]:analyst-collective/analytics" | ||
|
||
``` | ||
|
||
##### example ~/.dbt/profiles.yml | ||
```yml | ||
user: # you can have multiple profiles for different projects | ||
outputs: | ||
my-redshift: # uniquely named, you can have different targets in a profile | ||
type: redshift # only type supported | ||
host: localhost # any IP or fqdn | ||
port: 5439 | ||
user: my_user | ||
pass: password | ||
dbname: dev | ||
schema: my_model_schema # the schema to create models in (eg. analyst_collective) | ||
run-target: my-redshift # which target to run sql against | ||
``` | ||
#### use | ||
`dbt deps` to pull most recent version of dependencies | ||
|
||
`dbt test` to check this validity of your SQL model files (this runs against the DB) | ||
|
||
`dbt compile` to generate runnable SQL from model files | ||
|
||
`dbt run` to run model files on the current `run-target` database | ||
|
||
`dbt clean` to clear compiled files | ||
|
||
#### docker | ||
|
||
an alternate means of using dbt is with the docker image jthandy/dbt. if you already have docker installed on your system, run with: | ||
```docker run -v ${PWD}:/dbt -v ~/.dbt:/root/.dbt jthandy/dbt /bin/bash -c "[type your command here]" | ||
this can be run from any dbt project directory. it relies on the same configuration setup outlined above. | ||
on linux and osx hosts, running this can be streamlined by including the following function in your bash_profile: | ||
``` | ||
function dbt() { | ||
docker run -v ${PWD}:/dbt -v ~/.dbt:/root/.dbt jthandy/dbt /bin/bash -c "dbt ${1}" | ||
} | ||
``` | ||
|
||
at that point, any dbt command (i.e. `dbt run`) into a command line will execute within the docker container. | ||
|
||
#### troubleshooting | ||
|
||
If you see an error that looks like | ||
> Error: pg_config executable not found | ||
while installing dbt, make sure that you have development versions of postgres installed | ||
|
||
```bash | ||
# linux | ||
sudo apt-get install libpq-dev python-dev | ||
|
||
# osx | ||
brew install postgresql | ||
``` | ||
|
||
#### contributing | ||
|
||
From the root directory of this repository, run: | ||
```bash | ||
› python setup.py develop | ||
``` | ||
|
||
to install a development version of `dbt`. | ||
|
||
#### design principles | ||
|
||
dbt that supports an [opinionated analytics workflow](https://github.com/analyst-collective/wiki/wiki/Building-a-Mature-Analytics-Workflow:-The-Analyst-Collective-Viewpoint). Currently, dbt supports data modeling workflow. Future versions of dbt will support workflow for testing. | ||
|
||
##### modeling data with dbt | ||
- A model is a table or view built either on top of raw data or other models. Models are not transient; they are materialized in the database. | ||
- Models are composed of a single SQL `select` statement. Any valid SQL can be used. As such, models can provide functionality such as data cleansing, data transformation, etc. | ||
- Model files should be saved with a `.sql` extension. | ||
- Each model should be stored in its own `.sql` file. The file name will become the name of the table or view in the database. | ||
- Other models should be referenced with the `ref` function. This function will resolve dependencies during the `compile` stage. The only tables referenced without this function should be source raw data tables. | ||
- Models should be minimally coupled to the underlying schema to make them robust to changes therein. Examples of how to implement this practice: a) provide aliases when specifying table and field names in models that select directly from raw data, b) minimize the number of models that select directly from raw data. | ||
[PyPA Code of Conduct]: https://www.pypa.io/en/latest/code-of-conduct/ | ||
[gittr-url]: https://gitter.im/analyst-collective/dbt | ||
[documentation-url]: http://dbt.readthedocs.io/en/readthedocs/ | ||
[release-notes-url]: http://dbt.readthedocs.io/en/readthedocs/about/release-notes/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
# Contributing | ||
|
||
## Code | ||
|
||
We welcome PRs! We recommend that you log any feature requests as issues and discuss implementation approach with the team prior to getting to work. In order to get set up to develop, run the following from the root project directory to install a development version of dbt: | ||
|
||
```bash | ||
› python setup.py develop | ||
``` | ||
|
||
|
||
## Docs | ||
|
||
We welcome PRs with updated documentation! All documentation for dbt is written in markdown using [mkdocs](http://www.mkdocs.org/). Please follow installation instructions there to set up mkdocs on your local environment. |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
# Overview # | ||
|
||
## What is dbt? | ||
dbt [data build tool] is a tool for creating analytical data models. dbt facilitates an analytical workflow that closely mirrors software development, including source control, testing, and deployment. dbt makes it possible to produce reliable, modular analytic code as an individual or in teams. | ||
|
||
For more information on the thinking that led to dbt, see [this article]( https://medium.com/analyst-collective/building-a-mature-analytics-workflow-the-analyst-collective-viewpoint-7653473ef05b). | ||
|
||
## Who should use dbt? | ||
dbt is built for data consumers who want to model data in SQL to support production analytics use cases. Familiarity with tools like text editors, git, and the command line is helpful—while you do not need to be an expert with any of these tools, some basic familiarity is important. | ||
|
||
## Why do I need to model my data? | ||
With the advent of MPP analytic databases like Amazon Redshift and Google BigQuery, it is now common for companies to load and analyze large amounts of raw data in SQL-based environments. Raw data is often not suited for direct analysis and needs to be restructured first. Some common use cases include: | ||
- sessionizing raw web clickstream data | ||
- amortizing multi-month financial transactions | ||
|
||
Modeling data transforms raw data into data that can be more easily consumed by business users and BI platforms. It also encodes business rules that can then be relied on by all subsequent analysis, establishing a "single source of truth". | ||
|
||
## What exactly is a "data model" in this context? | ||
A dbt data model is a SQL `SELECT` statement with templating and dbt-specific extensions. | ||
|
||
## How does dbt work? | ||
|
||
dbt has a small number of core functions. It: | ||
- takes a set of data models and compiles them into raw SQL, | ||
- materializes them into your database as views and tables, and | ||
- runs automated tests on top of them to ensure their integrity. | ||
|
||
Once your data models have been materialized into your database, you can write analytic queries on top of them in any SQL-enabled tool. | ||
|
||
Conceptually, this is very simple. Practically, dbt solves some big headaches in exactly *how* it accomplishes these tasks: | ||
- dbt interpolates schema and table names in your data models. This allows you to do things like deploy models to test and production environments seamlessly. | ||
- dbt automatically infers a directed acyclic graph of the dependencies between your data models and uses this graph to manage the deployment to your schema. This graph is powerful, and allows for features like partial deployment and safe multi-threading. | ||
- dbt's opinionated design lets you focus on writing your business logic instead of writing configuration and boilerplate code. | ||
|
||
## Why model data in SQL? | ||
|
||
Historically, most analytical data modeling has been done prior to loading data into a SQL-based analytic database. Today, however, it's often preferable to model data within an analytic database using SQL. There are two primary reasons for this: | ||
|
||
1. SQL is a very widely-known language for working with data. Providing SQL-based modeling tools gives the largest-possible group of users access. | ||
1. Modern analytic databases are extremely performant, and have sophisticated optimizers. Writing data transformations in SQL allows users to describe transformations on their data but leave the execution plan to the underlying technology. In practice, this provides excellent results with far less work on the part of the author. | ||
|
||
Of course, SQL is not a Turing-complete language (to say the least!) and so, will inevitably not be suitable for 100% of potential use cases. dbt may be extended in the future to take advantage of support for non-SQL languages in platforms like Redshift and BigQuery. We have found, though, that modern SQL has a higher degree of coverage than we had originally expected. To users of languages like Python, solving a challenging problem in SQL often requires a different type of thinking, but the advantages of staying "in-database" and allowing the optimizer to work for you are very significant. | ||
|
||
## What databases does dbt currently support? | ||
Currently, dbt supports PostgreSQL and Amazon Redshift. We anticipate building support for additional databases in the future. |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
# Best practices # | ||
|
||
We use dbt extensively in our own analytics work and have developed some guidelines we believe will lead you to be more successful in your usage. | ||
|
||
## Limit dependencies on raw data | ||
|
||
It's straightforward to make sure that you maintain dependencies within a dbt project using the `ref()` function, but your project will inevitably depend on raw data stored elsewhere in your database. We recommend making what we call "base models" to minimize the dependencies on external tables. The way we have come to use this convention is that base models have the following responsibilities: | ||
|
||
- Select only the fields that are relevant for current analytics to limit complexity. More fields can always be added later. | ||
- Perform any needed type conversion. | ||
- Perform field renaming to rationalize field names into a standard format used within the project. | ||
- **Act as the sole access point to a given raw data table.** | ||
|
||
In this convention, all subsequent data models are built on top of base models rather than on top of raw data—only base models are allowed to select from raw data tables. This ensures both that all of the transformations within the base model will be applied to all uses of this data and that if the source data table moves (or is located in a different schema or table in a different environment) it can be renamed in a single place. | ||
|
||
For a simple example of a base model, check out this (link to a snowplow model). | ||
|
||
## Creating trustworthy analytics | ||
|
||
Software developers often use sophisticated tools for source control, environment management, and deployment. Analytics, to-date, has not had the same tooling. Frequently, all analytics is conducted in "production", and ad-hoc mechanisms are used within a given analytics product to know what is trustworthy and what is not. The question "Is this data trustworthy?" can make or break an analytics project, and managing environments and source control are the keys to making sure the answer to that question is always "Yes." | ||
|
||
## Managing multiple environments | ||
|
||
Currently, dbt supports multiple `run-target`s within a given project within `~/.dbt/profiles.yml`. Users can configure a default `run-target` and can override this setting with the `--target` flag passed to `dbt run`. We recommend setting your default `run-target` to your development environment, and then switch to your production `run-target` on a case-by-case basis. | ||
|
||
Using `run-target` to manage multiple environments gives you the flexibility set up your environments how you choose. Commonly, environments are managed by schemas within the same database: all test models are deployed to a schema called `dbt_[username]` and production models are deployed to a schema called `analytics`. An ideal setup would have production and test databases completely separate. Either way, we highly recommend maintaining multiple environments and managing deployments with `run-target`. | ||
|
||
## Source control workflows | ||
|
||
We believe that all dbt projects should be managed via source control. We use git for all of our source control, and use branching and pull requests to keep the master branch the sole source of organizational truth. Future versions of dbt will include hooks that will automatically deploy to production upon pushing to master. | ||
|
||
## Using dbt interactively | ||
|
||
The best development tools allow for very small units of work to be developed and tested quickly. One of the major advantages of dbt is getting analytics out of clunky tools and into text files that can be edited in whatever your editor of choice is—we have folks using vim, emacs, and Atom. | ||
|
||
When your project gets large enough, `dbt run` can begin to take a while. This stage in your development could be a bottleneck and slow you down. dbt provides three primary ways to address this: | ||
|
||
1. Use views instead of tables to the greatest extent possible in development. Views typically deploy much faster than tables, and in development it's often not critical that subsequent analytic queries run as fast as possible. It's easy to change this setting later and it will have no impact on your business logic. | ||
1. Use `dbt_project.yml` to disable portions of your project that you're not currently working on. If you have multiple modules within a given project, turn off the ones that you're not currently working on so that those models don't deploy with every `dbt run`. | ||
1. Pass the `--model` flag to `dbt run`. This flag asks dbt to only `run` the models you specify and their dependents. If you're working on a particular model, this can make a very significant difference in your workflow. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
# Building models # | ||
|
||
Building data models is the core of using dbt. This section provides guidance on how to think about data models in dbt and how to go about building them. | ||
|
||
## Everything is a `SELECT` | ||
|
||
The core concept of dbt data models is that everything is a `SELECT` statement. Using this approach, the SQL code within a given model defines the dataset, while dbt configuration defines what to do with it. | ||
|
||
The advantages of this may not be incredibly clear at first, but here are some things that can be done when thinking about specifying data models this way: | ||
- With a single config change, one data model or an entire hierarchy of models can be flipped from views to materialized tables. dbt takes care of wrapping a model's `SELECT` statement in the appropriate `CREATE TABLE` or `CREATE VIEW` syntax. | ||
- With two configuration changes, a model can be flipped from a materialized table that is rebuilt with every `dbt run` to a table that is built incrementally, inserting the most recent rows since the most recent `dbt run`. dbt will wrap the select into an `INSERT` statement and automatically generate the appropriate `WHERE` clause. | ||
- With one config change, a model can be made ephemeral. Instead of being deployed into the database, ephemeral models are pulled into dependent models as common table expressions. | ||
|
||
Because every model is a `SELECT`, these behaviors can all be configured very simply, allowing for flexibility in development workflow and production deployment. | ||
|
||
## Using `ref()` | ||
|
||
dbt models support interpolation via the Jinja2 templating language. This presents many powerful options for building data models, many of which are only now beginning to be explored! The most important function in dbt is `ref()`; it's impossible to build even moderately complex models without it. | ||
|
||
`ref()` is how you reference one model within another. This is a very common behavior, as typically models are built to be "stacked" on top of one another to create increasing analytical sophistication. Here is how this looks in practice: | ||
|
||
```sql | ||
--filename: model_a.sql | ||
|
||
select * | ||
from public.raw_data | ||
``` | ||
```sql | ||
--filename: model_b.sql | ||
|
||
select * | ||
from {{ref('model_a')}} | ||
``` | ||
|
||
`ref()` is, under the hood, actually doing two important things. First, it is interpolating the schema into your model file to allow you to change your deployment schema via configuration. Second, it is using these references between models to automatically build the dependency graph. This will enable dbt to deploy models in the correct order when using `dbt run`. | ||
|
||
When calling to Jinja2, functions are wrapped in double brackets—`{{}}`)—so writing `ref('model_name')` must actually be done as `{{ref('model_name')}}` |
Oops, something went wrong.