Skip to content

Commit

Permalink
Merge pull request #82 from analyst-collective/readthedocs
Browse files Browse the repository at this point in the history
Readthedocs
  • Loading branch information
jthandy authored Jul 30, 2016
2 parents c422536 + bdc069b commit cc0caec
Show file tree
Hide file tree
Showing 14 changed files with 340 additions and 135 deletions.
149 changes: 14 additions & 135 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,148 +1,27 @@
### dbt
# dbt

[![Join the chat at https://gitter.im/analyst-collective/dbt](https://badges.gitter.im/analyst-collective/dbt.svg)](https://gitter.im/analyst-collective/dbt?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
dbt [data build tool] helps you write reliable, modular analytic code as an individual or in teams using a workflow that closely mirrors software development.

A data build tool
---

#### installation
- View the [documentation][documentation-url].
- Project [release notes][release-notes-url].
- Join the [chat][gittr-url] on Gittr.

```bash
› pip install dbt
```

#### configuration
To create your first dbt project, run:
```bash
› dbt init [project]
```
This will create a sample dbt_project.yml file in the [project] directory with everything you need to get started.
## Code of Conduct

Next, create a `profiles.yml` file in the `~/.dbt` directory. If this directory doesn't exist, you should create it. The
`dbt_project.yml` file should be checked in to your models repository, so be sure that it does *not* contain any database
credentials! Make sure that all of your private information is stored in the `~/.dbt/profiles.yml` configuration file.
Everyone interacting in the dbt project's codebases, issue trackers, chat rooms, and mailing lists is expected to follow the [PyPA Code of Conduct].

##### example dbt_project.yml
```yml

# configure dbt file paths (relative to dbt_project.yml)
## Project background

# the package config is _required_. If other packages import this package,
# the name given below is used to reference this package
package:
name: 'package_name'
version: '1.0'
For more information on the thinking that led to dbt, see [this article](https://medium.com/analyst-collective/building-a-mature-analytics-workflow-the-analyst-collective-viewpoint-7653473ef05b).

source-paths: ["models"] # paths with source code to compile
target-path: "target" # path for compiled code
clean-targets: ["target"] # directories removed by the clean task
test-paths: ["test"] # where to store test results

# default paramaters that apply to _all_ models (unless overridden below)

model-defaults:
enabled: true # enable all models by default
materialized: false # If true, create tables. If false, create views

# custom configurations for each model. Unspecified models will use the model-defaults information above.

models:
pardot: # assuming pardot is listed in the models/ directory
enabled: true # enable all pardot models except where overriden (same as default)
pardot_emails: # override the configs for the pardot_emails model
enabled: true # enable this specific model (false to disable)
materialized: true # create a table instead of a view

# You can choose sort keys, a dist key, or both to improve query efficiency. By default, materialized
# tables are created with no sort or dist keys.
#
sort: ['@timestamp', '@userid'] # optionally set one or more sort keys on the materialized table
dist: '@userid' # optionally set a distribution key on the materialized table

pardot_visitoractivity:
materialized: false
sort: ['@timestamp'] # this has no effect, as sort and dist keys only apply to materialized tables

# add dependencies. these will get pulled during the `dbt deps` process.

repositories:
- "[email protected]:analyst-collective/analytics"

```

##### example ~/.dbt/profiles.yml
```yml
user: # you can have multiple profiles for different projects
outputs:
my-redshift: # uniquely named, you can have different targets in a profile
type: redshift # only type supported
host: localhost # any IP or fqdn
port: 5439
user: my_user
pass: password
dbname: dev
schema: my_model_schema # the schema to create models in (eg. analyst_collective)
run-target: my-redshift # which target to run sql against
```
#### use
`dbt deps` to pull most recent version of dependencies

`dbt test` to check this validity of your SQL model files (this runs against the DB)

`dbt compile` to generate runnable SQL from model files

`dbt run` to run model files on the current `run-target` database

`dbt clean` to clear compiled files

#### docker

an alternate means of using dbt is with the docker image jthandy/dbt. if you already have docker installed on your system, run with:
```docker run -v ${PWD}:/dbt -v ~/.dbt:/root/.dbt jthandy/dbt /bin/bash -c "[type your command here]"
this can be run from any dbt project directory. it relies on the same configuration setup outlined above.
on linux and osx hosts, running this can be streamlined by including the following function in your bash_profile:
```
function dbt() {
docker run -v ${PWD}:/dbt -v ~/.dbt:/root/.dbt jthandy/dbt /bin/bash -c "dbt ${1}"
}
```

at that point, any dbt command (i.e. `dbt run`) into a command line will execute within the docker container.

#### troubleshooting

If you see an error that looks like
> Error: pg_config executable not found
while installing dbt, make sure that you have development versions of postgres installed

```bash
# linux
sudo apt-get install libpq-dev python-dev

# osx
brew install postgresql
```

#### contributing

From the root directory of this repository, run:
```bash
› python setup.py develop
```

to install a development version of `dbt`.

#### design principles

dbt that supports an [opinionated analytics workflow](https://github.com/analyst-collective/wiki/wiki/Building-a-Mature-Analytics-Workflow:-The-Analyst-Collective-Viewpoint). Currently, dbt supports data modeling workflow. Future versions of dbt will support workflow for testing.

##### modeling data with dbt
- A model is a table or view built either on top of raw data or other models. Models are not transient; they are materialized in the database.
- Models are composed of a single SQL `select` statement. Any valid SQL can be used. As such, models can provide functionality such as data cleansing, data transformation, etc.
- Model files should be saved with a `.sql` extension.
- Each model should be stored in its own `.sql` file. The file name will become the name of the table or view in the database.
- Other models should be referenced with the `ref` function. This function will resolve dependencies during the `compile` stage. The only tables referenced without this function should be source raw data tables.
- Models should be minimally coupled to the underlying schema to make them robust to changes therein. Examples of how to implement this practice: a) provide aliases when specifying table and field names in models that select directly from raw data, b) minimize the number of models that select directly from raw data.
[PyPA Code of Conduct]: https://www.pypa.io/en/latest/code-of-conduct/
[gittr-url]: https://gitter.im/analyst-collective/dbt
[documentation-url]: http://dbt.readthedocs.io/en/readthedocs/
[release-notes-url]: http://dbt.readthedocs.io/en/readthedocs/about/release-notes/
14 changes: 14 additions & 0 deletions docs/about/contributing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Contributing

## Code

We welcome PRs! We recommend that you log any feature requests as issues and discuss implementation approach with the team prior to getting to work. In order to get set up to develop, run the following from the root project directory to install a development version of dbt:

```bash
› python setup.py develop
```


## Docs

We welcome PRs with updated documentation! All documentation for dbt is written in markdown using [mkdocs](http://www.mkdocs.org/). Please follow installation instructions there to set up mkdocs on your local environment.
Empty file added docs/about/license.md
Empty file.
45 changes: 45 additions & 0 deletions docs/about/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Overview #

## What is dbt?
dbt [data build tool] is a tool for creating analytical data models. dbt facilitates an analytical workflow that closely mirrors software development, including source control, testing, and deployment. dbt makes it possible to produce reliable, modular analytic code as an individual or in teams.

For more information on the thinking that led to dbt, see [this article]( https://medium.com/analyst-collective/building-a-mature-analytics-workflow-the-analyst-collective-viewpoint-7653473ef05b).

## Who should use dbt?
dbt is built for data consumers who want to model data in SQL to support production analytics use cases. Familiarity with tools like text editors, git, and the command line is helpful—while you do not need to be an expert with any of these tools, some basic familiarity is important.

## Why do I need to model my data?
With the advent of MPP analytic databases like Amazon Redshift and Google BigQuery, it is now common for companies to load and analyze large amounts of raw data in SQL-based environments. Raw data is often not suited for direct analysis and needs to be restructured first. Some common use cases include:
- sessionizing raw web clickstream data
- amortizing multi-month financial transactions

Modeling data transforms raw data into data that can be more easily consumed by business users and BI platforms. It also encodes business rules that can then be relied on by all subsequent analysis, establishing a "single source of truth".

## What exactly is a "data model" in this context?
A dbt data model is a SQL `SELECT` statement with templating and dbt-specific extensions.

## How does dbt work?

dbt has a small number of core functions. It:
- takes a set of data models and compiles them into raw SQL,
- materializes them into your database as views and tables, and
- runs automated tests on top of them to ensure their integrity.

Once your data models have been materialized into your database, you can write analytic queries on top of them in any SQL-enabled tool.

Conceptually, this is very simple. Practically, dbt solves some big headaches in exactly *how* it accomplishes these tasks:
- dbt interpolates schema and table names in your data models. This allows you to do things like deploy models to test and production environments seamlessly.
- dbt automatically infers a directed acyclic graph of the dependencies between your data models and uses this graph to manage the deployment to your schema. This graph is powerful, and allows for features like partial deployment and safe multi-threading.
- dbt's opinionated design lets you focus on writing your business logic instead of writing configuration and boilerplate code.

## Why model data in SQL?

Historically, most analytical data modeling has been done prior to loading data into a SQL-based analytic database. Today, however, it's often preferable to model data within an analytic database using SQL. There are two primary reasons for this:

1. SQL is a very widely-known language for working with data. Providing SQL-based modeling tools gives the largest-possible group of users access.
1. Modern analytic databases are extremely performant, and have sophisticated optimizers. Writing data transformations in SQL allows users to describe transformations on their data but leave the execution plan to the underlying technology. In practice, this provides excellent results with far less work on the part of the author.

Of course, SQL is not a Turing-complete language (to say the least!) and so, will inevitably not be suitable for 100% of potential use cases. dbt may be extended in the future to take advantage of support for non-SQL languages in platforms like Redshift and BigQuery. We have found, though, that modern SQL has a higher degree of coverage than we had originally expected. To users of languages like Python, solving a challenging problem in SQL often requires a different type of thinking, but the advantages of staying "in-database" and allowing the optimizer to work for you are very significant.

## What databases does dbt currently support?
Currently, dbt supports PostgreSQL and Amazon Redshift. We anticipate building support for additional databases in the future.
Empty file added docs/about/release-notes.md
Empty file.
40 changes: 40 additions & 0 deletions docs/guide/best-practices.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Best practices #

We use dbt extensively in our own analytics work and have developed some guidelines we believe will lead you to be more successful in your usage.

## Limit dependencies on raw data

It's straightforward to make sure that you maintain dependencies within a dbt project using the `ref()` function, but your project will inevitably depend on raw data stored elsewhere in your database. We recommend making what we call "base models" to minimize the dependencies on external tables. The way we have come to use this convention is that base models have the following responsibilities:

- Select only the fields that are relevant for current analytics to limit complexity. More fields can always be added later.
- Perform any needed type conversion.
- Perform field renaming to rationalize field names into a standard format used within the project.
- **Act as the sole access point to a given raw data table.**

In this convention, all subsequent data models are built on top of base models rather than on top of raw data—only base models are allowed to select from raw data tables. This ensures both that all of the transformations within the base model will be applied to all uses of this data and that if the source data table moves (or is located in a different schema or table in a different environment) it can be renamed in a single place.

For a simple example of a base model, check out this (link to a snowplow model).

## Creating trustworthy analytics

Software developers often use sophisticated tools for source control, environment management, and deployment. Analytics, to-date, has not had the same tooling. Frequently, all analytics is conducted in "production", and ad-hoc mechanisms are used within a given analytics product to know what is trustworthy and what is not. The question "Is this data trustworthy?" can make or break an analytics project, and managing environments and source control are the keys to making sure the answer to that question is always "Yes."

## Managing multiple environments

Currently, dbt supports multiple `run-target`s within a given project within `~/.dbt/profiles.yml`. Users can configure a default `run-target` and can override this setting with the `--target` flag passed to `dbt run`. We recommend setting your default `run-target` to your development environment, and then switch to your production `run-target` on a case-by-case basis.

Using `run-target` to manage multiple environments gives you the flexibility set up your environments how you choose. Commonly, environments are managed by schemas within the same database: all test models are deployed to a schema called `dbt_[username]` and production models are deployed to a schema called `analytics`. An ideal setup would have production and test databases completely separate. Either way, we highly recommend maintaining multiple environments and managing deployments with `run-target`.

## Source control workflows

We believe that all dbt projects should be managed via source control. We use git for all of our source control, and use branching and pull requests to keep the master branch the sole source of organizational truth. Future versions of dbt will include hooks that will automatically deploy to production upon pushing to master.

## Using dbt interactively

The best development tools allow for very small units of work to be developed and tested quickly. One of the major advantages of dbt is getting analytics out of clunky tools and into text files that can be edited in whatever your editor of choice is—we have folks using vim, emacs, and Atom.

When your project gets large enough, `dbt run` can begin to take a while. This stage in your development could be a bottleneck and slow you down. dbt provides three primary ways to address this:

1. Use views instead of tables to the greatest extent possible in development. Views typically deploy much faster than tables, and in development it's often not critical that subsequent analytic queries run as fast as possible. It's easy to change this setting later and it will have no impact on your business logic.
1. Use `dbt_project.yml` to disable portions of your project that you're not currently working on. If you have multiple modules within a given project, turn off the ones that you're not currently working on so that those models don't deploy with every `dbt run`.
1. Pass the `--model` flag to `dbt run`. This flag asks dbt to only `run` the models you specify and their dependents. If you're working on a particular model, this can make a very significant difference in your workflow.
37 changes: 37 additions & 0 deletions docs/guide/building-models.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Building models #

Building data models is the core of using dbt. This section provides guidance on how to think about data models in dbt and how to go about building them.

## Everything is a `SELECT`

The core concept of dbt data models is that everything is a `SELECT` statement. Using this approach, the SQL code within a given model defines the dataset, while dbt configuration defines what to do with it.

The advantages of this may not be incredibly clear at first, but here are some things that can be done when thinking about specifying data models this way:
- With a single config change, one data model or an entire hierarchy of models can be flipped from views to materialized tables. dbt takes care of wrapping a model's `SELECT` statement in the appropriate `CREATE TABLE` or `CREATE VIEW` syntax.
- With two configuration changes, a model can be flipped from a materialized table that is rebuilt with every `dbt run` to a table that is built incrementally, inserting the most recent rows since the most recent `dbt run`. dbt will wrap the select into an `INSERT` statement and automatically generate the appropriate `WHERE` clause.
- With one config change, a model can be made ephemeral. Instead of being deployed into the database, ephemeral models are pulled into dependent models as common table expressions.

Because every model is a `SELECT`, these behaviors can all be configured very simply, allowing for flexibility in development workflow and production deployment.

## Using `ref()`

dbt models support interpolation via the Jinja2 templating language. This presents many powerful options for building data models, many of which are only now beginning to be explored! The most important function in dbt is `ref()`; it's impossible to build even moderately complex models without it.

`ref()` is how you reference one model within another. This is a very common behavior, as typically models are built to be "stacked" on top of one another to create increasing analytical sophistication. Here is how this looks in practice:

```sql
--filename: model_a.sql

select *
from public.raw_data
```
```sql
--filename: model_b.sql

select *
from {{ref('model_a')}}
```

`ref()` is, under the hood, actually doing two important things. First, it is interpolating the schema into your model file to allow you to change your deployment schema via configuration. Second, it is using these references between models to automatically build the dependency graph. This will enable dbt to deploy models in the correct order when using `dbt run`.

When calling to Jinja2, functions are wrapped in double brackets—`{{}}`)—so writing `ref('model_name')` must actually be done as `{{ref('model_name')}}`
Loading

0 comments on commit cc0caec

Please sign in to comment.