Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get-started: update Get Started accord to most recent example-get-started repo #555

Merged
merged 13 commits into from
Aug 23, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion static/docs/changelog/0.18.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ really excited to share the progress with you:

- Commands startup latency reduced 3x

- 📙 **Documentation got better** - a whole new [get started](/doc/get-started)
- 📙 **Documentation got better** - a whole new [Get Started](/doc/get-started)
guide, new [use cases](/doc/use-cases), DVC internals, and lot of other great
stuff you can find here.

Expand Down
4 changes: 2 additions & 2 deletions static/docs/changelog/0.35.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ improvements) we have done in the last few months:
tags** (Git commits are coming). The new option `-T` or `--all-tags` is
supported by all DVC commands that support`-a` or `--all-branches`.

- 📖 [Get started guide](/doc/get-started/agenda) has been simplified (e.g. to
use tags instead of branches) and extended. We have also prepared a
- 📖 The [Get Started](/doc/get-started/agenda) section has been simplified
(e.g. to use tags instead of branches) and extended. We have also prepared a
[Github DVC project ](https://github.com/iterative/example-get-started)that
reflects the sequence of steps in the “get started” guide. You can now
download the whole project and reproduce all the models.
Expand Down
9 changes: 5 additions & 4 deletions static/docs/commands-reference/checkout.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,15 +144,16 @@ solving the problem:

```dvc
$ git tag
baseline <- first simple version of the model
bigram <- use bigrams to improve the model
baseline-experiment <- first simple version of the model
bigrams-experiment <- use bigrams to improve the model
```

This project comes with a predefined HTTP
[remote storage](/doc/commands-reference/remote). We can now just run `dvc pull`
that will fetch and checkout the most recent `model.pkl`, `data.xml`, and other
files that are under DVC control. The model file checksum
`3863d0e317dee0a55c4e59d2ec0eef33` is specified in the `train.dvc` file:
`3863d0e317dee0a55c4e59d2ec0eef33` will be used in the `train.dvc`
[stage file](/doc/commands-reference/run):

```dvc
$ dvc pull
Expand All @@ -177,7 +178,7 @@ Note: checking out 'baseline'.
HEAD is now at 40cc182...
```

Let's check the `model.pkl` entry in `train.dvc` again:
Let's check the `model.pkl` entry in `train.dvc` now:

```yaml
outs:
Expand Down
22 changes: 11 additions & 11 deletions static/docs/commands-reference/commit.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ recommend creating a virtual environment with a tool such as
```dvc
$ virtualenv -p python3 .env
$ source .env/bin/activate
$ pip install -r requirements.txt
$ pip install -r src/requirements.txt
```

Download the precomputed data using:
Expand Down Expand Up @@ -176,17 +176,17 @@ indeed _not in cache_ as claimed. Look at `train.dvc` first:
```yaml
cmd: python src/train.py data/features model.pkl
deps:
- md5: d05e0201a3fb47c878defea65bd85e4d
path: src/train.py
- md5: b7a357ba7fa6b726e615dd62b34190b4.dir
path: data/features
md5: b91b22bfd8d9e5af13e8f48523e80250
- md5: d05e0201a3fb47c878defea65bd85e4d
path: src/train.py
- md5: b7a357ba7fa6b726e615dd62b34190b4.dir
path: data/features
md5: b91b22bfd8d9e5af13e8f48523e80250
outs:
- cache: true
md5: 70599f166c2098d7ffca91a369a78b0d
metric: false
path: model.pkl
persist: false
- cache: true
md5: 70599f166c2098d7ffca91a369a78b0d
metric: false
path: model.pkl
persist: false
wdir: .
```

Expand Down
11 changes: 6 additions & 5 deletions static/docs/commands-reference/diff.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,9 @@ by the Git SCM, for example when `dvc init` was used with the `--no-scm` option.

## Examples

For these examples we can use the steps in our [Get Started](/doc/get-started)
guide, up to the [Add Files](/doc/get-started/add-files) step.
For these examples we can use the chapters in our
[Get Started](/doc/get-started) guide, up to
[Add Files](/doc/get-started/add-files).

<details>

Expand All @@ -63,7 +64,7 @@ guide, up to the [Add Files](/doc/get-started/add-files) step.
Start by cloning our sample repo if you don't already have it. Then move into
the repo and checkout the
[version](https://github.com/iterative/example-get-started/releases/tag/3-add-file)
corresponding to the _Add Files_ step:
corresponding to the _Add Files_ chapter:

```dvc
$ git clone https://github.com/iterative/example-get-started
Expand Down Expand Up @@ -103,7 +104,7 @@ added file with size 37.9 MB

We can base this example in the [Experiment Metrics](/doc/get-started/metrics)
and [Compare Experiments](/doc/get-started/compare-experiments) sections of our
Get Started guide, which describe different experiments to produce the
_Get Started_ section, which describe different experiments to produce the
`model.pkl` file. Our sample repository has the `bigrams-experiment` and
`baseline-experiment`
[tags](https://github.com/iterative/example-get-started/tags) respectively to
Expand Down Expand Up @@ -174,7 +175,7 @@ Let's use our sample repo once again, which has several
[available tags](https://github.com/iterative/example-get-started/tags) for
conveniency. The `5-preparation` tag corresponds to the
[Connect Code and Data](/doc/get-started/connect-code-and-data) section of our
Get Started guide, in which the `dvc run` command is used to create the
_Get Started_ section, in which the `dvc run` command is used to create the
`prepare.dvc` stage file. The output defined in this DVC-file is the
`data/prepared` directory.

Expand Down
4 changes: 2 additions & 2 deletions static/docs/commands-reference/fetch.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,8 +155,8 @@ solving the problem:
```dvc
$ git tag

baseline <- first simple version of the model
bigram <- use bigrams to improve the model
baseline-experiment <- first simple version of the model
bigrams-experiment <- use bigrams to improve the model
```

## Example: Default behavior
Expand Down
12 changes: 6 additions & 6 deletions static/docs/commands-reference/import-url.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@ them on your system.
Start by cloning our sample repo if you don't already have it. Then move into
the repo and checkout the
[version](https://github.com/iterative/example-get-started/releases/tag/2-remote)
corresponding to the [Configure](/doc/get-started/configure) step:
corresponding to the [Configure](/doc/get-started/configure) chapter:

```dvc
$ git clone https://github.com/iterative/example-get-started
Expand All @@ -140,13 +140,13 @@ $ mkdir data
```

You should now have a blank workspace, just before the
[Add Files](/doc/get-started/add-files) step.
[Add Files](/doc/get-started/add-files) chapter.

</details>

## Example: Tracking a remote file

An advanced alternate to [Add Files](/doc/get-started/add-files) step of the
An advanced alternate to [Add Files](/doc/get-started/add-files) chapter of the
_Get Started_ section is to use `dvc import-url`:

```dvc
Expand Down Expand Up @@ -248,8 +248,8 @@ instead of an `etag` we have an `md5` checksum. We did this so its easy to edit
the data file.

Let's now manually reproduce
[one of the processing steps](/doc/get-started/connect-code-and-data) from the
_Get Started_ project. Download the sample source code archive and unzip it:
[one of the processing chapters](/doc/get-started/connect-code-and-data) from
the _Get Started_ project. Download the sample source code archive and unzip it:

```dvc
$ wget https://code.dvc.org/get-started/code.zip
Expand All @@ -268,7 +268,7 @@ creating a virtual environment with a tool such as
```dvc
$ virtualenv -p python3 .env
$ source .env/bin/activate
$ pip install -r requirements.txt
$ pip install -r src/requirements.txt
```

</details>
Expand Down
5 changes: 3 additions & 2 deletions static/docs/commands-reference/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ recommend creating a virtual environment with a tool such as
```dvc
$ virtualenv -p python3 .env
$ source .env/bin/activate
$ pip install -r requirements.txt
$ pip install -r src/requirements.txt
```

Download the precomputed data using:
Expand Down Expand Up @@ -142,7 +142,8 @@ $ git tag
6-featurization
7-train
8-evaluation
9-bigrams
9-bigrams-model
10-bigrams-experiment
baseline-experiment
bigrams-experiment
```
Expand Down
2 changes: 1 addition & 1 deletion static/docs/get-started/add-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ Git:

```dvc
$ git add data/.gitignore data/data.xml.dvc
$ git commit -m "add raw data to DVC"
$ git commit -m "Add raw data to project"
```

Committing these special files to Git allows us to tack different versions of
Expand Down
16 changes: 8 additions & 8 deletions static/docs/get-started/compare-experiments.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,10 @@ DVC makes it easy to iterate on your project using Git commits with tags or Git
branches. It provides a way to try different ideas, keep track of them, switch
back and forth. To find the best performing experiment or track the progress, a
special _metric_ output type is supported in DVC (described in one of the
previous steps).
previous chapters).

Let's run evaluate for the latest `bigram` experiment we created in one of the
previous steps. It mostly takes just running the `dvc repro`:
Let's run evaluate for the latest `bigrams` experiment we created in previous
chapters. It mostly takes just running the `dvc repro`:

```dvc
$ git checkout master
Expand All @@ -17,12 +17,12 @@ $ dvc repro evaluate.dvc

`git checkout master` and `dvc checkout` commands ensure that we have the latest
experiment code and data respectively. And `dvc repro`, as we discussed in the
[reproduce](/doc/get-started/reproduce) step, is a way to run all the necessary
commands to build the model and measure its performance.
[Reproduce](/doc/get-started/reproduce) chapter, is a way to run all the
necessary commands to build the model and measure its performance.

```dvc
$ git commit -a -m "evaluate bigram model"
$ git tag -a "bigram-experiment" -m "bigrams"
$ git commit -am "Evaluate bigrams model"
$ git tag -a "bigrams-experiment" -m "Bigrams experiment evaluation"
```

Now, we can use `-T` option of the `dvc metrics show` command to see the
Expand All @@ -33,7 +33,7 @@ $ dvc metrics show -T

baseline-experiment:
auc.metric: 0.588426
bigram-experiment:
bigrams-experiment:
auc.metric: 0.602818
```

Expand Down
45 changes: 21 additions & 24 deletions static/docs/get-started/connect-code-and-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,37 +6,34 @@ basic useful framework to track, save and share models and large data files. To
achieve full reproducibility though, we'll have to connect code and
configuration with the data it processes to produce the result.

If you've followed this get started guide from the beginning, run these commands
to get the sample code:
<details>

> On Windows just use your browser to download the archive instead.
### Expand to prepare sample code ...

If you've followed this _Get Started_ section from the beginning, run these
commands to get the sample code:

```dvc
$ wget https://code.dvc.org/get-started/code.zip
$ unzip code.zip
$ rm -f code.zip
```

You'll also need to install its dependencies: Python packages like `pandas` and
`scikit-learn` that are required to run this example.

<details>

### Expand to prepare sample code ...
> On Windows just use your browser to download the archive instead.

After downloading the sample code, your project structure should look like this:
The workspace should now look like this:

```dvc
$ tree
.
├── data
│   ├── data.xml
│   └── data.xml.dvc
├── requirements.txt
└── src
   ├── evaluate.py
   ├── featurization.py
   ├── prepare.py
   ├── requirements.txt
 └── train.py
```

Expand All @@ -48,14 +45,14 @@ recommend creating a virtual environment with a tool such as
$ virtualenv -p python3 .env
$ source .env/bin/activate
$ echo ".env/" >> .gitignore
$ pip install -r requirements.txt
$ pip install -r src/requirements.txt
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
```

Save the progress to Git:
Optionally, save the progress to Git:

```dvc
$ git add .
$ git commit -m "add code"
$ git commit -m "Add source code files to repo"
```

</details>
Expand Down Expand Up @@ -94,11 +91,11 @@ This is how the result should look like now:
+ │ ├── test.tsv
+ │ └── train.tsv
+ ├── prepare.dvc
├── requirements.txt
└── src
├── evaluate.py
├── featurization.py
├── prepare.py
├── requirements.txt
└── train.py
```

Expand All @@ -107,16 +104,16 @@ This is how `prepare.dvc` looks like internally:
```yaml
cmd: python src/prepare.py data/data.xml
deps:
- md5: b4801c88a83f3bf5024c19a942993a48
path: src/prepare.py
- md5: a304afb96060aad90176268345e10355
path: data/data.xml
- md5: b4801c88a83f3bf5024c19a942993a48
path: src/prepare.py
- md5: a304afb96060aad90176268345e10355
path: data/data.xml
md5: c3a73109be6c186b9d72e714bcedaddb
outs:
- cache: true
md5: 6836f797f3924fb46fcfd6b9f6aa6416.dir
metric: false
path: data/prepared
- cache: true
md5: 6836f797f3924fb46fcfd6b9f6aa6416.dir
metric: false
path: data/prepared
wdir: .
```

Expand Down Expand Up @@ -160,6 +157,6 @@ Let's commit the changes to save the stage we built:

```dvc
$ git add data/.gitignore prepare.dvc
$ git commit -m "add data preparation stage"
$ git commit -m "Create data preparation stage"
$ dvc push
```
4 changes: 2 additions & 2 deletions static/docs/get-started/example-pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ it `python`. This is a short version of the [Tutorial](/doc/tutorial).

In this example, we will focus on building a simple ML pipeline that takes an
archive with StackOverflow posts and trains the prediction model and saves it as
an output. See [get started](/doc/get-started) to see links to other examples,
an output. See [Get Started](/doc/get-started) to see links to other examples,
tutorials, use cases if you want to cover other aspects of the DVC. The pipeline
itself is a sequence of transformation we apply to the data file:

Expand Down Expand Up @@ -57,7 +57,7 @@ $ pip install -r requirements.txt
```

Next, we will create a pipeline step-by-step, utilizing the same set of commands
that are described in earlier [get started](/doc/get-started) chapters.
that are described in earlier [Get Started](/doc/get-started) chapters.

> Note that its possible to define more than one pipeline in each <abbr>DVC
> project</abbr>. This will be determined by the interdependencies between
Expand Down
9 changes: 7 additions & 2 deletions static/docs/get-started/experiments.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,15 @@ bag_of_words = CountVectorizer(stop_words='english',

```dvc
$ vi src/featurization.py # edit to use bigrams (see above)
$ dvc repro train.dvc # get and save the new model.pkl
$ git commit -a -m "bigram model"
$ dvc repro train.dvc # regenerate the new model.pkl
$ git commit -am "Reproduce model using bigrams"
```

> Notice that `git commit -a` stages all the changes produced by `dvc repro`
> before committing them to Git. Refer to the
> [command reference](https://git-scm.com/docs/git-commit#Documentation/git-commit.txt--a)
> for more details.

Now, we have a new `model.pkl` captured and saved. To get back to the initial
version we run `git checkout` along with `dvc checkout` command:

Expand Down
Loading