Skip to content

Commit

Permalink
Significant refactoring of this repo structure and
Browse files Browse the repository at this point in the history
and changes in README files.

per iterative/dvc.org#487 (comment)
  • Loading branch information
jorgeorpinel committed Aug 14, 2019
1 parent 5bfa3ac commit 903feb7
Show file tree
Hide file tree
Showing 9 changed files with 104 additions and 83 deletions.
37 changes: 26 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,32 @@
# Get Started Tutorial (sources)

Contains source code, deployment and generation scripts for example DVC
repositories used in the [Get Started](https://dvc.org/doc/get-started) and
other sections of the docs.
Contains source code and [Shell](https://www.shellscript.sh/) scripts to
generate and deploy example DVC repositories used in the [Get
Started](https://dvc.org/doc/get-started) and other sections of the DVC docs.

- `get-started.sh` - generates the `example-get-started` DVC project from
scratch. Code bundle is downloaded from S3 the same way as in the _Get
Started_ -> [Connect Code and
Data](https://dvc.org/doc/get-started/connect-code-and-data) chapter.
## Requirements

If you change [source code](code/src/) files, run `deploy.sh` first to make
sure that the code.zip archive is up to date.
Please make sure you have these available on the environment where these scripts
will run:

- Git
- Python (with `pip`)

## Scripts

Each example DVC project is in each of the root folders:

<!-- ### dataset-registry -->

### example-get-started

- `generate.sh` - generates the `example-get-started` DVC project from
scratch. A source code archive is downloaded from S3 the same way as in
[Connect Code and Data](https://dvc.org/doc/get-started/connect-code-and-data).

> If you change the [source code](code/src/) files in this repo, run
> `deploy.sh` first, to make sure that the `code.zip` archive is up to date.
- `deploy.sh` - deploys code archive that is downloaded as part of the
`get-started.sh` to S3.
> Requires AWS CLI and write access to `dvc-share` S3 bucket.
`generate.sh` to S3.
> Requires AWS CLI and write access to `s3://dvc-share/get-started/`.
130 changes: 69 additions & 61 deletions code/README.md → example-get-started/code/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,9 @@ Please report any issues in

![](https://dvc.org/static/img/example-flow-2x.png)

Get Started is a step by step introduction into basic DVC concepts. It doesn't
go into details much, but provides links and expandable sections to learn more.

The idea of the project is a simplified version of the
[tutorial](https://dvc.org/doc/tutorial). It explores the natural language
processing (NLP) problem of predicting tags for a given StackOverflow question.
Expand All @@ -14,61 +17,67 @@ Python language by tagging it `python`.

## Installation

First, you need to download the project:

```shell
$ git clone https://github.com/iterative/example-get-started
```

Second, let's install the requirements. But before we do that, we **strongly**
recommend creating a virtual environment with `virtualenv` or a similar tool:
Start by cloning the project:

```shell
$ cd example-get-started
$ virtualenv -p python3 .env
$ source .env/bin/activate
```dvc
$ git clone https://github.com/iterative/example-get-started
$ cd example-get-started
```

Now, we can install requirements for the project:
Now let's install the requirements. But before we do that, we **strongly**
recommend creating a virtual environment with a tool such as
[virtualenv](https://virtualenv.pypa.io/en/stable/):

```shell
$ pip install -r requirements.txt
```dvc

This comment was marked as resolved.

Copy link
@shcheklein

shcheklein Aug 14, 2019

Member

@jorgeorpinel this time dvc does not work. Github is rendering this and it's not aware about dvc. bash or shell is the default option.

This comment was marked as resolved.

Copy link
@jorgeorpinel

jorgeorpinel Aug 14, 2019

Author Contributor

Oh yeah my bad. I just copied these lines from the docs repo, forgot to replace dvc! Fixed in d66ec76

$ virtualenv -p python3 .env
$ source .env/bin/activate
$ pip install -r src/requirements.txt
```

## Running in Your Environment

This project comes with a predefined remote DVC storage that contains all input,
intermediate and final results that were produced.
This DVC project comes with a preconfigured remote DVC storage that has raw data
(input), intermediate, and final results that are produced.

```shell
$ dvc remote list
storage https://remote.dvc.org/get-started
```console
$ dvc remote list
storage https://remote.dvc.org/get-started
```

You can run [`dvc pull`](https://man.dvc.org/pull) to download the data:

```shell
$ dvc pull -r storage
```console
$ dvc pull -r storage
```

and [`dvc repro`](https://man.dvc.org/repro) to reproduce the pipeline:
## Running in Your Environment

Run [`dvc repro`](https://man.dvc.org/repro) to reproduce the
[pipeline](https://dvc.org/doc/commands-reference/pipeline):

```shell
$ dvc repro evaluate.dvc
```console
$ dvc repro evaluate.dvc
```

> `dvc repro` requires a target [stage file](https://man.dvc.org/run)
> ([DVC-file](https://dvc.org/doc/user-guide/dvc-file-format)) to reconstruct
> and regenerate a pipeline. In this case we use `evaluate.dvc`, the last stage
> in this project's pipeline.
If you'd like to test commands like [`dvc push`](https://man.dvc.org/push),
that require write access to the remote storage, the easiest way would be to set
up the local remote on your file system:
up a "local remote" on your file system:

```shell
$ dvc remote add local /tmp/dvc-storage
> This kind of remote is located in the local file system, but is external to
> the DVC project.
```console
$ mkdir -P /tmp/dvc-storage
$ dvc remote add local /tmp/dvc-storage
```

You should be able to run:
You should now be able to run:

```shell
$ dvc push -r local
```console
$ dvc push -r local
```

## Existing Stages
Expand All @@ -83,7 +92,7 @@ playground ready.
created.
- `2-remote` - remote HTTP storage initialized. It is a shared read only storage
that contains all data artifacts produced during next steps.
- `3-add-file` - input data file `data.xml` downloaded and put under DVC
- `3-add-file` - raw data file `data.xml` downloaded and put under DVC
control with [`dvc add`](https://man.dvc.org/add). First `.dvc` meta-file
created.
- `4-source` - source code downloaded and put under Git control.
Expand All @@ -108,38 +117,37 @@ There are two additional tags:
for.
- `bigrams-experiment` - second version of the experiment.

Both these tags could be used to illustrate `-a` or `-T` DVC options across
different commands.
Both these tags could be used to illustrate `-a` or `-T` options across
different [DVC commands](https://man.dvc.org/).

## Project Structure

The project files, DVC files, data files changes as you apply stages one by one,
The data files, DVC-files, and results change as stages are created one by one,
but right after you for Git clone and [`dvc pull`](https://man.dvc.org/pull) to
download files that are under DVC control, the structure of the project should
look like this:

```shell
.

This comment was marked as resolved.

Copy link
@shcheklein

shcheklein Aug 14, 2019

Member

also, we can't force Github to indent like we do in docs automatically. Might be a good idea to keep spaces for the sake of readability.

This comment was marked as resolved.

Copy link
@jorgeorpinel

jorgeorpinel Aug 14, 2019

Author Contributor

Sure. But code blocks without indentation look OK to me? E.g.:

image

image

What do you think?

This comment was marked as resolved.

Copy link
@shcheklein

shcheklein Aug 14, 2019

Member

agreed! it looks absolutely fine

├── auc.metric <-- DVC metric file to compare baseline and bigrams
├── data <-- directory with input and intermediate data
│   ├── features <-- extracted feature matrices
│   │   ├── test.pkl
│   │   └── train.pkl
│   └── prepared <-- pre-processed dataset, split and TSV formatted
│   ├── test.tsv
│   └── train.tsv
│   ├── data.xml <-- initial XML StackOverflow dataset
│   ├── data.xml.dvc
├── evaluate.dvc <-- DVC files in the project root describe pipeline
├── featurize.dvc
├── model.pkl
├── prepare.dvc
├── requirements.txt <-- Python dependencies you need to run the project
├── src <-- sources to run the pipeline
│   ├── evaluate.py
│   ├── featurization.py
│   ── prepare.py
│   └── train.py
└── train.dvc
```sh
.
├── auc.metric # <-- DVC metric compares baseline and bigrams
├── data # <-- Directory with raw and intermediate data
│   ├── features # <-- Extracted feature matrices
│   │   ├── test.pkl
│   │   └── train.pkl
│   └── prepared # <-- Processed dataset (split and TSV formatted)
│   ├── test.tsv
│   └── train.tsv
│   ├── data.xml # <-- Initial XML StackOverflow dataset (raw data)
│   ├── data.xml.dvc
├── evaluate.dvc # <-- DVC-files in the project root describe pipeline
├── featurize.dvc
├── model.pkl
├── prepare.dvc
├── src # <-- Source code to run the pipeline stages
│   ├── evaluate.py
│   ├── featurization.py
│   ├── prepare.py
│   ── train.py
│   └── requirements.txt # <-- Python dependencies needed in the project
└── train.dvc
```

File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
11 changes: 5 additions & 6 deletions deploy.sh → example-get-started/deploy.sh
Original file line number Diff line number Diff line change
@@ -1,10 +1,9 @@
#!/bin/bash
#!/bin/sh

# e Exit immediately if a command exits with a non-zero exit status.
# u Treat unset variables as an error when substituting.
# v Print shell input lines as they are read.
# x Print commands and their arguments as they are executed.
set -euvx
set -eux

PACKAGE_DIR=code
PACKAGE=code.zip
Expand All @@ -16,17 +15,17 @@ rm -rf $TEST_DIR
mkdir $TEST_DIR

pushd $PACKAGE_DIR
zip -r $PACKAGE src/* requirements.txt
zip -r $PACKAGE src/*
popd

# Requires AWS CLI and write access to `dvc-share` S3 bucket.
# Requires AWS CLI and write access to `s3://dvc-share/get-started/`.
mv $PACKAGE_DIR/$PACKAGE .
aws s3 cp --acl public-read $PACKAGE s3://dvc-share/get-started/$PACKAGE

# Testing
wget https://dvc.org/s3/get-started/$PACKAGE -O $TEST_PACKAGE
unzip $TEST_PACKAGE -d $TEST_DIR
# TODO: Print some info. on what to look for here.
cmp $PACKAGE $TEST_PACKAGE
rm -f $TEST_PACKAGE
diff -r $PACKAGE_DIR $TEST_DIR

9 changes: 4 additions & 5 deletions get-started.sh → example-get-started/generate.sh
Original file line number Diff line number Diff line change
@@ -1,10 +1,9 @@
#!/bin/bash
#!/bin/sh

# e Exit immediately if a command exits with a non-zero exit status.
# u Treat unset variables as an error when substituting.
# v Print shell input lines as they are read.
# x Print commands and their arguments as they are executed.
set -euvx
set -eux

THIS="$( cd "$(dirname "$0")" ; pwd -P )"
REPO_NAME="example-get-started"
Expand Down Expand Up @@ -52,13 +51,13 @@ mkdir src
wget https://dvc.org/s3/get-started/code.zip
unzip code.zip
rm -f code.zip
echo "dvc[s3]" >> requirements.txt
echo "dvc[s3]" >> src/requirements.txt
cp $THIS/code/README.md $REPO_PATH
git add .
git commit -m 'add source code'
git tag -a "4-sources" -m "source code added"

pip install -r requirements.txt
pip install -r src/requirements.txt

dvc run -f prepare.dvc \
-d src/prepare.py -d data/data.xml \
Expand Down

0 comments on commit 903feb7

Please sign in to comment.