Skip to content

Commit

Permalink
Add dataset registry project generator
Browse files Browse the repository at this point in the history
  • Loading branch information
jorgeorpinel committed Aug 24, 2019
1 parent 2bd28e9 commit b1bd569
Show file tree
Hide file tree
Showing 5 changed files with 202 additions and 16 deletions.
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,11 @@ $ cd example-get-started
$ ./deploy.sh
```

<!-- ### dataset-registry -->
### dataset-registry

- `generate.sh`: Generates the `dataset-registry` DVC project from scratch. This
project is used by **example-get-started** below, so it should be generated
first.

### example-get-started

Expand Down
73 changes: 73 additions & 0 deletions dataset-registry/code/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# DVC Dataset Registry

This is an auto-generated repository for use in https://dvc.org/doc/. Please
report any issues in its source project,
[example-repos-dev](https://github.com/iterative/example-repos-dev).

_Dataset Registry_ is a centralized place to manage raw data files for use in
other example DVC projects, such as
https://github.com/iterative/example-get-started.

## Installation

Start by cloning the project:

```console
$ git clone https://github.com/iterative/dataset-registry
$ cd dataset-registry
```

This DVC project comes with a preconfigured DVC
[remote storage](https://man.dvc.org/remote) to hold all of the datasets. This
is a read-only HTTP remote.

```console
$ dvc remote list
storage https://remote.dvc.org/dataset-registry
```

You can run [`dvc pull`](https://man.dvc.org/pull) to download specific datasets
locally:

```console
$ dvc pull -r storage get-started/data.xml
```

## Testing data synchronization locally

If you'd like to test commands like [`dvc push`](https://man.dvc.org/push),
that require write access to the remote storage, the easiest way would be to set
up a "local remote" on your file system:

> This kind of remote is located in the local file system, but is external to
> the DVC project.
```console
$ mkdir -P /tmp/dvc-storage
$ dvc remote add local /tmp/dvc-storage
```

You should now be able to run:

```console
$ dvc push -r local
```

## Datasets

The folder structure of this project groups datasets corresponding to the
external projects they pertain to.
After cloning and using [`dvc pull`](https://man.dvc.org/pull) to download data
under DVC control, the workspace should look like this:


```console
$ tree
.
├── README.md
└── get-started
├── data.xml # Dataset used in iterative/example-get-started
└── data.xml.dvc

1 directory, 3 files
```
80 changes: 80 additions & 0 deletions dataset-registry/generate.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
#!/bin/sh

# Setup script env

# e Exit immediately if a command exits with a non-zero exit status.
# u Treat unset variables as an error when substituting.
# x Print commands and their arguments as they are executed.
set -eux

HERE="$( cd "$(dirname "$0")" ; pwd -P )"
REPO_NAME="dataset-registry"
REPO_PATH="$HERE/build/$REPO_NAME"

if [ -d "$REPO_PATH" ]; then
echo "Repo $REPO_PATH already exists, remove it first."
exit 1
fi

mkdir -p $REPO_PATH
pushd $REPO_PATH

# Create virtualenv, install `dvc`, initialize/config DVC project

virtualenv -p python3 .env
export VIRTUAL_ENV_DISABLE_PROMPT=true
source .env/bin/activate
echo '.env/' >> .gitignore

pip install dvc[s3]

git init
dvc init

# Remote active on this environment only for writing to HTTP redirect below.
dvc remote add -d --local storage s3://dvc-public/remote/dataset-registry

# Actual remote for generated project (read-only). Redirect of S3 bucket below.
dvc remote add -d storage https://remote.dvc.org/dataset-registry

cp $HERE/code/README.md $REPO_PATH

git add .
git commit -m "Init & config DVC project, add README"

# Get Started

mkdir get-started
wget https://data.dvc.org/get-started/data.xml -O get-started/data.xml
dvc add get-started/data.xml
git add get-started/.gitignore get-started/data.xml.dvc
git commit -m "Add Get Started dataset"
dvc push

# TODO: Gather more datasets!

popd

echo "`cat <<EOF-
The Git repo generated by this script is intended to be published on
https://github.com/iterative/dataset-registry. Make sure the Github repo
exists firt.
To create it with https://hub.github.com/ for example, run:
hub create iterative/dataset-registry -d "Get Started DVC project" \
-h "https://dvc.org/doc/get-started"
If the Github repo already exists, run these commands to rewrite it:
cd build/dataset-registry
git remote add origin [email protected]:iterative/dataset-registry.git
git push --force origin master
cd ../..
You may remove the generated repo with:
rm -fR build
`"
17 changes: 10 additions & 7 deletions example-get-started/code/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,10 @@ Please report any issues in its source project,
_Get Started_ is a step by step introduction into basic DVC concepts. It doesn't
go into details much, but provides links and expandable sections to learn more.

> Note that this project
[imports](https://dvc.org/doc/commands-reference/import) a dataset from
https://github.com/iterative/dataset-registry.

The idea of the project is a simplified version of the
[Tutorial](https://dvc.org/doc/tutorial). It explores the natural language
processing (NLP) problem of predicting tags for a given StackOverflow question.
Expand Down Expand Up @@ -50,7 +54,7 @@ You can run [`dvc pull`](https://man.dvc.org/pull) to download the data:
$ dvc pull -r storage
```

## Running in Your Environment
## Running in your environment

Run [`dvc repro`](https://man.dvc.org/repro) to reproduce the
[pipeline](https://dvc.org/doc/commands-reference/pipeline):
Expand Down Expand Up @@ -82,7 +86,7 @@ You should now be able to run:
$ dvc push -r local
```

## Existing Stages
## Existing stages

This project with the help of the Git tags reflects the sequence of actions that
are run in the DVC [get started](https://dvc.org/doc/get-started) guide. Feel
Expand Down Expand Up @@ -125,12 +129,11 @@ There are two additional tags:
These tags can be used to illustrate `-a` or `-T` options across different
[DVC commands](https://man.dvc.org/).

## Project Structure
## Project structure

The data files, DVC-files, and results change as stages are created one by one,
but right after you for Git clone and [`dvc pull`](https://man.dvc.org/pull) to
download files that are under DVC control, the structure of the project should
look like this:
The data files, DVC-files, and results change as stages are created one by one.
After cloning and using [`dvc pull`](https://man.dvc.org/pull) to download data
under DVC control, the workspace should look like this:

```console
$ tree
Expand Down
42 changes: 34 additions & 8 deletions example-get-started/generate.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
#!/bin/sh

# Setup script env

# e Exit immediately if a command exits with a non-zero exit status.
# u Treat unset variables as an error when substituting.
# x Print commands and their arguments as they are executed.
Expand All @@ -17,51 +19,66 @@ fi
mkdir -p $REPO_PATH
pushd $REPO_PATH

git init
# Initialize Git repo

virtualenv -p python3 .env
export VIRTUAL_ENV_DISABLE_PROMPT=true
source .env/bin/activate
echo '.env/' >> .gitignore

git init
git add .
git commit -m "Initialize Git repository"
git tag -a "0-empty" -m "Git initialized"

# https://dvc.org/doc/get-started/install

pip install dvc[s3]

# https://dvc.org/doc/get-started/initialize

dvc init
git commit -m "Initialize DVC project"
git tag -a "1-initialize" -m "DVC initialized."

# https://dvc.org/doc/get-started/configure

# Remote active on this environment only for writing to HTTP redirect above.
dvc remote add -d --local storage s3://dvc-public/remote/get-started

# Actual remote for generated project (read-only). Redirect of S3 bucket below.
dvc remote add -d storage https://remote.dvc.org/get-started

cp $HERE/code/README.md $REPO_PATH

git add .
git commit -m "Configure default HTTP remote (read-only)"
git commit -m "Configure default HTTP remote (read-only), add README"
git tag -a "2-remote" -m "Read-only remote storage configured."

mkdir data
wget https://data.dvc.org/get-started/data.xml -O data/data.xml
dvc add data/data.xml
# https://dvc.org/doc/get-started/add-files

mkdir data && cd data
dvc import https://github.com/iterative/dataset-registry \
get-started/data.xml
cd ..
git add data/.gitignore data/data.xml.dvc
git commit -m "Add raw data to project"
git tag -a "3-add-file" -m "Data file added."
dvc push
dvc push # https://dvc.org/doc/get-started/share-data

# https://dvc.org/doc/get-started/connect-code-and-data

wget https://code.dvc.org/get-started/code.zip
unzip code.zip
rm -f code.zip
cp $HERE/code/README.md $REPO_PATH
git add .
git commit -m "Add source code files to repo"
git tag -a "4-sources" -m "Source code added."

pip install -r src/requirements.txt

# https://dvc.org/doc/get-started/connect-code-and-data#create-a-first-data-transformation-stage

dvc run -f prepare.dvc \
-d src/prepare.py -d data/data.xml \
-o data/prepared \
Expand All @@ -71,6 +88,8 @@ git commit -m "Create data preparation stage"
git tag -a "5-preparation" -m "First pipeline stage (data preparation) created."
dvc push

# https://dvc.org/doc/get-started/pipeline

dvc run -f featurize.dvc \
-d src/featurization.py -d data/prepared \
-o data/features \
Expand All @@ -90,6 +109,8 @@ git commit -m "Create training stage"
git tag -a "7-train" -m "Training stage created."
dvc push

# https://dvc.org/doc/get-started/metrics

dvc run -f evaluate.dvc \
-d src/evaluate.py -d model.pkl -d data/features \
-M auc.metric \
Expand All @@ -100,13 +121,17 @@ git tag -a "baseline-experiment" -m "Baseline experiment evaluation"
git tag -a "8-evaluation" -m "Baseline evaluation stage created."
dvc push

# https://dvc.org/doc/get-started/experiments

sed -e s/max_features=5000\)/max_features=6000\,\ ngram_range=\(1\,\ 2\)\)/ -i "" \
src/featurization.py

dvc repro train.dvc
git commit -am "Reproduce model using bigrams"
git tag -a "9-bigrams-model" -m "Model retrained using bigrams."

# https://dvc.org/doc/get-started/compare-experiments

dvc repro evaluate.dvc
git commit -am "Evaluate bigrams model"
git tag -a "bigrams-experiment" -m "Bigrams experiment evaluation"
Expand All @@ -132,9 +157,10 @@ cd build/example-get-started
git remote add origin [email protected]:iterative/example-get-started.git
git push --force origin master
git push --force origin --tags
cd ../..
You may remove the generated repo with:
rm -fR build/
rm -fR build
`"

0 comments on commit b1bd569

Please sign in to comment.