-
Notifications
You must be signed in to change notification settings - Fork 13
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add dataset registry project generator
- Loading branch information
1 parent
2bd28e9
commit b1bd569
Showing
5 changed files
with
202 additions
and
16 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
# DVC Dataset Registry | ||
|
||
This is an auto-generated repository for use in https://dvc.org/doc/. Please | ||
report any issues in its source project, | ||
[example-repos-dev](https://github.com/iterative/example-repos-dev). | ||
|
||
_Dataset Registry_ is a centralized place to manage raw data files for use in | ||
other example DVC projects, such as | ||
https://github.com/iterative/example-get-started. | ||
|
||
## Installation | ||
|
||
Start by cloning the project: | ||
|
||
```console | ||
$ git clone https://github.com/iterative/dataset-registry | ||
$ cd dataset-registry | ||
``` | ||
|
||
This DVC project comes with a preconfigured DVC | ||
[remote storage](https://man.dvc.org/remote) to hold all of the datasets. This | ||
is a read-only HTTP remote. | ||
|
||
```console | ||
$ dvc remote list | ||
storage https://remote.dvc.org/dataset-registry | ||
``` | ||
|
||
You can run [`dvc pull`](https://man.dvc.org/pull) to download specific datasets | ||
locally: | ||
|
||
```console | ||
$ dvc pull -r storage get-started/data.xml | ||
``` | ||
|
||
## Testing data synchronization locally | ||
|
||
If you'd like to test commands like [`dvc push`](https://man.dvc.org/push), | ||
that require write access to the remote storage, the easiest way would be to set | ||
up a "local remote" on your file system: | ||
|
||
> This kind of remote is located in the local file system, but is external to | ||
> the DVC project. | ||
```console | ||
$ mkdir -P /tmp/dvc-storage | ||
$ dvc remote add local /tmp/dvc-storage | ||
``` | ||
|
||
You should now be able to run: | ||
|
||
```console | ||
$ dvc push -r local | ||
``` | ||
|
||
## Datasets | ||
|
||
The folder structure of this project groups datasets corresponding to the | ||
external projects they pertain to. | ||
After cloning and using [`dvc pull`](https://man.dvc.org/pull) to download data | ||
under DVC control, the workspace should look like this: | ||
|
||
|
||
```console | ||
$ tree | ||
. | ||
├── README.md | ||
└── get-started | ||
├── data.xml # Dataset used in iterative/example-get-started | ||
└── data.xml.dvc | ||
|
||
1 directory, 3 files | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
#!/bin/sh | ||
|
||
# Setup script env | ||
|
||
# e Exit immediately if a command exits with a non-zero exit status. | ||
# u Treat unset variables as an error when substituting. | ||
# x Print commands and their arguments as they are executed. | ||
set -eux | ||
|
||
HERE="$( cd "$(dirname "$0")" ; pwd -P )" | ||
REPO_NAME="dataset-registry" | ||
REPO_PATH="$HERE/build/$REPO_NAME" | ||
|
||
if [ -d "$REPO_PATH" ]; then | ||
echo "Repo $REPO_PATH already exists, remove it first." | ||
exit 1 | ||
fi | ||
|
||
mkdir -p $REPO_PATH | ||
pushd $REPO_PATH | ||
|
||
# Create virtualenv, install `dvc`, initialize/config DVC project | ||
|
||
virtualenv -p python3 .env | ||
export VIRTUAL_ENV_DISABLE_PROMPT=true | ||
source .env/bin/activate | ||
echo '.env/' >> .gitignore | ||
|
||
pip install dvc[s3] | ||
|
||
git init | ||
dvc init | ||
|
||
# Remote active on this environment only for writing to HTTP redirect below. | ||
dvc remote add -d --local storage s3://dvc-public/remote/dataset-registry | ||
|
||
# Actual remote for generated project (read-only). Redirect of S3 bucket below. | ||
dvc remote add -d storage https://remote.dvc.org/dataset-registry | ||
|
||
cp $HERE/code/README.md $REPO_PATH | ||
|
||
git add . | ||
git commit -m "Init & config DVC project, add README" | ||
|
||
# Get Started | ||
|
||
mkdir get-started | ||
wget https://data.dvc.org/get-started/data.xml -O get-started/data.xml | ||
dvc add get-started/data.xml | ||
git add get-started/.gitignore get-started/data.xml.dvc | ||
git commit -m "Add Get Started dataset" | ||
dvc push | ||
|
||
# TODO: Gather more datasets! | ||
|
||
popd | ||
|
||
echo "`cat <<EOF- | ||
The Git repo generated by this script is intended to be published on | ||
https://github.com/iterative/dataset-registry. Make sure the Github repo | ||
exists firt. | ||
To create it with https://hub.github.com/ for example, run: | ||
hub create iterative/dataset-registry -d "Get Started DVC project" \ | ||
-h "https://dvc.org/doc/get-started" | ||
If the Github repo already exists, run these commands to rewrite it: | ||
cd build/dataset-registry | ||
git remote add origin [email protected]:iterative/dataset-registry.git | ||
git push --force origin master | ||
cd ../.. | ||
You may remove the generated repo with: | ||
rm -fR build | ||
`" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,7 @@ | ||
#!/bin/sh | ||
|
||
# Setup script env | ||
|
||
# e Exit immediately if a command exits with a non-zero exit status. | ||
# u Treat unset variables as an error when substituting. | ||
# x Print commands and their arguments as they are executed. | ||
|
@@ -17,51 +19,66 @@ fi | |
mkdir -p $REPO_PATH | ||
pushd $REPO_PATH | ||
|
||
git init | ||
# Initialize Git repo | ||
|
||
virtualenv -p python3 .env | ||
export VIRTUAL_ENV_DISABLE_PROMPT=true | ||
source .env/bin/activate | ||
echo '.env/' >> .gitignore | ||
|
||
git init | ||
git add . | ||
git commit -m "Initialize Git repository" | ||
git tag -a "0-empty" -m "Git initialized" | ||
|
||
# https://dvc.org/doc/get-started/install | ||
|
||
pip install dvc[s3] | ||
|
||
# https://dvc.org/doc/get-started/initialize | ||
|
||
dvc init | ||
git commit -m "Initialize DVC project" | ||
git tag -a "1-initialize" -m "DVC initialized." | ||
|
||
# https://dvc.org/doc/get-started/configure | ||
|
||
# Remote active on this environment only for writing to HTTP redirect above. | ||
dvc remote add -d --local storage s3://dvc-public/remote/get-started | ||
|
||
# Actual remote for generated project (read-only). Redirect of S3 bucket below. | ||
dvc remote add -d storage https://remote.dvc.org/get-started | ||
|
||
cp $HERE/code/README.md $REPO_PATH | ||
|
||
git add . | ||
git commit -m "Configure default HTTP remote (read-only)" | ||
git commit -m "Configure default HTTP remote (read-only), add README" | ||
git tag -a "2-remote" -m "Read-only remote storage configured." | ||
|
||
mkdir data | ||
wget https://data.dvc.org/get-started/data.xml -O data/data.xml | ||
dvc add data/data.xml | ||
# https://dvc.org/doc/get-started/add-files | ||
|
||
mkdir data && cd data | ||
dvc import https://github.com/iterative/dataset-registry \ | ||
get-started/data.xml | ||
cd .. | ||
git add data/.gitignore data/data.xml.dvc | ||
git commit -m "Add raw data to project" | ||
git tag -a "3-add-file" -m "Data file added." | ||
dvc push | ||
dvc push # https://dvc.org/doc/get-started/share-data | ||
|
||
# https://dvc.org/doc/get-started/connect-code-and-data | ||
|
||
wget https://code.dvc.org/get-started/code.zip | ||
unzip code.zip | ||
rm -f code.zip | ||
cp $HERE/code/README.md $REPO_PATH | ||
git add . | ||
git commit -m "Add source code files to repo" | ||
git tag -a "4-sources" -m "Source code added." | ||
|
||
pip install -r src/requirements.txt | ||
|
||
# https://dvc.org/doc/get-started/connect-code-and-data#create-a-first-data-transformation-stage | ||
|
||
dvc run -f prepare.dvc \ | ||
-d src/prepare.py -d data/data.xml \ | ||
-o data/prepared \ | ||
|
@@ -71,6 +88,8 @@ git commit -m "Create data preparation stage" | |
git tag -a "5-preparation" -m "First pipeline stage (data preparation) created." | ||
dvc push | ||
|
||
# https://dvc.org/doc/get-started/pipeline | ||
|
||
dvc run -f featurize.dvc \ | ||
-d src/featurization.py -d data/prepared \ | ||
-o data/features \ | ||
|
@@ -90,6 +109,8 @@ git commit -m "Create training stage" | |
git tag -a "7-train" -m "Training stage created." | ||
dvc push | ||
|
||
# https://dvc.org/doc/get-started/metrics | ||
|
||
dvc run -f evaluate.dvc \ | ||
-d src/evaluate.py -d model.pkl -d data/features \ | ||
-M auc.metric \ | ||
|
@@ -100,13 +121,17 @@ git tag -a "baseline-experiment" -m "Baseline experiment evaluation" | |
git tag -a "8-evaluation" -m "Baseline evaluation stage created." | ||
dvc push | ||
|
||
# https://dvc.org/doc/get-started/experiments | ||
|
||
sed -e s/max_features=5000\)/max_features=6000\,\ ngram_range=\(1\,\ 2\)\)/ -i "" \ | ||
src/featurization.py | ||
|
||
dvc repro train.dvc | ||
git commit -am "Reproduce model using bigrams" | ||
git tag -a "9-bigrams-model" -m "Model retrained using bigrams." | ||
|
||
# https://dvc.org/doc/get-started/compare-experiments | ||
|
||
dvc repro evaluate.dvc | ||
git commit -am "Evaluate bigrams model" | ||
git tag -a "bigrams-experiment" -m "Bigrams experiment evaluation" | ||
|
@@ -132,9 +157,10 @@ cd build/example-get-started | |
git remote add origin [email protected]:iterative/example-get-started.git | ||
git push --force origin master | ||
git push --force origin --tags | ||
cd ../.. | ||
You may remove the generated repo with: | ||
rm -fR build/ | ||
rm -fR build | ||
`" |