Skip to content

Commit

Permalink
cmd ref: add data registry example to import cmd
Browse files Browse the repository at this point in the history
for #487
  • Loading branch information
jorgeorpinel committed Oct 30, 2019
1 parent 131af1e commit f01f860
Show file tree
Hide file tree
Showing 2 changed files with 60 additions and 4 deletions.
51 changes: 51 additions & 0 deletions static/docs/command-reference/import.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,3 +154,54 @@ $ dvc import --rev master \
This will overwrite the import stage (DVC-file) either removing or replacing the
`rev` field. This can produce an import stage that is able to be updated
normally with `dvc update` going forward.

## Example: Data registry

If you take a look at our
[dataset-registry](https://github.com/iterative/dataset-registry)
<abbr>project</abbr>, you'll see that it's organized into different directories
such as `tutorial/ver` and `use-cases/`, and these contain
[DVC-files](/doc/user-guide/dvc-file-format) that track different datasets.
Given this simple structure, these files can be easily shared among several
other projects, using `dvc get` and `dvc import`. For example:

```dvc
$ dvc get https://github.com/iterative/dataset-registry \
tutorial/ver/data.zip
```

> Used in our [versioning tutorial](/doc/tutorials/versioning)

Or

```dvc
$ dvc import [email protected]:iterative/dataset-registry.git \
use-cases/cats-dogs
```

`dvc import` provides a better way to incorporate data files tracked in external
projects because it saves the connection between the current project and the
source project. This means that enough information is recorded in an import
stage (DVC-file) in order to [reproduce](/doc/command-reference/repro)
downloading of this same data version in the future, where and when needed. This
is achieved with the `repo` field, for example (matching the import command
above):

```yaml
md5: 96fd8e791b0ee4824fc1ceffd13b1b49
locked: true
deps:
- path: use-cases/cats-dogs
repo:
url: [email protected]:iterative/dataset-registry.git
rev_lock: 0547f5883fb18e523e35578e2f0d19648c8f2d5c
outs:
- md5: b6923e1e4ad16ea1a7e2b328842d56a2.dir
path: cats-dogs
cache: true
metric: false
persist: false
```

See a full explanation in our [Data Registry](/doc/use-cases/data-registry) use
case.
13 changes: 9 additions & 4 deletions static/docs/use-cases/data-registry.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,11 +113,13 @@ See the `dvc import` command reference for more details on the `--rev`

Importing keeps the connection between the local project and data registry where
we are downloading the dataset from. This is achieved by creating a special
DVC-file (a.k.a. _import stage_) – that can be used for versioning the import
with Git. This connection will come in handy when the source data changes, and
we want to obtain these updates...
[DVC-file](/doc/user-guide/dvc-file-format) (a.k.a. _import stage_) that uses
the `repo` field. (This file can be used for versioning the import with Git.)

Back in our **dataset-registry** repository, a
> For a sample DVC-file resulting from `dvc import`, refer to
> [this example](/doc/command-reference/import#example-data-registry).
Back in our **dataset-registry** project, a
[second version](https://github.com/iterative/dataset-registry/tree/cats-dogs-v2/use-cases)
of our dataset was created by extracting the second part, with 1000 additional
images (500 cats, 500 dogs), into the same directory structure. Then, we simply
Expand All @@ -130,6 +132,9 @@ of the dataset is to run:
$ dvc update cats-dogs.dvc
```

This is possible because of the connection that the import stage saved among
local and source projects, as explained earlier.

<details>

### Expand for actionable command (optional)
Expand Down

0 comments on commit f01f860

Please sign in to comment.