Skip to content

Commit

Permalink
import,update: explain rev field and update vs re-importing for #735,…
Browse files Browse the repository at this point in the history
… but also for

use-case: add expandable sections to new data registry case
per #679 (comment)
and other misc. copy edits.

Also standardizes term "external" (repo) vs. "source" data/project in this context
and introduces the term "revision fixing".
  • Loading branch information
jorgeorpinel committed Oct 30, 2019
1 parent 008e358 commit 131af1e
Show file tree
Hide file tree
Showing 3 changed files with 129 additions and 47 deletions.
70 changes: 52 additions & 18 deletions static/docs/command-reference/import.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Download or copy file or directory from any <abbr>DVC project</abbr> in a Git
repository (e.g. hosted on GitHub) into the <abbr>workspace</abbr>, and track
changes in this [external dependency](/doc/user-guide/external-dependencies).
Creates a DVC-file.
Creates a special DVC-file a.k.a _import stage_.

> See also `dvc get`, that corresponds to the first step this command performs
> (just download the data).
Expand All @@ -23,43 +23,43 @@ positional arguments:
DVC provides an easy way to reuse datasets, intermediate results, ML models, or
other files and directories tracked in another DVC repository into the
workspace. The `dvc import` command downloads such a <abbr>data artifact</abbr>
in a way that it is tracked with DVC, so it can be updated when the external
data source changes.
in a way that it is tracked with DVC, so it can be updated when the data source
changes.

The `url` argument specifies the address of the Git repository containing the
external <abbr>project</abbr>. Both HTTP and SSH protocols are supported for
source <abbr>project</abbr>. Both HTTP and SSH protocols are supported for
online repositories (e.g. `[user@]server:project.git`). `url` can also be a
local file system path to an "offline" repository.

The `path` argument of this command is used to specify the location of the data
to be downloaded within the source project. It should point to a data file or
directory tracked by that project – specified in one of the
[DVC-files](/doc/user-guide/dvc-file-format) of the repository at `url`. (You
will not find these files directly in the source Git repository.) The source
will not find these files directly in the external Git repository.) The source
project should have a default [DVC remote](/doc/command-reference/remote)
configured, containing them.)

> See `dvc import-url` to download and tack data from other supported URLs.
After running this command successfully, the imported data is placed in the
current working directory with its original file name e.g. `data.txt`. An import
stage (DVC-file) is then created extending the full file or directory name of
the imported data e.g. `data.txt.dvc` – similar to having used `dvc run` to
generate the same output.
current working directory with its original file name e.g. `data.txt`. An
_import stage_ (DVC-file) is then created, extending the full file or directory
name of the imported data e.g. `data.txt.dvc` – similar to having used `dvc run`
to generate the same output.

DVC supports DVC-files that refer to data in an external DVC repository (hosted
on a Git server). In such a DVC-file, the `deps` section specifies the `repo`
URL and data `path`, and the `outs` section contains the corresponding local
path in the workspace. It records enough data from the external file or
directory to enable DVC to efficiently check it to determine whether the local
copy is out of date.
on a Git server) a.k.a _import stages_. In such a DVC-file, the `deps` section
specifies the `repo` URL and data `path`, and the `outs` section contains the
corresponding local path in the workspace. It records enough data from the
external file or directory to enable DVC to efficiently check it to determine
whether the local copy is out of date.

To actually [track the data](https://dvc.org/doc/get-started/add-files),
`git add` (and `git commit`) the import stage (DVC-file).
`git add` (and `git commit`) the import stage.

Note that import stages are considered always "locked", meaning that if you run
`dvc repro`, they won't be updated. Use `dvc update` on them to update the
downloaded data artifact from the external DVC repository.
downloaded data artifact from the source DVC repository.

## Options

Expand All @@ -72,8 +72,10 @@ downloaded data artifact from the external DVC repository.
- `--rev` - specific
[Git revision](https://git-scm.com/book/en/v2/Git-Internals-Git-References)
(such as a branch name, a tag, or a commit hash) of the DVC repository to
import the data from. The tip of the default branch is used by default when
this option is not specified.
import the data from. The tip of the repository's default branch is used by
default when this option is not specified. Note that this adds a `rev` field
in the import stage that fixes it to this revision. This can impact the
behavior of `dvc update`.

- `-h`, `--help` - prints the usage/help message, and exit.

Expand Down Expand Up @@ -120,3 +122,35 @@ outs:
Several of the values above are pulled from the original stage file
`model.pkl.dvc` in the external DVC repo. `url` and `rev_lock` fields are used
to specify the origin and version of the dependency.

## Example: fixed revisions & re-importing

When the `--rev` option is used, the import stage
([DVC-file](/doc/user-guide/dvc-file-format)) will include a `rev` field under
`repo` like this:

```yaml
deps:
- path: data/data.xml
repo:
url: [email protected]:iterative/dataset-registry.git
rev: cats-dogs-v1
rev_lock: 0547f5883fb18e523e35578e2f0d19648c8f2d5c
```

If the Git revision moves, such as a branch, this doesn't have much of an effect
on the import/update workflow. However, for static refs such as tags (unless
manually updated), or for SHA commits, `dvc update` will not have any effect on
the import. In this cases, in order to actually "update" an import, it's
necessary to **re-import the data** instead, by using `dvc import` again without
or with a different `--rev`. For example:

```dvc
$ dvc import --rev master \
[email protected]:iterative/dataset-registry.git \
use-cases/cats-dogs
```

This will overwrite the import stage (DVC-file) either removing or replacing the
`rev` field. This can produce an import stage that is able to be updated
normally with `dvc update` going forward.
27 changes: 19 additions & 8 deletions static/docs/command-reference/update.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# update

Update <abbr>data artifacts</abbr> imported from other DVC repositories.
Update <abbr>data artifacts</abbr> imported from external DVC repositories.

## Synopsis

Expand All @@ -15,16 +15,24 @@ positional arguments:

After creating <abbr>import stages</abbr>
([DVC-files](/doc/user-guide/dvc-file-format)) with `dvc import` or
`dvc import-url`, the external data source can change. Use `dvc update` to bring
these imported file, directory, or <abbr>data artifact</abbr> up to date.
`dvc import-url`, the data source can change. Use `dvc update` to bring these
imported file, directory, or <abbr>data artifact</abbr> up to date.

To indicate which import stages to update, we must specify the corresponding
DVC-file `targets` as command arguments.

Note that import stages are considered always "locked", meaning that if you run
`dvc repro`, they won't be updated. `dvc update` is the only command that can
update them. Also, for `dvc import` DVC-files, the `rev_lock` field is updated
by `dvc update`.
update them. Also, for `dvc import` import stages, the `rev_lock` field is
updated by `dvc update`.

To indicate which import stages to update, we must specify the corresponding
DVC-file `targets` as command arguments.
Another detail to note is that when the `--rev` (revision) option of
`dvc import` has been used to create an import stage, DVC is not aware of what
kind of
[Git revision](https://git-scm.com/book/en/v2/Git-Internals-Git-References) this
is, for example a branch or a tag. For static refs such as tags (unless manually
updated), or for SHA commits, `dvc update` will not have any effect on the
import.

## Options

Expand Down Expand Up @@ -60,4 +68,7 @@ Output 'model.pkl' didn't change. Skipping saving.
Saving information to 'model.pkl.dvc'.
```

This time nothing has changed, since the source repository is rather stable.
This time nothing has changed, since the source <abbr>project</abbr> is rather
stable.

> Refer to this [re-importing example]() for
79 changes: 58 additions & 21 deletions static/docs/use-cases/data-registry.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,22 +47,24 @@ containing 2800 images of cats and dogs. We partitioned the dataset in two for
our [Versioning Tutorial](/doc/tutorials/versioning), and backed up the parts on
a storage server, downloading them with `wget` in our examples. This setup was
then revised to download the dataset with `dvc get` instead, so we created the
[dataset-registry](https://github.com/iterative/dataset-registry)) project, a
[dataset-registry](https://github.com/iterative/dataset-registry)) repository, a
<abbr>DVC project</abbr> hosted on GitHub, to version the dataset (see its
[`tutorial/ver`](https://github.com/iterative/dataset-registry/tree/master/tutorial/ver)
directory).

However, there are a few problems with the way this dataset is structured (in 2
parts). Most importantly, this single dataset is tracked by 2 different
However, there are a few problems with the way this dataset is structured. Most
importantly, this single dataset is tracked by 2 different
[DVC-files](/doc/user-guide/dvc-file-format), instead of 2 versions of the same
one, which would better reflect the intentions of this dataset... Fortunately,
we have also prepared an improved alternative in the
[`use-cases/`](https://github.com/iterative/dataset-registry/tree/master/use-cases)
directory of the same repository.

As step one, we extracted the first part of the dataset into the
`use-cases/cats-dogs` directory (illustrated below), and ran <code>dvc add
use-cases/cats-dogs</code> to
To create a
[first version](https://github.com/iterative/dataset-registry/tree/cats-dogs-v1/use-cases)
of our dataset, we extracted the first part into the `use-cases/cats-dogs`
directory (illustrated below), and ran <code>dvc add use-cases/cats-dogs</code>
to
[track the entire directory](https://dvc.org/doc/command-reference/add#example-directory).

```dvc
Expand All @@ -77,33 +79,49 @@ use-cases/cats-dogs
└── dogs [400 image files]
```

This first version uses the
[`cats-dogs-v1`](https://github.com/iterative/dataset-registry/tree/cats-dogs-v1/use-cases)
Git tag. In a local DVC project, we can obtain this dataset with the following
command (note the usage of `--rev`):
In a local DVC project, we could have obtained this dataset at this point with
the following command:

```dvc
$ dvc import --rev cats-dogs-v1 \
[email protected]:iterative/dataset-registry.git \
$ dvc import [email protected]:iterative/dataset-registry.git \
use-cases/cats-dogs
```

> Note that unlike `dvc get`, which can be used from any directory, `dvc import`
> always needs to run from an [initialized](/doc/command-reference/init) DVC
> project.
<details>

### Expand for actionable command (optional)

The command above is meant for informational purposes only. If you actually run
it in a DVC project, although it should work, it will import the latest version
of `use-cases/cats-dogs` from `dataset-registry`. The following command would
actually bring in the version in question:

```dvc
$ dvc import --rev cats-dogs-v1 \
[email protected]:iterative/dataset-registry.git \
use-cases/cats-dogs
```

See the `dvc import` command reference for more details on the `--rev`
(revision) option.

</details>

Importing keeps the connection between the local project and data registry where
we are downloading the dataset from. This is achieved by creating a special
DVC-file (a.k.a. an _import stage_) – which can be used for versioning the
import with Git in the local project. This connection will come in handy when
the source data changes, and we want to obtain these updates...
DVC-file (a.k.a. _import stage_) – that can be used for versioning the import
with Git. This connection will come in handy when the source data changes, and
we want to obtain these updates...

Back in our **dataset-registry** repository, the second (and last) version of
our dataset exists under the
[`cats-dogs-v2`](https://github.com/iterative/dataset-registry/tree/cats-dogs-v2/use-cases)
tag. It was created by extracting the second part of the dataset, with 1000
additional images (500 cats, 500 dogs) in the same directory structure, and
simply running <code>dvc add use-cases/cats-dogs</code> again.
Back in our **dataset-registry** repository, a
[second version](https://github.com/iterative/dataset-registry/tree/cats-dogs-v2/use-cases)
of our dataset was created by extracting the second part, with 1000 additional
images (500 cats, 500 dogs), into the same directory structure. Then, we simply
ran <code>dvc add use-cases/cats-dogs</code> again.

In our local project, all we have to do in order to obtain this latest version
of the dataset is to run:
Expand All @@ -112,6 +130,25 @@ of the dataset is to run:
$ dvc update cats-dogs.dvc
```

<details>

### Expand for actionable command (optional)

As with the previous hidden note, actually trying the commands above should
produced the expected results, but not for obvious reasons. Specifically, the
initial `dvc import` command would have already obtained the latest version of
the dataset (as noted before), so this `dvc update` is unnecessary and won't
have an effect.

If you ran the `dvc import --rev cats-dogs-v1 ...` command instead, its import
stage (DVC-file) would be fixed to that Git tag (`cats-dogs-v1`). In order to
update it, do not use `dvc update`. Instead, re-import the data by using the
original import command (without `--rev`). Refer to
[this example](http://localhost:3000/doc/command-reference/import#example-fixed-revisions-re-importing)
for more information.

</details>

This downloads new and changed files in `cats-dogs/` from the source project,
and updates the metadata in the import stage DVC-file.

Expand Down

0 comments on commit 131af1e

Please sign in to comment.