-
Notifications
You must be signed in to change notification settings - Fork 394
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
import,update: explain rev field and update vs re-importing for #735,…
… but also for use-case: add expandable sections to new data registry case per #679 (comment) and other misc. copy edits. Also standardizes term "external" (repo) vs. "source" data/project in this context and introduces the term "revision fixing".
- Loading branch information
1 parent
008e358
commit 131af1e
Showing
3 changed files
with
129 additions
and
47 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,7 +3,7 @@ | |
Download or copy file or directory from any <abbr>DVC project</abbr> in a Git | ||
repository (e.g. hosted on GitHub) into the <abbr>workspace</abbr>, and track | ||
changes in this [external dependency](/doc/user-guide/external-dependencies). | ||
Creates a DVC-file. | ||
Creates a special DVC-file a.k.a _import stage_. | ||
|
||
> See also `dvc get`, that corresponds to the first step this command performs | ||
> (just download the data). | ||
|
@@ -23,43 +23,43 @@ positional arguments: | |
DVC provides an easy way to reuse datasets, intermediate results, ML models, or | ||
other files and directories tracked in another DVC repository into the | ||
workspace. The `dvc import` command downloads such a <abbr>data artifact</abbr> | ||
in a way that it is tracked with DVC, so it can be updated when the external | ||
data source changes. | ||
in a way that it is tracked with DVC, so it can be updated when the data source | ||
changes. | ||
|
||
The `url` argument specifies the address of the Git repository containing the | ||
external <abbr>project</abbr>. Both HTTP and SSH protocols are supported for | ||
source <abbr>project</abbr>. Both HTTP and SSH protocols are supported for | ||
online repositories (e.g. `[user@]server:project.git`). `url` can also be a | ||
local file system path to an "offline" repository. | ||
|
||
The `path` argument of this command is used to specify the location of the data | ||
to be downloaded within the source project. It should point to a data file or | ||
directory tracked by that project – specified in one of the | ||
[DVC-files](/doc/user-guide/dvc-file-format) of the repository at `url`. (You | ||
will not find these files directly in the source Git repository.) The source | ||
will not find these files directly in the external Git repository.) The source | ||
project should have a default [DVC remote](/doc/command-reference/remote) | ||
configured, containing them.) | ||
|
||
> See `dvc import-url` to download and tack data from other supported URLs. | ||
After running this command successfully, the imported data is placed in the | ||
current working directory with its original file name e.g. `data.txt`. An import | ||
stage (DVC-file) is then created extending the full file or directory name of | ||
the imported data e.g. `data.txt.dvc` – similar to having used `dvc run` to | ||
generate the same output. | ||
current working directory with its original file name e.g. `data.txt`. An | ||
_import stage_ (DVC-file) is then created, extending the full file or directory | ||
name of the imported data e.g. `data.txt.dvc` – similar to having used `dvc run` | ||
to generate the same output. | ||
|
||
DVC supports DVC-files that refer to data in an external DVC repository (hosted | ||
on a Git server). In such a DVC-file, the `deps` section specifies the `repo` | ||
URL and data `path`, and the `outs` section contains the corresponding local | ||
path in the workspace. It records enough data from the external file or | ||
directory to enable DVC to efficiently check it to determine whether the local | ||
copy is out of date. | ||
on a Git server) a.k.a _import stages_. In such a DVC-file, the `deps` section | ||
specifies the `repo` URL and data `path`, and the `outs` section contains the | ||
corresponding local path in the workspace. It records enough data from the | ||
external file or directory to enable DVC to efficiently check it to determine | ||
whether the local copy is out of date. | ||
|
||
To actually [track the data](https://dvc.org/doc/get-started/add-files), | ||
`git add` (and `git commit`) the import stage (DVC-file). | ||
`git add` (and `git commit`) the import stage. | ||
|
||
Note that import stages are considered always "locked", meaning that if you run | ||
`dvc repro`, they won't be updated. Use `dvc update` on them to update the | ||
downloaded data artifact from the external DVC repository. | ||
downloaded data artifact from the source DVC repository. | ||
|
||
## Options | ||
|
||
|
@@ -72,8 +72,10 @@ downloaded data artifact from the external DVC repository. | |
- `--rev` - specific | ||
[Git revision](https://git-scm.com/book/en/v2/Git-Internals-Git-References) | ||
(such as a branch name, a tag, or a commit hash) of the DVC repository to | ||
import the data from. The tip of the default branch is used by default when | ||
this option is not specified. | ||
import the data from. The tip of the repository's default branch is used by | ||
default when this option is not specified. Note that this adds a `rev` field | ||
in the import stage that fixes it to this revision. This can impact the | ||
behavior of `dvc update`. | ||
|
||
- `-h`, `--help` - prints the usage/help message, and exit. | ||
|
||
|
@@ -120,3 +122,35 @@ outs: | |
Several of the values above are pulled from the original stage file | ||
`model.pkl.dvc` in the external DVC repo. `url` and `rev_lock` fields are used | ||
to specify the origin and version of the dependency. | ||
|
||
## Example: fixed revisions & re-importing | ||
|
||
When the `--rev` option is used, the import stage | ||
([DVC-file](/doc/user-guide/dvc-file-format)) will include a `rev` field under | ||
`repo` like this: | ||
|
||
```yaml | ||
deps: | ||
- path: data/data.xml | ||
repo: | ||
url: [email protected]:iterative/dataset-registry.git | ||
rev: cats-dogs-v1 | ||
rev_lock: 0547f5883fb18e523e35578e2f0d19648c8f2d5c | ||
``` | ||
|
||
If the Git revision moves, such as a branch, this doesn't have much of an effect | ||
on the import/update workflow. However, for static refs such as tags (unless | ||
manually updated), or for SHA commits, `dvc update` will not have any effect on | ||
the import. In this cases, in order to actually "update" an import, it's | ||
necessary to **re-import the data** instead, by using `dvc import` again without | ||
or with a different `--rev`. For example: | ||
|
||
```dvc | ||
$ dvc import --rev master \ | ||
[email protected]:iterative/dataset-registry.git \ | ||
use-cases/cats-dogs | ||
``` | ||
|
||
This will overwrite the import stage (DVC-file) either removing or replacing the | ||
`rev` field. This can produce an import stage that is able to be updated | ||
normally with `dvc update` going forward. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -47,22 +47,24 @@ containing 2800 images of cats and dogs. We partitioned the dataset in two for | |
our [Versioning Tutorial](/doc/tutorials/versioning), and backed up the parts on | ||
a storage server, downloading them with `wget` in our examples. This setup was | ||
then revised to download the dataset with `dvc get` instead, so we created the | ||
[dataset-registry](https://github.com/iterative/dataset-registry)) project, a | ||
[dataset-registry](https://github.com/iterative/dataset-registry)) repository, a | ||
<abbr>DVC project</abbr> hosted on GitHub, to version the dataset (see its | ||
[`tutorial/ver`](https://github.com/iterative/dataset-registry/tree/master/tutorial/ver) | ||
directory). | ||
|
||
However, there are a few problems with the way this dataset is structured (in 2 | ||
parts). Most importantly, this single dataset is tracked by 2 different | ||
However, there are a few problems with the way this dataset is structured. Most | ||
importantly, this single dataset is tracked by 2 different | ||
[DVC-files](/doc/user-guide/dvc-file-format), instead of 2 versions of the same | ||
one, which would better reflect the intentions of this dataset... Fortunately, | ||
we have also prepared an improved alternative in the | ||
[`use-cases/`](https://github.com/iterative/dataset-registry/tree/master/use-cases) | ||
directory of the same repository. | ||
|
||
As step one, we extracted the first part of the dataset into the | ||
`use-cases/cats-dogs` directory (illustrated below), and ran <code>dvc add | ||
use-cases/cats-dogs</code> to | ||
To create a | ||
[first version](https://github.com/iterative/dataset-registry/tree/cats-dogs-v1/use-cases) | ||
of our dataset, we extracted the first part into the `use-cases/cats-dogs` | ||
directory (illustrated below), and ran <code>dvc add use-cases/cats-dogs</code> | ||
to | ||
[track the entire directory](https://dvc.org/doc/command-reference/add#example-directory). | ||
|
||
```dvc | ||
|
@@ -77,33 +79,49 @@ use-cases/cats-dogs | |
└── dogs [400 image files] | ||
``` | ||
|
||
This first version uses the | ||
[`cats-dogs-v1`](https://github.com/iterative/dataset-registry/tree/cats-dogs-v1/use-cases) | ||
Git tag. In a local DVC project, we can obtain this dataset with the following | ||
command (note the usage of `--rev`): | ||
In a local DVC project, we could have obtained this dataset at this point with | ||
the following command: | ||
|
||
```dvc | ||
$ dvc import --rev cats-dogs-v1 \ | ||
[email protected]:iterative/dataset-registry.git \ | ||
$ dvc import [email protected]:iterative/dataset-registry.git \ | ||
use-cases/cats-dogs | ||
``` | ||
|
||
> Note that unlike `dvc get`, which can be used from any directory, `dvc import` | ||
> always needs to run from an [initialized](/doc/command-reference/init) DVC | ||
> project. | ||
<details> | ||
|
||
### Expand for actionable command (optional) | ||
|
||
The command above is meant for informational purposes only. If you actually run | ||
it in a DVC project, although it should work, it will import the latest version | ||
of `use-cases/cats-dogs` from `dataset-registry`. The following command would | ||
actually bring in the version in question: | ||
|
||
```dvc | ||
$ dvc import --rev cats-dogs-v1 \ | ||
[email protected]:iterative/dataset-registry.git \ | ||
use-cases/cats-dogs | ||
``` | ||
|
||
See the `dvc import` command reference for more details on the `--rev` | ||
(revision) option. | ||
|
||
</details> | ||
|
||
Importing keeps the connection between the local project and data registry where | ||
we are downloading the dataset from. This is achieved by creating a special | ||
DVC-file (a.k.a. an _import stage_) – which can be used for versioning the | ||
import with Git in the local project. This connection will come in handy when | ||
the source data changes, and we want to obtain these updates... | ||
DVC-file (a.k.a. _import stage_) – that can be used for versioning the import | ||
with Git. This connection will come in handy when the source data changes, and | ||
we want to obtain these updates... | ||
|
||
Back in our **dataset-registry** repository, the second (and last) version of | ||
our dataset exists under the | ||
[`cats-dogs-v2`](https://github.com/iterative/dataset-registry/tree/cats-dogs-v2/use-cases) | ||
tag. It was created by extracting the second part of the dataset, with 1000 | ||
additional images (500 cats, 500 dogs) in the same directory structure, and | ||
simply running <code>dvc add use-cases/cats-dogs</code> again. | ||
Back in our **dataset-registry** repository, a | ||
[second version](https://github.com/iterative/dataset-registry/tree/cats-dogs-v2/use-cases) | ||
of our dataset was created by extracting the second part, with 1000 additional | ||
images (500 cats, 500 dogs), into the same directory structure. Then, we simply | ||
ran <code>dvc add use-cases/cats-dogs</code> again. | ||
|
||
In our local project, all we have to do in order to obtain this latest version | ||
of the dataset is to run: | ||
|
@@ -112,6 +130,25 @@ of the dataset is to run: | |
$ dvc update cats-dogs.dvc | ||
``` | ||
|
||
<details> | ||
|
||
### Expand for actionable command (optional) | ||
|
||
As with the previous hidden note, actually trying the commands above should | ||
produced the expected results, but not for obvious reasons. Specifically, the | ||
initial `dvc import` command would have already obtained the latest version of | ||
the dataset (as noted before), so this `dvc update` is unnecessary and won't | ||
have an effect. | ||
|
||
If you ran the `dvc import --rev cats-dogs-v1 ...` command instead, its import | ||
stage (DVC-file) would be fixed to that Git tag (`cats-dogs-v1`). In order to | ||
update it, do not use `dvc update`. Instead, re-import the data by using the | ||
original import command (without `--rev`). Refer to | ||
[this example](http://localhost:3000/doc/command-reference/import#example-fixed-revisions-re-importing) | ||
for more information. | ||
|
||
</details> | ||
|
||
This downloads new and changed files in `cats-dogs/` from the source project, | ||
and updates the metadata in the import stage DVC-file. | ||
|
||
|