Skip to content

Commit

Permalink
Merge pull request #490 from jorgeorpinel/master
Browse files Browse the repository at this point in the history
doc: regular updates; New `update` cmd ref
  • Loading branch information
shcheklein authored Jul 29, 2019
2 parents 5649968 + 00d663b commit bc444e0
Show file tree
Hide file tree
Showing 51 changed files with 750 additions and 548 deletions.
18 changes: 13 additions & 5 deletions src/Documentation/glossary.js
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,13 @@ export default {
},
{
name: 'DVC cache',
match: ['cache'],
match: ['DVC cache', 'cache', 'cache directory'],
desc:
'DVC cache is a hidden storage which is by default found at ' +
'`.dvc/cache`. This storage is used to manage different versions of ' +
'files which are under DVC control. For more information on cache, ' +
'please refer to this [guide](/doc/commands-reference/config#cache).'
'The DVC cache is a hidden storage (by default located in the ' +
'`.dvc/cache` directory) for files that are under DVC control, and ' +
'their different versions. For more details, please refer to this ' +
'[document](/doc/user-guide/dvc-files-and-directories' +
'#structure-of-cache-directory).'
},
{
name: 'Data artifact',
Expand All @@ -29,6 +30,13 @@ export default {
'result (such as extracted features or a ML model file) that is ' +
'under DVC control. Refer to [Data and Model Files Versioning]' +
'(/doc/use-cases/data-and-model-files-versioning) for more details.'
},
{
name: 'Import stage',
match: ['import stage', 'import stages'],
desc:
'Stage (DVC-file) created with the `dvc import` or `dvc import-url` ' +
'commands. They represent files or directories from external sources.'
}
]
}
2 changes: 2 additions & 0 deletions src/Documentation/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,7 @@
"status.md",
"unlock.md",
"unprotect.md",
"update.md",
"version.md"
],
"labels": {
Expand Down Expand Up @@ -170,6 +171,7 @@
"status.md": "status",
"unlock.md": "unlock",
"unprotect.md": "unprotect",
"update.md": "update",
"version.md": "version"
}
},
Expand Down
3 changes: 2 additions & 1 deletion static/docs/commands-reference/checkout.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ The execution of `dvc checkout` does:
on the command line. And if the `--with-deps` option is specified, it scans
backward from the given `targets` in the corresponding
[pipeline](/doc/get-started/pipeline).

- For any data files where the checksum doesn't match their DVC-file entry, the
data file is restored from the cache. The link strategy used (`reflink`,
`hardlink`, `symlink`, or `copy`) depends on the OS and the configured value
Expand Down Expand Up @@ -82,7 +83,7 @@ be pulled from a remote cache using `dvc pull`.
DVC will not checkout files referenced in later stage(s) than `targets`.

- `-R`, `--recursive` - `targets` is expected to contain at least one directory
path for this option to have effect. Determines the files to checout by
path for this option to have effect. Determines the files to checkout by
searching each target directory and its subdirectories for DVC-files to
inspect.

Expand Down
22 changes: 12 additions & 10 deletions static/docs/commands-reference/commit.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,14 @@ tying stages or a pipeline.
code, configuration, or data. Run DVC commands (`dvc run`, `dvc repro`, and
even `dvc add`) using the `--no-commit` option to avoid caching unnecessary
data over and over again. Use `dvc commit` when the files are finalized.

- One can always execute the code used in a stage without using DVC (keep in
mind that output files or directories in certain cases must first be
unprotected or removed, see `dvc unprotect`). Or one could be developing code
or data, repeatedly manually executing the code until it is working. Once it
is finished, use `dvc add`, `dvc commit`, or `dvc run` when appropriate to
update DVC-files and to store data to the cache.

- Sometimes we want to clean up a code or configuration file in a way that
doesn't cause a change in its results. We might write in-line documentation
with comments, change indentation, remove some debugging printouts, or any
Expand All @@ -47,12 +49,12 @@ want. Let's take a look at what is happening in the fist scenario closely:
Normally DVC commands like `dvc add`, `dvc repro` or `dvc run`, commit the data
to the DVC cache as the last step. What _commit_ means is that DVC:

- Computes a checksum for the file/directory.
- Enters the checksum and file name into the DVC-file.
- Tells the SCM to ignore the file/directory (e.g. add entry to `.gitignore`).
Note that if the workspace was initialized with no SCM support
(`dvc init --no-scm`), this does not happen.
- Adds the file/directory or to the DVC cache.
- Computes a checksum for the file/directory
- Enters the checksum and file name into the DVC-file
- Tells the SCM to ignore the file/directory (e.g. add entry to `.gitignore`)
(Note that if the workspace was initialized with no SCM support
(`dvc init --no-scm`), this does not happen.)
- Adds the file/directory or to the DVC cache

There are many cases where the last step is not desirable (usually, rapid
iteration on some experiment). For the DVC commands where available, the
Expand Down Expand Up @@ -211,7 +213,7 @@ execute this set of commands:
```dvc
$ dvc commit
$ dvc status
Pipeline is up to date. Nothing to reproduce.
Pipelines are up to date. Nothing to reproduce.
$ ls .dvc/cache/70
599f166c2098d7ffca91a369a78b0d
```
Expand Down Expand Up @@ -252,8 +254,8 @@ train.dvc:
modified: src/train.py
```

Let's edit one of the source files. It doesn't matter which one. You'll see that
both Git and DVC recognize a change was made.
Let's edit one of the source code files. It doesn't matter which one. You'll see
that both Git and DVC recognize a change was made.

If we ran `dvc repro` at this point, this pipeline would be reproduced. But
since the change was inconsequential, that would be a waste of time and CPU.
Expand All @@ -273,7 +275,7 @@ Are you sure you commit it? [y/n] y
$ dvc status
Pipeline is up to date. Nothing to reproduce.
Pipelines are up to date. Nothing to reproduce.
```

Nothing special is required, we simply `commit` to both the SCM and DVC. Since
Expand Down
11 changes: 8 additions & 3 deletions static/docs/commands-reference/diff.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,11 @@ were deleted/changed, and the file size differences.
Note that `dvc diff` does not show the line-to-line comparison among the target
files in each revision, like `git diff` does.

> For an example on how to create line-to-line text file comparison refer to
> issue
> [#770](https://github.com/iterative/dvc/issues/770#issuecomment-512693256) in
> our code repository.
If the `-t` option is used, the diff is limited to the `TARGET` file or
directory specified.

Expand All @@ -35,9 +40,9 @@ by the Git SCM, for example when `dvc init` was used with the `--no-scm` option.

## Options

- `-t TARGET`, `--target TARGET` - Source path to a data file or directory. If
not specified, compares all files and directories that are under DVC control
in the workspace.
- `-t TARGET`, `--target TARGET` - path to a data file or directory. If not
specified, compares all files and directories that are under DVC control in
the workspace.

- `-h`, `--help` - prints the usage/help message, and exit.

Expand Down
2 changes: 1 addition & 1 deletion static/docs/commands-reference/fetch.md
Original file line number Diff line number Diff line change
Expand Up @@ -309,4 +309,4 @@ the workspace (with `dvc repro train.dvc`).
> Note that in this sample project, the last stage file `evaluate.dvc` doesn't
> add any more data files than those form previous stages so at this point all
> of the files for this pipeline are in local cache and `dvc status -c` would
> output "Pipeline is up to date. Nothing to reproduce."
> output `Pipelines are up to date.`
41 changes: 39 additions & 2 deletions static/docs/commands-reference/get.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ single-purpose command that can be used out of the box after installing DVC.
> See `dvc get-url` to download data from other supported URLs.
After running this command successfully, the data found in the `url` `path` is
created in the current working directory with its original file name.
created in the current working directory, with its original file name.

## Options

Expand All @@ -44,4 +44,41 @@ created in the current working directory with its original file name.

- `-v`, `--verbose` - displays detailed tracing information.

<!-- ## Example -->
## Example

> Note that `dvc get` can be used form anywhere in the file system, as long as
> DVC is [installed](/doc/get-started/install).
We can use `dvc get` to download the resulting model file from our
[get started example](https://github.com/iterative/example-get-started), which
is a DVC project external to the current working directory). The desired file is
located in the root of the external repo, and named `model.pkl`.

```dvc
$ dvc get https://github.com/iterative/example-get-started model.pkl
Preparing to download data from 'https://remote.dvc.org/get-started'
...
$ ls
model.pkl
```

Note that the `model.pkl` file doesn't actually exist in the
[data directory](https://github.com/iterative/example-get-started/tree/master/)
of the external Git repo. Instead, the corresponding DVC-file
[train.dvc](https://github.com/iterative/example-get-started/blob/master/train.dvc)
is found, which specifies `model.pkl` in its outputs (`outs`). DVC then
[pulls](/doc/commands-reference/pull) the file from the default
[remote](/doc/commands-reference/remote) of the external DVC project (found in
its
[config file](https://github.com/iterative/example-get-started/blob/master/.dvc/config)).

A common use for downloading binary files from DVC repos, as done in this
example, is to place a ML model inside a wrapper application that serves as an
[ETL](https://en.wikipedia.org/wiki/Extract,_transform,_load) pipeline or as an
HTTP/RESTful API (web service) that provides predictions upon request. This can
be automated leveraging DVC with [CI/CD](https://en.wikipedia.org/wiki/CI/CD)
tools.

The same example applies to raw or intermediate data files as well, of course,
for cases where we want to download those files and perform some analysis on
them.
49 changes: 25 additions & 24 deletions static/docs/commands-reference/import-url.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Download or copy file or directory from any supported URL (for example `s3://`,
`ssh://`, and other protocols) or local directory to the <abbr>workspace</abbr>,
and track changes in the remote source with DVC. Creates a DVC-file.
and track changes in the remote data source with DVC. Creates a DVC-file.

> See also `dvc get-url` which corresponds to the first step this command
> performs (just download the data).
Expand All @@ -20,13 +20,13 @@ positional arguments:
## Description

In some cases it's convenient to add a data file or directory from a remote
location into the workspace, such that it will be automatically updated when the
external data source changes. Examples:
location into the workspace, such that it will be automatically updated (by
`dvc repro`) when the external data source changes. Examples:

- A remote system may produce occasional data files that are used in other
projects.
- A batch process running regularly updates a data file to import.
- A shared dataset on a remote storage that is managed and updated outside DVC.
- a remote system may produce occasional data files that are used in other
projects;
- a batch process running regularly updates a data file to import; and
- a shared dataset on a remote storage that is managed and updated outside DVC.

The `dvc import-url` command helps the user create such an external data
dependency. The `url` argument specifies the external location of the data to be
Expand Down Expand Up @@ -95,11 +95,12 @@ dependency. The `dvc import-url` command saves the user from having to manually
copy files from each of the remote storage schemes, and from having to install
CLI tools for each service.

When DVC inspects a DVC-file, its dependencies will be checked to see if any
have changed. A changed dependency will appear in the `dvc status` report,
indicating the need to reproduce this import stage. When DVC inspects an
external dependency, it uses a method appropriate to that dependency to test its
current status.
Note that by default, import stages are locked in their DVC-files (with
`locked: true`). Use `dvc update` manually on them to force updating the
downloaded file or directory from the external data source.

> If `dvc unlock` is used on locked stages, they will start to be checked by
> `dvc status`, and updated by `dvc repro`.
## Options

Expand Down Expand Up @@ -164,8 +165,8 @@ using `dvc import-url`:

### Click and expand to prepare the workspace

This is needed to actually run the command below in case you are reproducing
this example:
This is needed to actually run the command below in case you are trying this
example:

```dvc
$ git checkout 2-remote
Expand Down Expand Up @@ -223,8 +224,8 @@ file has changed.

What if that remote file is one which will be updated regularly? The project
goal might include regenerating a <abbr>data artifact</abbr> based on the
updated source. A pipeline can be triggered to re-execute based on a changed
external dependency.
updated data source. A pipeline can be triggered to re-execute based on a
changed external dependency.

Let us again use the [Getting Started](/doc/get-started) example, in a way which
will mimic an updated external data source.
Expand Down Expand Up @@ -348,7 +349,7 @@ $ tree
3 directories, 10 files
$ dvc status
Pipeline is up to date. Nothing to reproduce.
Pipelines are up to date. Nothing to reproduce.
```

Then in the data store directory, edit `data.xml`. It doesn't matter what you
Expand All @@ -362,8 +363,8 @@ data.xml.dvc:
modified: /path/to/data-store/data.xml
```

DVC has noticed the external dependency has changed. It is telling us that it is
necessary to now run `dvc repro`.
DVC has noticed the external dependency (import stage) has changed. It is
telling us that it is necessary to now run `dvc repro`.

```dvc
$ dvc repro prepare.dvc
Expand Down Expand Up @@ -396,11 +397,11 @@ $ git commit -a -m "updated data"
2 files changed, 6 insertions(+), 6 deletions(-)
$ dvc status
Pipeline is up to date. Nothing to reproduce.
Pipelines are up to date. Nothing to reproduce.
```

Because the external source for the data file changed, the change was noticed by
the `dvc status` command. Running `dvc repro` then ran both stages of this
pipeline, and if we had set up the other stages they also would have been run.
It first downloaded the updated data file. And then noticing that
Because the external data source for the data file changed, the change was
noticed by the `dvc status` command. Running `dvc repro` then ran both stages of
this pipeline, and if we had set up the other stages they also would have been
run. It first downloaded the updated data file. And then noticing that
`data/data.xml` had changed, that triggered the `prepare.dvc` stage to execute.
15 changes: 10 additions & 5 deletions static/docs/commands-reference/import.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
Download or copy file or directory from another DVC repository (on a git server
such as Github) into the <abbr>workspace</abbr>, and track changes in the remote
source with DVC. Creates a DVC-file.
data source with DVC. Creates a DVC-file.

> See also `dvc get` which corresponds to the first step this command performs
> (just download the data).
Expand All @@ -25,8 +25,8 @@ positional arguments:
DVC provides an easy way to reuse datasets, intermediate results, ML models, or
other files and directories tracked in another DVC repository into the present
<abbr>workspace</abbr>. The `dvc import` command downloads such a <abbr>data
artifact</abbr> in a way that it can be tracked with DVC, resulting in automatic
updates when the external data source changes.
artifact</abbr> in a way that it is tracked with DVC, so it can be updated when
the external data source changes.

The `url` argument specifies the external DVC project's Git repository URL (both
HTTP and SSH protocols supported, e.g. `[user@]server:project.git`), while
Expand All @@ -50,6 +50,13 @@ determine whether the local copy is out of date.
To actually [track the data](https://dvc.org/doc/get-started/add-files),
`git add` (and `git commit`) the import stage (DVC-file).

Note that by default, these import stages are locked in their DVC-files (with
fields `locked: true` and `rev_lock`). Use `dvc update` manually on them to
force updating the downloaded data artifact from the external DVC repo.

> If `dvc unlock` is used on locked stages, they will start to be checked by
> `dvc status`, and updated by `dvc repro`.
## Options

- `-o`, `--out` - specify a location in the workspace to place the imported data
Expand All @@ -65,5 +72,3 @@ To actually [track the data](https://dvc.org/doc/get-started/add-files),
problems arise, otherwise 1.

- `-v`, `--verbose` - displays detailed tracing information.

<!-- ## Example -->
14 changes: 7 additions & 7 deletions static/docs/commands-reference/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,16 @@

DVC is a command-line tool. The typical use case for DVC goes as follows

- In an existing Git repository, initialize a DVC repository with `dvc init`.
- Copy source files for modeling into the repository and convert the files into
DVC data files with `dvc add` command.
- Process source data files through your data processing and modeling code using
the `dvc run` command.
- In an existing Git repository, initialize a DVC repository with `dvc init`,
- Copy source code files for modeling into the repository and convert the files
into DVC data files with `dvc add` command;
- Process raw data files through your data processing and modeling code using
the `dvc run` command;
- Use `--outs` option to specify `dvc run` command outputs which will be
converted to DVC data files after the code runs.
converted to DVC data files after the code runs;
- Clone a git repo with the code of your ML application pipeline. However, this
will not copy your DVC cache. Use
[data remotes](/doc/commands-reference/remote) and `dvc push` to share the
cache (data).
cache (data);
- Use `dvc repro` to quickly reproduce your pipeline on a new iteration, after
your data item files or source code of your ML application are modified.
Loading

0 comments on commit bc444e0

Please sign in to comment.