Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: regular updates; New update cmd ref #490

Merged
merged 25 commits into from
Jul 29, 2019
Merged
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
5529725
lists: standardize all bullet lists throughout docs!
jorgeorpinel Jul 15, 2019
d4a8f16
term: review usage of "source"; differentiate "source code" and "data…
jorgeorpinel Jul 15, 2019
7e23c98
term: pluralize "the pipeline" per...
jorgeorpinel Jul 16, 2019
22cca95
win: add recs for POSIX terminal on Windows
jorgeorpinel Jul 16, 2019
5d016e9
diff: add note with link to issue explaining how to diff text file co…
jorgeorpinel Jul 18, 2019
3ffa64e
cmd ref: remove "(s)" and similar from command sample outputs...
jorgeorpinel Jul 19, 2019
6e8f58e
term: "initial data" -> "raw data"
jorgeorpinel Jul 19, 2019
61c4996
cmd ref: add `update` ref and changes in related commands
jorgeorpinel Jul 20, 2019
bbc53d8
syntax: remove `;` from bullet points that aren't full sentences
jorgeorpinel Jul 20, 2019
773dfbe
update: rewrite command short desc
jorgeorpinel Jul 20, 2019
4f52401
update: remove "(s)"s
jorgeorpinel Jul 20, 2019
af1ce92
update: add cmd ref link to side menu!
jorgeorpinel Jul 20, 2019
7498687
cmd ref: add `get` example, and related docs; and update "cache" glos…
jorgeorpinel Jul 21, 2019
d7d10fc
get: update example...
jorgeorpinel Jul 22, 2019
6f76e46
user-guide: update external-outputs.md wording and terminology
jorgeorpinel Jul 22, 2019
6d8b854
win: improve notes about POSIX options and
jorgeorpinel Jul 24, 2019
6727383
user-guide: update contributing guides with bullet styles, and
jorgeorpinel Jul 24, 2019
7bf102e
win: improve intro to Windows issue workarounds
jorgeorpinel Jul 25, 2019
1a064b1
lock: mention that imported stages are locked by default
jorgeorpinel Jul 25, 2019
3a9051e
lock: mention that `dvc unlock` is safe for import stages
jorgeorpinel Jul 25, 2019
4f2d35e
get: update example to mention deployment of binary files
jorgeorpinel Jul 27, 2019
d476807
user-guide: contributing-documentation updates per...
jorgeorpinel Jul 28, 2019
90c871a
updte Windows performance guide
jorgeorpinel Jul 28, 2019
4ffe63a
lock: update info about import stages, update, and repro
jorgeorpinel Jul 29, 2019
00d663b
get: Remove `:` from md file
jorgeorpinel Jul 29, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 13 additions & 5 deletions src/Documentation/glossary.js
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,13 @@ export default {
},
{
name: 'DVC cache',
match: ['cache'],
match: ['DVC cache', 'cache', 'cache directory'],
desc:
'DVC cache is a hidden storage which is by default found at ' +
'`.dvc/cache`. This storage is used to manage different versions of ' +
'files which are under DVC control. For more information on cache, ' +
'please refer to this [guide](/doc/commands-reference/config#cache).'
'The DVC cache is a hidden storage (by default located in the ' +
'`.dvc/cache` directory) for files that are under DVC control, and ' +
'their different versions. For more details, please refer to this ' +
'[document](/doc/user-guide/dvc-files-and-directories' +
'#structure-of-cache-directory).'
},
{
name: 'Data artifact',
Expand All @@ -29,6 +30,13 @@ export default {
'result (such as extracted features or a ML model file) that is ' +
'under DVC control. Refer to [Data and Model Files Versioning]' +
'(/doc/use-cases/data-and-model-files-versioning) for more details.'
},
{
name: 'Import stage',
match: ['import stage', 'import stages'],
desc:
'Stage (DVC-file) created with the `dvc import` or `dvc import-url` ' +
'commands. They represent files or directories from external sources.'
}
]
}
2 changes: 2 additions & 0 deletions src/Documentation/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,7 @@
"status.md",
"unlock.md",
"unprotect.md",
"update.md",
"version.md"
],
"labels": {
Expand Down Expand Up @@ -170,6 +171,7 @@
"status.md": "status",
"unlock.md": "unlock",
"unprotect.md": "unprotect",
"update.md": "update",
"version.md": "version"
}
},
Expand Down
3 changes: 2 additions & 1 deletion static/docs/commands-reference/checkout.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ The execution of `dvc checkout` does:
on the command line. And if the `--with-deps` option is specified, it scans
backward from the given `targets` in the corresponding
[pipeline](/doc/get-started/pipeline).

- For any data files where the checksum doesn't match their DVC-file entry, the
data file is restored from the cache. The link strategy used (`reflink`,
`hardlink`, `symlink`, or `copy`) depends on the OS and the configured value
Expand Down Expand Up @@ -82,7 +83,7 @@ be pulled from a remote cache using `dvc pull`.
DVC will not checkout files referenced in later stage(s) than `targets`.

- `-R`, `--recursive` - `targets` is expected to contain at least one directory
path for this option to have effect. Determines the files to checout by
path for this option to have effect. Determines the files to checkout by
searching each target directory and its subdirectories for DVC-files to
inspect.

Expand Down
22 changes: 12 additions & 10 deletions static/docs/commands-reference/commit.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,14 @@ tying stages or a pipeline.
code, configuration, or data. Run DVC commands (`dvc run`, `dvc repro`, and
even `dvc add`) using the `--no-commit` option to avoid caching unnecessary
data over and over again. Use `dvc commit` when the files are finalized.

- One can always execute the code used in a stage without using DVC (keep in
mind that output files or directories in certain cases must first be
unprotected or removed, see `dvc unprotect`). Or one could be developing code
or data, repeatedly manually executing the code until it is working. Once it
is finished, use `dvc add`, `dvc commit`, or `dvc run` when appropriate to
update DVC-files and to store data to the cache.

- Sometimes we want to clean up a code or configuration file in a way that
doesn't cause a change in its results. We might write in-line documentation
with comments, change indentation, remove some debugging printouts, or any
Expand All @@ -47,12 +49,12 @@ want. Let's take a look at what is happening in the fist scenario closely:
Normally DVC commands like `dvc add`, `dvc repro` or `dvc run`, commit the data
to the DVC cache as the last step. What _commit_ means is that DVC:

- Computes a checksum for the file/directory.
- Enters the checksum and file name into the DVC-file.
- Tells the SCM to ignore the file/directory (e.g. add entry to `.gitignore`).
Note that if the workspace was initialized with no SCM support
(`dvc init --no-scm`), this does not happen.
- Adds the file/directory or to the DVC cache.
- Computes a checksum for the file/directory
- Enters the checksum and file name into the DVC-file
- Tells the SCM to ignore the file/directory (e.g. add entry to `.gitignore`)
(Note that if the workspace was initialized with no SCM support
(`dvc init --no-scm`), this does not happen.)
- Adds the file/directory or to the DVC cache

There are many cases where the last step is not desirable (usually, rapid
iteration on some experiment). For the DVC commands where available, the
Expand Down Expand Up @@ -211,7 +213,7 @@ execute this set of commands:
```dvc
$ dvc commit
$ dvc status
Pipeline is up to date. Nothing to reproduce.
Pipelines are up to date. Nothing to reproduce.
$ ls .dvc/cache/70
599f166c2098d7ffca91a369a78b0d
```
Expand Down Expand Up @@ -252,8 +254,8 @@ train.dvc:
modified: src/train.py
```

Let's edit one of the source files. It doesn't matter which one. You'll see that
both Git and DVC recognize a change was made.
Let's edit one of the source code files. It doesn't matter which one. You'll see
that both Git and DVC recognize a change was made.

If we ran `dvc repro` at this point, this pipeline would be reproduced. But
since the change was inconsequential, that would be a waste of time and CPU.
Expand All @@ -273,7 +275,7 @@ Are you sure you commit it? [y/n] y

$ dvc status

Pipeline is up to date. Nothing to reproduce.
Pipelines are up to date. Nothing to reproduce.
```

Nothing special is required, we simply `commit` to both the SCM and DVC. Since
Expand Down
11 changes: 8 additions & 3 deletions static/docs/commands-reference/diff.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,11 @@ were deleted/changed, and the file size differences.
Note that `dvc diff` does not show the line-to-line comparison among the target
files in each revision, like `git diff` does.

> For an example on how to create line-to-line text file comparison refer to
> issue
> [#770](https://github.com/iterative/dvc/issues/770#issuecomment-512693256) in
> our code repository.

If the `-t` option is used, the diff is limited to the `TARGET` file or
directory specified.

Expand All @@ -35,9 +40,9 @@ by the Git SCM, for example when `dvc init` was used with the `--no-scm` option.

## Options

- `-t TARGET`, `--target TARGET` - Source path to a data file or directory. If
not specified, compares all files and directories that are under DVC control
in the workspace.
- `-t TARGET`, `--target TARGET` - path to a data file or directory. If not
specified, compares all files and directories that are under DVC control in
the workspace.

- `-h`, `--help` - prints the usage/help message, and exit.

Expand Down
2 changes: 1 addition & 1 deletion static/docs/commands-reference/fetch.md
Original file line number Diff line number Diff line change
Expand Up @@ -309,4 +309,4 @@ the workspace (with `dvc repro train.dvc`).
> Note that in this sample project, the last stage file `evaluate.dvc` doesn't
> add any more data files than those form previous stages so at this point all
> of the files for this pipeline are in local cache and `dvc status -c` would
> output "Pipeline is up to date. Nothing to reproduce."
> output `Pipelines are up to date.`
41 changes: 39 additions & 2 deletions static/docs/commands-reference/get.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ single-purpose command that can be used out of the box after installing DVC.
> See `dvc get-url` to download data from other supported URLs.

After running this command successfully, the data found in the `url` `path` is
created in the current working directory with its original file name.
created in the current working directory, with its original file name.

## Options

Expand All @@ -44,4 +44,41 @@ created in the current working directory with its original file name.

- `-v`, `--verbose` - displays detailed tracing information.

<!-- ## Example -->
## Example:
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

> Note that `dvc get` can be used form anywhere in the file system, as long as
> DVC is [installed](/doc/get-started/install).

We can use `dvc get` to download the resulting model file from our
[get started example](https://github.com/iterative/example-get-started), which
is a DVC project external to the current working directory). The desired file is
located in the root of the external repo, and named `model.pkl`.

```dvc
$ dvc get https://github.com/iterative/example-get-started model.pkl
Preparing to download data from 'https://remote.dvc.org/get-started'
...
$ ls
model.pkl
```

Note that the `model.pkl` file doesn't actually exist in the
[data directory](https://github.com/iterative/example-get-started/tree/master/)
of the external Git repo. Instead, the corresponding DVC-file
[train.dvc](https://github.com/iterative/example-get-started/blob/master/train.dvc)
is found, which specifies `model.pkl` in its outputs (`outs`). DVC then
[pulls](/doc/commands-reference/pull) the file from the default
[remote](/doc/commands-reference/remote) of the external DVC project (found in
its
[config file](https://github.com/iterative/example-get-started/blob/master/.dvc/config)).

A common use for downloading binary files from DVC repos, as done in this
example, is to place a ML model inside a wrapper application that serves as an
[ETL](https://en.wikipedia.org/wiki/Extract,_transform,_load) pipeline or as an
HTTP/RESTful API (web service) that provides predictions upon request. This can
be automated leveraging DVC with [CI/CD](https://en.wikipedia.org/wiki/CI/CD)
tools.

The same example applies to raw or intermediate data files as well, of course,
for cases where we want to download those files and perform some analysis on
them.
49 changes: 25 additions & 24 deletions static/docs/commands-reference/import-url.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Download or copy file or directory from any supported URL (for example `s3://`,
`ssh://`, and other protocols) or local directory to the <abbr>workspace</abbr>,
and track changes in the remote source with DVC. Creates a DVC-file.
and track changes in the remote data source with DVC. Creates a DVC-file.

> See also `dvc get-url` which corresponds to the first step this command
> performs (just download the data).
Expand All @@ -20,13 +20,13 @@ positional arguments:
## Description

In some cases it's convenient to add a data file or directory from a remote
location into the workspace, such that it will be automatically updated when the
external data source changes. Examples:
location into the workspace, such that it will be automatically updated (by
`dvc repro`) when the external data source changes. Examples:

- A remote system may produce occasional data files that are used in other
projects.
- A batch process running regularly updates a data file to import.
- A shared dataset on a remote storage that is managed and updated outside DVC.
- a remote system may produce occasional data files that are used in other
projects;
- a batch process running regularly updates a data file to import; and
- a shared dataset on a remote storage that is managed and updated outside DVC.

The `dvc import-url` command helps the user create such an external data
dependency. The `url` argument specifies the external location of the data to be
Expand Down Expand Up @@ -95,11 +95,12 @@ dependency. The `dvc import-url` command saves the user from having to manually
copy files from each of the remote storage schemes, and from having to install
CLI tools for each service.

When DVC inspects a DVC-file, its dependencies will be checked to see if any
have changed. A changed dependency will appear in the `dvc status` report,
indicating the need to reproduce this import stage. When DVC inspects an
external dependency, it uses a method appropriate to that dependency to test its
current status.
Note that by default, import stages are locked in their DVC-files (with
`locked: true`). Use `dvc update` manually on them to force updating the
downloaded file or directory from the external data source.

> If `dvc unlock` is used on locked stages, they will start to be checked by
> `dvc status`, and updated by `dvc repro`.

## Options

Expand Down Expand Up @@ -164,8 +165,8 @@ using `dvc import-url`:

### Click and expand to prepare the workspace

This is needed to actually run the command below in case you are reproducing
this example:
This is needed to actually run the command below in case you are trying this
example:

```dvc
$ git checkout 2-remote
Expand Down Expand Up @@ -223,8 +224,8 @@ file has changed.

What if that remote file is one which will be updated regularly? The project
goal might include regenerating a <abbr>data artifact</abbr> based on the
updated source. A pipeline can be triggered to re-execute based on a changed
external dependency.
updated data source. A pipeline can be triggered to re-execute based on a
changed external dependency.

Let us again use the [Getting Started](/doc/get-started) example, in a way which
will mimic an updated external data source.
Expand Down Expand Up @@ -348,7 +349,7 @@ $ tree
3 directories, 10 files

$ dvc status
Pipeline is up to date. Nothing to reproduce.
Pipelines are up to date. Nothing to reproduce.
```

Then in the data store directory, edit `data.xml`. It doesn't matter what you
Expand All @@ -362,8 +363,8 @@ data.xml.dvc:
modified: /path/to/data-store/data.xml
```

DVC has noticed the external dependency has changed. It is telling us that it is
necessary to now run `dvc repro`.
DVC has noticed the external dependency (import stage) has changed. It is
telling us that it is necessary to now run `dvc repro`.

```dvc
$ dvc repro prepare.dvc
Expand Down Expand Up @@ -396,11 +397,11 @@ $ git commit -a -m "updated data"
2 files changed, 6 insertions(+), 6 deletions(-)

$ dvc status
Pipeline is up to date. Nothing to reproduce.
Pipelines are up to date. Nothing to reproduce.
```

Because the external source for the data file changed, the change was noticed by
the `dvc status` command. Running `dvc repro` then ran both stages of this
pipeline, and if we had set up the other stages they also would have been run.
It first downloaded the updated data file. And then noticing that
Because the external data source for the data file changed, the change was
noticed by the `dvc status` command. Running `dvc repro` then ran both stages of
this pipeline, and if we had set up the other stages they also would have been
run. It first downloaded the updated data file. And then noticing that
`data/data.xml` had changed, that triggered the `prepare.dvc` stage to execute.
15 changes: 10 additions & 5 deletions static/docs/commands-reference/import.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

Download or copy file or directory from another DVC repository (on a git server
such as Github) into the <abbr>workspace</abbr>, and track changes in the remote
source with DVC. Creates a DVC-file.
data source with DVC. Creates a DVC-file.

> See also `dvc get` which corresponds to the first step this command performs
> (just download the data).
Expand All @@ -25,8 +25,8 @@ positional arguments:
DVC provides an easy way to reuse datasets, intermediate results, ML models, or
other files and directories tracked in another DVC repository into the present
<abbr>workspace</abbr>. The `dvc import` command downloads such a <abbr>data
artifact</abbr> in a way that it can be tracked with DVC, resulting in automatic
updates when the external data source changes.
artifact</abbr> in a way that it is tracked with DVC, so it can be updated when
the external data source changes.

The `url` argument specifies the external DVC project's Git repository URL (both
HTTP and SSH protocols supported, e.g. `[user@]server:project.git`), while
Expand All @@ -50,6 +50,13 @@ determine whether the local copy is out of date.
To actually [track the data](https://dvc.org/doc/get-started/add-files),
`git add` (and `git commit`) the import stage (DVC-file).

Note that by default, these import stages are locked in their DVC-files (with
fields `locked: true` and `rev_lock`). Use `dvc update` manually on them to
force updating the downloaded data artifact from the external DVC repo.

> If `dvc unlock` is used on locked stages, they will start to be checked by
> `dvc status`, and updated by `dvc repro`.

## Options

- `-o`, `--out` - specify a location in the workspace to place the imported data
Expand All @@ -65,5 +72,3 @@ To actually [track the data](https://dvc.org/doc/get-started/add-files),
problems arise, otherwise 1.

- `-v`, `--verbose` - displays detailed tracing information.

<!-- ## Example -->
14 changes: 7 additions & 7 deletions static/docs/commands-reference/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,16 @@

DVC is a command-line tool. The typical use case for DVC goes as follows

- In an existing Git repository, initialize a DVC repository with `dvc init`.
- Copy source files for modeling into the repository and convert the files into
DVC data files with `dvc add` command.
- Process source data files through your data processing and modeling code using
the `dvc run` command.
- In an existing Git repository, initialize a DVC repository with `dvc init`,
- Copy source code files for modeling into the repository and convert the files
into DVC data files with `dvc add` command;
- Process raw data files through your data processing and modeling code using
the `dvc run` command;
- Use `--outs` option to specify `dvc run` command outputs which will be
converted to DVC data files after the code runs.
converted to DVC data files after the code runs;
- Clone a git repo with the code of your ML application pipeline. However, this
will not copy your DVC cache. Use
[data remotes](/doc/commands-reference/remote) and `dvc push` to share the
cache (data).
cache (data);
- Use `dvc repro` to quickly reproduce your pipeline on a new iteration, after
your data item files or source code of your ML application are modified.
Loading