Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write import, get, and get-url refs; std. existing refs; other doc format/lang. updates #464

Merged
merged 45 commits into from
Jul 15, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
5575a43
import-url: finish changing old references to `import`, now `import-url`
jorgeorpinel Jun 29, 2019
c8b3739
term: review and increment usage of "protocol"
jorgeorpinel Jun 29, 2019
a25620d
cmd ref: full name for subcommands
jorgeorpinel Jun 29, 2019
618584a
term: S3 buckets have "keys" (not paths)
jorgeorpinel Jul 1, 2019
48f9121
get-url: adds first cmd ref doc
jorgeorpinel Jul 1, 2019
2be096a
remote: small update to remote small duplicity
jorgeorpinel Jul 1, 2019
661705d
term: std way to use "current" and "present" (working directory)
jorgeorpinel Jul 1, 2019
f887c50
term: fix links to "stage"
jorgeorpinel Jul 1, 2019
900b8fc
term: review usage of "setting", favoring "config option" or
jorgeorpinel Jul 1, 2019
1583de8
term: revert usage of "key" vs "path" for S3 remote URLS
jorgeorpinel Jul 3, 2019
fa4ed4f
cmd ref: update import-url/get-rul and related commands with
jorgeorpinel Jul 3, 2019
92c62b5
term: revert "present" for "current" (working directory)
jorgeorpinel Jul 3, 2019
d865ff8
status: update `-a` option desc.
jorgeorpinel Jul 3, 2019
2b0ef40
revert a couple recent errors
jorgeorpinel Jul 5, 2019
37eae0f
term: review "data artifact" related terms and add glossary <abbr> tag
jorgeorpinel Jul 5, 2019
2ff746b
add: revert shortenned command list in desc.
jorgeorpinel Jul 5, 2019
6e467dd
term: review usage of "check" and "checkout"
jorgeorpinel Jul 8, 2019
e635432
download: reduce summary notes in import-url and get-url
jorgeorpinel Jul 8, 2019
836fa17
guides: add `get-url` to comment spec in DVC-file format doc
jorgeorpinel Jul 8, 2019
44e9c95
get-url: remove S3 write ops and permisions from ref.
jorgeorpinel Jul 8, 2019
10a52f3
term: revive "import stage"
jorgeorpinel Jul 8, 2019
23ca5db
guide: remove unnecessary sentence from share-data
jorgeorpinel Jul 9, 2019
880736d
status: rewrap usage code block
jorgeorpinel Jul 9, 2019
3fb8da1
cases: mention directories and `dvc run` in data-and-model-files-vers…
jorgeorpinel Jul 9, 2019
b61b26d
cmd ref: First version of `import` and `get`, with updated `-url` cou…
jorgeorpinel Jul 9, 2019
b94c799
Simplify notes about single-use commands.
jorgeorpinel Jul 11, 2019
88dddaf
cmd ref: Add "Git server e.g. Github" note to `import` and `get` summ…
jorgeorpinel Jul 11, 2019
ae0ed5c
cmd ref: updated `url` arg desc in `import` and `get`
jorgeorpinel Jul 11, 2019
65980d9
cmd ref: add note abot http and ssh protocols to `get` and `import`
jorgeorpinel Jul 11, 2019
aa4067b
init: clarify "local" (repo) term
jorgeorpinel Jul 11, 2019
6780754
remote: fix grammar in `--local` option of `modify` and `remove`
jorgeorpinel Jul 11, 2019
7f33a3a
term: review use of "config(uration) file" and link to /doc/commands-…
jorgeorpinel Jul 11, 2019
d5b38f4
remote: std `--local` opt desc
jorgeorpinel Jul 11, 2019
ce2dc0c
cmd ref: remove outdated comment from `get` and `import`
jorgeorpinel Jul 11, 2019
dab6a36
cmd ref: make note about single-use commands into regular paragraphs
jorgeorpinel Jul 11, 2019
4f9308f
cmd ref: udpate `url` arg desc in `get` and `import` (again)
jorgeorpinel Jul 11, 2019
92a5238
version: remove unnecessary note
jorgeorpinel Jul 11, 2019
fd7f428
s3: update info on boto3 methods and permissions required...
jorgeorpinel Jul 12, 2019
13bde23
init: update with details about using or nto a Git repo for the DVC p…
jorgeorpinel Jul 12, 2019
c38cc8e
cmd ref: improve desc of `import` and `get` commands, et al
jorgeorpinel Jul 14, 2019
377410c
cmd ref: fix command to install DVC with pip inc all remotes
jorgeorpinel Jul 14, 2019
6d0d0e0
install: add [oss] to list of optional deps when installing via pip
jorgeorpinel Jul 15, 2019
2b1008a
cmd ref: be more specific about what import and get are for...
jorgeorpinel Jul 15, 2019
d53434d
cmd ref: clarify that get and get-url download files anywhere...
jorgeorpinel Jul 15, 2019
1bdd04f
import: add note that the original release is now import-url in cmd ref
jorgeorpinel Jul 15, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions src/Documentation/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -92,8 +92,11 @@
"destroy.md",
"diff.md",
"fetch.md",
"get-url.md",
"get.md",
"gc.md",
"import-url.md",
"import.md",
"init.md",
"install.md",
"lock.md",
Expand Down Expand Up @@ -135,8 +138,11 @@
"destroy.md": "destroy",
"diff.md": "diff",
"fetch.md": "fetch",
"get-url.md": "get-url",
"get.md": "get",
"gc.md": "gc",
"import-url.md": "import-url",
"import.md": "import",
"init.md": "init",
"install.md": "install",
"lock.md": "lock",
Expand Down
12 changes: 6 additions & 6 deletions static/docs/commands-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,12 +69,12 @@ to work with directory hierarchies with `dvc add`.
the single DVC-file points to a file in the DVC cache that contains
references to the files in the added hierarchy.

In a DVC project `dvc add` can be used to version control any data artifacts -
input, intermediate, output files and directories, as well as model files. It is
useful by itself to go back and forth between different versions of datasets or
models. Usually though, it is recommended to use `dvc run` and `dvc repro`
mechanism to version control intermediate and output artifacts (like models).
This way you bring data provenance and make your project reproducible.
In a DVC project `dvc add` can be used to version control any <abbr>data
artifact</abbr> (input, intermediate, or output files and directories, and model
files). It is useful by itself to go back and forth between different versions
of datasets or models. Usually though, it is recommended to use `dvc run` and
`dvc repro` mechanism to version control intermediate and final results (like
models). This way you bring data provenance and make your project reproducible.

## Options

Expand Down
5 changes: 2 additions & 3 deletions static/docs/commands-reference/cache.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,8 @@ default `cache` directory.

The DVC cache is where your data files, models, etc (anything you want to
version with DVC) are actually stored. The corresponding files you see in the
working directory or "workspace" simply link to the ones in cache. (See
`dvc config cache` `type` setting for more information on file links on
different platforms.)
workspace simply link to the ones in cache. (See `dvc config cache`, `type`
config option, for more information on file links on different platforms.)

> For more cache-related configuration options refer to `dvc config cache`.

Expand Down
15 changes: 7 additions & 8 deletions static/docs/commands-reference/cache_dir.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# dir
# cache dir

Set/unset the cache directory location intuitively (compared to using
`dvc config cache`).
Expand All @@ -18,7 +18,7 @@ positional arguments:

Helper to set the `cache.dir` configuration option. Unlike doing so with
`dvc config cache`, this command transform paths (`value`) that are provided
relative to the present working directory into paths **relative to the config
relative to the current working directory into paths **relative to the config
file location**. They are required in the latter form for the config file.

## Options
Expand All @@ -29,12 +29,11 @@ file location**. They are required in the latter form for the config file.
- `--system` - modify a system config file (e.g. `/etc/dvc.config`) instead of
`.dvc/config`.

- `--local` - modify a local
[config file](/doc/user-guide/dvc-files-and-directories) instead of
`.dvc/config`. It is located in `.dvc/config.local` and is Git-ignored. This
is useful when you need to specify private config options in your config that
you don't want to track and share through Git (credentials, private locations,
etc).
- `--local` - modify a local [config file](/doc/commands-reference/config)
instead of `.dvc/config`. It is located in `.dvc/config.local` and is
Git-ignored. This is useful when you need to specify private config options in
your config that you don't want to track and share through Git (credentials,
private locations, etc).

- `-u`, `--unset` - remove the `cache.dir` config option from the config file.
Don't provide a `value` when using this flag.
Expand Down
2 changes: 1 addition & 1 deletion static/docs/commands-reference/checkout.md
Original file line number Diff line number Diff line change
Expand Up @@ -179,7 +179,7 @@ MD5 (model.pkl) = 3863d0e317dee0a55c4e59d2ec0eef33
```

What if we want to rewind history, so to speak? The `git checkout` command lets
us checkout at any point in the commit history, or even check out other tags. It
us checkout at any point in the commit history, or even checkout other tags. It
automatically adjusts the files, by replacing file content and adding or
deleting files as necessary.

Expand Down
12 changes: 6 additions & 6 deletions static/docs/commands-reference/commit.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,12 +55,12 @@ to the DVC cache as the last step. What _commit_ means is that DVC:
- Adds the file/directory or to the DVC cache.

There are many cases where the last step is not desirable (usually, rapid
iteration on some experiment). For the DVC commands where it is appropriate the
`--no-commit` option prevents the last step from occurring - thus, we are saving
some time and space, by not storing all the data artifacts for all the attempts
we do. The checksum is still computed and added to the DVC-file, but the file is
not added to the cache. That's where the `dvc commit` command comes into play.
It handles that last step of adding the file to the DVC cache.
iteration on some experiment). For the DVC commands where available, the
`--no-commit` option prevents the last step from occurring, thus we are saving
time and space by not storing all the <abbr>data artifacts</abbr> for every
command attempt. The checksum is still computed and added to the DVC-file, but
the file is not added to the cache. That's where the `dvc commit` command comes
into play. It handles that last step of adding the file to the DVC cache.

## Options

Expand Down
10 changes: 5 additions & 5 deletions static/docs/commands-reference/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ You can query/set/replace/unset DVC configuration options with this command. It
takes a config option `name` (a section and a key, separated by a dot) and its
`value` (any valid alpha-numeric string generally).

This command reads and overwrites the DVC config file `.dvc/config`. If
This command reads and overwrites the DVC configuration file `.dvc/config`. If
`--local` option is specified, `.dvc/config.local` is modified instead.

If the config option `value` is not provided and `--unset` option is not used,
Expand Down Expand Up @@ -95,16 +95,16 @@ details.)
config location results in `.dvc/cache`.

> See also helper command `dvc cache dir` to intuitively set this config
> option, properly transforming paths relative to the present working
> option, properly transforming paths relative to the current working
> directory into paths relative to the config file location.

- `cache.protected` - makes files in the workspace read-only. Possible values
are `true` or `false` (default). Run `dvc checkout` for the change go into
effect. (It affects only files that are under DVC control.)

Due to the way DVC handles linking between the data files in the cache and
their counterparts in the working directory, it's easy to accidentally corrupt
the cached version of a file by editing or overwriting it. Turning this config
their counterparts in the workspace, it's easy to accidentally corrupt the
cached version of a file by editing or overwriting it. Turning this config
option on forces you to run `dvc unprotect` before updating a file, providing
an additional layer of security to your data.

Expand Down Expand Up @@ -158,7 +158,7 @@ details.)

### state

State config options. Check the
State config options. See
[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories) to learn
more about the state file that is used for optimization.

Expand Down
2 changes: 1 addition & 1 deletion static/docs/commands-reference/diff.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ by the Git SCM, for example when `dvc init` was used with the `--no-scm` option.

- `-t TARGET`, `--target TARGET` - Source path to a data file or directory. If
not specified, compares all files and directories that are under DVC control
in the current workspace.
in the workspace.

- `-h`, `--help` - prints the usage/help message, and exit.

Expand Down
2 changes: 1 addition & 1 deletion static/docs/commands-reference/gc.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ $ du -sh .dvc/cache/
```

When you run `dvc gc` it removes all objects from cache that are not referenced
in the current workspace (by collecting hash sums from the DVC-files):
in the workspace (by collecting hash sums from the DVC-files):

```dvc
$ dvc gc
Expand Down
158 changes: 158 additions & 0 deletions static/docs/commands-reference/get-url.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
# get-url

Download or copy file or directory from any supported URL (for example `s3://`,
`ssh://`, and other protocols) or local directory to the local file system.

> Unlike `dvc import-url`, this command does not track the downloaded data
> file(s) (does not create a DVC-file).
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

## Synopsis

```usage
usage: dvc get-url [-h] [-q | -v] url [out]

positional arguments:
url (See supported URLs in the description.)
out Destination path to put data to.
```

## Description

In some cases it's convenient to get a data file or directory from a remote
location into the current working directory, regardless of whether it's a DVC
project. The `dvc get-url` command helps the user do just that.

The `url` argument should provide the location of the data to be downloaded,
while `out` can be used to specify the (path and) file name desired for the
downloaded data file or directory.

Note that this command doesn't require an existing DVC project to run in. It's a
single-purpose command that can be used out of the box after installing DVC.

> See `dvc get` to download data or model files or directories from other DVC
> repositories (e.g. Github URLs).

DVC supports several types of (local or) remote locations (protocols):

| Type | Discussion | URL format |
| ------- | ------------------------------------------------------- | ------------------------------------------ |
| `local` | Local path | `/path/to/local/file` |
| `s3` | Amazon S3 | `s3://mybucket/data.csv` |
| `gs` | Google Storage | `gs://mybucket/data.csv` |
| `ssh` | SSH server | `ssh://[email protected]:/path/to/data.csv` |
| `hdfs` | HDFS | `hdfs://[email protected]/path/to/data.csv` |
| `http` | HTTP to file with _strong ETag_ (see explanation below) | `https://example.com/path/to/data.csv` |

> Depending on the remote locations type you plan to download data from you
> might need to specify one of the optional dependencies: `[s3]`, `[ssh]`,
> `[gs]`, `[azure]`, and `[oss]` (or `[all]` to include them all) when
> [installing DVC](/doc/get-started/install) with `pip`.

Another way to understand the `dvc get-url` command is as a tool for downloading
data files.

On GNU/Linux systems for example, instead of `dvc get-url` with HTTP(S) it's
possible to instead use:

```dvc
$ wget https://example.com/path/to/data.csv
```

## Options

- `-h`, `--help` - prints the usage/help message, and exit.

- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no
problems arise, otherwise 1.

- `-v`, `--verbose` - displays detailed tracing information.

## Examples

<details>

### Click and expand for a local example

```dvc
$ dvc get-url /local/path/to/data
```

The above command will copy the `/local/path/to/data` file or directory into
`./dir`.

</details>

<details>

### Click for AWS S3 example

This command will copy an S3 object into the current working directory with the
same file name:

```dvc
$ dvc get-url s3://bucket/path
```

By default DVC expects your AWS CLI is already
[configured](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html).
DVC will be using default AWS credentials file to access S3. To override some of
these settings, you could the options described in `dvc remote modify`.

> We use the `boto3` library to and communicate with AWS S3. The following API
> methods may be performed:
>
> - `head_object`
> - `download_file`
>
> So make sure you have the `s3:GetObject` permission enabled.

</details>

<details>

### Click for Google Cloud Storage example

```dvc
$ dvc get-url gs://bucket/path file
```

The above command downloads the `/path` file (or directory) into `./file`.

</details>

<details>

### Click for SSH example

```dvc
$ dvc get-url ssh://[email protected]/path/to/data
```

Using default SSH credentials, the above command gets the `data` file (or
directory).

</details>

<details>

### Click for HDFS example

```dvc
$ dvc get-url hdfs://[email protected]/path/to/data
```

</details>

<details>

### Click for HTTP example

> Both HTTP and HTTPS protocols are supported.

```dvc
$ dvc get-url https://example.com/path/to/data
```

</details>

<details>
47 changes: 47 additions & 0 deletions static/docs/commands-reference/get.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# get

Download or copy file or directory from another DVC repository (on a git server
such as Github) into the local file system.

> Unlike `dvc import`, this command does not track the downloaded data file(s)
> (does not create a DVC-file).

## Synopsis

```usage
usage: dvc get [-h] [-q | -v] [-o [OUT]] [--rev [REV]] url path

positional arguments:
url URL of Git repository with DVC project to download from.
path Path to data within DVC repository.
```

## Description

DVC provides an easy way to reuse datasets, intermediate results, ML models, or
other files and directories tracked in another DVC repository into the current
working directory, regardless of whether it's a DVC project. The `dvc get`
command downloads such a <abbr>data artifact</abbr>.

The `url` argument specifies the external DVC project's Git repository URL (both
HTTP and SSH protocols supported, e.g. `[user@]server:project.git`), while
`path` is used to specify the path to the data to be downloaded within the repo.

Note that this command doesn't require an existing DVC project to run in. It's a
single-purpose command that can be used out of the box after installing DVC.

> See `dvc get-url` to download data from other supported URLs.

After running this command successfully, the data found in the `url` `path` is
created in the current working directory with its original file name.

## Options

- `-h`, `--help` - prints the usage/help message, and exit.

- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no
problems arise, otherwise 1.

- `-v`, `--verbose` - displays detailed tracing information.

<!-- ## Example -->
Loading