diff --git a/content/authors/peter_rowlands.md b/content/authors/peter_rowlands.md new file mode 100644 index 0000000000..003b7fd040 --- /dev/null +++ b/content/authors/peter_rowlands.md @@ -0,0 +1,8 @@ +--- +name: Peter Rowlands +avatar: peter_rowlands.jpg +links: + - https://github.com/pmrowla +--- + +Software engineer at [http://dvc.org](http://dvc.org) diff --git a/content/blog/2020-06-29-optimizing-dvc-remotes.md b/content/blog/2020-06-29-optimizing-dvc-remotes.md new file mode 100644 index 0000000000..33a4a29218 --- /dev/null +++ b/content/blog/2020-06-29-optimizing-dvc-remotes.md @@ -0,0 +1,169 @@ +--- +title: Optimization Improvements in DVC 1.0 +date: 2020-06-29 +description: | + An overview of how data synchronization to and from remote storage is optimized in DVC 1.0. +author: peter_rowlands +--- + +One of the key features provided by DVC is the ability to efficiently sync +versioned datasets between a user's local machine and +[remote storage](https://dvc.org/doc/command-reference/remote), and version 1.0 +includes several performance optimizations related to synchronizing data with +remotes. This post provides an overview of some of the ways in which remote +performance has been optimized in DVC. + +In order to sync data between a local machine and remote storage, we must first +do the following (equivalent to the `dvc status` command with option `-c` +(`--cloud`)), before data can actually be uploaded or downloaded to or from the +remote: + +1. Determine which files are present locally (i.e. in the DVC cache). +2. Determine which files are present in the remote. +3. Determine the difference between the two sets of files. + +Commonly used cloud sync utilities, such as [rclone](https://rclone.org/), must +be generalized to support any file structure, which can come at the cost of +performance, especially when dealing with large datasets. In DVC, since we have +control over the structure of both our local cache and our remote storage, we +leverage this structure to optimize this "status" step in all remote storage +operations (i.e. `dvc status -c`, `dvc push`, `dvc pull`, `dvc fetch`). In DVC +version 1.0, these optimizations have significantly improved performance +compared to older versions of DVC and tools like rclone. + +![benchmarks](https://raw.githubusercontent.com/gist/pmrowla/338d9645bd05df966f8aba8366cab308/raw/37f64e28e8c2e963aff0d320f6e0cea0724342a5/remote-benchmarks.svg) +_Note: benchmark details can be found +[here](https://gist.github.com/pmrowla/338d9645bd05df966f8aba8366cab308#benchmarks), +rclone run using default configuration_ + +## DVC cache/remote structure + +Before continuing, it will be helpful for the reader to keep in mind a few key +points about the DVC cache and remote storage structure. + +``` +.dvc/cache +├── 00 +│ ├── 411460f7c92d2124a67ea0f4cb5f85 +│ ├── 6f52e9102a8d3be2fe5614f42ba989 +│ └── ... +├── 01 +├── 02 +├── 03 +├── ... +└── ff +``` + +_Example DVC cache_ + +- Files versioned by DVC are identified and stored according to their MD5 hash. + The cache is organized into 256 subdirectories (from `00` to `ff`), where each + subdirectory is the first 2 characters in an MD5 hash string. +- MD5 is an evenly distributed hash function, so the DVC cache will be evenly + distributed (i.e. given a large enough dataset, each cache subdirectory will + contain an approximately equal number of files) +- DVC remote storage structure mirrors the DVC local cache structure, so a DVC + remote will also be evenly distributed + +## Optimizing remote status queries + +From the previously mentioned data sync "status" steps, #2 ("Determine which +files are present in the remote") typically presents the largest bottleneck in +DVC, since we are limited by cloud storage APIs. In general, cloud storage APIs +provide two possible ways to determine what files are present in a remote: + +1. Directly query whether or not a specific file exists in the remote +2. Request the complete list of all files contained in the remote (and then + compare that list with the list of files we are searching for) + +_(For example, the S3 API provides the `HeadObject` and `ListObjects` methods, +respectively)_ + +When using the first method, performance depends on the number of files being +queried - for a single file, it would take a single API request, for 1 million +files, it would take 1 million API requests. With the second method, it is +important to note that cloud APIs will only return a certain number of files at +a time (the amount returned varies depending on the API), so performance depends +on the total number of files - for an API that returns 1000 files at a time +(such as S3), listing a remote containing 1000 files or less would would take a +single API request, listing a remote which contains 1 million files would take +1000 API requests. + +The end result is that as a general rule, when querying for a small number of +files, the first method will be faster. When querying for a large number of +files, the second method will be faster (with the added caveat that given a +sufficiently large remote, this no longer applies - if a remote contains +billions of files, it will likely be faster to check 1 million files +individually when compared to retrieving the full remote listing). It should +also be noted that cloud storage APIs generally do not provide a way to retrieve +the number of files stored in the remote, without fetching the entire list of +files. + +Choosing the ideal method from these two options will have a significant impact +on performance in a data sync operation. Generic data sync tools typically rely +on the user to provide some hint regarding their use case, since the sync tool +has no way of determining the size of the remote storage relative to the size of +the local dataset. In rclone, the `--no-traverse` option specifies which +behavior should be used. In older versions of DVC, we provided a (now +deprecated) `no_traverse` remote configuration option. However, it is possible +to automatically determine which method would be optimal, as long as the sync +tool can estimate the approximate total size of the remote. + +Consider a case where a dataset being synced contains 100 files. Checking if +each file exists in the remote individually would take 100 API calls. If we know +that the API returns 1000 files per page, we can say that if the remote contains +less than `1000 * 100 = 100,000` files, fetching the full remote listing will be +faster than checking each file individually, since it will take less than 100 +API calls to complete. If the remote contains more files than this `100,000` +threshold, it will be faster to check our 100 files individually. _(In terms of +real world performance, there are other considerations that apply, such as +different API calls taking different amounts of time to complete, and the amount +of time it takes to run list comparison operations.)_ + +In DVC 1.0, we take advantage of the DVC remote structure to estimate the remote +size, and automatically determine which method to use when running data sync +related commands. The over/under remote size threshold is only dependent on the +number of files being queried (i.e. the number of files in our DVC versioned +dataset). And as we established earlier, a DVC remote will be evenly +distributed. Therefore, if we know the number of files in a subset of the +remote, we can estimate the number of files contained in the entire remote. + +For example, if we know that the remote subdirectory `00/` contains 10 files, we +can estimate that the remote contains roughly `256 * 10 = 2,560` files in total. +So, by requesting a list of one subdirectory at a time (rather than the full +remote) via the cloud storage API, we can calculate a running estimate of the +total remote size. If the running estimated total size goes over the threshold +value, DVC will stop fetching the contains of the remote subdirectory, and +switch to querying each file in our dataset individually. If DVC reaches the end +of the subdirectory without the estimated size going over the threshold, it will +continue to fetch the full listing for the rest of the remote. + +## Indexing versioned directories + +A common use case in DVC is in which a user has a large directory versioned with +`dvc add`. After `dvc push`ing one version of the dataset, smaller portions of +that dataset will be updated, whether through adding some new files, or +modifying a subset of existing files. In older versions of DVC, running +`dvc status -c` with the new version of the dataset (or by extension trying to +push the updated dataset) could potentially be a very slow operation. Prior to +version 1.0, DVC would check for the existence of every file in a dataset +against the remote, regardless of whether or not a file had changed since the +last pushed version of the dataset. So for a directory containing 1 million +files, even if only 1 file was added or modified between versions, all 1 million +files would be checked when querying for the remote status. + +In DVC 1.0, we now keep an index of versioned directories which have been stored +in a remote, as well as the contents of those directories. Using the previous +example, when the user runs `dvc status -c` (or `dvc push`) in version 1.0, DVC +will now only have to check whether or not the one newly modified file exists in +the remote, since the previous version of the directory will have already been +indexed. The same will still apply regardless of the number of modified files - +in DVC 1.0 for versioned directories, DVC will only need to check for the +modified files in a remote. And by reducing the number of files which have to be +queried for a given data sync operation, the amount of time it takes to run the +steps outlined in the previous section of this post will be signficantly +reduced. + +_Note that this optimization only applies to DVC versioned directories. +Individually versioned files (including those added with `dvc add -R`) are not +indexed in DVC 1.0)_ diff --git a/static/uploads/avatars/peter_rowlands.jpg b/static/uploads/avatars/peter_rowlands.jpg new file mode 100644 index 0000000000..2e0771c577 Binary files /dev/null and b/static/uploads/avatars/peter_rowlands.jpg differ