-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QUESTION & DOCS]: Comparison with DVC #168
Comments
Hi Selva, |
Hey @vanangamudi, Thanks for waiting on a reply here! I'll expand a bit upon @hhsecond's excellent summary above. Executive SummaryAt first glance, Hangar, DVC, and DAT might appear to solve similar(ish) problems: versioning, making use of, and distributing/collaborating on "data". However, the implementation/design and world-view of each tool are drastically different; drastically impacting end-user workflows and performance. Direct ComparisonsHangar vs DVCPhilosophyThe simplest way to understand why/how Hangar and DVC differ might be:
A really important point in the above statements is the difference between what Hangar and DVC consider "data"
This is massively affects every aspect of usage, performance, and scaling abilities (as explained below). WorkflowHangarBecause Hangar thinks of data only as numeric arrays; there is no need for Hangar to consider domain specific storage conventions or formats! With a small set of schemas and technologies, As a Hangar user, all you do is say: >>> # write data
>>> checkout.arrayset['some_sample'] = np.array([0, 1, 2]) # some data
>>> # read data
>>> res = checkout.arrayset['some_sample']
>>> res
array([0, 1, 2]) In a Hangar workflow, there is
Most importantly in Hangar: the numeric data you put in is exactly the numeric data you get out. While explaining how data is indexed/stored in the DVCAt it's core, DVC is dependent on the Essentially, all DVC does is create a snapshot of some set of files (which the user marks as being "data" files, identified by either a filename suffix or via manually adding the file path to DVC). Because DVC operates in a In the DVC model, regardless of how the needs / processing pipeline / usage of some piece of data changes in the future, if you want to see data from some previous point in time, you get files written for the processes that exist at that point in time. In DVC, you must always retain:
This is a fundamental limitation of DVC because Git was written to Handle text files representing pieces of code. Thinking of "data" and "text" as analogous entities is a fallacy disproved by the following argument:
What you really want is the data itself, the directly computable set of numbers representing some piece of information in the real world, not the container in which it is stored. (ie. what you want is a system like Hangar) PerformanceHangarThe Hangar backends storage methods are highly optimized. Storing numeric data is our specialty, and the team has spent countless hours (and relied on many years of experience) to write backends which are highly performant for reads while balancing compression, shuffling, integrity checking, & multi-threading/processing considerations. Performance is a main consideration, and much work has gone into making sure that Hangar has some of the highest performance reads and compression levels around. I would suggest seeing this on a sample set of data you deal with in the real world. Further, most hangar book-keeping operations (checkout, commit, diff, merge, fetch, clone, etc), do not actually require reading the numeric data files (which can be very large) from disk in order to execute. The vast majority of operations occur on only book-keeping "metadata" (very small - ~40 bytes each - structures acting to describe a commit / sample / data location). Combined with highly performant algorithms (similar to those used in the Disk space is also further preserved by automatically deduplicating data. If you add some sample which has identical contents to any sample existing in the entirety of the DVCDVC stores what it is given. read speed / compression ratios are only as good as the files added to it. Without dedicated engineering efforts this commonly results in sub-par usability and increased costs through disk usage and cpu requirements during the reading / decoding phases. Also, for many operations DVC scales with Direct ComparisonsFeature Set
Hangar vs DATMany of the same points in relation to performance / workflow are analogous in the comparison of
More InfoFor further reading on the details above, I would encourage you to read up on the following section of the Hangar ReadTheDocs Site: |
Also, since I never addressed your comment on DVC Hangar is a much more focused project than DVC. Rather than try to handle both execution, results tracking, and pipelining of specific workflow (ML graphs / training) in the same tool which is responsible for versioning and accessing your Data, we limit out scope to putting/retriving data on/from disk, versioning it, and enabling distribution/collaboration. I liken adding pipeline/run features directly to the
solutionWhile there isn't any built in support for "metrics" like DVC, the general nature of
>>> co.arraysets['metrics']['AUC'] = np.array([2.414])
>>> co.arraysets['metrics']['ROC'] = np.array([0.4522])
>>> # continue as needed
>>> co.arraysets['metrics']['AUC']
array([2.414])
>>> co.metadata['model1-AUC'] = str(2.414)
>>> res = co.metadata['model1-AUC']
>>> res
'2.414'
>>> float(res)
2.414 Other Questions?Hope this helps. let me know if you have any questions, comments, rebuttals, or concerns! -Rick |
Hi @rlizzo , thanks for the comprehensive explanation. |
@rlizzo @hhsecond I've been looking for a version control system that can handle image pixels, and am really impressed by the comparison graphs you've shown against DVC. Thanks for all the effort the team put into building this tool from the ground up! Unfortunately it looks like this project has gone stale though since the last commit was Sept 2, 2020. What happened in that regard? Is there anything that could be done to revive the project? |
Executive Summary
How does this compare with DVC.
Additional Context / Explantation
I am not trying to start a flame war, but we spent quite a lot of time in investigating DVC[1] for our purpose. But one of my friend suggested to take a look at hangar. One key thing we really like about DVC is the metrics features. I read through the hangar docs, it looks a lot different from DVC but lot similiar to the dat project[2]. I may be wrong. Need some help with understanding the difference.
External Links
[1] https://github.com/iterative/dvc
[2] https://datproject.org/
The text was updated successfully, but these errors were encountered: