-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deal with duplicated MD5 in storages #994
Conversation
Codecov Report
@@ Coverage Diff @@
## master #994 +/- ##
==========================================
- Coverage 92.23% 92.23% -0.01%
==========================================
Files 72 72
Lines 5368 5393 +25
==========================================
+ Hits 4951 4974 +23
- Misses 417 419 +2
Continue to review full report at Codecov.
|
I implemented the |
Fixed |
Ready for review and merge @ctb Possibly release this as 3.3.1 too, since it is annoying when it happens (it triggers after SBT is built, during saving)? |
#648 unveiled a previous bug: since
index
save data to a location based onmd5sum
, what to do with files with potentially the same MD5 but different content?This PR adds two files that match this case: both have empty
k=21
scaled minhash sketches, and so have the same MD5 (despite coming from different datasets).Note that
FSStorage
didn't check for duplicates, so it just overwrites the previous entry. This actually turns the tree into a DAG =]ZipStorage
triggers an error if entries are duplicated.Solutions?
md5sum
to take into account more data (for now it hashesksize
+ each individualhash
). This breaks all previous signatures.Storage.save
takes apath
argument, and return thepath
where the content was actually written. ForFSStorage
andZipStorage
it's the same, but forIPFSStorage
it returns the IPFS hash, so... we can just change the path for the other storages too (MD5 + "_1", and so on), and this avoids creating duplicated entries. SinceStorage.load
takes what is returned fromStorage.save
when reading from aStorage
, this is backwards-compatible, and is also correct (since it avoids the Tree-to-DAG problem).Checklist
make test
Did it pass the tests?make coverage
Is the new code covered?without a major version increment. Changing file formats also requires a
major version number increment.
changes were made?