Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: update zip crate to 2.0 #385

Merged
merged 9 commits into from
Jul 11, 2024
Merged

MRG: update zip crate to 2.0 #385

merged 9 commits into from
Jul 11, 2024

Conversation

bluegenes
Copy link
Contributor

@bluegenes bluegenes commented Jul 10, 2024

Required changes for zip crate:

  • update zip from 0.6 to 2.0 (NOT 2.3.1, as 2.1+ have a bug for large files)
  • change FileOptions to reflect altered structure in 2.x

Simplifications:

  • simplify zip writing to remove Arc and ZipMessage

previously, we needed to keep track of additional information for all signatures in order to be able to write the manifest. Upon switching to sourmash core's manifest writing (#217), we no longer need to keep track of all signature information and remove this overly complicated implementation.

@bluegenes
Copy link
Contributor Author

bluegenes commented Jul 11, 2024

Zip writing worked, but zip writing is not happening correctly. Decided to try simplifying now that we have manifest writing in sourmash core, and don't need to pass extra manifest info in separately.

  • simplified zip writing to remove Arc and ZipMessage

...still not writing zips properly. Might be a file option issue? Same error after simplification:

Checking zip outside of python (to avoid questions of sourmash compatibility):

sourmash scripts manysketch -p dna,k=21,k=31,k=51,scaled=5,abund -o test.zip ms.csv
unzip -t test.zip                                                      ✔  10176  07:33:28
Archive:  test.zip
    testing: signatures/87ed81f6b2f01f1eb25a2877bf6ef752.sig.gz   bad CRC 79625916  (should be ea733a03)
    testing: signatures/6c4a69002b80091ca1ebe93f22911158.sig.gz   bad CRC fc04f52a  (should be 65936b04)
    testing: signatures/d66d11f8dd1d232598dded50cbc5bf8b.sig.gz   bad CRC e54c1637  (should be d0cbb876)
    testing: signatures/92d3dbfcb5a696ff43130e2f38ec5fce.sig.gz   bad CRC 45e2a402  (should be 643a2877)
    testing: signatures/855a38e5af50ac09abdc518d0e688202.sig.gz   bad CRC de42d26a  (should be 23c99bb1)
    testing: signatures/ff1c85607642ca4700b9b65f948ddcc2.sig.gz   bad CRC f7e8b022  (should be 9b6c8d2d)
    testing: signatures/9f6925238ccb6cdc0ffdf3ee822ba57f.sig.gz   bad CRC eba90895  (should be 2bb445fd)
    testing: signatures/1652b7cfff15bd62fc07fa836ead49c0.sig.gz   bad CRC dc420b3e  (should be 1f89d45d)
    testing: signatures/71525a0a715d08751806c90d35f862b8.sig.gz   bad CRC 9af4e593  (should be 02e8184a)
    testing: SOURMASH-MANIFEST.csv    bad CRC a74e984f  (should be ea13249c)
At least one error was detected in test.zip.

Signatures look good upon viewing with zless, but can't be loaded directly from files after unzip -t test.zip (e.g. signatures/87ed81f6b2f01f1eb25a2877bf6ef752.sig.gz file). However, they can be loaded if viewed with zless and then copied into a separate file.

because of this, maybe something is going wrong with compression. Perhaps a FileOptions issue.

Manifest looks ok at first, then has some extra data that is not right. Also seems like a compression issue ...

internal_location,md5,md5short,ksize,moltype,num,scaled,n_hashes,with_abundance,name,filename
signatures/87ed81f6b2f01f1eb25a2877bf6ef752.sig.gz,87ed81f6b2f01f1eb25a2877bf6ef752,87ed81f6,21,DNA,0,5,198,1,short3,short3.fa
signatures/d66d11f8dd1d232598dded50cbc5bf8b.sig.gz,d66d11f8dd1d232598dded50cbc5bf8b,d66d11f8,31,DNA,0,5,186,1,short3,short3.fa
signatures/6c4a69002b80091ca1ebe93f22911158.sig.gz,6c4a69002b80091ca1ebe93f22911158,6c4a6900,51,DNA,0,5,179,1,short3,short3.fa
signatures/9f6925238ccb6cdc0ffdf3ee822ba57f.sig.gz,9f6925238ccb6cdc0ffdf3ee822ba57f,9f692523,21,DNA,0,5,204,1,short2,short2.fa
signatures/71525a0a715d08751806c90d35f862b8.sig.gz,71525a0a715d08751806c90d35f862b8,71525a0a,31,DNA,0,5,191,1,short2,short2.fa
signatures/1652b7cfff15bd62fc07fa836ead49c0.sig.gz,1652b7cfff15bd62fc07fa836ead49c0,1652b7cf,51,DNA,0,5,190,1,short2,short2.fa
signatures/92d3dbfcb5a696ff43130e2f38ec5fce.sig.gz,92d3dbfcb5a696ff43130e2f38ec5fce,92d3dbfc,21,DNA,0,5,206,1,short,short.fa
signatures/ff1c85607642ca4700b9b65f948ddcc2.sig.gz,ff1c85607642ca4700b9b65f948ddcc2,ff1c8560,31,DNA,0,5,191,1,short,short.fa
signatures/855a38e5af50ac09abdc518d0e688202.sig.gz,855a38e5af50ac09abdc518d0e688202,855a38e5,51,DNA,0,5,201,1,short,short.fa
PK^A^B-^C-^@^@^@^@^@^@^@!^@^C:s<EA>^C   ^@^@^C  ^@^@2^@^T^@^@^@^@^@^@^@^@^@<A4><81>^@^@^@^@signatures/87ed81f6b2f01f1eb25a2877bf6ef752.sig.gz^A^@^P^@<FF><FF><FF><FF>^@^@^@^@<FF><FF><FF><FF>^@^@^@^@PK^A^B-^C-^@^@^@^@^@^@^@!^@v<B8><CB><D0>^@^@^@^@2^@^T^@^@^@^@^@^@^@^@^@<A4><81>g  ^@^@signatures/d66d11f8dd1d232598dded50cbc5bf8b.sig.gz^A^@^P^@<FF><FF><FF><FF>^@^@^@^@<FF><FF><FF><FF>^@^@^@^@PK^A^B-^C-^@^@^@^@^@^@^@!^@^Dk<93>e^@^@^@^@2^@^T^@^@^@^@^@^@^@^@^@<A4><81>G^R^@^@signatures/6c4a69002b80091ca1ebe93f22911158.sig.gz^A^@^P^@<FF><FF><FF><FF>^@^@^@^@<FF><FF><FF><FF>^@^@^@^@PK^A^B-^C-^@^@^@^@^@^^@^@!^@<FD>E<B4>+D      ^@^@D   ^@^@2^@^T^@^@^@^@^@^@^@^@^@<A4><81><DB>^Z^@^@signatures/9f6925238ccb6cdc0ffdf3ee822ba57f.sig.gz^A^@^P^@<FF><FF><FF><FF>^@^@^@^@<FF><FF><FF><FF>^@^@^@^@PK^A^B-^C-^@^@^@^@^@^@^@!^@J^X<E8>^B<AB>^H^@^@<AB>^H^@^@2^@^T^@^@^@^@^@^@^@^@^@<A4><81><83>$^@^@signatures/71525a0a715d08751806c90d35f862b8.sig.gz^A^@^P^@<FF><FF><FF><FF>^@^@^@^@<FF><FF><FF><FF>^@^@^@^@PK^A^B-^C-^@^^@^@^@^@^@^@!^@]ԉ^_<A6>^H^@^@<A6>^H^@^@2^@^T^@^@^@^@^@^@^@^@^@<A4><81><92>-^@^@signatures/1652b7cfff15bd62fc07fa836ead49c0.sig.gz^A^@^P^@<FF><FF><FF><FF>^@^@^@^@<FF><FF><FF><FF>^@^@^@^@PK^A^B-^C-^@^@^@^@^@^@^@!^@w(:dW      ^@^@W   ^@^@2^@^T^@^@^@^@^@^@^@^@^@<A4><81><9C>6^@^@signatures/92d3dbfcb5a696ff43130e2f38ec5fce.sig.gz^A^@^P^@<FF><FF><FF><FF>^@^@^@^@<FF><FF><FF><FF>^@^@^@^@PK^A^B-^C-^@^@^@^^@^@^@^@!^@-<8D>l<9B><A8>^H^@^@<A8>^H^@^@2^@^T^@^@^@^@^@^@^@^@^@<A4><81>W@^@^@signatures/ff1c85607642ca4700b9b65f948ddcc2.sig.gz^A^@^P^@<FF><FF><FF><FF>^@^@^@^@<FF><FF><FF><FF>^@^@^@^@PK^A^B-^C-^@^@^^@^@^@^@^@!^@<B1><9B><C9>#^_    ^@^@^_  ^@^@2^@^T^@^@^@^@^@^@^@^@^@<A4><81>cI^@^@signatures/855a38e5af50ac09abdc518d0e688202.sig.gz^A^@^P^@<FF><FF><FF><FF>^@^@^@^@<FF><FF><FF><FF>^@^@^@^@PK^A^B-^C-^@^@^@^@^@^@^@!^@<BD>/<FC><9D><F0>^D^@^@<F0>^D^@^@^U^@^T^@^@^@^@^@^@^@^@^@<A4><81><E6>R^@^@SOURMASH-MANIFEST.csv^A^@^P^@<FF><FF><FF><FF>^@^@^@^@<FF><FF><FF><FF>^@^@^@^@PK^E^F^@^@^@^@

@bluegenes
Copy link
Contributor Author

bluegenes commented Jul 11, 2024

Switchin Compression Method from "Stored" to "Deflated" allows us to sourmash sig describe the sig files directly after extracting from zip archive AND removes the extra lines from the manifest.

unzip -t test.zip                                                    ✔  10304  09:14:35
Archive:  test.zip
    testing: signatures/6c4a69002b80091ca1ebe93f22911158.sig.gz   OK
    testing: signatures/d66d11f8dd1d232598dded50cbc5bf8b.sig.gz   OK
    testing: signatures/87ed81f6b2f01f1eb25a2877bf6ef752.sig.gz   OK
    testing: signatures/855a38e5af50ac09abdc518d0e688202.sig.gz   OK
    testing: signatures/ff1c85607642ca4700b9b65f948ddcc2.sig.gz   OK
    testing: signatures/92d3dbfcb5a696ff43130e2f38ec5fce.sig.gz   OK
    testing: signatures/1652b7cfff15bd62fc07fa836ead49c0.sig.gz   OK
    testing: signatures/71525a0a715d08751806c90d35f862b8.sig.gz   OK
    testing: signatures/9f6925238ccb6cdc0ffdf3ee822ba57f.sig.gz   OK
    testing: SOURMASH-MANIFEST.csv    OK
No errors detected in compressed data of test.zip.

Still having the same error when looking at the zipfile with sourmash, though

sourmash sig describe test.zip                                       ✔  10289  09:11:19

== This is sourmash version 4.8.9. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

Traceback (most recent call last):
  File "/Users/ntward/miniforge3/envs/branchwater-dev/bin/sourmash", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/ntward/miniforge3/envs/branchwater-dev/lib/python3.12/site-packages/sourmash/__main__.py", line 20, in main
    retval = mainmethod(args)
             ^^^^^^^^^^^^^^^^
  File "/Users/ntward/miniforge3/envs/branchwater-dev/lib/python3.12/site-packages/sourmash/cli/sig/describe.py", line 61, in main
    return sourmash.sig.__main__.describe(args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ntward/miniforge3/envs/branchwater-dev/lib/python3.12/site-packages/sourmash/sig/__main__.py", line 302, in describe
    for sig, location in loader:
  File "/Users/ntward/miniforge3/envs/branchwater-dev/lib/python3.12/site-packages/sourmash/sourmash_args.py", line 695, in load_many_signatures
    idx = load_file_as_index(loc, yield_all_files=yield_all_files)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ntward/miniforge3/envs/branchwater-dev/lib/python3.12/site-packages/sourmash/save_load.py", line 65, in load_file_as_index
    return _load_database(filename, yield_all_files)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ntward/miniforge3/envs/branchwater-dev/lib/python3.12/site-packages/sourmash/save_load.py", line 116, in _load_database
    db = load_fn(
         ^^^^^^^^
  File "/Users/ntward/miniforge3/envs/branchwater-dev/lib/python3.12/site-packages/sourmash/save_load.py", line 197, in _load_sbt
    db = load_sbt_index(filename, cache_size=cache_size)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ntward/miniforge3/envs/branchwater-dev/lib/python3.12/site-packages/sourmash/sbtmh.py", line 10, in load_sbt_index
    return SBT.load(
           ^^^^^^^^^
  File "/Users/ntward/miniforge3/envs/branchwater-dev/lib/python3.12/site-packages/sourmash/sbt.py", line 841, in load
    storage = ZipStorage(location)
              ^^^^^^^^^^^^^^^^^^^^
  File "/Users/ntward/miniforge3/envs/branchwater-dev/lib/python3.12/site-packages/sourmash/sbt_storage.py", line 105, in __init__
    self._objptr = rustcall(lib.zipstorage_new, to_bytes(path), len(path))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ntward/miniforge3/envs/branchwater-dev/lib/python3.12/site-packages/sourmash/utils.py", line 78, in rustcall
    raise exc
sourmash.exceptions.Panic: sourmash panicked: thread 'unnamed' panicked with 'called `Result::unwrap()` on an `Err` value: InvalidArchive("Extra data field contains disk number")' at src/core/src/storage.rs:367

I tried compiling with luiz's rc-zip sourmash branch, (replaces piz for zip reading), but the error remained the same.

@bluegenes
Copy link
Contributor Author

bluegenes commented Jul 11, 2024

I can read the file just fine with python zipfile

looking at manifest:

import zipfile
zipf = zipfile.ZipFile('test.zip')
with zipf.open('SOURMASH-MANIFEST.csv') as sm:
    print(sm.read())

looking at a sigfile:

import gzip
with gzip.GzipFile(fileobj=zipf.open("signatures/6c4a69002b80091ca1ebe93f22911158.sig.gz", 'r'), mode='r') as f:
    print(f.read())

@bluegenes
Copy link
Contributor Author

bluegenes commented Jul 11, 2024

If i remove the large_file(True) from the zip options, sourmash can read the files again.

large file requires zip64 support,

pub const fn large_file(self, large: bool) -> Self
Set whether the new file’s compressed and uncompressed size is less than 4 GiB.
If set to false and the file exceeds the limit, an I/O error is thrown and the file is aborted. If set to true, readers will require ZIP64 support and if the file does not exceed the limit, 20 B are wasted. The default is false.
https://docs.rs/zip/2.1.3/zip/write/struct.FileOptions.html#method.large_file

rc_zip has a function for reading extra fields in zip64. This seems promising, given the sourmash error "Extra data field contains disk number"

image

@bluegenes
Copy link
Contributor Author

Ok, we're actually facing two separate issues here, only one of which is a real issue:

  1. When using Compression::Stored with zip v2.1.3 and large_file(True), we find the bug described here: Other programs report incorrect CRCs when the large_file(true) is used to write zip members zip-rs/zip2#195.

  2. I created a secondary issue when I switched to Compression::Deflate, as our current sourmash zip reading using piz (orrc_zip in a PR) does not seem to support the resulting file. Regular zip (non-large file) seems fine. This is ok -- we will just stick with Stored.

I've rolled back to zip v2.0 as was suggested in the issue. We should use that until the bug is fixed!

@bluegenes bluegenes changed the title WIP: try updating zip crate MRG: update zip crate to 2.0 Jul 11, 2024
@bluegenes
Copy link
Contributor Author

ok @ctb ready for review

@bluegenes bluegenes merged commit fb378be into main Jul 11, 2024
2 checks passed
@bluegenes bluegenes deleted the upd-zip branch July 11, 2024 18:51
@ctb
Copy link
Collaborator

ctb commented Jul 12, 2024

quick question - will tests fail properly if the zip 2.1+ bug is introduced in a version upgrade?

@bluegenes
Copy link
Contributor Author

quick question - will tests fail properly if the zip 2.1+ bug is introduced in a version upgrade?

Yep- tests were failing with the bug. I set dependabot to ignore zip for now though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants