Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: write full gather result from fastgather and non-rocksdb fastmultigather #298

Merged
merged 20 commits into from
May 10, 2024

Conversation

bluegenes
Copy link
Contributor

@bluegenes bluegenes commented Apr 2, 2024

This PR adds utilities for building full gather results file for fastgather and non-rocksdb fastmultigather, and makes full output default.

Benchmarking

software/version command details time max RAM
branchwater v0.9.3 fastgather minimal result 1m 47s 14 GB
branchwater v0.9.3-dev fastgather full result 1m 57s 14 GB
branchwater v0.9.3 fastmultigather minimal result 8m 3s 25 GB
branchwater v0.9.3-dev fastmultigather full result 8m 9s 25 GB
branchwater v0.9.3 fastmultigather rocksdb full result 24s 600 MB

progress/separate PRs:

@bluegenes
Copy link
Contributor Author

bluegenes commented Apr 19, 2024

benchmarking with this branch:

session:

srun -p bmh -J bench -t 24:00:00  --nodes=1 -c 64 --mem=50G --pty bash

location: /home/ntpierce/SRR606249-feb13-2024
query: SRR606249.trim.k31.sig.zip
db: /group/ctbrowngrp/sourmash-db/gtdb-rs214/gtdb-rs214-k31.zip

hackmd version: https://hackmd.io/IgakZLceRyeTCXuCgH0iDA?edit

software/version command details time max RAM
branchwater v0.9.3 fastgather minimal result 1m 47s 14 GB
branchwater v0.9.3 fastgather full result 1m 57s 14 GB
branchwater v0.9.3 fastmultigather minimal result 8m 3s 25 GB
branchwater v0.9.3 fastmultigather full result 8m 9s 25 GB
branchwater v0.9.3 fastmultigather rocksdb full result 24s 600 MB

I ran non-rocksdb fastgather and fastmultigather a couple times each -- sometimes the full result was faster than the minimal result, depending on multithreading efficiency (Percent of CPU). I kept the faster times here.

Just looking at the number of lines in the gather result:

     85 SRR606249.gather.csv
     85 SRR606249.x.gtdb-rs214.gather.csv
     85 srr.fg.csv
     85 srr.fg.full.csv
     79 srr.fmg-rdb.csv

... so rocksdb fastmultigather produces 79 result lines, while the rest had 85 🤔

@bluegenes bluegenes changed the title EXP: explore outputting full gather result from fastgather/fastmultigather MRG: explore outputting full gather result from fastgather/fastmultigather Apr 20, 2024
@bluegenes
Copy link
Contributor Author

@ctb ready for review.

Note: When I did column-by-column comparisons with python-based sourmash gather, the remaining_bp entries do not match. Can you see anything I'm doing wrong? I think the calculation here may actually be correct?

@bluegenes bluegenes changed the title MRG: explore outputting full gather result from fastgather/fastmultigather MRG: optionally write full gather result from fastgather and non-rocksdb fastmultigather Apr 20, 2024
@ctb
Copy link
Collaborator

ctb commented May 1, 2024

A few notes and questions -

  • if I infer correctly, the results of sourmash gather are in SRR606249.x.gtdb-rs214.gather.csv, yes?
  • this PR sets the output columns to be {'match_name', 'match_filename', 'match_md5'} rather than {'name', 'filename', 'md5'}, right?

@ctb
Copy link
Collaborator

ctb commented May 1, 2024

when I look at the differences in the matching names between srr.fg.full.csv and SRR606249.x.gtdb-rs214.gather.csv, I see:

Screenshot 2024-05-01 at 4 43 04 PM

To me this looks like it's simply the difference between different implementations of the underlying gather algorithm: there are different matches, but they're all the same species, so I bet there is a tie at some point and sourmash gather decides differently from fastgather.

See https://github.com/ctb/2024-debug-gather-difference for notebook for exploring the differences.

@ctb
Copy link
Collaborator

ctb commented May 1, 2024

I think the column name update is good, per sourmash-bio/sourmash#1555, since we want to use the prefetch-like column names. Just, y'know, we should put them in the PR description and also document it.

I'll look at the set of column names here and see if there's something else we should change here in prep for sourmash v5.

@ctb
Copy link
Collaborator

ctb commented May 2, 2024

Just looking at the number of lines in the gather result:

     85 SRR606249.gather.csv
     85 SRR606249.x.gtdb-rs214.gather.csv
     85 srr.fg.csv
     85 srr.fg.full.csv
     79 srr.fmg-rdb.csv

... so rocksdb fastmultigather produces 79 result lines, while the rest had 85 🤔

Actually, I'm finding (on a stripped-down subset of 90 sketches that contains the union of all matches across the various CSVs) that my newly generated srr.fmg.csv only has 79 matches. You missed including it in the table above, too, but your CSV srr.fmg.csv also only has 79 matches.

That suggests that maybe fastmultigather is the problem, not rocksdb.

@ctb
Copy link
Collaborator

ctb commented May 2, 2024

I see the same 79-line output from fastmultigather against my combined-matches-k31.sig.zip whether I use rocksdb or the zipfile, so I have a small, reproducible example of the differences. 🎉 . Will commence the digging!

@ctb
Copy link
Collaborator

ctb commented May 2, 2024

That suggests that maybe fastmultigather is the problem, not rocksdb.

ok, I'm wrong, because of the way fastmultigather -o works. Digging in more.

@ctb
Copy link
Collaborator

ctb commented May 4, 2024

Debugged one mismatch here: #318 - this was causing f_unique_to_query to be incorrectly calculated in the Rust code.

Fixed by #319, which can be merged into this PR.

There's another set of mismatches occurring from the same error in the RocksDB code, which uses sourmash-rs core code to calculate f_unique_to_query - see sourmash-bio/sourmash#3137, and a fix in sourmash-bio/sourmash#3138.

And there's an additional discrepancy in the RocksDB-based fastmultigather, as noted above. I've zeroed in on one specific problem that shows up when looking at gather results (from Python, or fastgather, for fastmultigather against a sig list ;). In brief, the 12th match in RocksDB-fastmultigather is returning too small an overlap. I can spend more time debugging this in the future, but it suggests to me that something is rotten in RevIndex::gather.

Screenshot 2024-05-04 at 1 23 56 PM

@ctb
Copy link
Collaborator

ctb commented May 4, 2024

Random additional thought - is there any reason not to make --full-results the default, given that there seems to be no significant extra processing cost?

That change would also fix #254

@bluegenes
Copy link
Contributor Author

Random additional thought - is there any reason not to make --full-results the default, given that there seems to be no significant extra processing cost?

That change would also fix #254

Yes, we can do that. I set it up this way b/c you were originally concerned about slowdown :)

@ctb
Copy link
Collaborator

ctb commented May 9, 2024

Yes, we can do that. I set it up this way b/c you were originally concerned about slowdown :)

and I appreciate it, but I think you've done a great job of showing that that's not a concern!

NOTE: PR into #298

Removes `--full-results` and updates tests for switch to full results by default.
@ctb
Copy link
Collaborator

ctb commented May 10, 2024

some things to do before merge -

@bluegenes
Copy link
Contributor Author

some things to do before merge -

done! ready for re-review, etc

@bluegenes bluegenes changed the title MRG: optionally write full gather result from fastgather and non-rocksdb fastmultigather MRG: write full gather result from fastgather and non-rocksdb fastmultigather May 10, 2024
Copy link
Collaborator

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran some separate validation scripts and it all looks great! Thank you!! It's a go for merge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants