-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use new load_file_as_signatures
function more broadly
#1059
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1059 +/- ##
==========================================
+ Coverage 83.51% 92.51% +8.99%
==========================================
Files 99 74 -25
Lines 8885 5662 -3223
==========================================
- Hits 7420 5238 -2182
+ Misses 1465 424 -1041
Continue to review full report at Codecov.
|
load_file_as_signatures
function more broadly
ok, this looks ready for a full review. ...any takers? it's not really THAT many changes, apart from the new tests 😁 |
Let's try... |
sourmash/sourmash_args.py
Outdated
sig_md5 = sig.md5sum() | ||
if sig_md5.startswith(select_md5.lower()): | ||
sl = [sig] | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for large collections (databases?), should we check if there is more than one sig starting with the same MD5?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, thanks for catching this!
I'm not sure what to do about this... for really large collections it could be quite slow. But then again, for large collections it's not clear you want to use this! So, yeah. I'll change the behavior and add a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 11e519a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what to do about this... for really large collections it could be quite slow. But then again, for large collections it's not clear you want to use this! So, yeah. I'll change the behavior and add a test.
Yeah... at this point there is no access to the databases anymore, where this info can be checked while loading or it's less expensive (at least for SBTs? I think it is for LCA too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment about MD5 selectors, but otherwise LGTM!
#1044 added a new function
load_file_as_signatures
that permittedsig split
to load signatures from databases as well as signature files. This PR deploys that function a bit more widely, and adds selector functionality.This PR permits many (most?) CLI functions to load signatures from databases without any special sauce.
So, for example
sourmash sig describe
can now be run on SBTs and LCAs. It also addresses many generic loading issues, including stdin loading (#1049), loading from directories via subdir traversal, and the ability to provide lists of signature files in a text file as queries or database index builds.Examples
This will update
old.sig
and add the newly computed signature(s) frominput.fa
--This will concatenate all signatures under
tests/test-data
--This will list all signatures in that database --
This will use the signature with that md5sum in that database as a query against all the signatures under that directory --
Testing thoughts
While the actual changes are relatively minor (no significant changes to current tests!) the implications are pretty big and so I'm planning on adding quite a few tests. We're also hitting on some untested behavior and/or grey areas in current behavior, so I'll add some lockdown tests to do my best to make that our current behavior for 3.x isn't too changed.
Affected issues
Fixes #1050 - load_signature now fails properly on non-existent file
Fixes #1048 - traverse-directory behavior now on by default in signature submodule commands
Fixes #1049 - stdin now accepted for signature input
Fixes #672 - compare now runs on SBTs/LCAs as well
Fixes #594 - sourmash sig extract now works on databases
Fixes #662 - added
--from-file
and--query-from-file
in appropriate placesCloses #1039 - can convert b/t databases now, on command line
Fixes #978 - new function
load_file_as_index
now loads databases as Index subclasses.Detailed list of new functionality
General functionality
Almost all commands that take a list of signatures can now load signatures from LCA and SBT databases. Commands that need a single query signature also support loading from databases, as well as signature files, and can usually use an
--md5
selector to pick out the right signature.Expansion of support for reading signatures from stdin.
Specific additions to CLI
--from-file
tosourmash compare
CLI--md5
selector tosourmash gather
andsourmash search
to pick a query signature.--from-file
tosourmash index
--query-from-file
tosourmash lca classify
andsourmash lca summarize
--from-file
tosourmash lca index
--query-from-file
tosourmash multigather
--md5
selector tosourmash sig export
to pick a single signature--from-file
tosourmash compare
Specific additions to API
load_file_as_index
is new top-level API for loadingIndex
objects from files.load_file_as_signatures
is new top-level API for loading of collections of signatures, including from SBT dbs, LCA dbs, and directories.Bug fixes
sourmash sig rename
now correctly fails on files that it could not open (+ tested)signature.load_signatures
behavior wheredo_raise
is now respected (+ added test)TODO:
sourmash sig split
andsourmash sig cat
#1048, do we need to provide an explicit --traverse-directory flag? maybe just traverse by default?sig rename
that did NOT useload_signatures(..., do_raise=True)
load_one_signature
useload_file_as_signatures
API #1062sourmash_args.load_query_signature
load_query_signature
load_dbs_and_sigs
load_database
function to replaceload_sbt
etc. ref provide a unified loading API for databases #978--from-file
to compare?sourmash_args.load_file_as_signatures
error
etc calls insignature.load_signatures
Testing TODO
--traverse-directory
and--force
behavior--md5
behavior with query--md5
behavior with queryindex
--from-file
; make issue re argparse modification stuff for 4.0 (adjust sourmash index argparse to accomodate --from-file #1066)lca classify
--query-from-file
, with and without other--query
; traverse testslca summarize
--query-from-file
, with and without other--query
; traverse testslca index --from-file
; make issue re argparse modification for 4.0; traverse testsmultigather
--query-from-file
compare
new behavior; traverse stuff, toomake test
Did it pass the tests?make coverage
Is the new code covered?without a major version increment. Changing file formats also requires a
major version number increment.
changes were made?