-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] plot a random subsample with 'sourmash plot --subsample'. #343
Conversation
@taylorreiter comments & review welcome! |
Note, had to fix the version of the random number seed for version compat - see https://stackoverflow.com/questions/11929701/why-is-seeding-the-random-generator-not-stable-between-versions-of-python |
Codecov Report
@@ Coverage Diff @@
## master #343 +/- ##
==========================================
+ Coverage 86.96% 86.99% +0.02%
==========================================
Files 13 13
Lines 2018 2037 +19
Branches 36 36
==========================================
+ Hits 1755 1772 +17
- Misses 262 264 +2
Partials 1 1
Continue to review full report at Codecov.
|
@ctb My use case was wanting to visualize divergent samples in a group of 11,000 samples that should have had similar tetranucleotide frequency throughout. I had no hypothesis as to which samples would not have similar tetranucleotide frequency, but suspected there would be some. This provides the first step. If a random subsample is selected, I would expect that some of the time a non-similar sample would be plotted. The next interesting step I think would be to select the N most similar samples to sample X, and plot these from a sourmash compare matrix. I could do this in R pretty easily. Ran the code and liked the output! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very quick! Nice tool!
Ready for review & merge, @betatim @luizirber ! |
& thanks for trying it out, @taylorreiter :) |
whups! I see @taylorreiter has already approved it, so I'll merge when the tests pass :) |
This adds
--subsample <N>
and--subsample-seed <R>
tosourmash plot
, which will plot a randomly chosen subset of size N, chosen using Python'srandom.shuffle
, seeded with--subsample-seed
. Note that the seed defaults to 1, which intentionally gives stable results when used with the same inputs.Fixes #221.
Also fixes #334, detecting multiple ksizes/moltypes earlier.
make test
Did it pass the tests?make coverage
Is the new code covered?without a major version increment. Changing file formats also requires a
major version number increment.
changes were made?