-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] add abundance weights in to 'sourmash gather' #347
Conversation
Codecov Report
@@ Coverage Diff @@
## master #347 +/- ##
==========================================
+ Coverage 89.9% 89.92% +0.01%
==========================================
Files 31 31
Lines 4478 4486 +8
Branches 36 36
==========================================
+ Hits 4026 4034 +8
Misses 451 451
Partials 1 1
Continue to review full report at Codecov.
|
Is this still WIP, or can it be merged? |
I would like some external checks by other people! Do numbers make sense? Also, docs. |
@ctb - @taylorreiter @luizirber and I are thinking that we can put together a mock community with 5-10 species at varying abundances from the SRA reads. Something like 10,000 reads from x, 100,000 reads from y, and 100,000,000 reads from z. Then we can see if the relative abundances match up. Sound reasonable? |
Sure! Maybe start with known genomes, then fake reads, and then real reads?
… On Dec 11, 2017, at 1:00 PM, Phillip Brooks ***@***.***> wrote:
@ctb @taylorreiter @luizirber and I are thinking that we can put together a mock community with 5-10 species at varying abundances from the SRA reads. Something like 10,000 reads from x, 100,000 reads from y, and 100,000,000 reads from z. Then we can see if the relative abundances match up. Sound reasonable?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I've merged in the latest master and updated the tutorial to match the new output. Turns out we don't have any docs on gather so shrug. I'm going to merge this once the tests pass. |
Actually, I won't; I think we need to stick with the two-person review. So: ready for review @taylorreiter @luizirber @brooksph |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't get this branch to run.
My session looks like this:
appnope (0.1.0)
bleach (2.1.2)
bz2file (0.98)
cycler (0.10.0)
Cython (0.27.3)
decorator (4.2.1)
entrypoints (0.2.3)
html5lib (1.0.1)
ijson (2.3)
ipykernel (4.8.0)
ipython (6.2.1)
ipython-genutils (0.2.0)
ipywidgets (7.1.1)
jedi (0.11.1)
Jinja2 (2.10)
jsonschema (2.6.0)
jupyter (1.0.0)
jupyter-client (5.2.2)
jupyter-console (5.2.0)
jupyter-core (4.4.0)
khmer (2.1.1)
MarkupSafe (1.0)
matplotlib (2.1.2)
mistune (0.8.3)
nbconvert (5.3.1)
nbformat (4.4.0)
notebook (5.4.0)
numpy (1.14.0)
pandas (0.22.0)
pandocfilters (1.4.2)
parso (0.1.1)
pexpect (4.3.1)
pickleshare (0.7.4)
pip (9.0.1)
prompt-toolkit (1.0.15)
ptyprocess (0.5.2)
Pygments (2.2.0)
pyparsing (2.2.0)
python-dateutil (2.6.1)
pytz (2017.3)
pyzmq (16.0.4)
qtconsole (4.3.1)
scikit-learn (0.19.1)
scipy (1.0.0)
screed (1.0)
Send2Trash (1.4.2)
setuptools (38.5.1)
simplegeneric (0.8.1)
six (1.11.0)
sourmash (2.0.0a2)
terminado (0.8.1)
testpath (0.3.1)
tornado (4.5.3)
traitlets (4.3.2)
wcwidth (0.1.7)
webencodings (0.5.1)
wheel (0.24.0)
widgetsnbextension (3.1.3)
I installed sourmash with
pip install -U https://github.com/dib-lab/sourmash/archive/gather/abund.zip
I made a signature from ecoli k12, and a signature with a duplicated k12 signature with
for infile in ecolik12*; do sourmash compute -k 31 --scaled 10000 -o ${infile}.sig --track-abundance $infile; done
I then run:
sourmash gather -o gather_abund_out_k12_1.csv ecolik12.fa.sig ~/github/polyurethane/clusters_analyses/2017-08-18-clusters/sourmash_gather_db/genbank-k31.sbt.json
And I get the following error:
select query k=31 automatically.
loaded query: ecolik12.fa... (k=31, DNA)
Error in parsing signature; quitting.
Exception:
ses/2017-08-18-clusters/sourmash_gather_db/genbank-k31.sbt.json
found 0 matches total;
Traceback (most recent call last):
File "/Users/taylorreiter/Envs/sourmash-gather-abund/bin/sourmash", line 11, in <module>
load_entry_point('sourmash==2.0.0a2', 'console_scripts', 'sourmash')()
File "/Users/taylorreiter/Envs/sourmash-gather-abund/lib/python3.6/site-packages/sourmash_lib/__main__.py", line 63, in main
cmd(sys.argv[2:])
File "/Users/taylorreiter/Envs/sourmash-gather-abund/lib/python3.6/site-packages/sourmash_lib/commands.py", line 967, in gather
(1 - weighted_missed) * 100)
UnboundLocalError: local variable 'weighted_missed' referenced before assignment
loaded query: mqc500.QC.AMBIGUOUS.99.unalign... (k=31, DNA) | ||
loaded SBT genbank-k31.sbt.json | ||
loaded 0 signatures and 1 databases total. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've always been confused by why this says 0 signatures instead of 1. Is it 0 index because python?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's been fixed in #372, which was recently merged. It's because no signatures were loaded, only a database, but it always bugged me too (even though I wrote it).
I'm uploading my signatures here, with .txt appended so that I can upload them. |
I can't replicate the parsing problem - that indicates that the ecolik12.fa.sig is a bad signature!?, but it works for me! - but that is indeed a separate problem with this branch. |
@ctb LGTM. Built temp gather/abund container until merged. Documented here: https://github.com/brooksph/2017_sourmash_gather_comparison/tree/master/sourmash_abund. Summary: I built a mock metagenome with 'five' copies of the same genome. Using sourmash I computed the signatures at with --scaled 10000 and ran gather against the ref seq and genbank sbts. The average abundance was 5. (https://github.com/brooksph/2017_sourmash_gather_comparison/blob/master/sourmash_abund/outputs/classification/sourmash/mock.genome.fa.scaled10k.k51.gather.matches.csv) |
thanks @brooksph - appreciate it. Is this something that we sheould turn into a test (although probably not against a large SBT)? Seems like so. If so could you start a new issue or a new PR that has your signature in it? |
Also @brooksph if you could click the "approved" button for the review that would be nice. I'll fix the problem @taylorreiter identified before merging. |
Ok...I checked and |
On Tue, Feb 06, 2018 at 05:56:16PM -0800, Taylor Reiter wrote:
Ok...I checked and `gather` also fails on master branch, so I guess the signature is bad...although I'm not sure why. Seems like maybe a separate problem?
if this is the signature you uploaded, it works for me
…--t
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
It is the signature I uploaded. I wonder if I have a DB problem then...in any case, sounds like a personal problem. |
ahh yes @taylorreiter that looks like the problem, your genbank sbt must be wonky. |
ahh - the problem @taylorreiter is actually fixed in #380, yay, I am going to merge this 'un. |
This addresses #180 for
sourmash gather
when query signature has been created using--track-abundance
.gather
now reports fraction of unique instead of fraction of original query;--track-abundance
, fraction is weighted by abundance in query;also fixes #266 #280 by changing the text output to be the unique fraction of the match.
make test
Did it pass the tests?make coverage
Is the new code covered?without a major version increment. Changing file formats also requires a
major version number increment.
changes were made?