Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] add abundance weights in to 'sourmash gather' #347

Merged
merged 18 commits into from
Feb 7, 2018
Merged

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Sep 29, 2017

This addresses #180 for sourmash gather when query signature has been created using --track-abundance.

  • gather now reports fraction of unique instead of fraction of original query;
  • if query signature was computed with --track-abundance, fraction is weighted by abundance in query;
  • fix tests for updated behavior
  • add tests for new behavior!
  • explain new behavior in docs and tutorial
  • come up with good example data sets

also fixes #266 #280 by changing the text output to be the unique fraction of the match.

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@codecov-io
Copy link

codecov-io commented Sep 29, 2017

Codecov Report

Merging #347 into master will increase coverage by 0.01%.
The diff coverage is 86.2%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #347      +/-   ##
==========================================
+ Coverage    89.9%   89.92%   +0.01%     
==========================================
  Files          31       31              
  Lines        4478     4486       +8     
  Branches       36       36              
==========================================
+ Hits         4026     4034       +8     
  Misses        451      451              
  Partials        1        1
Impacted Files Coverage Δ
sourmash_lib/commands.py 90.38% <100%> (+0.38%) ⬆️
sourmash_lib/search.py 92.43% <84%> (-2.47%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ae9a32f...753f086. Read the comment docs.

@taylorreiter
Copy link
Contributor

Is this still WIP, or can it be merged?

@ctb
Copy link
Contributor Author

ctb commented Dec 11, 2017

I would like some external checks by other people! Do numbers make sense?

Also, docs.

@brooksph
Copy link
Contributor

brooksph commented Dec 11, 2017

@ctb - @taylorreiter @luizirber and I are thinking that we can put together a mock community with 5-10 species at varying abundances from the SRA reads. Something like 10,000 reads from x, 100,000 reads from y, and 100,000,000 reads from z. Then we can see if the relative abundances match up. Sound reasonable?

@ctb
Copy link
Contributor Author

ctb commented Dec 11, 2017 via email

@ctb
Copy link
Contributor Author

ctb commented Feb 3, 2018

I've merged in the latest master and updated the tutorial to match the new output. Turns out we don't have any docs on gather so shrug.

I'm going to merge this once the tests pass.

@ctb ctb changed the title [WIP] add abundance weights in to 'sourmash gather' [MRG] add abundance weights in to 'sourmash gather' Feb 3, 2018
@ctb
Copy link
Contributor Author

ctb commented Feb 3, 2018

Actually, I won't; I think we need to stick with the two-person review.

So: ready for review @taylorreiter @luizirber @brooksph

Copy link
Contributor

@taylorreiter taylorreiter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't get this branch to run.

My session looks like this:

appnope (0.1.0)
bleach (2.1.2)
bz2file (0.98)
cycler (0.10.0)
Cython (0.27.3)
decorator (4.2.1)
entrypoints (0.2.3)
html5lib (1.0.1)
ijson (2.3)
ipykernel (4.8.0)
ipython (6.2.1)
ipython-genutils (0.2.0)
ipywidgets (7.1.1)
jedi (0.11.1)
Jinja2 (2.10)
jsonschema (2.6.0)
jupyter (1.0.0)
jupyter-client (5.2.2)
jupyter-console (5.2.0)
jupyter-core (4.4.0)
khmer (2.1.1)
MarkupSafe (1.0)
matplotlib (2.1.2)
mistune (0.8.3)
nbconvert (5.3.1)
nbformat (4.4.0)
notebook (5.4.0)
numpy (1.14.0)
pandas (0.22.0)
pandocfilters (1.4.2)
parso (0.1.1)
pexpect (4.3.1)
pickleshare (0.7.4)
pip (9.0.1)
prompt-toolkit (1.0.15)
ptyprocess (0.5.2)
Pygments (2.2.0)
pyparsing (2.2.0)
python-dateutil (2.6.1)
pytz (2017.3)
pyzmq (16.0.4)
qtconsole (4.3.1)
scikit-learn (0.19.1)
scipy (1.0.0)
screed (1.0)
Send2Trash (1.4.2)
setuptools (38.5.1)
simplegeneric (0.8.1)
six (1.11.0)
sourmash (2.0.0a2)
terminado (0.8.1)
testpath (0.3.1)
tornado (4.5.3)
traitlets (4.3.2)
wcwidth (0.1.7)
webencodings (0.5.1)
wheel (0.24.0)
widgetsnbextension (3.1.3)

I installed sourmash with

 pip install -U https://github.com/dib-lab/sourmash/archive/gather/abund.zip

I made a signature from ecoli k12, and a signature with a duplicated k12 signature with

for infile in ecolik12*; do sourmash compute -k 31 --scaled 10000 -o ${infile}.sig --track-abundance $infile; done

I then run:

sourmash gather -o gather_abund_out_k12_1.csv ecolik12.fa.sig ~/github/polyurethane/clusters_analyses/2017-08-18-clusters/sourmash_gather_db/genbank-k31.sbt.json 

And I get the following error:

select query k=31 automatically.
loaded query: ecolik12.fa... (k=31, DNA)
Error in parsing signature; quitting.
Exception: 
                                                                               ses/2017-08-18-clusters/sourmash_gather_db/genbank-k31.sbt.json

found 0 matches total;
Traceback (most recent call last):
  File "/Users/taylorreiter/Envs/sourmash-gather-abund/bin/sourmash", line 11, in <module>
    load_entry_point('sourmash==2.0.0a2', 'console_scripts', 'sourmash')()
  File "/Users/taylorreiter/Envs/sourmash-gather-abund/lib/python3.6/site-packages/sourmash_lib/__main__.py", line 63, in main
    cmd(sys.argv[2:])
  File "/Users/taylorreiter/Envs/sourmash-gather-abund/lib/python3.6/site-packages/sourmash_lib/commands.py", line 967, in gather
    (1 - weighted_missed) * 100)
UnboundLocalError: local variable 'weighted_missed' referenced before assignment

loaded query: mqc500.QC.AMBIGUOUS.99.unalign... (k=31, DNA)
loaded SBT genbank-k31.sbt.json
loaded 0 signatures and 1 databases total.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've always been confused by why this says 0 signatures instead of 1. Is it 0 index because python?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's been fixed in #372, which was recently merged. It's because no signatures were loaded, only a database, but it always bugged me too (even though I wrote it).

@taylorreiter
Copy link
Contributor

I'm uploading my signatures here, with .txt appended so that I can upload them.
ecolik12.fa.sig.txt
ecolik12x2.txt.sig.txt

@ctb
Copy link
Contributor Author

ctb commented Feb 7, 2018

I can't replicate the parsing problem - that indicates that the ecolik12.fa.sig is a bad signature!?, but it works for me! - but that is indeed a separate problem with this branch.

@brooksph
Copy link
Contributor

brooksph commented Feb 7, 2018

@ctb LGTM. Built temp gather/abund container until merged. Documented here: https://github.com/brooksph/2017_sourmash_gather_comparison/tree/master/sourmash_abund.

Summary: I built a mock metagenome with 'five' copies of the same genome. Using sourmash I computed the signatures at with --scaled 10000 and ran gather against the ref seq and genbank sbts. The average abundance was 5. (https://github.com/brooksph/2017_sourmash_gather_comparison/blob/master/sourmash_abund/outputs/classification/sourmash/mock.genome.fa.scaled10k.k51.gather.matches.csv)

@ctb
Copy link
Contributor Author

ctb commented Feb 7, 2018

thanks @brooksph - appreciate it. Is this something that we sheould turn into a test (although probably not against a large SBT)? Seems like so. If so could you start a new issue or a new PR that has your signature in it?

@ctb
Copy link
Contributor Author

ctb commented Feb 7, 2018

Also @brooksph if you could click the "approved" button for the review that would be nice. I'll fix the problem @taylorreiter identified before merging.

@taylorreiter
Copy link
Contributor

Ok...I checked and gather also fails on master branch, so I guess the signature is bad...although I'm not sure why. Seems like maybe a separate problem?

@ctb
Copy link
Contributor Author

ctb commented Feb 7, 2018 via email

Copy link
Contributor

@brooksph brooksph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@taylorreiter
Copy link
Contributor

It is the signature I uploaded. I wonder if I have a DB problem then...in any case, sounds like a personal problem.

@ctb
Copy link
Contributor Author

ctb commented Feb 7, 2018

ahh yes @taylorreiter that looks like the problem, your genbank sbt must be wonky.

@ctb
Copy link
Contributor Author

ctb commented Feb 7, 2018

ahh - the problem @taylorreiter is actually fixed in #380, yay, I am going to merge this 'un.

@ctb ctb merged commit bf91ee2 into master Feb 7, 2018
@ctb ctb deleted the gather/abund branch February 7, 2018 02:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unexpectedly finding more results
4 participants