Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provide complete documentation on gather and prefetch output columns #2812

Closed
ctb opened this issue Oct 15, 2023 · 3 comments · Fixed by #2954
Closed

provide complete documentation on gather and prefetch output columns #2812

ctb opened this issue Oct 15, 2023 · 3 comments · Fixed by #2954
Labels
doc documentation content or issues

Comments

@ctb
Copy link
Contributor

ctb commented Oct 15, 2023

we still don't have anything comprehensive. lots of issues on different aspects of it thought 😆

@ctb ctb added the doc documentation content or issues label Oct 15, 2023
@ctb
Copy link
Contributor Author

ctb commented Nov 2, 2023

came up in #2833 too - more docs needed 😭

@ctb
Copy link
Contributor Author

ctb commented Jan 25, 2024

this has been sitting on my laptop for far too long so I'm going to post it here and then hopefully finish it ...soon:

https://hackmd.io/HP1iRzkHRE-2NYJqN5tSsQ?view

current content:

Prefetch CSV output columns

  • intersect_bp - integer: size of overlap between match and original query, estimated by multiplying the number of overlapping hashes by scaled.
  • jaccard - float: Jaccard similarity of the two sketches.
  • max_containment - float: max of f_query_match and f_match_query.
  • f_query_match - float: the fraction of the query contained by the match
  • f_match_query - float: the fraction of the match contained by the query
  • match_filename - string: filename the match sketch was loaded from.
  • match_name - string: full name of match sketch.
  • match_md5 - string: truncated md5sum of match sketch (8 char).
  • match_bp - integer: size of match, estimated by multiplying the sketch size by scaled.
  • query_filename - string: filename the query sketch was loaded from.
  • query_name - string: full name of query sketch.
  • query_md5 - string: truncated md5sum of query sketch (8 char).
  • query_bp - integer: size of query, estimated by multiplying the sketch size by scaled.
  • ksize - integer: k-mer size for the sketches used in the comparison.
  • moltype - string: molecule type of the sketches.
  • scaled - integer: scaled value at which the comparison was done.
  • query_n_hashes - integer: number of hashes in the query.
  • query_abundance - integer: median hash abundance in the sketch, if available (CTB check: if available; median - or is it just true/false?
  • query_containment_ani - float: ANI estimated from the query containment in the match.
  • match_containment_ani - float: ANI estimated from the match containment in the query.
  • average_containment_ani - float: ANI estimated from the average of the query and match containment.
  • max_containment_ani - float: ANI estimated from the max containment between query/match.
    potential_false_negative - boolean: True if the sketch size(s) were too small to give a reliable ANI estimate. False if ANI estimate is reliable.

Gather CSV output columns

Here the query is typically a metagenome, and the matches are one or more genomes that collectively cover the query.

  • unique_intersect_bp - integer: size of overlap between match and remaining query, estimated by multiplying the number of overlapping hashes by scaled. Rank/order dependent. Does not double count hashes.
  • intersect_bp - integer: size of overlap between match and query, estimated by multiplying the number of overlapping hashes by scaled. Independent of rank order and will often double-count hashes.
  • f_orig_query - float: the fraction of the original query represented by this match. Approximates the fraction of metagenomic reads that will map to this genome.
  • f_match - float: the containment of the match in the query.
  • f_unique_to_query - float: the fraction of matching hashes (unweighted) that are unique to this query; rank dependent. Will sum to the fraction of total k-mers (unweighted) that were identified.
  • f_unique_weighted - float: the fraction of matching hashes (weighted by multiplicity) that are unique to this query. This will sum to the fraction of total weighted k-mers that were identified. Approximates the fraction of metagenomic reads that will map to this genome after all previous matches at lower (earlier) ranks are mapped.
  • average_abund - float: mean abundance of the weighted hashes unique to the intersection. Empty if query does not have abundance. Rank dependent, does not double count.
  • median_abund - integer: median abundance of the weighted hashes unique to the intersection. Empty if query has no abundance. Rank dependent, does not double count.
  • std_abund - float: std deviation of the abundance of the hashes unique to the intersection. Empty if query has no abundance. Rank dependent, does not double count.
  • filename - string: filename/location of database from which the match was loaded.
  • name - string: full sketch name of the match.
  • md5 - string: full md5sum of the match sketch.
  • f_match_orig - float: the fraction of the match in the full query. Rank independent.
  • gather_result_rank - float: rank of this match in the results.
  • remaining_bp - integer: how many bp remain in the query after subtracting this match, estimated by multiplying remaining hashes by scaled.
  • query_filename - string: the filename from which the query was loaded.
  • query_name - string: the query sketch name.
  • query_md5 - string: truncated md5sum of the query sketch.
  • query_bp - integer: estimated number of bp in the query, estimated by multiplying the sketch size by scaled.
  • ksize - integer: k-mer size for the sketches used in the comparison.
  • moltype - string: molecule type of the comparison.
  • scaled - integer: scaled value of the comparison.
  • query_n_hashes - integer: number of hashes in the query sketch.
  • query_abundance - boolean: True if the query has abundance information; False otherwise.
  • query_containment_ani - float: ANI estimated from the query containment in the match.
  • match_containment_ani - float: ANI estimated from the match containment in the query.
  • average_containment_ani - float: ANI estimated from the average of the query and match containment.
  • max_containment_ani - float: ANI estimated from the max of the query and match containment.
  • potential_false_negative - boolean: True if the sketch size(s) were too small to give a reliable ANI estimate. False otherwise.
  • n_unique_weighted_found - integer: sum of (abundance-weighted) hashes found in this rank.
  • sum_weighted_found - integer: sum of the hashes x abundance found thus far, i.e. running total of n_unique_weighted_found. The last value divided by total_weighted_hashes will equal the total fraction of (weighted) k-mers identified.
  • total_weighted_hashes - integer: sum of hashes x abundance for the entire dataset. Constant value.

@ctb
Copy link
Contributor Author

ctb commented Jan 28, 2024

updated 1/28/24 - the gather columns should be properly and fully described now.

ctb added a commit that referenced this issue Jan 29, 2024
@ctb ctb closed this as completed in #2954 Jan 30, 2024
ctb added a commit that referenced this issue Jan 30, 2024
…2954)

This PR adds full column descriptions for `gather` and `prefetch` to
`classifying-signatures.md`. It also updates some other details in that
document, including adding a link to the published Hera et al. paper in
2023.

See [rendered
docs](https://sourmash--2954.org.readthedocs.build/en/2954/classifying-signatures.html)!

Fixes #2812
Fixes #2367

---------

Co-authored-by: Colton Baumler <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc documentation content or issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant