provide complete documentation on `gather` and `prefetch` output columns #2812

ctb · 2023-10-15T21:14:36Z

we still don't have anything comprehensive. lots of issues on different aspects of it thought 😆

ctb · 2023-11-02T03:33:59Z

came up in #2833 too - more docs needed 😭

ctb · 2024-01-25T15:50:42Z

this has been sitting on my laptop for far too long so I'm going to post it here and then hopefully finish it ...soon:

https://hackmd.io/HP1iRzkHRE-2NYJqN5tSsQ?view

current content:

Prefetch CSV output columns

intersect_bp - integer: size of overlap between match and original query, estimated by multiplying the number of overlapping hashes by scaled.
jaccard - float: Jaccard similarity of the two sketches.
max_containment - float: max of f_query_match and f_match_query.
f_query_match - float: the fraction of the query contained by the match
f_match_query - float: the fraction of the match contained by the query
match_filename - string: filename the match sketch was loaded from.
match_name - string: full name of match sketch.
match_md5 - string: truncated md5sum of match sketch (8 char).
match_bp - integer: size of match, estimated by multiplying the sketch size by scaled.
query_filename - string: filename the query sketch was loaded from.
query_name - string: full name of query sketch.
query_md5 - string: truncated md5sum of query sketch (8 char).
query_bp - integer: size of query, estimated by multiplying the sketch size by scaled.
ksize - integer: k-mer size for the sketches used in the comparison.
moltype - string: molecule type of the sketches.
scaled - integer: scaled value at which the comparison was done.
query_n_hashes - integer: number of hashes in the query.
query_abundance - integer: median hash abundance in the sketch, if available (CTB check: if available; median - or is it just true/false?
query_containment_ani - float: ANI estimated from the query containment in the match.
match_containment_ani - float: ANI estimated from the match containment in the query.
average_containment_ani - float: ANI estimated from the average of the query and match containment.
max_containment_ani - float: ANI estimated from the max containment between query/match.
potential_false_negative - boolean: True if the sketch size(s) were too small to give a reliable ANI estimate. False if ANI estimate is reliable.

Gather CSV output columns

Here the query is typically a metagenome, and the matches are one or more genomes that collectively cover the query.

unique_intersect_bp - integer: size of overlap between match and remaining query, estimated by multiplying the number of overlapping hashes by scaled. Rank/order dependent. Does not double count hashes.
intersect_bp - integer: size of overlap between match and query, estimated by multiplying the number of overlapping hashes by scaled. Independent of rank order and will often double-count hashes.
f_orig_query - float: the fraction of the original query represented by this match. Approximates the fraction of metagenomic reads that will map to this genome.
f_match - float: the containment of the match in the query.
f_unique_to_query - float: the fraction of matching hashes (unweighted) that are unique to this query; rank dependent. Will sum to the fraction of total k-mers (unweighted) that were identified.
f_unique_weighted - float: the fraction of matching hashes (weighted by multiplicity) that are unique to this query. This will sum to the fraction of total weighted k-mers that were identified. Approximates the fraction of metagenomic reads that will map to this genome after all previous matches at lower (earlier) ranks are mapped.
average_abund - float: mean abundance of the weighted hashes unique to the intersection. Empty if query does not have abundance. Rank dependent, does not double count.
median_abund - integer: median abundance of the weighted hashes unique to the intersection. Empty if query has no abundance. Rank dependent, does not double count.
std_abund - float: std deviation of the abundance of the hashes unique to the intersection. Empty if query has no abundance. Rank dependent, does not double count.
filename - string: filename/location of database from which the match was loaded.
name - string: full sketch name of the match.
md5 - string: full md5sum of the match sketch.
f_match_orig - float: the fraction of the match in the full query. Rank independent.
gather_result_rank - float: rank of this match in the results.
remaining_bp - integer: how many bp remain in the query after subtracting this match, estimated by multiplying remaining hashes by scaled.
query_filename - string: the filename from which the query was loaded.
query_name - string: the query sketch name.
query_md5 - string: truncated md5sum of the query sketch.
query_bp - integer: estimated number of bp in the query, estimated by multiplying the sketch size by scaled.
ksize - integer: k-mer size for the sketches used in the comparison.
moltype - string: molecule type of the comparison.
scaled - integer: scaled value of the comparison.
query_n_hashes - integer: number of hashes in the query sketch.
query_abundance - boolean: True if the query has abundance information; False otherwise.
query_containment_ani - float: ANI estimated from the query containment in the match.
match_containment_ani - float: ANI estimated from the match containment in the query.
average_containment_ani - float: ANI estimated from the average of the query and match containment.
max_containment_ani - float: ANI estimated from the max of the query and match containment.
potential_false_negative - boolean: True if the sketch size(s) were too small to give a reliable ANI estimate. False otherwise.
n_unique_weighted_found - integer: sum of (abundance-weighted) hashes found in this rank.
sum_weighted_found - integer: sum of the hashes x abundance found thus far, i.e. running total of n_unique_weighted_found. The last value divided by total_weighted_hashes will equal the total fraction of (weighted) k-mers identified.
total_weighted_hashes - integer: sum of hashes x abundance for the entire dataset. Constant value.

ctb · 2024-01-28T23:04:59Z

updated 1/28/24 - the gather columns should be properly and fully described now.

…2954) This PR adds full column descriptions for `gather` and `prefetch` to `classifying-signatures.md`. It also updates some other details in that document, including adding a link to the published Hera et al. paper in 2023. See [rendered docs](https://sourmash--2954.org.readthedocs.build/en/2954/classifying-signatures.html)! Fixes #2812 Fixes #2367 --------- Co-authored-by: Colton Baumler <[email protected]>

ctb added the doc documentation content or issues label Oct 15, 2023

ctb added a commit that referenced this issue Jan 29, 2024

add full column descriptions per #2812

f0a45d7

This was referenced Jan 29, 2024

MRG: add full column descriptions for gather and prefetch output #2954

Merged

consider reorganizing gather/prefetch column documentation #2961

Open

ctb closed this as completed in #2954 Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

provide complete documentation on `gather` and `prefetch` output columns #2812

provide complete documentation on `gather` and `prefetch` output columns #2812

ctb commented Oct 15, 2023

ctb commented Nov 2, 2023

ctb commented Jan 25, 2024 •

edited

Loading

ctb commented Jan 28, 2024

provide complete documentation on gather and prefetch output columns #2812

provide complete documentation on gather and prefetch output columns #2812

Comments

ctb commented Oct 15, 2023

ctb commented Nov 2, 2023

ctb commented Jan 25, 2024 • edited Loading

Prefetch CSV output columns

Gather CSV output columns

ctb commented Jan 28, 2024

provide complete documentation on `gather` and `prefetch` output columns #2812

provide complete documentation on `gather` and `prefetch` output columns #2812

ctb commented Jan 25, 2024 •

edited

Loading