Skip to content

Latest commit

 

History

History
76 lines (56 loc) · 2.61 KB

databases.md

File metadata and controls

76 lines (56 loc) · 2.61 KB

Prepared search databases

RefSeq microbial genomes - SBT

These database are formatted for use with sourmash search and sourmash gather.

Approximately 60,000 microbial genomes (including viral and fungal) from NCBI RefSeq.

Genbank microbial genomes - SBT

These database are formatted for use with sourmash search and sourmash gather.

Approximately 100,000 microbial genomes (including viral and fungal) from NCBI Genbank.

Details

The individual signatures for the above SBTs were calculated as follows:

sourmash compute -k 4,5 \
                         -n 2000 \
                         --track-abundance \
                         --name-from-first \
                         -o {output} \
                         {input}

sourmash compute -k 21,31,51 \
                         --scaled 2000 \
                         --track-abundance \
                         --name-from-first \
                         -o {output} \
                         {input}

Genbank LCA Database

These databases are formatted for use with sourmash lca.

Approximately 87,000 microbial genomes (including viral and fungal) from NCBI Genbank.

Details

The above LCA databases were calculated as follows:

sourmash lca index genbank-genomes-taxonomy.2017.05.29.csv \
    genbank-k21.lca.json.gz -k 21 --scaled=10000 \
    -f --traverse-directory .sbt.genbank-k21 --split-identifiers

See github.com/dib-lab/2018-ncbi-lineages for information on preparing the genbank-genomes-taxonomy file.