These database are formatted for use with sourmash search
and
sourmash gather
.
Approximately 60,000 microbial genomes (including viral and fungal) from NCBI RefSeq.
- RefSeq k=21, 2017.05.09 - 3.5 GB
- RefSeq k=31, 2017.05.09 - 3.5 GB
- RefSeq k=51, 2017.05.09 - 3.5 GB
These database are formatted for use with sourmash search
and
sourmash gather
.
Approximately 100,000 microbial genomes (including viral and fungal) from NCBI Genbank.
- Genbank k=21, 2017.05.09- 4.2 GB
- Genbank k=31, 2017.05.09 - 4.2 GB
- Genbank k=51, 2017.05.09 - 4.2 GB
The individual signatures for the above SBTs were calculated as follows:
sourmash compute -k 4,5 \
-n 2000 \
--track-abundance \
--name-from-first \
-o {output} \
{input}
sourmash compute -k 21,31,51 \
--scaled 2000 \
--track-abundance \
--name-from-first \
-o {output} \
{input}
These databases are formatted for use with sourmash lca
.
Approximately 87,000 microbial genomes (including viral and fungal) from NCBI Genbank.
- Genbank k=21, 2017.11.07, 105 MB
- Genbank k=31, 2017.11.07, 118 MB
- Genbank k=51, 2017.11.07, 123 MB
The above LCA databases were calculated as follows:
sourmash lca index genbank-genomes-taxonomy.2017.05.29.csv \
genbank-k21.lca.json.gz -k 21 --scaled=10000 \
-f --traverse-directory .sbt.genbank-k21 --split-identifiers
See github.com/dib-lab/2018-ncbi-lineages for information on preparing the genbank-genomes-taxonomy file.