Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

find signatures by name in zipped sbt #993

Closed
bluegenes opened this issue May 21, 2020 · 9 comments
Closed

find signatures by name in zipped sbt #993

bluegenes opened this issue May 21, 2020 · 9 comments
Labels
code herein lies code question

Comments

@bluegenes
Copy link
Contributor

I have a list of fasta files I'd like to compare, for which I stored signatures within a zipped sbt. Titus mentioned the the sbt json contains mapping information for the signature name:: signature.

Is there a standard way to iterate through this list to find (and then load) signatures from the sbt by name?

@ctb
Copy link
Contributor

ctb commented May 21, 2020

to add to this for @luizirber --

tessa would like to get a list of all leaf nodes and their names.

she would separately like to be able to load a specific leaf node.

I'm 99% sure this is possible from looking at the .sbt.json file, but I'm not sure if there's a good interface for doing it, or if it needs to be hacked together.

@luizirber
Copy link
Member

One option is fixing the SigLeaf nodes in the .sbt.json, because now they are... pretty useless. Here is an example:

"10": {
  "filename": "f0c834bc306651d2b9321fb21d3e8d8f",
  "name": "f0c834bc306651d2b9321fb21d3e8d8f",
  "metadata": "f0c834bc306651d2b9321fb21d3e8d8f"
},

Note how filename, name and metadata are all the same! We can start setting name to be the actual name field from the signature, and this would avoid loading all the leaves data.

(filename can be something other than just the MD5, especially after #994 is merged. metadata should probably be a dict, but changing that now will probably break stuff or be annoying to deal with).


So, for the use case: SBT.leaves() returns all leaves already; could use that to check if there is a name match, and then return (name, data) for that leaf. Something like

def sig_by_name(sbt, name):
    for leaf in sbt.leaves():
        if leaf.name() == name:
            yield leaf.data

Note that name doesn't necessarily match a filename. In that case it would need to load the leaf data:

def sig_by_filename(sbt, filename):
    for leaf in sbt.leaves():
        if leaf.data.filename == filename:
            yield leaf.data

@olgabot
Copy link
Collaborator

olgabot commented May 27, 2020

Note how filename, name and metadata are all the same! We can start setting name to be the actual name field from the signature, and this would avoid loading all the leaves data.

Ah! I noticed this in #925 and changed name to be the actual name because it was easier to keep track of A, B, C type names rather than a bunch of md5 hashes. Should I revert it?

@luizirber
Copy link
Member

Ah! I noticed this in #925 and changed name to be the actual name because it was easier to keep track of A, B, C type names rather than a bunch of md5 hashes. Should I revert it?

Don't revert, I think this should be the actual behavior =]

@ctb
Copy link
Contributor

ctb commented May 31, 2020

So, for now @bluegenes, it looks like you will need to hack yourself some kind of dictionary to store the filenames of what you want to load, based on the JSON file. What I don't know offhand is if there is a simple API way to load a SigLeaf once you have the name/filename/md5sum/whatever.

Ref also #994 which allows for duplicate md5sums to be stored in an SBT.

@ctb
Copy link
Contributor

ctb commented May 31, 2020

A bigger question might be whether we want to include any other metadata search functionality in Index. I think it'd be easy to add and pretty lightweight, and be quite helpful with things like adding more md5sum selector facilities.

@ctb
Copy link
Contributor

ctb commented May 31, 2020

Hi @bluegenes here is a more specific partial answer to enable you to do what I think you need to do, like, yesterday :) ---

#! /usr/bin/env python                                                          
import sys
import sourmash

tree = sourmash.load_sbt_index(sys.argv[1])
print(tree._leaves)

# the ._leaves attribute has the in-memory component of the SBT,                
# which is quite small.                                                         
for k in tree._leaves:
    print(k, tree._leaves[k])
    leaf = tree._leaves[k]

    print('---')
    print('leaf key', (k,))

    # accessing .data will trigger load of signature; 'data' will be            
    # a SourmashSignature                                                       
    assert leaf._data is None
    print('   ', leaf.data.name(), len(leaf.data.minhash))

So, if you wanted to have an index of names etc, I would suggest doing something like this:

#! /usr/bin/env python
import sys
import sourmash

tree = sourmash.load_sbt_index(sys.argv[1])

sig_name_to_leaf_idx = {}

# the ._leaves attribute has the in-memory component of the SBT,
# which is quite small.1
for k in tree._leaves:
    #print(k, tree._leaves[k])
    leaf = tree._leaves[k]

    #print('---')
    #print('leaf key', (k,))

    # accessing .data will trigger load of signature; 'data' will be
    # a SourmashSignature
    assert leaf._data is None
    
    #print('   ', leaf.data.name(), len(leaf.data.minhash))
    name = leaf.data.name()
    assert name not in sig_name_to_leaf_idx
    sig_name_to_leaf_idx[name] = k

#
# now save / load / whatever your sig_name_to_leaf_idx dict, which has
# string keys and integer values, and then
#

tree2 = sourmash.load_sbt_index(sys.argv[1])

SEARCH = 'NC_007951.1 Burkholderia'
found_leaf_idx = None
for name in sig_name_to_leaf_idx:
    if name.find(SEARCH) != -1:           # found?
        assert not found_leaf_idx
        found_leaf_idx = k

# load me!
match_leaf = tree2._leaves[k]
match_sig = match_leaf.data
print(match_sig.name(), len(match_sig.minhash))

lmk if that helps - I'm sure luiz had it all in short-term memory, but I hadn't interacted with this code in a while :)

@ctb
Copy link
Contributor

ctb commented Jun 19, 2021

#1590 will support this, if we go forward with it; this is generically supported by manifests (in that PR).

@ctb
Copy link
Contributor

ctb commented Jun 26, 2021

closed by #1590! Also see #1075.

@ctb ctb closed this as completed Jun 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code herein lies code question
Projects
None yet
Development

No branches or pull requests

4 participants