-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Importing a DTS Payload to a KBase Narrative #79
Comments
I think this proposal looks reasonable, based on previous discussions, but I do have a couple comments and questions. IMO step 3 (user selects a manifest file and that makes a spreadsheet) doesn't really need user interaction. If all the information is already present to make the import spec spreadsheet (and thus the bulk import cell), there are a couple options for automation already. One thought is that the staging service can detect that a manifest.json file was uploaded, and can parse it to an import spec automatically without user intervention. Alternately, we could just include the manifest files as allowed Import Specification files (along with csv, tsv, and excel). So the user would just select the manifest.json file, select "Import Specification" as the file type, and have it create the import cell from there. That would take a little modification to the I guess that all depends on what's in the |
Thanks, Bill! The In the short term, let's just populate the |
I think we should definitely include file types and whatever other fields are required as part of the corresponding bulk import spec. I put some user interactivity into Step 3 above to allow a user to override the defaults for fields based on the knowledge in their heads that doesn't reside anywhere else. Much of the information in the non-required fields in the import spec spreadsheets isn't available in the file metadata (at least not the stuff we get from the JGI Data Portal), so I don't think we have the means to populate the other fields in EDIT: I just noticed that I put all the fields in the import template spreadsheets under "required fields" in the descripton of this issue. Zach Crockett and I think that only the file path is required for each of these templates, and that there are reasonable defaults in place for the other fields. |
Thanks for the review @briehl. Thoughts on who on the KBase side would be best able to accomplish an MVP on this and how long? 1 sprint? |
I think making a UI for that would be non-trivial. IMO we can leave that blank in the generated spreadsheet, and either alert the user about it on creation, or just let them find out when they try to upload. They can always download that spreadsheet, alter it as they see fit, and re-upload. |
Here's a schema-lite of the inputs for the various functions that are relevant to IMG data, like we discussed yesterday, and are the 6 data types listed in your original post above. In general, I think an instructions block that looks like this would work best: "instructions": {
"data_type": "genbank_genome",
"parameters": {
"param1": "value1",
"param2": "value2"
...etc
}
} This will be a mix of tricky / redundant for some of the multi-file types (gff+fasta genome, gff+fasta metagenome, non-interleaved reads), and might contain some redundant information, but that might be ok to start with. I've put a Here's the list, and I'll follow up with a more realistic example: Assembly {
"*staging_file_subdir_path": "str, file path",
"*assembly_name": "str, object_name",
"type": "str, ['draft isolate', 'finished isolate', 'mag', 'sag', 'virus', 'plasmid', 'construct', 'metagenome']",
"min_contig_length": "int"
} Genbank genome {
"*staging_file_subdir_path": "str, file path",
"*genome_name": "str, object_name",
"genome_type": "str, ['draft isolate', 'finished isolate', 'mag', 'sag', 'virus', 'plasmid', 'construct']",
"source": "str, ['RefSeq user', 'Ensembl user', 'Other']",
"release": "str",
"genetic_code": "int",
"scientific_name": "str",
"generate_ids_if_needed": "str",
"generate_missing_genes": "str"
} GFF+FASTA genome {
"*fasta_file": "str, file path",
"*gff_file": "str, file path",
"*genome_name": "str, object_name",
"genome_type": "str, ['draft isolate', 'finished isolate', 'fungi', 'mag', 'other Eukaryote', 'plant', 'sag', 'virus', 'plasmid', 'construct']",
"scientific_name": "str",
"source": "str, ['RefSeq user', 'Ensembl user', 'JGI', 'Other']",
"taxon_wsname": "str",
"release": "str",
"genetic_code": "int",
"generate_missing_genes": "str"
} GFF+FASTA metagenome {
"*fasta_file": "str, file path",
"*gff_file": "str, file path",
"*genome_name": "str, object_name",
"source": "str, ['EBI user', 'IMG user', 'JGI user', 'BGI user', 'Other']",
"release": "str",
"genetic_code": "int",
"generate_missing_genes": "str"
} Interleaved FASTQ reads {
"*fastq_fwd_staging_file_name": "str, file path",
"*name": "str, object_name",
"sequencing_tech": "str, ['Illumina', 'PacBio CLR', 'PacBio CCS', 'IonTorrent', 'NanoPore', 'Unknown']",
"single_genome": "str",
"read_orientation_outward": "str",
"insert_size_std_dev": "float",
"insert_size_mean": "float"
} Noninterleaved FASTQ reads {
"*fastq_fwd_staging_file_name": "str, file path",
"*fastq_rev_staging_file_name": "str, file path",
"*name": "str, object_name",
"sequencing_tech": "str, ['Illumina', 'PacBio CLR', 'PacBio CCS', 'IonTorrent', 'NanoPore', 'Unknown']",
"single_genome": "str",
"read_orientation_outward": "str",
"insert_size_std_dev": "float",
"insert_size_mean": "float"
} SRA reads {
"*sra_staging_file_name": "str, file path",
"*name": "str, object_name",
"sequencing_tech": "str, ['Illumina', 'PacBio CLR', 'PacBio CCS', 'IonTorrent', 'NanoPore', 'Unknown']",
"single_genome": "str",
"read_orientation_outward": "str",
"insert_size_std_dev": "float",
"insert_size_mean": "float"
} A simple example of just the required values for, say, a genbank genome would be: "instructions": {
"data_type": "genbank_genome",
"parameters": {
"staging_file_subdir_path": "path/to/some_genome.gbk",
"genome_name": "some_genome"
}
} |
There's a couple additional things here.
|
Thanks for the sketch, @briehl ! I think the redundancy is just fine, and I like that you've pushed the unstructured part of the schema into a What do you think about payloads that include several objects? A while back Zach did some digging and found some use cases that had 1-2 objects and some that had significantly more. How do you feel about something like this? "instructions": {
"protocol": "KBase narrative import",
"objects": [
{
"data_type": "genbank_genome",
"parameters": {
"staging_file_subdir_path": "path/to/some_genome.gbk",
"genome_name": "some_genome"
}
},
...
{
"data_type": "genbank_genome",
"parameters": {
"staging_file_subdir_path": "path/to/some_other_genome.gbk",
"genome_name": "some_other_genome"
}
}
]
} (I added a This structure implies, of course, that we allow more than one data type per payload. In principle this is nice and makes all kinds of things easy, but I'm not sure how it looks in practice. |
I think that might work? Does this mean putting the |
Yes, this was my original intent. Sorry if I wasn't clear about that! Early on, we were thinking that we would restrict file transfers to single file types for simplicity. But if it's practical to Import All The Things in one manifest, the We're trying to make the DTS independent of any organization's internal workings, and separating Let me know if you want to chat more about this, and do feel free to change what I've proposed in the |
Also, my example above doesn't take advantage of the fact that the manifest includes the |
I see! Yeah, that makes sense. So, just to summarize:
So the structure is (roughly): {
"name": "manifest",
"resources": [{
"id": "foo",
"name": "bar",
...etc
}, ...],
"instructions": {
"protocol": "KBase narrative import",
"objects": [...what you put above]
}
} Yeah, I agree that we could do some kinds of referencing to data sources from within the instructions field, but I'm not sure how best to do that that would be easy to interpret. Still, this makes sense, and should in fact be easier to process. I misinterpreted having the instructions be attached to each file resource. One thing that could be done, then, is to just put all the parameters of each data type as a list, so something like: "instructions": {
"protocol": "KBase narrative import",
"objects": {
"genbank_genome": [
{
"staging_file_subdir_path": "path/to/some_genome.gbk",
"genome_name": "some_genome"
},
{
"staging_file_subdir_path": "path/to/some_other_genome.gbk",
"genome_name": "some_other_genome"
}
]
}
} But I see the value in having a list of discrete objects, too, to keep them independent and to have an explicit |
This is a proposal to modify the KBase file staging service and its UI to recognize and parse DTS manifests and import their contents to KBase Narratives. The CE/DTS team believes this is the simplest approach to getting data into KBase from other organizations using the DTS.
Proposed Workflow
manifest.json
file with file metadata and aninstructions
field conveying any information needed by the file staging service to import the files into a KBase Narrative.manifest.json
file within the payload folder. This triggers a process within the file staging service to generate a bulk import spreadsheet appropriate for the payload. This process uses theinstructions
field withinmanifest.json
to determine the appropriate file type for the spreadsheet, etc, and extracts the names of the files, using them to populate the generated spreadsheet, which appears in the user's staging directory, either within the DTS payload folder or at the top level.The
instructions
field in the DTS manifest is itself a JSON object that can store any required information, so its content can be tailored to any of the supported import spreadsheet templates. Down the line, we may develop something more structured if it makes sense, but we can assume for the moment that a KBase-Narrative-specific set of instructions is acceptable.Notes From Previous Discussions
Some recent changes introduce an
instructions
field in the DTS manifest that contains a JSON object that we can use to convey information about how the payload should be treated once it reaches its destination. This issue is an exploration of a "protocol" we might use to import the contents of a payload to a KBase narrative. This import process needs information from the requesting user because the metadata associated with the files in a payload is clearly insufficient for figuring out how the content should be translated into a narrative.To be clear: the specification of any such protocol is not part of the DTS itself--this
instructions
field is simply a feature offered by the DTS to support the creation of these protocols. But this seems like a reasonable place to record our thoughts.Provocative Questions
instructions
field, say), we can write a program that monitors the KBase file staging area for payloads, parses manifests, and generates import template spreadsheets with these defaults inserted. This might be all we need to get off the ground.instructions
field in the manifest can refer to an aribtrary JSON object. In the case of a KBase payload to be imported to a narrative, we can have a JSON object with a singletype
field, whose value can be one of (say)assembly
,genbank
,gff-fasta
,fastq-interleaved
,fastq-noninterleaved
,sra
.Types of KBase payloads
Assembly import
Required fields for each file:
staging_file_subdir_path
: FASTA file pathassembly_name
type
min_contig_length
TSV import template
Genbank genome import
Required fields for each file:
staging_file_subdir_path
genome_name
genome_type
source
scientific_name
release
genetic_code
generate_ids_if_needed
generate_missing_genes
TSV import template
GFF+FASTA genome and metagenome import
Required fields for each file:
fasta_file
gff_file
genome_name
source
release
genetic_code
generate_missing_genes
TSV import template
Interleaved FASTQ reads import
Required fields for each file:
fastq_fwd_staging_file_name
name
sequencing_tech
single_genome
read_orientation_outward
insert_size_std_dev
insert_size_mean
TSV import template
Noninterleaved FASTQ reads import
Required fields for each file pair:
fastq_fwd_staging_file_name
fastq_rev_staging_file_name
name
sequencing_tech
single_genome
read_orientation_outward
insert_size_std_dev
insert_size_mean
TSV import template
SRA reads import
Required fields for each file:
sra_staging_file_name
name
sequencing_tech
single_genome
read_orientation_outward
insert_size_std_dev
insert_size_mean
TSV import template
References
The text was updated successfully, but these errors were encountered: