-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the ability to parse a DTS transfer manifest into bulk import specifications #200
Comments
In general, I think this is fine. I have some thoughts in no particular order: Just so I understand the whole workflow / user story here:
I think other folks have gone over the main constraint enough - the Narrative UI's basically inoperable for > 500 data objects at a time. This is mostly because of how our data import MVP never really got past MVP. The next steps would be to handle that with some UI work. That should be part of this project if going through the Narrative UI is the best way. Although, I'd have to say if you're trying to manually validate > 500 lines in anything - XSV file, or narrative import cell - that sounds like a really tedious job and probably prone to some user error anyway. Still, there's enough automated error detection in there for missing or malformed strings and the like that it might be useful. Anyway, I see a few options here, with some pros and cons to each. 1. Go as outlined and use the bulk import cell.This will entail the following: a. Address the UI issues about having lots of files input at a time. I can't really make a good estimate at how much work this will be, but I don't think it's entirely trivial. Having the UI paginate the inputs down to, say, 20-100 at a time won't be terrible, but might have hidden UX caveats. 2. Work outside the Narrative interfaceAnother option is to skip the narrative all together, or come up with some other interface. The bulk import cell is a really fancy code generator - it generates a set of inputs that start a bulk app job runner which is an API call to the execution engine. The bulk import cell has a bunch of other helpful features, like input validation and job monitoring and restarting. But in the end, it just sends off a ton of import jobs to the execution engine and watches them run, then we wind up with data in the narrative. Short-cutting that would also be clunky as hell and hard to monitor, but I'd be remiss if I didn't put it as an option. :) |
Really I think a few steps here would handle most of this task:
There's also the issue of adding GFF + FASTA genome imports to the staging service, and what ripple effects that'll have on the interface, but if that's part of the MVP we have to handle that anyway. |
Thanks for the thoughtful analysis, Bill! I think I can tackle item 1 fairly easily as long as we decide specifically on input ( |
Just to be clear, it's not the staging service that's the issue, it's the narrative import UI. I would guess the staging service could handle 100k item manifests with no issue |
My take on this is that's it's putting way too much impetus on the user and adding lots of complication when it's not necessary. If there's enough information in the DTS manifest to parse it into a bulk import manifest, there's enough info to just kick off jobs directly. What I would do is:
|
I agree completely that my proposal isn't great from a user's perspective. I was assuming that no one had a lot of effort to throw at this task besides myself, so I tried to figure out a way to elbow the DTS manifest into the existing machinery without adding another service. But if the lightweight service you mention could be added without disrupting anything (by me or someone else), and if there's no perceived need for a user to be able to "pick through things" before doing bulk imports, I'd love to go in the direction that you suggest. |
By the way, we're focused on manifests as import instructions to avoid organization-specific machinery like KBase API calls. We want to see how far we can get in the direction of an "organization-agnostic" service. |
My guess is that updating the staging service to parse the mantifest + fixing the narrative UI would be comparable (maybe more) amounts of work to putting up a simple fastapi service to parse the manifest and kick off a job, but of course I'm looking at this from a very high level view. Although... it might not need to be a service per se, it could just be a daemon. But making it a service makes it easier to monitor |
If creating a new process (daemon or service) is practical, let's do it. Hacking on existing infrastructure and changing the way things work therein is much riskier than building something for a specific purpose in my experience. |
I'm just catching up here. It seems like it would be ideal to reduce or remove the user interaction here, especially if they already know what external data they're transferring, AND there's enough information about each data object to run each importer. I guess I'm saying that if it's possible to minimize the user having to sift through hundreds of lines of imports, we should. I guess that could either be automatic or not. Automatic would be the service that @MrCreosote describes - it watches the DTN node for some directory that hosts the manifest files and launches the bulk import job based on those. A non-automatic way would be for a user to click a button to go through it, somewhere. Probably in the Narrative import panel. It could give a quick digest when it starts - maybe just a count of how many of each object type - and tell you how to get job status before starting up. That might be a slightly unnecessary step, and could be decided upstream - the user selects what narrative to import data into in the DTS service, and it could do the copying to staging and start the jobs. Job monitoring, error handling, and restarts might need to get handled elsewhere - possibly in some modifications to the job browser or something in the narrative. |
A few things we need to consider with whatever solutions are proposed:
|
It would be ideal if we could avoid sticking KBase-specific things (like narratives) into the DTS, since the DTS isn't a KBase-specific thing, but we could tuck this information into the manifest if we needed to. |
We discussed this at our weekly DTS "heartbeat" meeting, and here's what we'd like to do:
So far, we've been dancing around the search/request process, avoiding discussing it in detail. I think now it's time to start nailing things down. I'd like to put together a Python DTS client that exposes the functionality we need to search for files and request specific IDs for transfer. I think this would help us understand the space of search and request parameters/filters and would allow us to start stress testing the DTS without a lot of UI-specific time and effort. Does this seem like a reasonable course to yous, @MrCreosote and @briehl ? |
There are libs that allow you to react to events in a filesystem, like a file getting added to a directory |
Now that the Data Transfer Service is ostensibly working, it would be convenient for the staging service to be able to parse the contents of a
manifest.json
file (deposited in a transfer-specific directory within a KBase user's staging area) into one or more xSV files that can then be imported using the existing bulk import functionality. I'm using this issue to track my investigation of how we might add such a feature.Assessment
The KBase Narrative itself defines a SetupImportCells function that interacts with the staging service to parse files for import, including
Downstream, this function calls a
bulkSpecification
method which is mapped to the staging service'sbulk_specification
endpoint here.Upstream, the function is called by the KBase staging area viewer UI code here.
Proposed approach
We could modify the
SetupImportCells
function to add support for the intermediate parsing of a DTS transfer manifest which produces a set of import specification spreadsheets. Currently, the function produces its "import app cells" in 3 stages:We could modify this procedure to inject logic to parse a DTS transfer manifest if it was found among the selected set of files (passed to the function in
fileInfo
):We would have to add another endpoint to the staging service to support stage 2, but this seems easier than modifying existing endpoints and changing their assumptions. I'll continue to update this issue as I find out more.
Risks / Constraints
Because this approach is incremental and depends on the existing import specification machinery in the staging service, it carries with it all of the related limitations and drawbacks:
I'm more concerned about item 1 (can we improve the performance of the staging service without a ton of work?). But I assume that we'll eventually want something a bit more automated or less finicky for users to work with.
References
The text was updated successfully, but these errors were encountered: