work needed to allow users to run analyses separate to pathogen workflows #74

jameshadfield · 2024-11-24T22:51:32Z

We recently had a general discussion (meeting notes, 2024-11-21/22) about what would be required for workflows to be runnable by external collaborators entirely from custom configs, i.e. no modification of anything in the pathogen repo. Here's a summary of what we discussed, specifically as it pertains to this guide. I don't think we're actually very far away from achieving this, and the aim of this summary is to help guide us to this goal and add things that we consider

Running workflows from a separate analysis directory There are prototypes of this here and here, and the corresponding CLI prototype is here. Once these prototypes have solidified I expect the necessary snakemake modifications will be added to this guide. While they still require the workflow repo directory to be managed by the user (git cloned, updated etc), I think that's enough for the purposes of this goal; if we can get the CLI to manage them then even better.
Adding private metadata & sequences to workflows (Provide a generic pattern for including additional user data alongside curated data #72) This is blocked on merge: Support sequences augur#1579 but there's no reason we can't trial this out for metadata-only additions with augur merge right now. If we consider private metadata curation beyond the scope of this (I do) then we already have working script-based approaches (ncov, mpox) to follow.
Generalized subsampling This is currently possible with general Snakemake rules. The only small hiccup is the need to nullify unused default subsampling names because of the config dictionary merge.
Allow customisation of DTA columns etc This guide doesn't specify the config style & snakemake rules for these parts of the workflow so in practice they'll most likely be copied from existing workflows. I think these are config-customisable by setting your own config["traits"]["columns"] (or similar). The desired customisations brought up in the recent meeting were all conceptually similar to this, they didn't involve toggling rules on/off or other more complex changes to the workflow. I'd consider this task "done" except for the ability to nullify values where dictionaries are used in the config (see subsampling section above).
Workflow versioning. When running analyses separate to the workflow it's crucial to make it clear when user configs are out-of-date and provide a path to updating them. Conversely it's desired to know what effect a particular config value has, although that seems a harder problem to me and perhaps docs + config validation would achieve this. We can start by using one-off checks within code however linters and config schemas would be more powerful¹. We've talked forever about generating docs from schemas and perhaps that's a direction we could take for pathogen repos from day 1.

¹ My experience with schemas in augur is that they are good at identifying invalid data but poor at explaining what's wrong and therefore hint at how to fix it. Presumably the schemas here will be simpler so the error messages may be more informative.

The text was updated successfully, but these errors were encountered:

genehack · 2024-11-27T18:53:32Z

Bit of a nitpick, but I think an important one in terms of problem scoping and definition. When you say:

what would be required for workflows to be runnable by external collaborators entirely from custom configs, i.e. no modification of the Snakefile.

…would it not be more accurate to say, "i.e., no modification of anything in the pathogen repo."?

huddlej · 2024-12-05T20:09:01Z

Regarding "Running workflows from a separate analysis directory", this analysis of seasonal flu data from Loes et al. 2024 is another example of how I've tried to setup an analysis that uses the seasonal-flu workflow but that lives outside of that repo as another external user might.

tsibley · 2024-12-05T21:38:58Z

@huddlej 👍 That's what the ncov tutorial does too.

victorlin · 2024-12-06T18:46:56Z

ncov tutorial is based on similar usage by the old my_profiles directory, reflected in ncov's .gitignore

jameshadfield added the proposal Proposals that warrant further discussion label Nov 24, 2024

jameshadfield mentioned this issue Dec 4, 2024

Allow different (multiple) inputs nextstrain/avian-flu#106

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

work needed to allow users to run analyses separate to pathogen workflows #74

work needed to allow users to run analyses separate to pathogen workflows #74

jameshadfield commented Nov 24, 2024 •

edited

Loading

genehack commented Nov 27, 2024

huddlej commented Dec 5, 2024

tsibley commented Dec 5, 2024

victorlin commented Dec 6, 2024

work needed to allow users to run analyses separate to pathogen workflows #74

work needed to allow users to run analyses separate to pathogen workflows #74

Comments

jameshadfield commented Nov 24, 2024 • edited Loading

genehack commented Nov 27, 2024

huddlej commented Dec 5, 2024

tsibley commented Dec 5, 2024

victorlin commented Dec 6, 2024

jameshadfield commented Nov 24, 2024 •

edited

Loading