Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

work needed to allow users to run analyses separate to pathogen workflows #74

Open
5 tasks
jameshadfield opened this issue Nov 24, 2024 · 4 comments
Open
5 tasks
Labels
proposal Proposals that warrant further discussion

Comments

@jameshadfield
Copy link
Member

jameshadfield commented Nov 24, 2024

We recently had a general discussion (meeting notes, 2024-11-21/22) about what would be required for workflows to be runnable by external collaborators entirely from custom configs, i.e. no modification of anything in the pathogen repo. Here's a summary of what we discussed, specifically as it pertains to this guide. I don't think we're actually very far away from achieving this, and the aim of this summary is to help guide us to this goal and add things that we consider

  • Running workflows from a separate analysis directory There are prototypes of this here and here, and the corresponding CLI prototype is here. Once these prototypes have solidified I expect the necessary snakemake modifications will be added to this guide. While they still require the workflow repo directory to be managed by the user (git cloned, updated etc), I think that's enough for the purposes of this goal; if we can get the CLI to manage them then even better.

  • Adding private metadata & sequences to workflows (Provide a generic pattern for including additional user data alongside curated data #72) This is blocked on merge: Support sequences augur#1579 but there's no reason we can't trial this out for metadata-only additions with augur merge right now. If we consider private metadata curation beyond the scope of this (I do) then we already have working script-based approaches (ncov, mpox) to follow.

  • Generalized subsampling This is currently possible with general Snakemake rules. The only small hiccup is the need to nullify unused default subsampling names because of the config dictionary merge.

  • Allow customisation of DTA columns etc This guide doesn't specify the config style & snakemake rules for these parts of the workflow so in practice they'll most likely be copied from existing workflows. I think these are config-customisable by setting your own config["traits"]["columns"] (or similar). The desired customisations brought up in the recent meeting were all conceptually similar to this, they didn't involve toggling rules on/off or other more complex changes to the workflow. I'd consider this task "done" except for the ability to nullify values where dictionaries are used in the config (see subsampling section above).

  • Workflow versioning. When running analyses separate to the workflow it's crucial to make it clear when user configs are out-of-date and provide a path to updating them. Conversely it's desired to know what effect a particular config value has, although that seems a harder problem to me and perhaps docs + config validation would achieve this. We can start by using one-off checks within code however linters and config schemas would be more powerful¹. We've talked forever about generating docs from schemas and perhaps that's a direction we could take for pathogen repos from day 1.

¹ My experience with schemas in augur is that they are good at identifying invalid data but poor at explaining what's wrong and therefore hint at how to fix it. Presumably the schemas here will be simpler so the error messages may be more informative.

@jameshadfield jameshadfield added the proposal Proposals that warrant further discussion label Nov 24, 2024
@genehack
Copy link
Contributor

Bit of a nitpick, but I think an important one in terms of problem scoping and definition. When you say:

what would be required for workflows to be runnable by external collaborators entirely from custom configs, i.e. no modification of the Snakefile.

…would it not be more accurate to say, "i.e., no modification of anything in the pathogen repo."?

@huddlej
Copy link

huddlej commented Dec 5, 2024

Regarding "Running workflows from a separate analysis directory", this analysis of seasonal flu data from Loes et al. 2024 is another example of how I've tried to setup an analysis that uses the seasonal-flu workflow but that lives outside of that repo as another external user might.

@tsibley
Copy link
Member

tsibley commented Dec 5, 2024

@huddlej 👍 That's what the ncov tutorial does too.

@victorlin
Copy link
Member

ncov tutorial is based on similar usage by the old my_profiles directory, reflected in ncov's .gitignore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal Proposals that warrant further discussion
Projects
None yet
Development

No branches or pull requests

5 participants