Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Examples page #7

Merged
merged 113 commits into from
Jul 5, 2023
Merged

Examples page #7

merged 113 commits into from
Jul 5, 2023

Conversation

Yoshanuikabundi
Copy link
Contributor

@Yoshanuikabundi Yoshanuikabundi commented May 31, 2023

Progress

  • Implement pre-processing for colab
  • Implement pre-processing for zip download
  • Implement pre-processing for sphinx
  • Implement Sphinx extension
  • Take Python bits that pre-process notebooks (proc_examples.py and its dependencies), separate them from changes to Sphinx, and stick them in a second PR (Pre-process notebooks from other repositories and cache in a branch #8)
  • Write GHA to manually deploy processed notebooks to a branch in second PR and merge it
  • Check the above action works
  • Get RTD building from the above in this PR
  • Update source notebooks (in toolkit, interchange, nagl) to include:
    • Good titles
      • Toolkit
      • Interchange
      • NAGL
    • Thumbnails
      • Toolkit
      • Interchange
      • NAGL
    • Category metadata
      • Toolkit
      • Interchange
      • NAGL
    • Remove per-notebook installation instructions
      • Toolkit
      • Interchange
      • NAGL
    • Fix broken links
      • Toolkit
      • Interchange
      • NAGL
    • Follow NGLView best practices and warn about HTML failings (eg show_file, trajectories, etc)
      • Toolkit
      • Interchange
      • NAGL
    • Get the above into releases
      • Toolkit
      • Interchange
      • NAGL
  • Implement category system in Sphinx extension
  • Choose categories and author examples.md
  • Finish implementing thumbnail handling
  • Ensure everything works on RTD in this PR
    • Sphinx-rendered notebooks
      • Colab link
      • GitHub link
      • Download link
      • nglview
    • gallery of notebooks
      • categories
      • thumbnails
      • titles
      • CSS
  • Support PRs/branches
    • GHA
    • RTD
    • colab
  • add quarterly calendar event to clean up cache branch
  • clean up cache branch when PR is closed
  • Write automation for running the cache update action - nightly cron job
  • Trigger RTD build when cache update action runs on main
  • Source repo on gallery things
  • Warn about experimental notebooks on rendered page
  • Merge this PR and tell the world!
  • (stretch goal) Update theme CSS as needed to beautify the above
    • .alert styled like .admonition
    • Top-of-notebook-links styled as nice, spread-out buttons
  • (stretch goal) Switch away from git branch for myst-nb after update is released (UPGRADE: myst-parser 1.0 executablebooks/MyST-NB#479)
  • (stretch goal) Add prose to notebooks that are missing it
  • (stretch goal) Write specific instructions for how to run examples locally
  • (stretch goal) Distribute solved Conda environments
  • (stretch goal) Tags and genindex
  • (stretch goal) Remove NGLView JS from this repo once its in an NGLView release
  • (stretch goal) Don't store history on _cookbook_data branch
  • (stretch goal) Split the examples Conda environment into what's needed to process the examples vs what's needed to run them and combine with mamba env update
  • (stretch goal) Get fragmenter and qcsubmit working
    • Leave open PR for testing

Goals and motivation

This is going to be a one-stop shop for all the examples, tutorials, and cookbooks in the OpenFF world.

Rendered

Notes for discussion about how to design this:

I would like:

  • All the examples in one place
  • A place for examples that don't fit neatly into a project
  • Searchable, easy, obvious navigation of examples
  • Easy user access to...
    • Dependency installation for notebooks
    • Colab with dependency installation cell
    • Binder with dependencies installed?
    • Downloading a zip of a notebook and its associated files
    • A fully-rendered and executed HTML version of each notebook

Maybe it would be nice if:

  • We separated tutorials from cookbook examples? This seems to be a common framing in other projects, and lots of people I talked to at the conference brought up the idea. The distinction is that tutorials teach you how to do something, and a cookbook just gives you fully working code to copy and paste.

Possibly surprising things we can do:

  • Store arbitrary metadata in notebooks (as long as you can de/serialize it to text)
  • Store config files and readmes and environments alongside notebooks
  • Arbitrary code in openff-docs sphinx extension
  • Integrate sphinx extensions into openff-sphinx-theme
  • Store and access arbitrary files on RTD
  • Store and access arbitrary files in a branch a la gh-pages (gross but possibly necessary for Colab)
  • Take a base notebook from a project repository and inject purpose-built cells into it (compare the first cells of the notebook prepared for Colab (conda-colab installation), the rendered notebook (links to different versions of the notebook), and the source notebook (neither of the above). Unfortunately the Colab link in the rendered notebook doesn't yet use this prepared notebook because Colab can only take notebooks from GitHub, but you can see it working in the OpenMM cookbook because they use GH Pages - click "Open in Google Colab")

Constraints:

  • Colab can only load notebooks from GitHub, so if we want to inject a dependency installation cell we need to push to a repo
  • If we want to store notebooks in source repositories without output, we need to execute them when we build the docs to include output in the docs (and get thumbnails automatically)
  • Docs need to be rebuilt whenever the examples change in source repositories
  • Detecting changes in examples, storing different versions of notebooks, etc requires holding state somewhere

We need to standardize and make explicit:

Thankfully most of these are already broadly the same across repositories

  • How the examples folder is laid out and where the notebooks go
  • Where files that a notebook relies on go
  • Which dependencies are used in notebooks, or at least how to find out which dependencies are used

I recommend something like:

devtools/conda-envs/examples.yaml
examples/
    deprecated/
        <notebooks_that_are_ignored>
    experimental/
        <notebooks_that_are_presented_but_experimental>
    <notebook_name>/
        <notebook_name>.ipynb
        <file_notebook_needs>
        <other_file_notebook_needs>
    <another_notebook_name>/
        <another_notebook_name>.ipynb
        thumbnail.png
    <notebook_without_files>.ipynb

This implies using the same environment for testing notebooks in CI as for running notebooks as a user, which is not current practice at least in the Toolkit. This means users might get a few dependencies they don't need, but guarantees the right dependencies are being tested.

We need to decide, in broad strokes

Some dev experience design stuff:

  • Whether we want openff-docs to automatically include new notebooks
  • Whether we want openff-docs to automatically update existing notebooks
  • Whether we want openff-docs to automatically remove old notebooks
  • How the examples page layout is specified
    • Notebook metadata?
      • Tags?
      • Categories?
      • Full on navigation tree?
    • Regular old Sphinx MarkDown/ReST?

There's basically two extremes here, and intermediate states are possible: One is just do all the layout in MarkDown and have the Sphinx extension only take care of downloading and processing the notebooks (what OpenMM Cookbook does, means that new examples only show up with a PR to openff-docs), and the other is to have all the layout information in the notebook metadata and have the Sphinx extension take care of everything (a very basic version of which is currently implemented in this PR).

And some technical stuff:

  • Whether we want openff-docs to execute notebooks (and how to cache this)
  • Where modified/executed notebooks live

I have some ideas that I hate:

  1. Store all the processed notebooks in main and keep them up to date with CI
  2. Process notebooks with CI into a branch, then pull that branch in when RTD gets built (I think this is my fave option)
  3. Do the examples stuff in a separate repo, keep them on GitHub Pages, and just link them from here (domain names might be tricky, and it'll fracture the sidebar, but lots of flexibility with rebuilds)
  4. Spend hours executing notebooks every time RTD builds (this also doesn't give us Colab-specific dependency installation cells)

We need to change in existing examples

  • Everything needs a (short) title
  • Metadata probably needs to be added
  • Thumbnails?
  • Fix broken links
  • We may want to remove existing in-notebook installation instructions and rely on injected cells for that

@j-wags
Copy link
Member

j-wags commented Jun 1, 2023

2023_05_31 JM/MT/JW meeting notes

Colab can only load notebooks from GitHub, so if we want to inject a dependency installation cell we need to push to a repo

We'll assume that the solution will include re-running the raw source notebooks to generate output (since we can't assume that notebooks in different repos will have their outputs stored.

Docs need to be rebuilt whenever the examples change in source repositories
Detecting changes in examples, storing different versions of notebooks, etc requires holding state somewhere

We'll have automation that checks for changes new releases and re-executes if the source notebooks change. The state will be stored in a branch that looks nothing like the other branches, kinda like a gh-pages situation.

Which dependencies are used in notebooks?

  • MT in favor of envs that install relatively quickly so people aren't waiting a long time for colab to start
  • JW in favor of maximalist environment, could use something like single file installer to "cache"

Result: Undecided. We'll start with a maximalist-ish solution for first iteration.

Constraints:

Colab can only load notebooks from GitHub, so if we want to inject a dependency installation cell we need to push to a repo
If we want to store notebooks in source repositories without output, we need to execute them when we build the docs to include output in the docs (and get thumbnails automatically)
Docs need to be rebuilt whenever the examples change in source repositories
Detecting changes in examples, storing different versions of notebooks, etc requires holding state somewhere

JW -- Existing constraints look good, though I don't like the expectation that folder and notebook name should be the same (means only one notebook per folder, which isn't necessarily a pattern I can commit to remembering/doesn't seem necessary)
MT -- This looks reasonable.

Organization

MT + JW -- Each notebook folder will have a thumbnail.png.

We'll automatically include new notebooks, automatically update existing notebooks, automatically remove notebooks, and organize the notebooks with categories. We'll base these updates on when things get released.

We'll tentatively assign each notebook to one or more categories. JM will come up with a tentative list of categories which the package maintainers will assign to each of their notebooks.

The titles will be auto-harvested from the top-level header. Some of these will need to be abbreviated since they're currently quite long.

We need to change in existing examples

Everything needs a title
Metadata probably needs to be added
Thumbnails?
Fix broken links
We may want to remove existing in-notebook installation instructions and rely on injected cells for that

JW + MT -- Agree (except JW may have a little technical trouble with complying with the final one, though if this replaces the Toolkit's need to support cloud-runnable notebooks maybe it'll be fine)

JM -- Should I copy all this stuff in each repo?

MT + JW -- Should centralize this in openff-docs. No benefit to copying it out.

JM -- Separate cookbooks and tutorials? People said really nice things about cookbooks at the in-person meeting so maybe they get their own category.

Undecided, will revisit later.

@Yoshanuikabundi
Copy link
Contributor Author

@mattwthompson I'm curious how you would like me to handle INTERCHANGE_EXPERIMENTAL? I tried executing all the notebooks and a bunch of them failed because the environment variable wasn't set. I'm thinking I just set the environment variable when I do the executing (behind closed doors) and let users discover it on their own interactively? It'll be disruptive for users on Colab but I guess that's kinda the point.

Also my computer can run all of the examples (in parallel) in like 5 minutes. Makes the twenty minute execution times in the Toolkit CI pretty frustrating.

Also this seems to be the maximalist environment, for future reference:

channels:
    - conda-forge
    - bioconda
dependencies:
    - pip
    - python=3.10
    # Cookbook
    - gitpython
    - nbconvert
    - nbformat
    # Examples
    - openff-toolkit-examples
    - gromacs
    - lammps
    - rich
    - jax

This seems to be going too well, I should check if the notebooks are kicking up exceptions that are getting happily baked into the executed notebooks...

Comment on lines 240 to 247
# Execute notebooks in parallel for rendering as HTML
if do_exec:
# Context manager ensures the pool is correctly terminated if there's
# an exception
with Pool() as pool:
# Wait a second between launching subprocesses
# Workaround https://github.com/jupyter/nbconvert/issues/1066
_ = [*pool.imap_unordered(execute_notebook, delay_iterator(notebooks))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I've always wanted to be able to do this in tests but I couldn't figure out how to get nbval to do it. I mean, I guess it's possible and I just never found out how. Still have to turn off pytest-randomly, probably

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah figuring this out was not fun. There's a race condition in nbconvert (which executes the notebooks) so if you launch too many notebook kernels too quickly you sometimes get two of them on the same port. But fixing each problem that came up in the dumbest way possible seemed to work eventually!

@mattwthompson
Copy link
Member

I just set it in my action: https://github.com/openforcefield/openff-interchange/blob/v0.3.4/.github/workflows/examples.yaml#L34

Using the %env INTERCHANGE_EXPERIMENTAL=1 magic should work for local execution and CI, and it should make Colab work as well, if that's desired. I haven't thought through whether the experimental examples should work on Colab out of the box or if users should need to find out how to set the magic themselves...

@mattwthompson
Copy link
Member

I'm moving around some files in Interchange; shouldn't affect things too much and doesn't need to be the last change. Just FYI: openforcefield/openff-interchange#740

@Yoshanuikabundi
Copy link
Contributor Author

OK. Now that I've been working on this for 3 weeks (:scream:), there are some things that I should note.

We agreed at our meeting that we only want to update the examples in this repository when a project makes a release. This is actually a super important fact that it's taken me a while to figure out the full implications of...

  • Since releases are rare (compared to, say, PRs or days) and examples have to be fast enough to run in project CI, we can afford to just re-execute every notebook every release and not worry about diffing things - this'll only take like an hour or something per release and massively simplify everything

  • As an added benefit, we get a guarantee that if output changes in a release that will be reflected in the rendered notebook

  • It's an open question if we should re-render every notebook every release, or just the notebooks from the updated project

  • This will mean that the branch where we're storing the executed state and colab packages and everything will have a commit per release for each of our projects that have examples... which seems like it could be valuable?

  • Given that we're going for a universal environment, it also means that at every release (from now) of every package (with examples) we will automatically generate a fully-specified Conda environment that includes ALL OpenFF software that people are likely to use to run simulations (except maybe BespokeFit for now), as well as all the software they'll want to use with it

  • We're going to want to spruce up the source examples before this PR is done so that the spruced-up examples are in a release when it goes live (thumbnails, prose, titles, etc)

  • NGLView does work in the rendered notebooks, but its a bit fiddly and we might have some best-practices for how to use it in examples

I'm hoping to get the GitHub action that does the execution and pre-processing written by the end of the week... after which point this should come together very quickly. I'm thinking I'll do it in a separate PR so we can make sure it works on a release or two before merging this one and making its results visible.

@mattwthompson
Copy link
Member

It's an open question if we should re-render every notebook every release, or just the notebooks from the updated project

Weakly-held opinion is that re-running everything everytime anything gets a bump seems excessive but generally keeping things up-to-date is nice. I came across some notebooks in the toolkit that haven't been updated since openforcefield/openff-toolkit#1426, which is not super recent.

@Yoshanuikabundi
Copy link
Contributor Author

Yoshanuikabundi commented Jun 20, 2023

Which dependencies are used in notebooks?

  • MT in favor of envs that install relatively quickly so people aren't waiting a long time for colab to start

  • JW in favor of maximalist environment, could use something like single file installer to "cache"

Result: Undecided. We'll start with a maximalist-ish solution for first iteration.

The current implementation runs the notebooks for rendering as web pages from a maximalist environment stored in this repo, but allows source repos to override it for the Colab link and zip download. If the source repo doesn't provide an environment for a notebook, Colab and the zip gets the maximalist environment. So zippy examples can include minimalist environments where it makes a difference. I don't have handling of source repo-wide environments - it's either the openff-docs example environment, or a notebook environment.

Running each notebook in its own environment for the web page rendering will be slow and difficult, so I don't want to do it. On the other hand, if a release includes an example that doesn't work with the openff-docs example environment, a quick PR to update that environment should be enough to get the release working again.

@Yoshanuikabundi
Copy link
Contributor Author

@j-wags @mattwthompson @lilyminium I think this is ready to merge! I would appreciate some clicking around and checking that everything works for everyone else, and a second pair of eyes on whatever you all have time for, and then I'll merge tomorrow unless something comes up.

@j-wags I haven't created a quarterly calendar event to clean up the cache because I've added an action to clean up PR folders in the cache when the PR gets merged. Combined with the fact that I've updated the action not to store histories, this should keep everything tidy. I'll test the cleanup works when this PR gets merged :/

The only automation for regenerating the cache is the scheduled nightly one; if the cache needs to be regenerated at any other time (for example, while developing a PR), then it has to be done manually. Instructions for how to do that should be in a comment on any new PR - I'll test that when I open the PR to add QCSubmit and Fragmenter. I just really didn't want to wait 30 minutes every time I make a change to any part of the repo; regenerating the cache should be relatively rare. There's also no way to trigger an RTD PR build in a GitHub action, so if the cache was automatically regenerated in PRs, you'd have to manually trigger that once its done.

@lilyminium I still have NAGL 0.2.2 pinned here, because of that versioneer issue I raised at NAGL. To unpin, that needs to be fixed, as that's how the cookbook knows which tag/branch to get the example notebooks from - the environment file is solved by Mamba, and then the tag corresponding to the installed version is cloned to get the notebooks. Updating by hand if the next release doesn't fix that issue should be fine; the version needs to be updated in both the examples conda environment and the globals_.py module. I'm happy to quickly take you through how this all works if you want to know and don't wanna read this entire thread! Let me know.

Once this is merged I'll write up some documentation on how it works so that it can be maintained/updated while I'm at OpenFE, and I'll open that PR for QCSubmit/Fragmenter.

@Yoshanuikabundi Yoshanuikabundi marked this pull request as ready for review July 4, 2023 13:35
@mattwthompson
Copy link
Member

I spent a small amount of time poking through things and didn't observe anything notably violent. In fact things seem pretty good - pages load snappily, the content I expect to be there is there, and even the 3D renders work great!

If there was one thing I could suggest as an improvement, it'd be more thumbnails. I'm partially responsible for this, so ....

One of the NAGL examples ("Prepare a dataset for training") should probably be renamed to include GCNN/NAGL in the title; when I read it for the first time I didn't know what it was about and a new user might confuse it with QC/physical property dataset curation.

Some of the cell outputs could do with pretty-ification, like this one which is a fair amount of information in a small number of wrapped black-and-white lines. Out of scope here but might be worth exploring?

image

It'd be a luxury to suppress warnings generated by the runner, like this one:

image

I could spend an hour or two going through things in great detail, the outcome of which would mostly be me wanting to re-write half of the examples and nothing to do with the automation that generates these webpages or dancing around the edges like earlier in this comment. So I think the plan to merge roughly as-is in a day or two is great.

@Yoshanuikabundi
Copy link
Contributor Author

If there was one thing I could suggest as an improvement, it'd be more thumbnails.

Agreed! Thumbnails are super time consuming, happen in the source repositories, and like all updates from source repos require a release to be updated, so hopefully if we all chip in a thumbnail or two when we have a moment of inspiration we can fix this over time.

One of the NAGL examples ("Prepare a dataset for training") should probably be renamed to include GCNN/NAGL in the title

Also agreed! This will be fixed in the next release of NAGL

Some of the cell outputs could do with pretty-ification, like this one which is a fair amount of information in a small number of wrapped black-and-white lines. Out of scope here but might be worth exploring?

Definitely worth exploring!

It'd be a luxury to suppress warnings generated by the runner

I think we could just hide STDERR, but I'm not sure it's a good idea because I'd like users to know that they're not doing anything wrong if they get a warning when they run it themselves. I think the ideal resolution for each warning, in order of preference, is:

  1. Fix the warning (possibly just by tweaking the environment in some cases)
  2. Tell the user about the warning in the notebook (preferably including why it can't be fixed/when it will be fixed)
  3. Explicitly silence warnings in the notebook where appropriate (so users understand what's going on)

My priority here is helping users understand what's going on, and helping to make outputs reproducible between rendering and running. If it pushes us to fix warnings or find pathways to avoid warnings, so much the better. So again, hopefully we can improve this over time.

I could spend an hour or two going through things in great detail, the outcome of which would mostly be me wanting to re-write half of the examples and nothing to do with the automation that generates these webpages or dancing around the edges like earlier in this comment.

I think that's my assessment too - plenty of improvements to make in examples.

So I think the plan to merge roughly as-is in a day or two is great.

Woo!

@Yoshanuikabundi Yoshanuikabundi merged commit 8b3f64e into main Jul 5, 2023
@Yoshanuikabundi Yoshanuikabundi changed the title [DNM] Examples page Examples page Jul 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants