Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache engine for reticulate using dill #1210

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

leogama
Copy link

@leogama leogama commented May 13, 2022

Hello there! I bring good news.

The Python package dill —"serialize all of python"— is about to release a new version (0.3.5), that will likely include my patches for its session saving and restoring functionality. Therefore, I'm back to work in the cache feature for reticulate based on @tmastny's proposal from some years ago (#167).

My idea is then to require dill>=0.3.5 as the previous versions have too many problems to work around.

I didn't review the entire code from the original PR yet, just enough to make it work again with the current main branch, so there's probably a little more work to do (maybe even on knitr's side). Criticism and suggestions are welcome.

@t-kalinowski
Copy link
Member

t-kalinowski commented May 13, 2022

Hi, thank you for reviving this PR! I will take a closer look at this next week, but after taking a glance I have quick question: does this need to be rebased on the current main branch? It looks like there are some unrelated changes in the PR (e.g., changes to py_to_r).

@t-kalinowski
Copy link
Member

Hi, thank you for reviving this PR! I will take a closer look at this next week, but after taking a glance I have Quick question: does this need to be rebased on the current main branch? It looks like there are some unrelated changes in the PR (e.g., changes to py_to_r).

@leogama
Copy link
Author

leogama commented May 14, 2022

Question 1: Yes, it needs to be rebased on main because the merge cause conflicts in at least 2 files. I had to fix the conflicts manually.

Question 2: This is precisely one of the parts I didn't have time to review. But, yes, I spotted this py_to_r and other functions and I'm thinking of reverting these unrelated changes. They are from the original PR, which I didn't want to lose.

@t-kalinowski
Copy link
Member

I don't think that any of the changes in the python.R file (py_to_r and friends) contribute to the PR, and it looks to me like they are an artifact of merging or rebasing the main branch (if I'm mistaken, please let me know!). It'd be good to remove them.

R/knitr-engine.R Outdated Show resolved Hide resolved
R/knitr-engine.R Outdated Show resolved Hide resolved
R/knitr-engine.R Outdated Show resolved Hide resolved
R/knitr-engine.R Outdated
r_obj_exists <- "'r' in globals()"
r_is_R <- "type(r).__module__ == '__main__' and type(r).__name__ == 'R'"
if (py_eval(r_obj_exists) && py_eval(r_is_R)) {
py_run_string("del globals()['r']")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this have side effects (basically meaning the r object is no longer visible after this code is run)?

Copy link
Author

@leogama leogama May 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is run at the end of a knitir block, and currently the r object in injected again in the Python namespace at the beginning of the next block. An alternative to putting this object directly in the user's __main__ namespace would be to add it to __builtins__. This would bring another advantage: the user could then create an r global variable without overwriting it, it would only be masked. Then del r would unmask it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't put the r object in __builtins__, but I did put the "R object class" in __builtins__. The r object is not removed before saving the cache anymore, it's just ignored. However, my previous suggestion is still open for discussion.

R/knitr-engine.R Outdated Show resolved Hide resolved
@leogama
Copy link
Author

leogama commented Sep 3, 2022

Hello, there! I'm back to finish this PR. The dill Python package is about to release its next version, 0.3.6, with many new features related to session saving and restoring. You guys also just released a new version of reticulate, so I guess I'll have plenty of time to implement and test this feature until the next one.

One potential problem we'll face is that dill will drop the support for Python 2 and Python 3 < 3.7 in this new release. It's possible to use the current 0.3.5 version for those, but maybe it'll need some logic to distinguish the cases, as with the upcoming 0.3.6 there's much more control of what is saved and how it's saved.

@t-kalinowski
Copy link
Member

Hi @leogama That's great to hear, this will be a great addition to reticulate!

Regarding python version compatibility, would it be possible to only enable this feature for newer versions of Python?

@leogama
Copy link
Author

leogama commented Sep 8, 2022

Regarding python version compatibility, would it be possible to only enable this feature for newer versions of Python?

Of course, I prefer to start this as simple as possible and add features incrementally without many edge cases to care about. If you are OK with Python ≥ 3.7, then let's begin with that.

I've been studying the knitr execution model and the R cache implementation in the last days. I'm worried about the various cache options (with implications to cache invalidation) and the R-Python interactions. For example, it seems that the R cache is loaded and saved for every chunk, even if its of another engine (e.g. Python), but the opposite is not true.

Here is a diagram of the execution model I've found so far, which probably has some holes:

exec-model dot

@t-kalinowski
Copy link
Member

@yihui Can you please advise on #1210 (comment)?

@yihui
Copy link
Member

yihui commented Sep 8, 2022

The diagram looks about right to me. Great job! :)

BTW, it will be great to have @tmastny look at this PR if he has time.

@leogama
Copy link
Author

leogama commented Sep 9, 2022

The diagram looks about right to me. Great job! :)

@yihui, I'm trying to generate this with roxygen2 by putting dot language statement in custom tags throughout the code and then outputting them to a file to be processed by Graphviz (I tried the Rgraphviz package, but it's too limited).

My idea is to generate something like a UML activity diagram to truthfully represent the execution model. When it's polished enough, e.g. with file/line references in the nodes, I may submit it to knitr as "developer documentation".

I'll probably also need to write down some schemes to wrap my head around the various cache options and the cache invalidation criteria, if we'd like to reproduce them for Python...

@leogama
Copy link
Author

leogama commented Sep 9, 2022

@t-kalinowski: do you think it's better to add features to the cache mechanism in this single PR or to just implement the basics here and split the features in separate PRs?

@leogama
Copy link
Author

leogama commented Sep 9, 2022

@yihui We have a small issue concerning the working directories where the cache code chunks are run. It seems that the R cache code always run in the knit() call's WD. The cache_engine() call to load/purge any cache from a non-R engine also runs there. But the cache code for saving the non-R engine state is called indirectly from the in_input_dir(engine(options)) call, which may change the WD.

Both functions input_dir() and in_input_dir() are private, and the path in options$hash is relative to the knit() WD. I see three possible solutions to make it work transparently:

  1. make the options$hash path absolute (will affect log messages)
  2. pass the original working directory to cache_engine() somehow
  3. call the non-R engine cache saving function after the call to engine(), with the WD restored

I advocate for the third option, and that cache engines should use the same API as the R cache, i.e. a list object with functions as "methods", like cache_engine()$exists(), cache_engine()$load(), cache_engine()$save(), etc. By the way, the function cache$exists() should probably also call this cache_engine()$exists() when options$engine != 'R' for consistency.

@t-kalinowski
Copy link
Member

t-kalinowski commented Sep 9, 2022

@t-kalinowski: do you think it's better to add features to the cache mechanism in this single PR or to just implement the basics here and split the features in separate PRs?

However you think is easiest. I'm happy to engage and review either way.

@yihui
Copy link
Member

yihui commented Sep 9, 2022

@leogama I agree with you. That seems to require a change in knitr, right? Please feel free to submit a PR there. Thanks!

@leogama
Copy link
Author

leogama commented Sep 9, 2022

However you think is easiest. I'm happy to engage and review either way.

Great. I think I'll restrict this PR to the basic cache mechanism and then open other PRs for extra features like the chunk options cache.vars, cache.comments, dependson and autodep. This way, people who use the dev version can start using and testing the Python cache sooner.

@leogama leogama requested a review from kevinushey September 17, 2022 01:59
@leogama
Copy link
Author

leogama commented Sep 17, 2022

We have a test file now. I think it's time to run the workflows.

@leogama
Copy link
Author

leogama commented Sep 17, 2022

Thanks for authorizing the workflows. The new test didn't run because I hadn't add dill to the list of Python modules to be installed in the testing virtualenv. Of course it won't run untill version 0.3.6 is released, but at least now I know the "skip" test works.

I'll work on the documentation next. pkgdown isn't very happy...

@leogama
Copy link
Author

leogama commented Sep 18, 2022

@kevinushey How should I generate new man/*.md files? Just call roxygen2::roxygenize() or devtools::document()?

@leogama
Copy link
Author

leogama commented Sep 19, 2022

I added the cache.vars chunk option here with the basic implementation because dill isn't able to save all kinds of objects. Therefore, having at least a simple cache.vars working helps to deal with unpickleable variables. But the next dill release will have a much more granular control over which variables are saved. This feature can be integrated (and documented) in the knitr engine later.

@kevinushey
Copy link
Collaborator

devtools::document() should suffice for generating documentation.

@t-kalinowski
Copy link
Member

Hi @leogama, is this ready to merge?

@leogama
Copy link
Author

leogama commented Oct 11, 2022

Hi @leogama, is this ready to merge?

Not yet. It's waiting for a new release of https://github.com/uqfoundation/dill
I'll update you when it's ready (maybe it'll need some changes here).

@leogama leogama marked this pull request as draft October 11, 2022 13:34
@leogama
Copy link
Author

leogama commented Oct 27, 2022

@t-kalinowski dill v0.3.6 was finally released. I'll update the PR (it needs some changes) and finish this by the weekend.

@leogama
Copy link
Author

leogama commented Dec 13, 2022

Hi, @t-kalinowski. Sorry for the hiatus. I've adapted the code and tests for the released dill v0.3.6. Could you trigger a new workflow run*?

(*) I think it'll be necessary to somehow install my approved (but not merged) branch from knitr for the tests to succeed.

@leogama leogama marked this pull request as ready for review December 13, 2022 17:31
@t-kalinowski
Copy link
Member

Hi @leogama, welcome back! I'm glad to help get this into main.

In the interim, we can add a github actions workflow step that installs the appropriate knitr branch in the runners, just to confirm everything passes.

Once we're happy we can help coordinate getting the knitr PR merged and into the next CRAN release, and then merge this PR and out to CRAN. It'll have to be done in stages, with knitr going to CRAN first I believe.

@leogama
Copy link
Author

leogama commented Dec 16, 2022

Sure! How do I set up a custom package installation from a GitHub branch in Workflows? I have absolutely no idea.

@leogama
Copy link
Author

leogama commented Dec 16, 2022

@t-kalinowski I've added the knitr PR branch as an "extra package" to the R dependencies in the workflow. I'm not sure if this will overwrite or conflict with the DESCRIPTION's knitr dependency. Couldn't find any documentation or examples about that... So let's try?

@t-kalinowski
Copy link
Member

@leogama I'd like to help get this ready to merge into main.

I think that all the packages where this branch had dependencies on development versions of have since been released to CRAN, so it should simplify fixing the CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants