Cache engine for `reticulate` using `dill` #1210

leogama · 2022-05-13T16:52:29Z

Hello there! I bring good news.

The Python package dill —"serialize all of python"— is about to release a new version (0.3.5), that will likely include my patches for its session saving and restoring functionality. Therefore, I'm back to work in the cache feature for reticulate based on @tmastny's proposal from some years ago (#167).

My idea is then to require dill>=0.3.5 as the previous versions have too many problems to work around.

I didn't review the entire code from the original PR yet, just enough to make it work again with the current main branch, so there's probably a little more work to do (maybe even on knitr's side). Criticism and suggestions are welcome.

…Addresses knitr rstudio#1505

…ir()'

This is what cache saving does, therefore it is necesseray that load_session() runs in the same dir or it won't find local python modules.

t-kalinowski · 2022-05-13T22:44:39Z

Hi, thank you for reviving this PR! I will take a closer look at this next week, but after taking a glance I have quick question: does this need to be rebased on the current main branch? It looks like there are some unrelated changes in the PR (e.g., changes to py_to_r).

t-kalinowski · 2022-05-13T22:44:39Z

Hi, thank you for reviving this PR! I will take a closer look at this next week, but after taking a glance I have Quick question: does this need to be rebased on the current main branch? It looks like there are some unrelated changes in the PR (e.g., changes to py_to_r).

leogama · 2022-05-14T14:33:38Z

Question 1: Yes, it needs to be rebased on main because the merge cause conflicts in at least 2 files. I had to fix the conflicts manually.

Question 2: This is precisely one of the parts I didn't have time to review. But, yes, I spotted this py_to_r and other functions and I'm thinking of reverting these unrelated changes. They are from the original PR, which I didn't want to lose.

t-kalinowski · 2022-05-16T12:58:05Z

I don't think that any of the changes in the python.R file (py_to_r and friends) contribute to the PR, and it looks to me like they are an artifact of merging or rebasing the main branch (if I'm mistaken, please let me know!). It'd be good to remove them.

R/knitr-engine.R

kevinushey · 2022-05-16T17:47:02Z

R/knitr-engine.R

+  r_obj_exists <- "'r' in globals()"
+  r_is_R <- "type(r).__module__ == '__main__' and type(r).__name__ == 'R'"
+  if (py_eval(r_obj_exists) && py_eval(r_is_R)) {
+    py_run_string("del globals()['r']")


Wouldn't this have side effects (basically meaning the r object is no longer visible after this code is run)?

This is run at the end of a knitir block, and currently the r object in injected again in the Python namespace at the beginning of the next block. An alternative to putting this object directly in the user's __main__ namespace would be to add it to __builtins__. This would bring another advantage: the user could then create an r global variable without overwriting it, it would only be masked. Then del r would unmask it.

Didn't put the r object in __builtins__, but I did put the "R object class" in __builtins__. The r object is not removed before saving the cache anymore, it's just ignored. However, my previous suggestion is still open for discussion.

R/knitr-engine.R

leogama · 2022-09-03T17:56:03Z

Hello, there! I'm back to finish this PR. The dill Python package is about to release its next version, 0.3.6, with many new features related to session saving and restoring. You guys also just released a new version of reticulate, so I guess I'll have plenty of time to implement and test this feature until the next one.

One potential problem we'll face is that dill will drop the support for Python 2 and Python 3 < 3.7 in this new release. It's possible to use the current 0.3.5 version for those, but maybe it'll need some logic to distinguish the cases, as with the upcoming 0.3.6 there's much more control of what is saved and how it's saved.

t-kalinowski · 2022-09-06T20:08:19Z

Hi @leogama That's great to hear, this will be a great addition to reticulate!

Regarding python version compatibility, would it be possible to only enable this feature for newer versions of Python?

leogama · 2022-09-08T02:26:21Z

Regarding python version compatibility, would it be possible to only enable this feature for newer versions of Python?

Of course, I prefer to start this as simple as possible and add features incrementally without many edge cases to care about. If you are OK with Python ≥ 3.7, then let's begin with that.

I've been studying the knitr execution model and the R cache implementation in the last days. I'm worried about the various cache options (with implications to cache invalidation) and the R-Python interactions. For example, it seems that the R cache is loaded and saved for every chunk, even if its of another engine (e.g. Python), but the opposite is not true.

Here is a diagram of the execution model I've found so far, which probably has some holes:

t-kalinowski · 2022-09-08T20:32:35Z

@yihui Can you please advise on #1210 (comment)?

yihui · 2022-09-08T22:19:30Z

The diagram looks about right to me. Great job! :)

BTW, it will be great to have @tmastny look at this PR if he has time.

leogama · 2022-09-09T02:06:10Z

The diagram looks about right to me. Great job! :)

@yihui, I'm trying to generate this with roxygen2 by putting dot language statement in custom tags throughout the code and then outputting them to a file to be processed by Graphviz (I tried the Rgraphviz package, but it's too limited).

My idea is to generate something like a UML activity diagram to truthfully represent the execution model. When it's polished enough, e.g. with file/line references in the nodes, I may submit it to knitr as "developer documentation".

I'll probably also need to write down some schemes to wrap my head around the various cache options and the cache invalidation criteria, if we'd like to reproduce them for Python...

leogama · 2022-09-09T02:10:05Z

@t-kalinowski: do you think it's better to add features to the cache mechanism in this single PR or to just implement the basics here and split the features in separate PRs?

leogama · 2022-09-09T17:49:56Z

@yihui We have a small issue concerning the working directories where the cache code chunks are run. It seems that the R cache code always run in the knit() call's WD. The cache_engine() call to load/purge any cache from a non-R engine also runs there. But the cache code for saving the non-R engine state is called indirectly from the in_input_dir(engine(options)) call, which may change the WD.

Both functions input_dir() and in_input_dir() are private, and the path in options$hash is relative to the knit() WD. I see three possible solutions to make it work transparently:

make the options$hash path absolute (will affect log messages)
pass the original working directory to cache_engine() somehow
call the non-R engine cache saving function after the call to engine(), with the WD restored

I advocate for the third option, and that cache engines should use the same API as the R cache, i.e. a list object with functions as "methods", like cache_engine()$exists(), cache_engine()$load(), cache_engine()$save(), etc. By the way, the function cache$exists() should probably also call this cache_engine()$exists() when options$engine != 'R' for consistency.

t-kalinowski · 2022-09-09T17:56:38Z

@t-kalinowski: do you think it's better to add features to the cache mechanism in this single PR or to just implement the basics here and split the features in separate PRs?

However you think is easiest. I'm happy to engage and review either way.

yihui · 2022-09-09T19:37:03Z

@leogama I agree with you. That seems to require a change in knitr, right? Please feel free to submit a PR there. Thanks!

leogama · 2022-09-09T22:03:40Z

However you think is easiest. I'm happy to engage and review either way.

Great. I think I'll restrict this PR to the basic cache mechanism and then open other PRs for extra features like the chunk options cache.vars, cache.comments, dependson and autodep. This way, people who use the dev version can start using and testing the Python cache sooner.

leogama · 2022-09-17T02:00:04Z

We have a test file now. I think it's time to run the workflows.

leogama · 2022-09-17T17:31:37Z

Thanks for authorizing the workflows. The new test didn't run because I hadn't add dill to the list of Python modules to be installed in the testing virtualenv. Of course it won't run untill version 0.3.6 is released, but at least now I know the "skip" test works.

I'll work on the documentation next. pkgdown isn't very happy...

leogama · 2022-09-18T16:27:28Z

@kevinushey How should I generate new man/*.md files? Just call roxygen2::roxygenize() or devtools::document()?

leogama · 2022-09-19T16:35:55Z

I added the cache.vars chunk option here with the basic implementation because dill isn't able to save all kinds of objects. Therefore, having at least a simple cache.vars working helps to deal with unpickleable variables. But the next dill release will have a much more granular control over which variables are saved. This feature can be integrated (and documented) in the knitr engine later.

kevinushey · 2022-09-20T17:59:22Z

devtools::document() should suffice for generating documentation.

t-kalinowski · 2022-10-07T15:44:26Z

Hi @leogama, is this ready to merge?

leogama · 2022-10-11T13:34:23Z

Hi @leogama, is this ready to merge?

Not yet. It's waiting for a new release of https://github.com/uqfoundation/dill
I'll update you when it's ready (maybe it'll need some changes here).

leogama · 2022-10-27T10:29:49Z

@t-kalinowski dill v0.3.6 was finally released. I'll update the PR (it needs some changes) and finish this by the weekend.

leogama · 2022-12-13T17:31:20Z

Hi, @t-kalinowski. Sorry for the hiatus. I've adapted the code and tests for the released dill v0.3.6. Could you trigger a new workflow run*?

(*) I think it'll be necessary to somehow install my approved (but not merged) branch from knitr for the tests to succeed.

t-kalinowski · 2022-12-15T23:31:41Z

Hi @leogama, welcome back! I'm glad to help get this into main.

In the interim, we can add a github actions workflow step that installs the appropriate knitr branch in the runners, just to confirm everything passes.

Once we're happy we can help coordinate getting the knitr PR merged and into the next CRAN release, and then merge this PR and out to CRAN. It'll have to be done in stages, with knitr going to CRAN first I believe.

leogama · 2022-12-16T01:51:55Z

Sure! How do I set up a custom package installation from a GitHub branch in Workflows? I have absolutely no idea.

leogama · 2022-12-16T13:30:02Z

@t-kalinowski I've added the knitr PR branch as an "extra package" to the R dependencies in the workflow. I'm not sure if this will overwrite or conflict with the DESCRIPTION's knitr dependency. Couldn't find any documentation or examples about that... So let's try?

t-kalinowski · 2023-06-21T18:14:59Z

@leogama I'd like to help get this ready to merge into main.

I think that all the packages where this branch had dependencies on development versions of have since been released to CRAN, so it should simplify fixing the CI.

tmastny and others added 9 commits April 18, 2018 18:20

added unit tests that cover knitr rstudio#1505

228624b

added cache_eng_python to add Python session caching between chunks. …

b52ecaa

…Addresses knitr rstudio#1505

dill caching engine for knitr, with tests

c345ce2

changes from feedback on knitr rstudio#1518 with updated tests

ff39889

fixed testing utils source in dill tests

8e07779

Merge 'rstudio/main' with 'tmastny/master' into branch 'cache-engine'

638a4e7

cache engine: update 'r' object identification logic

9753870

fix 'cache_path' when 'output.dir' is different from 'knitr:::input_d…

02c1771

…ir()'

cache loading should run in the input directory

dbebab3

This is what cache saving does, therefore it is necesseray that load_session() runs in the same dir or it won't find local python modules.

kevinushey reviewed May 16, 2022

View reviewed changes

leogama added 3 commits May 26, 2022 19:50

remove duplicated conversion functions

bd29f84

Merge branch 'main' into cache-engine

f8497a0

remove trailing whitespaces and empty line

fe4cd9f

leogama mentioned this pull request Sep 12, 2022

Changes to the cache API required by the new reticulate cache implementation yihui/knitr#2170

Open

First version of cache implementation with new knitr API

5d6f7a7

leogama added 3 commits September 15, 2022 20:06

Set environment() as default argument in eng_python_initialize()

445a5ca

Basic test for knitr engine cache

c6a88ad

minor

975c1b0

leogama requested a review from kevinushey September 17, 2022 01:59

leogama added 2 commits September 17, 2022 14:33

Workflows: install module dill in the testing virtualenv

401b1ba

Docs: remove @params from cache_eng_python, add it to pkgdown index

62c77d8

leogama added 2 commits September 18, 2022 22:11

Correctly initialize Python in knitr, honoring 'engine.path'

f487b52

Implement the 'cache.vars' chunk option; some style changes

cb9ee1f

Remove unused 'envir' parameter from 'eng_python_initialize*' functions

7d4eeec

leogama marked this pull request as draft October 11, 2022 13:34

leogama added 2 commits December 13, 2022 14:16

update cache engine docs

395627e

cache: adapt code and tests to dill package v0.3.6

55d1e03

leogama marked this pull request as ready for review December 13, 2022 17:31

fix typo, update generated documentation

d43b593

Workflow: use PR branch from knitr for testing

f354f60

leogama and others added 2 commits December 19, 2022 21:33

fix typo

38ef3ce

Merge branch 'main' into cache-engine

79b9732

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache engine for `reticulate` using `dill` #1210

Cache engine for `reticulate` using `dill` #1210

leogama commented May 13, 2022

t-kalinowski commented May 13, 2022 •

edited

Loading

t-kalinowski commented May 13, 2022

leogama commented May 14, 2022

t-kalinowski commented May 16, 2022

kevinushey May 16, 2022

leogama May 17, 2022 •

edited

Loading

leogama Sep 14, 2022

leogama commented Sep 3, 2022

t-kalinowski commented Sep 6, 2022

leogama commented Sep 8, 2022

t-kalinowski commented Sep 8, 2022

yihui commented Sep 8, 2022

leogama commented Sep 9, 2022

leogama commented Sep 9, 2022

leogama commented Sep 9, 2022

t-kalinowski commented Sep 9, 2022 •

edited

Loading

yihui commented Sep 9, 2022

leogama commented Sep 9, 2022

leogama commented Sep 17, 2022

leogama commented Sep 17, 2022 •

edited

Loading

leogama commented Sep 18, 2022

leogama commented Sep 19, 2022 •

edited

Loading

kevinushey commented Sep 20, 2022

t-kalinowski commented Oct 7, 2022

leogama commented Oct 11, 2022

leogama commented Oct 27, 2022

leogama commented Dec 13, 2022

t-kalinowski commented Dec 15, 2022

leogama commented Dec 16, 2022

leogama commented Dec 16, 2022

t-kalinowski commented Jun 21, 2023

Cache engine for reticulate using dill #1210

Are you sure you want to change the base?

Cache engine for reticulate using dill #1210

Conversation

leogama commented May 13, 2022

t-kalinowski commented May 13, 2022 • edited Loading

t-kalinowski commented May 13, 2022

leogama commented May 14, 2022

t-kalinowski commented May 16, 2022

kevinushey May 16, 2022

Choose a reason for hiding this comment

leogama May 17, 2022 • edited Loading

Choose a reason for hiding this comment

leogama Sep 14, 2022

Choose a reason for hiding this comment

leogama commented Sep 3, 2022

t-kalinowski commented Sep 6, 2022

leogama commented Sep 8, 2022

t-kalinowski commented Sep 8, 2022

yihui commented Sep 8, 2022

leogama commented Sep 9, 2022

leogama commented Sep 9, 2022

leogama commented Sep 9, 2022

t-kalinowski commented Sep 9, 2022 • edited Loading

yihui commented Sep 9, 2022

leogama commented Sep 9, 2022

leogama commented Sep 17, 2022

leogama commented Sep 17, 2022 • edited Loading

leogama commented Sep 18, 2022

leogama commented Sep 19, 2022 • edited Loading

kevinushey commented Sep 20, 2022

t-kalinowski commented Oct 7, 2022

leogama commented Oct 11, 2022

leogama commented Oct 27, 2022

leogama commented Dec 13, 2022

t-kalinowski commented Dec 15, 2022

leogama commented Dec 16, 2022

leogama commented Dec 16, 2022

t-kalinowski commented Jun 21, 2023

Cache engine for `reticulate` using `dill` #1210

Cache engine for `reticulate` using `dill` #1210

t-kalinowski commented May 13, 2022 •

edited

Loading

leogama May 17, 2022 •

edited

Loading

t-kalinowski commented Sep 9, 2022 •

edited

Loading

leogama commented Sep 17, 2022 •

edited

Loading

leogama commented Sep 19, 2022 •

edited

Loading