RFC - Caching of non-Flyte offloaded objects #1893

eapolinario · 2021-12-03T07:44:41Z

#1581

Signed-off-by: Eduardo Apolinario <[email protected]>

…ts-rfc

Signed-off-by: Eduardo Apolinario <[email protected]>

rfc/system/0000-caching-of-offloaded-objects.md

EngHabu · 2021-12-04T15:53:21Z

rfc/system/0000-caching-of-offloaded-objects.md

+    v = bar(df=df) 
+```
+
+It's worth noting that this is a strictly opt-in feature, controlled at the level of Type Transformers. In other words, annotating types for which Type Transformers are not marked as opted in will be a no-op.


I think a big question for me is whether it should be opt-in for certain types... like Pandas (for which data has already been loaded in memory)... I understand that the proposal, at least from a purity perspective, leaves it as opt-in feature for all... but I would like to question that a bit... would it be a fair assumption, on users side, to say that "built-in" types are always by value?

would it be a fair assumption, on users side, to say that "built-in" types are always by value?

That's true, but it happened almost by accident, i.e. we never exposed (nor enforced) the notion that values were being cached by value or by reference.

Also, we shouldn't be too prescriptive about hashing complex types for two reasons:

cost: calculating hashes might be expensive

No single way of calculating the hash.
i. We can certainly offer a few default implementations (and be upfront about the probability of collisions, etc)

Does this make sense? Can you clarify what we could gain if we made this assumption of all types will be cached by value the default?

Can you clarify what we could gain if we made this assumption of all types will be cached by value the default?

Mainly that from the feedback we heard a few times (and maybe it's overblown) that this was what the existing UX suggested to them... the reason I want to discuss this now is if we decide to offer it by default to some types, we will need an opt-out experience... and I think it might be worth writing down a few examples of each....

This is the first time we're even making the distinction between the two caching semantics. I think it's fair to assume (and make it clear in the docs) that we use cache-by-value everywhere unless specified, in which case users will be able to opt-in to this experience of providing cache-by-value semantics for specific types.

EngHabu · 2021-12-04T15:59:18Z

rfc/system/0000-caching-of-offloaded-objects.md

+
+How does this affect @dynamic tasks, subworkflows, and launchplans?
+
+What's the observability provided by data catalog?


I think as part of this work we should enrich the caching data sent by propeller to indicate whether caching by value or reference occurred... not sure of the specific syntax but conveying this information, I believe, is important...

Specially in the scenario where you call like this:

@workflow def wf1() -> Something: df = query_data() # produces hash-annotated dataframes return expensive_compute(df) @workflow def wf2(df: pd.DataFrame) -> Something: expensive_compute(df)

In which case, it's very plausible that wf2() gets called with non-hash-annotated DataFrame (maybe called from the UI with an s3 path) and will NOT match the cache key produced/used within wf1() execution... and conveying this distinction/expectation to users I think is important to avoid confusion..

That's a really good catch.

Do you think it's worth annotating the type in the declaration of expensive_compute, as having the hash overridden? Something like:

def expensive_compute(df: Annotated[pd.DataFrame, HashOverriden]): ...

This way, callers of expensive_compute might see in the logs if they call that task with dataframes missing the hash annotation.

Not sure what does it mean to have an input with that annotation? do you mean flyte/flytekit would log that "task author expected hash and you provided non..." ?

do you mean flyte/flytekit would log that "task author expected hash and you provided non..." ?

Exactly.

If we leverage tags as we seem to be leaning towards, maybe this is a moot point then? because by reference lookup will continue to work as expected? I think if we surface the information to the user (that we used did a lookup by reference not by hash) maybe that is enough? not as a warning or anything... just stating a fact...
We will have to thoroughly document all of this... it's a good pattern if we can make sure users understand what's going on...

I still think that it's worth showing to the user whenever we are able that this caching-by-value semantics is used.

One idea we talked about is to catch cases as early as during registration phase and at least inform the user that their intent might not match the code. I updated the RFC with two examples of such ideas.

rfc/system/0000-caching-of-offloaded-objects.md

Signed-off-by: Eduardo Apolinario <[email protected]>

EngHabu

Awesome! LGTM

eapolinario added 6 commits November 22, 2021 15:19

Add initial template

39d8048

Signed-off-by: Eduardo Apolinario <[email protected]>

Caching of offloaded types RFC intial commit

45950f6

Signed-off-by: Eduardo Apolinario <[email protected]>

Merge remote-tracking branch 'origin' into caching-of-offloaded-objec…

d50a816

…ts-rfc

Move file

bb67712

Signed-off-by: Eduardo Apolinario <[email protected]>

Expand on proposed implementation

e32c2e8

Signed-off-by: Eduardo Apolinario <[email protected]>

Rephrase some of the wording.

bc33365

Signed-off-by: Eduardo Apolinario <[email protected]>

eapolinario mentioned this pull request Dec 3, 2021

Caching of offloaded objects flyteorg/flytekit#762

Merged

8 tasks

kumare3 reviewed Dec 3, 2021

View reviewed changes

rfc/system/0000-caching-of-offloaded-objects.md Outdated Show resolved Hide resolved

eapolinario mentioned this pull request Dec 3, 2021

Add hash to literal #minor flyteorg/flyteidl#237

Merged

8 tasks

EngHabu reviewed Dec 4, 2021

View reviewed changes

eapolinario added 5 commits December 7, 2021 00:02

Rename file to gh PR number

0e99bfe

Signed-off-by: Eduardo Apolinario <[email protected]>

Add problem 1. section

69b7b13

Signed-off-by: Eduardo Apolinario <[email protected]>

Clarify the point re: multiple tags

d9400b2

Signed-off-by: Eduardo Apolinario <[email protected]>

Describe problem 2.

f1219b7

Signed-off-by: Eduardo Apolinario <[email protected]>

Add footnote and reword the Problem 2 description.

fde5b33

Signed-off-by: Eduardo Apolinario <[email protected]>

EngHabu approved these changes Dec 16, 2021

View reviewed changes

eapolinario merged commit 546431b into master Dec 16, 2021

eapolinario deleted the caching-of-offloaded-objects-rfc branch December 16, 2021 21:55

eapolinario mentioned this pull request Mar 1, 2022

Take literal hash into account during cache key calculation flyteorg/flytepropeller#406

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC - Caching of non-Flyte offloaded objects #1893

RFC - Caching of non-Flyte offloaded objects #1893

eapolinario commented Dec 3, 2021

EngHabu Dec 4, 2021

eapolinario Dec 6, 2021

EngHabu Dec 7, 2021

eapolinario Dec 15, 2021

EngHabu Dec 4, 2021

eapolinario Dec 6, 2021

EngHabu Dec 6, 2021

eapolinario Dec 6, 2021

EngHabu Dec 7, 2021

eapolinario Dec 15, 2021

EngHabu left a comment


		How does this affect @dynamic tasks, subworkflows, and launchplans?

		What's the observability provided by data catalog?

RFC - Caching of non-Flyte offloaded objects #1893

RFC - Caching of non-Flyte offloaded objects #1893

Conversation

eapolinario commented Dec 3, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EngHabu left a comment

Choose a reason for hiding this comment