Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC - Caching of non-Flyte offloaded objects #1893

Merged
merged 11 commits into from
Dec 16, 2021

Conversation

eapolinario
Copy link
Contributor

rfc/system/0000-caching-of-offloaded-objects.md Outdated Show resolved Hide resolved
rfc/system/0000-caching-of-offloaded-objects.md Outdated Show resolved Hide resolved
v = bar(df=df)
```

It's worth noting that this is a strictly opt-in feature, controlled at the level of Type Transformers. In other words, annotating types for which Type Transformers are not marked as opted in will be a no-op.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a big question for me is whether it should be opt-in for certain types... like Pandas (for which data has already been loaded in memory)... I understand that the proposal, at least from a purity perspective, leaves it as opt-in feature for all... but I would like to question that a bit... would it be a fair assumption, on users side, to say that "built-in" types are always by value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be a fair assumption, on users side, to say that "built-in" types are always by value?

That's true, but it happened almost by accident, i.e. we never exposed (nor enforced) the notion that values were being cached by value or by reference.

Also, we shouldn't be too prescriptive about hashing complex types for two reasons:

  1. cost: calculating hashes might be expensive
  2. No single way of calculating the hash.
    i. We can certainly offer a few default implementations (and be upfront about the probability of collisions, etc)

Does this make sense? Can you clarify what we could gain if we made this assumption of all types will be cached by value the default?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what we could gain if we made this assumption of all types will be cached by value the default?

Mainly that from the feedback we heard a few times (and maybe it's overblown) that this was what the existing UX suggested to them... the reason I want to discuss this now is if we decide to offer it by default to some types, we will need an opt-out experience... and I think it might be worth writing down a few examples of each....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first time we're even making the distinction between the two caching semantics. I think it's fair to assume (and make it clear in the docs) that we use cache-by-value everywhere unless specified, in which case users will be able to opt-in to this experience of providing cache-by-value semantics for specific types.


How does this affect @dynamic tasks, subworkflows, and launchplans?

What's the observability provided by data catalog?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as part of this work we should enrich the caching data sent by propeller to indicate whether caching by value or reference occurred... not sure of the specific syntax but conveying this information, I believe, is important...

Specially in the scenario where you call like this:

@workflow
def wf1() -> Something:
  df = query_data() # produces hash-annotated dataframes
  return expensive_compute(df)

@workflow
def wf2(df: pd.DataFrame) -> Something:
  expensive_compute(df)

In which case, it's very plausible that wf2() gets called with non-hash-annotated DataFrame (maybe called from the UI with an s3 path) and will NOT match the cache key produced/used within wf1() execution... and conveying this distinction/expectation to users I think is important to avoid confusion..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a really good catch.

Do you think it's worth annotating the type in the declaration of expensive_compute, as having the hash overridden? Something like:

def expensive_compute(df: Annotated[pd.DataFrame, HashOverriden]):
   ...

This way, callers of expensive_compute might see in the logs if they call that task with dataframes missing the hash annotation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what does it mean to have an input with that annotation? do you mean flyte/flytekit would log that "task author expected hash and you provided non..." ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean flyte/flytekit would log that "task author expected hash and you provided non..." ?

Exactly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we leverage tags as we seem to be leaning towards, maybe this is a moot point then? because by reference lookup will continue to work as expected? I think if we surface the information to the user (that we used did a lookup by reference not by hash) maybe that is enough? not as a warning or anything... just stating a fact...
We will have to thoroughly document all of this... it's a good pattern if we can make sure users understand what's going on...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that it's worth showing to the user whenever we are able that this caching-by-value semantics is used.

One idea we talked about is to catch cases as early as during registration phase and at least inform the user that their intent might not match the code. I updated the RFC with two examples of such ideas.

rfc/system/0000-caching-of-offloaded-objects.md Outdated Show resolved Hide resolved
Signed-off-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Eduardo Apolinario <[email protected]>
Copy link
Contributor

@EngHabu EngHabu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! LGTM

@eapolinario eapolinario merged commit 546431b into master Dec 16, 2021
@eapolinario eapolinario deleted the caching-of-offloaded-objects-rfc branch December 16, 2021 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants