-
Notifications
You must be signed in to change notification settings - Fork 680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC - Caching of non-Flyte offloaded objects #1893
Conversation
Signed-off-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Eduardo Apolinario <[email protected]>
v = bar(df=df) | ||
``` | ||
|
||
It's worth noting that this is a strictly opt-in feature, controlled at the level of Type Transformers. In other words, annotating types for which Type Transformers are not marked as opted in will be a no-op. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a big question for me is whether it should be opt-in for certain types... like Pandas (for which data has already been loaded in memory)... I understand that the proposal, at least from a purity perspective, leaves it as opt-in feature for all... but I would like to question that a bit... would it be a fair assumption, on users side, to say that "built-in" types are always by value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be a fair assumption, on users side, to say that "built-in" types are always by value?
That's true, but it happened almost by accident, i.e. we never exposed (nor enforced) the notion that values were being cached by value or by reference.
Also, we shouldn't be too prescriptive about hashing complex types for two reasons:
- cost: calculating hashes might be expensive
- No single way of calculating the hash.
i. We can certainly offer a few default implementations (and be upfront about the probability of collisions, etc)
Does this make sense? Can you clarify what we could gain if we made this assumption of all types will be cached by value the default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify what we could gain if we made this assumption of all types will be cached by value the default?
Mainly that from the feedback we heard a few times (and maybe it's overblown) that this was what the existing UX suggested to them... the reason I want to discuss this now is if we decide to offer it by default to some types, we will need an opt-out experience... and I think it might be worth writing down a few examples of each....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the first time we're even making the distinction between the two caching semantics. I think it's fair to assume (and make it clear in the docs) that we use cache-by-value everywhere unless specified, in which case users will be able to opt-in to this experience of providing cache-by-value semantics for specific types.
|
||
How does this affect @dynamic tasks, subworkflows, and launchplans? | ||
|
||
What's the observability provided by data catalog? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think as part of this work we should enrich the caching data sent by propeller to indicate whether caching by value or reference occurred... not sure of the specific syntax but conveying this information, I believe, is important...
Specially in the scenario where you call like this:
@workflow
def wf1() -> Something:
df = query_data() # produces hash-annotated dataframes
return expensive_compute(df)
@workflow
def wf2(df: pd.DataFrame) -> Something:
expensive_compute(df)
In which case, it's very plausible that wf2() gets called with non-hash-annotated DataFrame (maybe called from the UI with an s3 path) and will NOT match the cache key produced/used within wf1() execution... and conveying this distinction/expectation to users I think is important to avoid confusion..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a really good catch.
Do you think it's worth annotating the type in the declaration of expensive_compute
, as having the hash overridden? Something like:
def expensive_compute(df: Annotated[pd.DataFrame, HashOverriden]):
...
This way, callers of expensive_compute
might see in the logs if they call that task with dataframes missing the hash annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what does it mean to have an input with that annotation? do you mean flyte/flytekit would log that "task author expected hash and you provided non..." ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean flyte/flytekit would log that "task author expected hash and you provided non..." ?
Exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we leverage tags as we seem to be leaning towards, maybe this is a moot point then? because by reference lookup will continue to work as expected? I think if we surface the information to the user (that we used did a lookup by reference not by hash) maybe that is enough? not as a warning or anything... just stating a fact...
We will have to thoroughly document all of this... it's a good pattern if we can make sure users understand what's going on...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think that it's worth showing to the user whenever we are able that this caching-by-value semantics is used.
One idea we talked about is to catch cases as early as during registration phase and at least inform the user that their intent might not match the code. I updated the RFC with two examples of such ideas.
Signed-off-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Eduardo Apolinario <[email protected]>
Signed-off-by: Eduardo Apolinario <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! LGTM
#1581