-
Notifications
You must be signed in to change notification settings - Fork 681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC for Eviction of cached task outputs #2633
RFC for Eviction of cached task outputs #2633
Conversation
Signed-off-by: Nick Müller <[email protected]>
Signed-off-by: Nick Müller <[email protected]>
We also see a use case for this cache eviction functionality from a privacy perspective; the right to be forgotten. GDPR (30 days to complete a deletion request):
CCPA (45 days to complete a deletion request):
When processing personally identifiable information (PII) to adhere to, e.g., GDPR, we generally only retain data for 30 days in our workflows. We do this to ensure that any deletion requests from users whose data we are processing will be fulfilled "automagically" within the expected timeframe for the "right to be forgotten". With the cache eviction API we would be able to build a system that could evict the cache for certain types of workflows or tasks within a given timeframe. With that said, and in fear of adding scope creep to this RFC, it might be even better to have the ability to set a TTL on the cache as an attribute, e.g., # Might make sense to only allow for a quite coarse-granularity
cache_ttl_hours = 30 * 24
cache_ttl_days = 30 |
@paulbes Interestingly enough, we talked about automatic expiration of cached values after a certain timespan internally just yesterday 😄 Our discussion was mainly focused on a housekeeping/cache size perspective rather than GDPR related, but I agree it could be useful for that as well.
I agree that'd be great to have, but I'm not sure either if we want to include it in this RFC or keep it separate/add afterwards so we don't extend the scope too much... |
A few thoughts on this. I know we have discussed including a I think implementation of a cache eviction API does open possibilities for automated eviction. To support automated eviction it sounds like a separate service as @paulbes suggested or an additional component into one of the Flyte core services that periodically scrapes and GCs the cache is required if I am understanding this correctly? IMO it is a deep enough topic to require a separate discussion. I think it could be a very useful feature, maybe open a new issue to track it? |
@hamersaw Yes, you're correct, the addition mentioned by @paulbes (and the one we talked about internally) would require an additional service/component within Flyte to periodically check all cached entries and evict them should they have cross a certain expiry threshold.
I agree, I'll open another issue to track this idea once the RFC gets accepted! Would definitely be a useful feature to have in Flyte itself, I'd say. |
Do we see Intratask Checkpoints as a cache? If so it might make sense to include them as part of an eviction. |
@sbrunk Good point 🤔 Yes, we should probably also clear out all these values when we're evicting the cache for a task as we might still be accessing cached values otherwise even though it looks like we should be re-computing everything. Since at least some of the handling is done outside flyteadmin/propeller's context (inside the actual code executed), we might have to remove these values before the execution (instead of afterwards, as suggested for the cached output). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is awesome, thank you for the write-up!
@katrogan any idea/input on this? I'm honestly not quite sure how flytekit/the |
cc @kumare3 for the intra-task checkpointing bits |
Signed-off-by: Nick Müller <[email protected]>
@kumare3 Any comment on the Intratask Checkpoints topic? I believe we should remove/skip those values as well, but not sure what the best way to do so would be? @katrogan @pmahindrakar-oss I've added some of the comments from this thread to the RFC doc. Please take another look at the changes to see if they satisfy your comments/questions or if there's something we should clarify in more detail. |
Intra-task checkpoints should only be applied during consecutive retries of a task within the same workflow execution. So if there was a new workflow execution with the |
Ah, great, didn't know about that, thanks for the info @hamersaw 👍 I agree we should keep it out of the RFC, if possible, then. Might be worth to add a follow up issue for the sake of completeness when cleaning up cached data, but otherwise we might be increasing an already quite extended scope even more. |
Checkpoints are localized to a single execution today not shared through cache - maybe we should do that in the future |
Signed-off-by: Nick Müller <[email protected]>
Look good to me @MorpheusXAUT for the suggested changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Can we make sure to update to include details on recursively evicting dynamic and subworkflow nodes?
Sure thing, will update tomorrow morning 👍 |
Signed-off-by: Nick Müller <[email protected]>
@katrogan Added some details about dynamic/workflow nodes and partial failures and slightly re-arranged the doc to emphasize we prefer extending the existing endpoints/adding a new |
thank you @MorpheusXAUT looks great! |
https://hackmd.io/qOztkaj4Rb6ypodvGEowAg?view
Comments already present on the HackMD doc are from our internal team and have been left in for clarification/further discussion.
Initial discussion on Slack