-
Notifications
You must be signed in to change notification settings - Fork 674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC - Cache Reservation API #1461
RFC - Cache Reservation API #1461
Conversation
Thank you for opening this pull request! 🙌 |
@hamersaw should we merge this? |
Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
2e31f0d
to
c066da3
Compare
User-side functionality will require the introduction of a cacheReservation Flyte task annotation. An elementary example is how this may look is provided below: | ||
|
||
```python | ||
@task(cacheReservation=true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wild-endeavor @kumare3 Can you give your opinion about this flag?
@task(cacheReservation=true) | |
@task(cache_serialized=true) |
maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was actually one of the slight differences. the flytekit integration is currently using cache_reservable to induce the API. the flag is used in tandem with the cache flag or it doesn't do anything. so tasks with this functionality enabled will be annotated as follows:
@task(cache=True, cache_reservable=True, cache_version="1.0")
this seemed fairly idiomatic, seeing as how the reservation functionality introduces a layer on top of the existing cache scheme, but very open to feedback. i can certainly make changes in the flytekit PR (once it's ready).
cc @eapolinario
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya I don't like cacheReservation. Cache_serialized=True is better.
Is there a timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I'll make the changes. Do we want to require the cache flag to be set as well? Or is setting cache_serialized enough to enable caching?
Is the timeout you mention for executions waiting on the cache to be populated? Is introducing another parameter the best solution here? The existing timeout parameter will cover the single executing task but not others that are waiting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: Maybe this is a terrible idea. The catalog key currently includes the cache_version parameter. Which, in absence, might promote undesired behavior where different versions of tasks are execution in serial.
Is there any advantage to generalizing this more? We can change the parameter to just "serialized" which means that only a single instance of the task can run at a time (obviously defined by inputs). The backend implementations will be very similar to what they are now, mostly changing the naming to better clarify what is happening.
@task(serialized=True)
And the obvious use case is for cachable tasks
@task(cache=True, serialized=True)
And can add a timeout which fails tasks that are waiting for other to complete
@task(serialized=True, serialize_timeout=30m)
Signed-off-by: Daniel Rammer <[email protected]>
random comments @kumare3
|
I'm able to comment on a few of these.
That was my initial intuition as well, in our domain serialize is more commonly associated with the aforementioned rather than executing a sequence of tasks in serial.
I believe so,
We should be careful with this. As I noted ^^^, I believe we would need to add a "version" parameter to each task similar to the existing "cache_version" parameter. Otherwise different versions of tasks (predicated on more than input / output values) may be executed in serial. We could deprecate the existing cache_version in favor of plain version, but I'm not sure if this is feasible.
Currently reservation timeouts are handled using a grace period multiplier (configured in datacatalog) on the reservation extension heartbeat interval. This heartbeat interval is the same as the workflow reevaluation loop. For my local testing the reeval is set to 5s and multiplier to 3 - so a 15 second deadlock. It seems production reeval is commonly on the order of 30s, if we use the same 3 multiplier then 1:30. |
@hamersaw have we reached a resolution here? |
I believe so, I updated the RFC to reflect the current state. @wild-endeavor says as long as you and @kumare3 are in consensus he's happy (https://unionai.slack.com/archives/C02CMUNT4PQ/p1633111921106700). The only two issues were: As long as that sounds good, it's ready to move forward. |
Awesome! let's get merge it then.. |
hackmd.io note