Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix failure expiry race condition #67

Merged
merged 3 commits into from
Nov 1, 2022
Merged

Conversation

njhill
Copy link
Member

@njhill njhill commented Oct 31, 2022

Motivation

A bug was encountered whereby model load failure records would not be cleaned up after expiration while there were continuous (failing) requests for that model, meaning no reload attempt occurred and those requests failed indefinitely.

Modifications

  • Add unit test to reproduce the problem, which required
    • Making expiry time configurable
    • Modifying the behaviour of the dummy model so that triggered loading failures only occur on the first attempt
  • Add some log statements
  • Fix the main bug: When a failed cache entry without any record in the registry is encountered during a load attempt, remove it rather than treating as a load failure
  • Fix some other logic related to failure expiry
  • Use a shorter expiry time (2/3 of configured time) for recently-requested models

Result

Failures expire as intended in all situations

Signed-off-by: Nick Hill [email protected]

Copy link
Member

@ckadner ckadner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @njhill -- I did run the test, as well as just the new/changed tests with and without the corresponding main source changes so functionally your changes look good as far as I can surmise. I just have two small nit-picky questions/comments

} else {
lastUsed = ce != null ? runtimeCache.getLastUsedTime(modelId) : -1L;
// Use shorter expiry age if model was used in last 3 minutes
final long expiryAge = (lastUsed > 0 && (now - lastUsed) < 180_000L)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the last 3 minutes (180_000L) here be a defined constant/env variable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it could be ... let me push a change with that

Motivation

A bug was encountered whereby model load failure records would not be cleaned up after expiration while there were continuous (failing) requests for that model, meaning no reload attempt occurred and those requests failed indefinitely.

Modifications

- Add unit test to reproduce the problem, which required
   - Making expiry time configurable
   - Modifying the behaviour of the dummy model so that triggered loading failures only occur on the first attempt
- Add some log statements
- Fix the main bug: When a failed cache entry without any record in the registry is encountered during a load attempt, remove it rather than treating as a load failure
- Fix some other logic related to failure expiry
- Use a shorter expiry time (2/3 of configured time) for recently-used models

Result

Failures expire as intended in all situations

Signed-off-by: Nick Hill <[email protected]>
Signed-off-by: Nick Hill <[email protected]>
@njhill
Copy link
Member Author

njhill commented Nov 1, 2022

Thanks @ckadner, I pushed an update with the constant, PTAL

Copy link
Member

@ckadner ckadner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@kserve-oss-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ckadner, njhill

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kserve-oss-bot kserve-oss-bot merged commit 1c2de08 into main Nov 1, 2022
@njhill njhill deleted the fail-expire-race branch November 1, 2022 23:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants