Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Ensure intra-cluster InternalExceptions are propagated properly #58

Merged
merged 2 commits into from
Oct 15, 2022

Conversation

njhill
Copy link
Member

@njhill njhill commented Sep 8, 2022

Motivation

In certain error edge cases, InternalExceptions can be thrown on the inferencing path, but these currently don't propagate properly because they aren't declared to be thrown by the internal thrift applyModel(Multi) method. Because of this, they are treated as generic errors meaning that the error detail (message) is lost and when returned to the user can have a confusing message like "Nowhere available to load".

Furthermore, there was a bug when checking the type of remote exceptions, where in certain places they were not first being unwrapped as intended.

Modifications

  • Add InternalException to internal applyModel thrift rpc method definitions
  • Ensure remote exceptions to be examined are unwrapped from InvocationTargetException first
  • Add unit test for this case
  • Also update to use latest version of thrift

Result

InternalExceptions will propagate properly between modelmesh pods

@ckadner
Copy link
Member

ckadner commented Sep 9, 2022

Hi @njhill -- Your changes look good as far as I understand from reading the code. But would you have an example use case for me to test this before and after your change?

@njhill
Copy link
Member Author

njhill commented Sep 12, 2022

@ckadner good question... it's kind of an edge case but will try to craft a unit test to cover it.

Motivation

In certain error edge cases, InternalExceptions can be thrown on the inferencing path, but these currently don't propagate properly because they aren't declared to be thrown by the internal thrift applyModel(Multi) method. Because of this, they are treated as generic errors meaning that the error detail (message) is lost and when returned to the user can have a confusing message like "Nowhere available to load".

Modifications

- Add InternalException to internal applyModel thrift rpc method definitions
- Also update to use latest version of thrift

Result

InternalExceptions will propagate properly between modelmesh pods

Signed-off-by: Nick Hill <[email protected]>
@njhill
Copy link
Member Author

njhill commented Oct 3, 2022

@ckadner have added a unit test and also fixed a related bug that it exposed, PTAL!

@njhill njhill marked this pull request as ready for review October 3, 2022 22:33
Copy link
Member

@ckadner ckadner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@kserve-oss-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ckadner, njhill

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kserve-oss-bot kserve-oss-bot merged commit 949aadc into main Oct 15, 2022
@ckadner
Copy link
Member

ckadner commented Oct 15, 2022

Sorry for the delay @njhill -- I ran the unit tests including running your new test without your code changes to verify they would fail. I also built a new modelmesh image from your code and "deployed" that to my modelmesh cluster to make sure nothing breaks.

@njhill njhill deleted the propagate-ie branch October 15, 2022 05:30
ruivieira pushed a commit to ruivieira/modelmesh that referenced this pull request Jul 10, 2024
Create Workflow for release and tag with Changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants