Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Eliminate the exception during shutdown #3826

Merged
merged 4 commits into from
May 5, 2023

Conversation

frostming
Copy link
Contributor

@frostming frostming commented May 4, 2023

Signed-off-by: Frost Ming [email protected]

What does this PR address?

Fixes #3813

A bug that causes the runner to keep running after it is terminated. This is because the CancelledError is swallowed silently, making the tasks never be cancelled by the runner.

BTW, since CancelledError is subclassing BaseException it won't be caught by the except Exception block. We can remove it.

Before submitting:

@frostming frostming requested a review from a team as a code owner May 4, 2023 05:00
@frostming frostming requested review from aarnphm and removed request for a team May 4, 2023 05:00
aarnphm
aarnphm previously approved these changes May 4, 2023
aarnphm
aarnphm previously approved these changes May 4, 2023
@sauyon
Copy link
Contributor

sauyon commented May 4, 2023

Just tried this and if you kill the server in training you get

  File "/home/sauyon/devel/bentoml/src/bentoml/_internal/marshal/dispatcher.py", line 206, in train_optimizer
    await self._wake_event.wait_for(self._queue.__len__)
  File "/usr/lib/python3.7/asyncio/locks.py", line 381, in wait_for
    await self.wait()
  File "/usr/lib/python3.7/asyncio/locks.py", line 354, in wait
    await fut
concurrent.futures._base.CancelledError
 (trace=ae8eb131d85f182410e65752fac63ced,span=6b02be23ca37ea62,sampled=0)
Traceback (most recent call last):
  File "/home/sauyon/devel/bentoml/src/bentoml/_internal/marshal/dispatcher.py", line 206, in train_optimizer
    await self._wake_event.wait_for(self._queue.__len__)
  File "/usr/lib/python3.7/asyncio/locks.py", line 381, in wait_for
    await self.wait()
  File "/usr/lib/python3.7/asyncio/locks.py", line 354, in wait
    await fut
concurrent.futures._base.CancelledError
2023-05-03T23:05:35-0700 [ERROR] [runner:iris_clf:1] Traceback (most recent call last):
  File "/home/sauyon/devel/bentoml/src/bentoml/_internal/marshal/dispatcher.py", line 206, in train_optimizer
    await self._wake_event.wait_for(self._queue.__len__)
  File "/usr/lib/python3.7/asyncio/locks.py", line 381, in wait_for
    await self.wait()
  File "/usr/lib/python3.7/asyncio/locks.py", line 354, in wait
    await fut
concurrent.futures._base.CancelledError
 (trace=ae8eb131d85f182410e65752fac63ced,span=6b02be23ca37ea62,sampled=0)
Traceback (most recent call last):
  File "/home/sauyon/devel/bentoml/src/bentoml/_internal/marshal/dispatcher.py", line 206, in train_optimizer
    await self._wake_event.wait_for(self._queue.__len__)
  File "/usr/lib/python3.7/asyncio/locks.py", line 381, in wait_for
    await self.wait()
  File "/usr/lib/python3.7/asyncio/locks.py", line 354, in wait
    await fut
concurrent.futures._base.CancelledError

instead, which doesn't seem much better?

@frostming
Copy link
Contributor Author

Just tried this and if you kill the server in training you get

@sauyon Thanks for that info. I tested on Python 3.7(which I didn't do previously), it looks good after I reraise the exception.

@sauyon
Copy link
Contributor

sauyon commented May 5, 2023

Looks better to me! Should we be worried about the return on L316 and on L180?

@frostming
Copy link
Contributor Author

frostming commented May 5, 2023

Looks better to me! Should we be worried about the return on L316 and on L180?

L180 is intentional to return a fallback value. L316 doesn't make a difference since it is a top async task on its own, but IMO it has a slight mismatch in semantic: a cancellation should not be a normal return. I will change it.

Copy link
Contributor

@sauyon sauyon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this LGTM! Not sure why I didn't reraise in the training code and added the raise elsewhere...

@frostming
Copy link
Contributor Author

Ok, this LGTM! Not sure why I didn't reraise in the training code and added the raise elsewhere...

NVM, it may come from a refactor when the logic is extracted into a function

@sauyon
Copy link
Contributor

sauyon commented May 5, 2023

@aarnphm are unit tests broken?

@aarnphm
Copy link
Contributor

aarnphm commented May 5, 2023

Oh we need to install Pillow in our tests. Let me make a quick PR for that

@aarnphm aarnphm merged commit 5248eee into bentoml:main May 5, 2023
@frostming frostming deleted the fix/terminate-runner branch May 5, 2023 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bugs: Exception during shutdown
3 participants