-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't hang RPC when netty-tcnative .so fails to load due to (musl) linking errors #2599
Comments
There are many other types of linking errors (besides musl) that can cause this situation too. We encountered it back when we upgraded google-cloud-java from 0.15.0 to 1.0.1 because the netty-tcnative dependency had to move in step with it, and we hadn't updated netty-tcnative. |
I agree that every linking error is equivalent for this issue. |
Usually such errors manifest as an exception that goes into some cubby, never to be read from. Did you get anything from logs that would indicate this happened? |
Yes, if you turn on FINE logging, you will see errors like this:
|
@garrettjonesgoogle When building your server or chanel, GrpcSslContexts checks that either openssl is available, or that jetty alpn is. If both of those fail, it shouldn't even be possible to start a server or channel. So, I guess the question is, how did you even get to handling RPCs if it should have failed much earlier? |
The first attempt to make an RPC in the app encounters this issue. I'm not sure why there are no failures before that point; I would have expected gRPC to through an exception, but it's not. |
We also face this issue, both with SSL lib or when a proto class is missing. So upvoting this bug. |
I wanted to poke this issue. I verified that it still happens in gRPC 1.3.x. Here is a relevant stack trace:
This failure does not propagate back to the initial RPC; the initial RPC eventually throws DEADLINE_EXCEEDED because the failure isn't propagated. Is there any way to propagate this error to the RPC? |
@garrettjonesgoogle, that back trace is really useful. I thought the SslContext was created in the NettyChannelBuilder before/during build(). It looks like it is accidentally being delayed. We can fix that pretty easily. |
@ejona86 that's great to hear! |
Creating the SslContext can throw, generally due to broken ALPN. We want that to propagate to the caller of build(), instead of within the channel where it could easily cause hangs. We still delay creation until actual build() time, since TLS is not guaranteed to work and the application may be configuring plaintext or similar later before calling build() where SslContext is unnecessary. The only externally-visible change should be the exception handling. I'd add a test, but the things throwing are static and trying to inject them would be pretty messy. Fixes grpc#2599
@garrettjonesgoogle, I just sent out #3060 which has the pretty small fix in 1c0f826. We were going to be releasing 1.4.0 tomorrow. This has been a big enough problem for users that you'd probably like us to include it in 1.4.0, right? |
@ejona86 absolutely - we probably get a new issue filed for this every couple weeks, and each person who encounters this probably burns tens of minutes to possibly hours trying to figure it out. |
Creating the SslContext can throw, generally due to broken ALPN. We want that to propagate to the caller of build(), instead of within the channel where it could easily cause hangs. We still delay creation until actual build() time, since TLS is not guaranteed to work and the application may be configuring plaintext or similar later before calling build() where SslContext is unnecessary. The only externally-visible change should be the exception handling. I'd add a test, but the things throwing are static and trying to inject them would be pretty messy. Fixes grpc#2599
Creating the SslContext can throw, generally due to broken ALPN. We want that to propagate to the caller of build(), instead of within the channel where it could easily cause hangs. We still delay creation until actual build() time, since TLS is not guaranteed to work and the application may be configuring plaintext or similar later before calling build() where SslContext is unnecessary. The only externally-visible change should be the exception handling. I'd add a test, but the things throwing are static and trying to inject them would be pretty messy. Fixes #2599
Creating the SslContext can throw, generally due to broken ALPN. We want that to propagate to the caller of build(), instead of within the channel where it could easily cause hangs. We still delay creation until actual build() time, since TLS is not guaranteed to work and the application may be configuring plaintext or similar later before calling build() where SslContext is unnecessary. The only externally-visible change should be the exception handling. I'd add a test, but the things throwing are static and trying to inject them would be pretty messy. Fixes grpc#2599
Creating the SslContext can throw, generally due to broken ALPN. We want that to propagate to the caller of build(), instead of within the channel where it could easily cause hangs. We still delay creation until actual build() time, since TLS is not guaranteed to work and the application may be configuring plaintext or similar later before calling build() where SslContext is unnecessary. The only externally-visible change should be the exception handling. I'd add a test, but the things throwing are static and trying to inject them would be pretty messy. Fixes #2599
Please answer these questions before submitting your issue.
What version of gRPC are you using?
1.0.3
What JVM are you using (
java -version
)?openjdk version "1.8.0_102"
OpenJDK Runtime Environment (build 1.8.0_102)
OpenJDK 64-Bit Server VM (build 25.102-b01, mixed mode)
What did you do?
If possible, provide a recipe for reproducing the error.
https://github.com/garrettjonesgoogle/gcloud-java/tree/deadline-exceeded-issue/google-cloud-example-docker-gradle-alpine
./gradlew jar shadowJar
docker build .
Then deploy to a GCE instance and run it.
sudo docker run -it YOUR_DOCKER_BUILD_ID_HERE sh
java -Djava.util.logging.config.file=logging.properties -cp google-cloud-example-docker-gradle-alpine-all.jar com.google.cloud.pubsub.spi.v1.PublisherSmokeTest --project_id YOUR_PROJECT_ID_HERE
What did you expect to see?
An exception indicating that the netty dependency was unsatisfied
What did you see instead?
After the call times out, DEADLINE_EXCEEDED
Notes
If a user has a high timeout, it can take a long time for them to discover something is wrong. Then when they receive DEADLINE_EXCEEDED, they have no idea why - it doesn't guide them to the problem with the dependency. They have to know to turn on FINE logging and go log spelunking to fine the root cause. Example user-filed issue: googleapis/google-cloud-java#1430
The text was updated successfully, but these errors were encountered: