Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of threads created in Gateway for blocking identity calls causing "too many concurrent streams" error #23853

Closed
berkaycanbc opened this issue Oct 22, 2024 · 11 comments · Fixed by #24196
Assignees
Labels
component/zeebe Related to the Zeebe component/team kind/bug Categorizes an issue or PR as a bug support Marks an issue as related to a customer support request version:8.4.13 version:8.5.9 version:8.6.5

Comments

@berkaycanbc
Copy link
Contributor

berkaycanbc commented Oct 22, 2024

Describe the bug

After upgrading from Camunda 8.5.2 to Camunda 8.6.2, a customer started to see frequent 500 errors coming from Identity. The error message is java.io.IOException: too many concurrent streams caused by Identity SDK's http client. The error message is originated from Zeebe gateway.

This is due to switching to use virtual threads instead of a configured number of threads in a thread pool for Zeebe Gateway (see #18697). That means now we allow creating unlimited number of threads per blocking identity call. Since Zeebe gateway doesn't limit number of threads created for execution, identity SDK can now create unlimited number of streams but Identity SDK's HttpClient itself fails to handle it.

Expected behavior

Identity SDK's Http client shouldn't fail with too many concurrent streams.

Log/Stacktrace

Full Stacktrace

Exception:
java.util.concurrent.CompletionException
Message:
java.io.IOException: too many concurrent streams
Stacktrace:
java.util.concurrent.CompletableFuture.encodeRelay
java.util.concurrent.CompletableFuture.uniComposeStage
java.util.concurrent.CompletableFuture.thenCompose
jdk.internal.net.http.MultiExchange.responseAsyncImpl
jdk.internal.net.http.MultiExchange.lambda$responseAsync0$2
com.dynatrace.agent.introspection.threading.completable.FunctionWrapper.apply(FunctionWrapper.java:22)
java.util.concurrent.CompletableFuture$UniCompose.tryFire
java.util.concurrent.CompletableFuture.postComplete
java.util.concurrent.CompletableFuture$AsyncSupply.run
jdk.internal.net.http.HttpClientImpl$DelegatingExecutor.execute
java.util.concurrent.CompletableFuture.completeAsync
jdk.internal.net.http.MultiExchange.responseAsync
jdk.internal.net.http.HttpClientImpl.sendAsync
jdk.internal.net.http.HttpClientImpl.send
jdk.internal.net.http.HttpClientFacade.send
io.camunda.identity.sdk.impl.rest.RestClient.send(RestClient.java:119)
io.camunda.identity.sdk.impl.rest.RestClient.request(RestClient.java:106)
io.camunda.identity.sdk.impl.TenantsImpl.forToken(TenantsImpl.java:38)
jdk.internal.reflect.DirectMethodHandleAccessor.invoke
java.lang.reflect.Method.invoke
io.camunda.identity.sdk.annotation.AnnotationProcessor.lambda$apply$0(AnnotationProcessor.java:33)
jdk.proxy2.$Proxy134.forToken
io.camunda.zeebe.gateway.interceptors.impl.IdentityInterceptor.interceptCall(IdentityInterceptor.java:99)
io.grpc.ServerInterceptors$InterceptCallHandler.startCall(ServerInterceptors.java:269)
io.grpc.internal.ServerImpl$ServerTransportListenerImpl.startWrappedCall(ServerImpl.java:701)
io.grpc.internal.ServerImpl$ServerTransportListenerImpl.access$2200(ServerImpl.java:408)
io.grpc.internal.ServerImpl$ServerTransportListenerImpl$1HandleServerCall.runInternal(ServerImpl.java:613)
io.grpc.internal.ServerImpl$ServerTransportListenerImpl$1HandleServerCall.runInContext(ServerImpl.java:603)
io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
java.util.concurrent.ThreadPerTaskExecutor$TaskRunner.run
java.lang.VirtualThread.run

Related support case: SUPPORT-24047

@berkaycanbc berkaycanbc added kind/bug Categorizes an issue or PR as a bug component/zeebe Related to the Zeebe component/team labels Oct 22, 2024
@github-actions github-actions bot added the support Marks an issue as related to a customer support request label Oct 22, 2024
@megglos
Copy link
Contributor

megglos commented Oct 22, 2024

@npepinpe If I'm getting it right, a potential workaround could be scaling the Zeebe Gateway to handle the throughput? Until you need to scale the identity deployment itself as well.

@npepinpe
Copy link
Member

that would mitigate it, but could just push the problem downstream. worth a try

@npepinpe
Copy link
Member

imo we should fix it, and the error message should reflect the possible actions to take. ultimately it can also mean identity backend is not fast enough to handle the load

@npepinpe
Copy link
Member

ℹ️ If scaling up the gateways doesn't help, it could be that Identity is the bottleneck - in which case more gateways means more requests hitting it, and they will all block fairly fast. So you may also need to scale up your Identity pods, whether vertically or horizontally (or a combination of both).

@npepinpe
Copy link
Member

We will schedule this and attempt to get it into the next wave of patches in November (though it's a very tight deadline).

@megglos
Copy link
Contributor

megglos commented Oct 24, 2024

Solution approach proposal by @npepinpe , within the identity SDK

basically use a semaphore to control contention of the SDK (or at least, usage of that method, since I guess not all of the SDK needs to be locked), and fail gracefully:

  • wait X seconds for the lock instead of failing immediately
  • if failing to grab it, indicate in the error message Identity is currently unable to handle the load

it will greatly improve the UX, indicate a possible solution, and hopefully will also reduce the rate of the error by hopefully keeping identity snappy (but that's a theory).

@megglos
Copy link
Contributor

megglos commented Oct 25, 2024

@koevskinikola will block a time slot next week. @npepinpe can you offer pairing on this to make this slot most effective potentially creating a fix?

@npepinpe
Copy link
Member

@koevskinikola please schedule anything on my calendar 🙃

@koevskinikola
Copy link
Member

koevskinikola commented Oct 29, 2024

Solution plan:

  1. We introduce a small bounded cache (inavlidated after ex. default 5s)
    • The cache timeout should be configurable, preferably in some experimental properties that can be deprecated or removed once the Identity is intergrated in Zeebe.
  2. We introduce a semaphore to throttle the number of concurrent requests
    • Implementation is similar to the ZeebeJavaClient#BlockingExecutor implementation.
    • The maximum in-flight requests, and request timeout should be configurable.
    • If the semaphore "queue" is full, we throw an exception which results in an UNAVAILABLE error. Zeebe clients can use this to perform backoff.

The idea behind this solution proposal is that:

  1. The cache reduces the number of concurrent requests by providing tenant ids immediately.
  2. If there is a large number of "tenant-unique" requests, the semaphore controls the number of in-flight requests and excessive requests are intergrated by the existing backoff mechanism in the Zeebe clients.

Technical plan:

  • We will attempt to reproduce the scenario with an IT test. It might be difficult to determine the right ammount of concurrent Identity requests to hit the IOException, so the implementation of the IT test will be timeboxed.
  • We will implement an identity-sdk wrapper on the Gateway side. The wrapper will contain:
    • The tenantId cache.
    • The Identity request semaphore.
    • The semaphore will return an UNAVAILALBE error, to that it can be easily picked up by backoff mechanisms in the Zeebe clients.
  • We will provide configuration options in the "experimental" properties.
    • Once Identity is integrated with Zeebe, these properties can be more easily removed.
  • The identity-sdk wrapper will be provided in 8.7-SNAPSHOT and implemented for the gRPC IdentityInterceptor and the REST API Identity Spring bean.
    • The implementation will be backported until version 8.4.

github-merge-queue bot pushed a commit that referenced this issue Oct 31, 2024
## Description

<!-- Describe the goal and purpose of this PR. -->

Fixes an issue where too many tenant requeststo Identity cause the
service to fail with an `IOException` due to too many concurrent
requests.

## Checklist

<!--- Please delete options that are not relevant. Boxes should be
checked by reviewer. -->
- [ ] for CI changes:
- [ ] structural/foundational changes signed off by [CI
DRI](https://github.com/cmur2)
- [ ]
[ci.yml](https://github.com/camunda/camunda/blob/main/.github/workflows/ci.yml)
modifications comply with ["Unified CI"
requirements](https://github.com/camunda/camunda/wiki/CI-&-Automation#workflow-inclusion-criteria)

## Related issues

closes #23853
github-merge-queue bot pushed a commit that referenced this issue Nov 1, 2024
# Description
Backport of #24196 to `stable/8.4`.

relates to #23853
original author: @koevskinikola
github-merge-queue bot pushed a commit that referenced this issue Nov 4, 2024
# Description
Backport of #24196 to `stable/8.5`.

relates to #23853
original author: @koevskinikola
@npepinpe
Copy link
Member

@koevskinikola - can we close this?

@koevskinikola
Copy link
Member

Yup, thanks for the ping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/zeebe Related to the Zeebe component/team kind/bug Categorizes an issue or PR as a bug support Marks an issue as related to a customer support request version:8.4.13 version:8.5.9 version:8.6.5
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants