-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PulsarProxy does not always refresh authentication tokens received from client #10816
Comments
We have tested this on 2.7.0, 2.7.1 and 2.7.2 |
Nice description and proposal |
@eolivelli I tried looking for exact piece of code where this happens but I find code a little bit difficult for me to follow (at least on github). Probably I can find a solution with some help from someone pointing me in the right direction. |
The issue had no activity for 30 days, mark with Stale label. |
@nodece @codelipenghui I think that there's a major gap with the solution in #17517, #17831 or #18130 if the intention is to fix token refresh with the Pulsar Proxy. One detail of the proxy is that after the proxy has connected, it switches to a mode where plain bytes are proxied without decoding the protocol messages. pulsar/pulsar-proxy/src/main/java/org/apache/pulsar/proxy/server/ProxyConnection.java Line 392 in 82237d3
pulsar/pulsar-proxy/src/main/java/org/apache/pulsar/proxy/server/ProxyConnection.java Lines 246 to 281 in 82237d3
This is where the frameDecoder is removed: pulsar/pulsar-proxy/src/main/java/org/apache/pulsar/proxy/server/DirectProxyHandler.java Lines 447 to 449 in 82237d3
There's no information when a Pulsar binary protocol message starts and ends, and it's not possible to intercept or inject any messages between the client and the broker in the current solution. If any messages are injected, there's a chance that it corrupts the traffic since the message might get injected in a middle of another message since the proxy isn't doing any framing. Solving this issue will be tricky to solve (while keeping the same performance characteristics). It needs more design discussions and documenting the design. It feels that there's no real need to inject traffic in the proxy or do any special token refresh. The proxy stops authenticating and simply passes the bytes back and forth after the connection has been established. |
Thank @lhotari for your explanation.
I know this, so the proxy authentication data should not contain an expiration time, and then just refresh the client authentication data, and we also should notice we have a client on the proxy, which is used to lookup the topic URL and metadata, sometimes the client is ignored. #17831 didn't refresh the proxy's client authentication data, here should be the user's authentication data. The proxy's client passes two authentication data, one from the self, and one from the actual user. #18130 fixes a bug that updates the authentication variable, this should be an important fix. |
I agree with @lhotari that the solution in #17831 will lead to corruption. To take a step back, I think we need to clarify what the actual issue. Once the client's connection through the proxy is established with the broker, the broker's auth refresh challenge protocol takes over, assuming everything is configured correctly. What appears to be the problem is that the |
I am not sure that I think this any more. The only connections that appear to be the problem are the ones with a pulsar/pulsar-proxy/src/main/java/org/apache/pulsar/proxy/server/ProxyConnection.java Lines 545 to 546 in f360807
|
I think there is still an additional issue we'll need to solve. I need to sign off now, but I have a partial fix and will continue working on it in a few hours. |
One problem I see is that the auth data is hard coded: pulsar/pulsar-proxy/src/main/java/org/apache/pulsar/proxy/server/ProxyConnection.java Lines 322 to 324 in f360807
As a result, when a client connects for a lookup with an existing client/proxy connection but triggers a new proxy/broker connection, the auth data may be expired. I have a draft solution for this issue here: #19026. I am still working on a test. |
@nodece What would be helpful is to describe the scenario and sketch sequence diagrams of the scenario where refreshing of the authentication tokens would be needed. We don't currently have a sufficient description of the problem so that there would be a reasonable solution to the problem. The connection pool in the proxy is bound to the connection from the client to the proxy. pulsar/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ConnectionPool.java Lines 189 to 207 in 82237d3
Pulsar Go client: https://github.com/apache/pulsar-client-go/blob/d92fb1407d3d39c8a498dd7c7abdc0bbb3fc7e1a/pulsar/internal/connection_pool.go#L75-L76 Pulsar C++ client: https://github.com/apache/pulsar-client-cpp/blob/fa3ac76eddb08c3eb9d865332214af5aa5a5fe88/lib/ConnectionPool.cc#L64-L65 Because of this reason, the connection pool held within the ProxyConnection should contain only a single connection to do lookups. pulsar/pulsar-proxy/src/main/java/org/apache/pulsar/proxy/server/ProxyConnection.java Lines 86 to 87 in f360807
I wonder if we could come up with a better description of the problem and the solution. |
I fixed the lookup action by #17831.
You're referring to the final process, not lookup process, there are two connections between |
@nodece I'm not exactly sure what you mean with this comment. The ProxyConnection is one of these states: enum State {
Init,
// Connecting between user client and proxy server.
// Mutual authn needs verify between client and proxy server several times.
Connecting,
// Proxy the lookup requests to a random broker
// Follow redirects
ProxyLookupRequests,
// Connecting to the broker
ProxyConnectingToBroker,
// If we are proxying a connection to a specific broker, we
// are just forwarding data between the 2 connections, without
// looking into it
ProxyConnectionToBroker,
Closing,
Closed,
} Yes, there's a separate connection for lookups and after the state goes to ProxyConnectingToBroker/ProxyConnectionToBroker, the connection(s) used for lookups, will no longer be needed. This is what I meant with my comment "When the proxy switches to the mode where plain bytes are proxies, the proxy should never use the authentication data anymore. This is why it's unnecessary to keep it up-to-date after the proxy switches the connection to proxying mode." I see that you know this, but my point is that it's not described in the PR descriptions. For example, iterating through the connections in the connection pool (used for lookup connections) doesn't seem like the correct approach at all: https://github.com/apache/pulsar/pull/17831/files#diff-bbca5aac9dede7618187e91b91ffd0c6c8ffb836cd79ff2d104439e8cf5fc0daR545-R561 Perhaps this works for most cases where there's a single connection in the connection pool. What if there are multiple connections? It would also be helpful to add integration tests where this would happen. |
Thank you for your detailed explanation and guidance.
This scenario would be one client connecting different brokers by proxy, which means they use the same authentication data on the broker, and I send the new authentication data to each broker, maybe there is a performance issue here. |
I identified an addition issue with #17831. Specifically, we go to Here is my comment on the PR: |
Fixes: #10816 PIP: #19771 Supersedes: #19026 Depends on: #20062 ### Motivation The Pulsar Proxy does not properly handle authentication data refresh when in state `ProxyLookupRequests`. The consequence is described in #10816. Essentially, the problem is that the proxy caches stale authentication data and sends it to the broker leading to connection failures. #17831 attempted to fix the underlying problem, but it missed an important edge cases. Specifically, it missed the case that the `ConnectionPool` will have multiple connections when a lookup gets redirected. As such, the following problem exists (and is fixed by this PR): 1. Client opens connection to perform lookups. 2. Proxy connects to broker 1 to get the topic ownership info. 3. Time passes. 4. Client does an additional lookup, and this topic is on a newly created broker 2. In this case, the proxy opens a new connection with the stale client auth data. 5. Broker 2 rejects the connection because it fails with expired authentication. ### Modifications * Remove some of the implementation from #17831. This new implementation still allows a broker to challenge the client through the proxy, but notably, it limits the number of challenges sent to the client. Further, the proxy does not challenge the client when the auth data is not expired. * Introduce authentication refresh in the proxy so that the proxy challenges the client any time the auth data is expired. * Update the `ProxyClientCnx` to get the `clientAuthData` from the `ProxyConnection` to ensure that it gets new authentication data. * Add clock skew to the `AuthenticationProviderToken`. This is necessary to make some of the testing not flaky and it will also be necessary for users to configure in their clusters. ### Verifying this change The `ProxyRefreshAuthTest` covers the existing behavior and I expanded it to cover the edge case described above. Additionally, testing this part of the code will be much easier to test once we implement #19624. ### Documentation - [x] `doc-not-needed` ### Matching PR in forked repository PR in forked repository: the relevant tests pass locally, so I am going to skip the forked tests.
Fixes: apache#10816 PIP: apache#19771 Supersedes: apache#19026 Depends on: apache#20062 The Pulsar Proxy does not properly handle authentication data refresh when in state `ProxyLookupRequests`. The consequence is described in apache#10816. Essentially, the problem is that the proxy caches stale authentication data and sends it to the broker leading to connection failures. apache#17831 attempted to fix the underlying problem, but it missed an important edge cases. Specifically, it missed the case that the `ConnectionPool` will have multiple connections when a lookup gets redirected. As such, the following problem exists (and is fixed by this PR): 1. Client opens connection to perform lookups. 2. Proxy connects to broker 1 to get the topic ownership info. 3. Time passes. 4. Client does an additional lookup, and this topic is on a newly created broker 2. In this case, the proxy opens a new connection with the stale client auth data. 5. Broker 2 rejects the connection because it fails with expired authentication. * Remove some of the implementation from apache#17831. This new implementation still allows a broker to challenge the client through the proxy, but notably, it limits the number of challenges sent to the client. Further, the proxy does not challenge the client when the auth data is not expired. * Introduce authentication refresh in the proxy so that the proxy challenges the client any time the auth data is expired. * Update the `ProxyClientCnx` to get the `clientAuthData` from the `ProxyConnection` to ensure that it gets new authentication data. * Add clock skew to the `AuthenticationProviderToken`. This is necessary to make some of the testing not flaky and it will also be necessary for users to configure in their clusters. The `ProxyRefreshAuthTest` covers the existing behavior and I expanded it to cover the edge case described above. Additionally, testing this part of the code will be much easier to test once we implement apache#19624. - [x] `doc-not-needed` PR in forked repository: the relevant tests pass locally, so I am going to skip the forked tests. (cherry picked from commit 075b625)
Describe the bug
We have Pulsar deployed as a single cluster with a Pulsar Proxy, 3 brokers, 3 bookies and 3 zookeepers.
We use token authentication. Tokens expire every 5 minutes and they are refreshed 1 minute before they expire. We use
Supplier
interface inPulsarClient
to provide a valid token.The thing is that if we connect directly to brokers token refreshing works perfectly (we use a Docker Swarm and deploy broker as a service with multiple instances, so swarm manages balancing of connections). No
ClientCnx
erros are displayed (see #8922) and newReader
,Consumer
orProducer
can be created from same client instance.Te problem comes when we use Pulsar Proxy. It seems that Pulsar Proxy keeps a connection pool against the broker while it manages connections from Pulsar Client. Connection pool against brokers eventually requires creating new connections, same as connections between Client and Proxy. Connection between Client and Proxy always has valid tokens, since when a new connection in pool is required
Supplier
inPulsarClient
returns a currently valid token.The connection between the proxy and the brokers seems to be another story as when at least 5 minutes have elapsed since first token was used for first
Producer
,Reader
orConsumer
new connections may fail between Proxy and brokers. But not in all cases.Below are some logs from a custom test that consumes data from a topic using a
Reader
and produces data every 30 seconds (using a newProducer
each time). The token is refreshed 2 times but after second one creating newProducer
fails.Look at
RefreshableCredentialsSupplier
logs that inform when a token is refreshed and after it is destroyed when was it last accessed byPulsarClient
.If we look at corresponding broker logs (broker3) we see that connection is being stablished using an expired token (at a time AFTER the refresh):
Notice the time 2021-06-03T10:46:56Z that is the expiration time of the second token. First refresh worked OK as broker did not fail past first 5 minutes. Error occurs on broker at
12:47:47.109
reporting an expired token at 2021-06-03T10:46:56Z. However that token had already been refreshed one minute before at2021-06-03 12:45:59.301
(local time is CEST or GMT+2):After that moment following producers were created at following moments (using third token)
2021-06-03 12:46:11.809
,2021-06-03 12:46:43.532
,2021-06-03 12:47:15.302
. However at2021-06-03 12:47:38.882
client reported connection failed (using second token) for error occurring on broker at12:47:47.109
.Expected behavior
All previous tokens belonging to same client must be refreshed at proxy connection pool once they are obtained through connections from client.
This is a blocker issue that prevents us from using Pulsar Proxy on our clusters.
The text was updated successfully, but these errors were encountered: