Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][cpp] Support retry and apply operation timeout for lookup requests #17410

Merged

Conversation

BewareMyPower
Copy link
Contributor

@BewareMyPower BewareMyPower commented Sep 1, 2022

Motivation

Currently the operation timeout only works for requests other than
lookup, like SEND and FLOW. However, the lookup requests, which are sent
by LookupService, should also apply the operation timeout, see

AtomicLong opTimeoutMs = new AtomicLong(conf.getLookupTimeoutMs());
Backoff backoff = new BackoffBuilder()
.setInitialTime(conf.getInitialBackoffIntervalNanos(), TimeUnit.NANOSECONDS)
.setMandatoryStop(opTimeoutMs.get() * 2, TimeUnit.MILLISECONDS)
.setMax(conf.getMaxBackoffIntervalNanos(), TimeUnit.NANOSECONDS)
.create();
getPartitionedTopicMetadata(topicName, backoff, opTimeoutMs,

In addition, no attempts would be retried if lookup failed even due to a
retryable reason. For example, if some of all configured brokers were
down, the C++ client would fail immediately.

Therefore, this PR intends to retry lookup for some certain cases:

  • The connection cannot be established, except connection timeout, which
    is controlled by the connection timeout.
  • A ServiceNotReady error is received, except it's caused by
    PulsarServerException, e.g. the listener name is wrong.

Then, apply the operation timeout to avoid infinite retries.

Modifications

  • Add a ResultRetryable error code, which should only be used
    internally. Complete the futures with this error in the cases said
    previously.
  • Add a RetryableLookupService implementation to support retries based
    on the backoff policy. Replace the directly usages of LookupService
    implementations with this class in ClientImpl.

The following tests are added:

  • ClientTest.testMultiBrokerUrl: verify when multiple brokers are
    configured, even if one of them is not available, the creation of
    producer or consumer could still succeed.
  • LookupService.testRetry: verify all lookup methods could be retried.
  • LookupService.testTimeout: verify all lookup methods could be
    completed with ResultTimeout if no brokers are available.

TODO

In future, we should add lookup timeout instead of operation timeout for
lookup requests and separate lookup connection pool, see PIP-91.

Documentation

Check the box below or label this PR directly.

Need to update docs?

  • doc-required
    (Your PR needs to update docs and you will update later)

  • doc-not-needed
    (Please explain why)

  • doc
    (Your PR contains doc changes)

  • doc-complete
    (Docs have been already added)

@BewareMyPower BewareMyPower added type/bug The PR fixed a bug or issue reported a bug component/client-c++ labels Sep 1, 2022
@BewareMyPower BewareMyPower added this to the 2.12.0 milestone Sep 1, 2022
@BewareMyPower BewareMyPower self-assigned this Sep 1, 2022
@BewareMyPower BewareMyPower marked this pull request as draft September 1, 2022 17:05
@github-actions
Copy link

github-actions bot commented Sep 1, 2022

@BewareMyPower Please provide a correct documentation label for your PR.
Instructions see Pulsar Documentation Label Guide.

@github-actions github-actions bot added doc-not-needed Your PR changes do not impact docs and removed doc-label-missing labels Sep 1, 2022
@BewareMyPower BewareMyPower marked this pull request as ready for review September 2, 2022 04:28
@BewareMyPower
Copy link
Contributor Author

@merlimat @Demogorgon314 @RobertIndie PTAL

@BewareMyPower BewareMyPower force-pushed the bewaremypower/cpp-lookup-timeout branch from a4ea7b8 to 765fb3b Compare September 13, 2022 08:52
### Motivation

Currently the operation timeout only works for requests other than
lookup, like SEND and FLOW. However, the lookup requests, which are sent
by `LookupService`, should also apply the operation timeout, see

https://github.com/apache/pulsar/blob/7075a5ce0d4a70f52625ac8c3d0c48894442b72a/pulsar-client/src/main/java/org/apache/pulsar/client/impl/PulsarClientImpl.java#L1019-L1025

In addition, no attempts would be retried if lookup failed even due to a
retryable reason. For example, if some of all configured brokers were
down, the C++ client would fail immediately.

Therefore, this PR intends to retry lookup for some certain cases:
- The connection cannot be established, except connection timeout, which
  is controlled by the connection timeout.
- A `ServiceNotReady` error is received, except it's caused by
  `PulsarServerException`, e.g. the listener name is wrong.

Then, apply the operation timeout to avoid infinite retries.

### Modifications

- Add a `ResultRetryable` error code, which should only be used
  internally. Complete the futures with this error in the cases said
  previously.
- Add a `RetryableLookupService` implementation to support retries based
  on the backoff policy. Replace the directly usages of `LookupService`
  implementations with this class in `ClientImpl`.

The following tests are added:
- `ClientTest.testMultiBrokerUrl`: verify when multiple brokers are
  configured, even if one of them is not available, the creation of
  producer or consumer could still succeed.
- `LookupService.testRetry`: verify all lookup methods could be retried.
- `LookupService.testTimeout`: verify all lookup methods could be
  completed with `ResultTimeout` if no brokers are available.

### TODO

In future, we should add lookup timeout instead of operation timeout for
lookup requests and separate lookup connection pool, see PIP-91.
@BewareMyPower BewareMyPower force-pushed the bewaremypower/cpp-lookup-timeout branch from 765fb3b to 191cbaa Compare September 13, 2022 14:21
@BewareMyPower
Copy link
Contributor Author

@Demogorgon314 Demogorgon314 merged commit 86ea31b into apache:master Sep 15, 2022
@BewareMyPower BewareMyPower deleted the bewaremypower/cpp-lookup-timeout branch September 15, 2022 06:44
@BewareMyPower BewareMyPower modified the milestones: 2.12.0, 2.11.0 Sep 15, 2022
BewareMyPower added a commit that referenced this pull request Sep 15, 2022
…sts (#17410)

### Motivation

Currently the operation timeout only works for requests other than
lookup, like SEND and FLOW. However, the lookup requests, which are sent
by `LookupService`, should also apply the operation timeout, see

https://github.com/apache/pulsar/blob/7075a5ce0d4a70f52625ac8c3d0c48894442b72a/pulsar-client/src/main/java/org/apache/pulsar/client/impl/PulsarClientImpl.java#L1019-L1025

In addition, no attempts would be retried if lookup failed even due to a
retryable reason. For example, if some of all configured brokers were
down, the C++ client would fail immediately.

Therefore, this PR intends to retry lookup for some certain cases:
- The connection cannot be established, except connection timeout, which
  is controlled by the connection timeout.
- A `ServiceNotReady` error is received, except it's caused by
  `PulsarServerException`, e.g. the listener name is wrong.

Then, apply the operation timeout to avoid infinite retries.

### Modifications

- Add a `ResultRetryable` error code, which should only be used
  internally. Complete the futures with this error in the cases said
  previously.
- Add a `RetryableLookupService` implementation to support retries based
  on the backoff policy. Replace the directly usages of `LookupService`
  implementations with this class in `ClientImpl`.

The following tests are added:
- `ClientTest.testMultiBrokerUrl`: verify when multiple brokers are
  configured, even if one of them is not available, the creation of
  producer or consumer could still succeed.
- `LookupService.testRetry`: verify all lookup methods could be retried.
- `LookupService.testTimeout`: verify all lookup methods could be
  completed with `ResultTimeout` if no brokers are available.

### TODO

In future, we should add lookup timeout instead of operation timeout for
lookup requests and separate lookup connection pool, see PIP-91.

* Handle ResultRetryable in handleDisconnection

(cherry picked from commit 86ea31b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-picked/branch-2.11 doc-not-needed Your PR changes do not impact docs release/2.11.0 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants