Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky-test: C++ ClientTest.testReferenceCount #14848

Closed
BewareMyPower opened this issue Mar 24, 2022 · 11 comments · Fixed by #14854, #15271, #17325 or #17645
Closed

Flaky-test: C++ ClientTest.testReferenceCount #14848

BewareMyPower opened this issue Mar 24, 2022 · 11 comments · Fixed by #14854, #15271, #17325 or #17645

Comments

@BewareMyPower
Copy link
Contributor

BewareMyPower commented Mar 24, 2022

ClientTest.testReferenceCount is flaky. It fails sporadically.

example failure

2022-03-23 08:18:48.488 INFO  [140121845906176] ClientConnection:182 | [<none> -> pulsar://localhost:6650] Create ClientConnection, timeout=10000
2022-03-23 08:18:48.488 INFO  [140121845906176] ConnectionPool:96 | Created connection for pulsar://localhost:6650
2022-03-23 08:18:48.492 INFO  [140121726842624] ClientConnection:368 | [127.0.0.1:32784 -> 127.0.0.1:6650] Connected to broker
2022-03-23 08:18:48.494 INFO  [140121726842624] HandlerBase:64 | [persistent://public/default/client-test-reference-count-1648023528, ] Getting connection from pool
2022-03-23 08:18:48.504 INFO  [140121726842624] ProducerImpl:188 | [persistent://public/default/client-test-reference-count-1648023528, ] Created producer on broker [127.0.0.1:32784 -> 127.0.0.1:6650]
/pulsar/pulsar-client-cpp/tests/ClientTest.cc:[192](https://github.com/apache/pulsar/runs/5656553451?check_suite_focus=true#step:11:192): Failure
Value of: 1
Expected: producers[0].use_count()
Which is: 7
2022-03-23 08:18:48.525 INFO  [140385848068864] ClientConnection:182 | [<none> -> pulsar://localhost:6650] Create ClientConnection, timeout=10000
2022-03-23 08:18:48.525 INFO  [140385848068864] ConnectionPool:96 | Created connection for pulsar://localhost:6650
2022-03-23 08:18:48.527 INFO  [140385729005312] ClientConnection:368 | [127.0.0.1:32786 -> 127.0.0.1:6650] Connected to broker
2022-03-23 08:18:48.530 INFO  [140385729005312] HandlerBase:64 | [persistent://public/default/client-test-reference-count-1648023528, ] Getting connection from pool
2022-03-23 08:18:48.536 INFO  [140385729005312] ProducerImpl:188 | [persistent://public/default/client-test-reference-count-1648023528, ] Created producer on broker [127.0.0.1:32786 -> 127.0.0.1:6650]
/pulsar/pulsar-client-cpp/tests/ClientTest.cc:192: Failure
Value of: 1
Expected: producers[0].use_count()
Which is: 7
@BewareMyPower BewareMyPower self-assigned this Mar 24, 2022
BewareMyPower added a commit to BewareMyPower/pulsar that referenced this issue Mar 24, 2022
Fixes apache#14848

Fixes apache#14719

### Motivation

apache#7793 introduced a `testReferenceLeak` to avoid cyclic referenece of the
reader. However, it adds a unused field `readerImplWeakPtr_` only for
tests. The access to this field is not thread safe that the write
operation happens in `handleConsumerCreated` while the read operation
can happen anywhere via the getter. So there is a little chance that
`readerPtr` in `testReferenceLeak` doesn't point to the right object.

In addition, we should only guarantee the reference count becomes 0
after the producer, consumer or reader goes out of its scope. apache#14797
adds a `ClientTest.testReferenceCount` but it's also flaky. It's caused
by the shared pointer of `ProducerImpl` is published to another thread
via `shared_from_this()` but the test has a strong expectation that the
reference count is exactly 1.

### Modifications

- Remove `readerImplWeakPtr_` from `ReaderImpl` and get the weak pointer
  from `Reader` directly by adding a method to `PulsarFriend`.
- Add the check of reader's reference count to `testReferenceCount` and
  remove the redundant `testReferenceLeak`.
- Instead of asserting the reference count of producer/consumer/reader
  is 1, just assume the it's greater than 0.
BewareMyPower added a commit to BewareMyPower/pulsar that referenced this issue Mar 31, 2022
Fixes apache#14848

Fixes apache#14719

### Motivation

apache#7793 introduced a `testReferenceLeak` to avoid cyclic referenece of the
reader. However, it adds a unused field `readerImplWeakPtr_` only for
tests. The access to this field is not thread safe that the write
operation happens in `handleConsumerCreated` while the read operation
can happen anywhere via the getter. So there is a little chance that
`readerPtr` in `testReferenceLeak` doesn't point to the right object.

In addition, we should only guarantee the reference count becomes 0
after the producer, consumer or reader goes out of its scope. apache#14797
adds a `ClientTest.testReferenceCount` but it's also flaky. It's caused
by the shared pointer of `ProducerImpl` is published to another thread
via `shared_from_this()` but the test has a strong expectation that the
reference count is exactly 1.

### Modifications

- Remove `readerImplWeakPtr_` from `ReaderImpl` and get the weak pointer
  from `Reader` directly by adding a method to `PulsarFriend`.
- Add the check of reader's reference count to `testReferenceCount` and
  remove the redundant `testReferenceLeak`.
- Instead of asserting the reference count of producer/consumer/reader
  is 1, just assume the it's greater than 0.
BewareMyPower added a commit to BewareMyPower/pulsar that referenced this issue Apr 1, 2022
Fixes apache#14848

Fixes apache#14719

### Motivation

apache#7793 introduced a `testReferenceLeak` to avoid cyclic referenece of the
reader. However, it adds a unused field `readerImplWeakPtr_` only for
tests. The access to this field is not thread safe that the write
operation happens in `handleConsumerCreated` while the read operation
can happen anywhere via the getter. So there is a little chance that
`readerPtr` in `testReferenceLeak` doesn't point to the right object.

In addition, we should only guarantee the reference count becomes 0
after the producer, consumer or reader goes out of its scope. apache#14797
adds a `ClientTest.testReferenceCount` but it's also flaky. It's caused
by the shared pointer of `ProducerImpl` is published to another thread
via `shared_from_this()` but the test has a strong expectation that the
reference count is exactly 1.

### Modifications

- Remove `readerImplWeakPtr_` from `ReaderImpl` and get the weak pointer
  from `Reader` directly by adding a method to `PulsarFriend`.
- Add the check of reader's reference count to `testReferenceCount` and
  remove the redundant `testReferenceLeak`.
- Instead of asserting the reference count of producer/consumer/reader
  is 1, just assume the it's greater than 0.
lhotari pushed a commit that referenced this issue Apr 1, 2022
Fixes #14848
Fixes #14719

### Motivation

#7793 introduced a `testReferenceLeak` to avoid cyclic referenece of the
reader. However, it adds a unused field `readerImplWeakPtr_` only for
tests. The access to this field is not thread safe that the write
operation happens in `handleConsumerCreated` while the read operation
can happen anywhere via the getter. So there is a little chance that
`readerPtr` in `testReferenceLeak` doesn't point to the right object.

In addition, we should only guarantee the reference count becomes 0
after the producer, consumer or reader goes out of its scope. #14797
adds a `ClientTest.testReferenceCount` but it's also flaky. It's caused
by the shared pointer of `ProducerImpl` is published to another thread
via `shared_from_this()` but the test has a strong expectation that the
reference count is exactly 1.

### Modifications

- Remove `readerImplWeakPtr_` from `ReaderImpl` and get the weak pointer
  from `Reader` directly by adding a method to `PulsarFriend`.
- Add the check of reader's reference count to `testReferenceCount` and
  remove the redundant `testReferenceLeak`.
- Instead of asserting the reference count of producer/consumer/reader
  is 1, just assume the it's greater than 0.
BewareMyPower added a commit that referenced this issue Apr 2, 2022
Fixes #14848
Fixes #14719

### Motivation

#7793 introduced a `testReferenceLeak` to avoid cyclic referenece of the
reader. However, it adds a unused field `readerImplWeakPtr_` only for
tests. The access to this field is not thread safe that the write
operation happens in `handleConsumerCreated` while the read operation
can happen anywhere via the getter. So there is a little chance that
`readerPtr` in `testReferenceLeak` doesn't point to the right object.

In addition, we should only guarantee the reference count becomes 0
after the producer, consumer or reader goes out of its scope. #14797
adds a `ClientTest.testReferenceCount` but it's also flaky. It's caused
by the shared pointer of `ProducerImpl` is published to another thread
via `shared_from_this()` but the test has a strong expectation that the
reference count is exactly 1.

### Modifications

- Remove `readerImplWeakPtr_` from `ReaderImpl` and get the weak pointer
  from `Reader` directly by adding a method to `PulsarFriend`.
- Add the check of reader's reference count to `testReferenceCount` and
  remove the redundant `testReferenceLeak`.
- Instead of asserting the reference count of producer/consumer/reader
  is 1, just assume the it's greater than 0.

(cherry picked from commit f84ff57)
BewareMyPower added a commit that referenced this issue Apr 2, 2022
Fixes #14848
Fixes #14719

### Motivation

#7793 introduced a `testReferenceLeak` to avoid cyclic referenece of the
reader. However, it adds a unused field `readerImplWeakPtr_` only for
tests. The access to this field is not thread safe that the write
operation happens in `handleConsumerCreated` while the read operation
can happen anywhere via the getter. So there is a little chance that
`readerPtr` in `testReferenceLeak` doesn't point to the right object.

In addition, we should only guarantee the reference count becomes 0
after the producer, consumer or reader goes out of its scope. #14797
adds a `ClientTest.testReferenceCount` but it's also flaky. It's caused
by the shared pointer of `ProducerImpl` is published to another thread
via `shared_from_this()` but the test has a strong expectation that the
reference count is exactly 1.

### Modifications

- Remove `readerImplWeakPtr_` from `ReaderImpl` and get the weak pointer
  from `Reader` directly by adding a method to `PulsarFriend`.
- Add the check of reader's reference count to `testReferenceCount` and
  remove the redundant `testReferenceLeak`.
- Instead of asserting the reference count of producer/consumer/reader
  is 1, just assume the it's greater than 0.

(cherry picked from commit f84ff57)
BewareMyPower added a commit that referenced this issue Apr 2, 2022
Fixes #14848
Fixes #14719

### Motivation

#7793 introduced a `testReferenceLeak` to avoid cyclic referenece of the
reader. However, it adds a unused field `readerImplWeakPtr_` only for
tests. The access to this field is not thread safe that the write
operation happens in `handleConsumerCreated` while the read operation
can happen anywhere via the getter. So there is a little chance that
`readerPtr` in `testReferenceLeak` doesn't point to the right object.

In addition, we should only guarantee the reference count becomes 0
after the producer, consumer or reader goes out of its scope. #14797
adds a `ClientTest.testReferenceCount` but it's also flaky. It's caused
by the shared pointer of `ProducerImpl` is published to another thread
via `shared_from_this()` but the test has a strong expectation that the
reference count is exactly 1.

### Modifications

- Remove `readerImplWeakPtr_` from `ReaderImpl` and get the weak pointer
  from `Reader` directly by adding a method to `PulsarFriend`.
- Add the check of reader's reference count to `testReferenceCount` and
  remove the redundant `testReferenceLeak`.
- Instead of asserting the reference count of producer/consumer/reader
  is 1, just assume the it's greater than 0.

(cherry picked from commit f84ff57)
@RobertIndie
Copy link
Member

After #14854 get merged, I still get this error.

example failure

[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from ClientTest
[ RUN      ] ClientTest.testReferenceCount
2022-04-11 04:43:29.498 INFO  [139942018497280] ClientConnection:189 | [<none> -> pulsar://localhost:6650] Create ClientConnection, timeout=10000
2022-04-11 04:43:29.498 INFO  [139942018497280] ConnectionPool:96 | Created connection for pulsar://localhost:6650
2022-04-11 04:43:29.500 INFO  [139941896357632] ClientConnection:375 | [127.0.0.1:45924 -> 127.0.0.1:6650] Connected to broker
2022-04-11 04:43:29.503 INFO  [139941896357632] HandlerBase:64 | [persistent://public/default/client-test-reference-count-[164](https://github.com/apache/pulsar/runs/5966290255?check_suite_focus=true#step:11:164)9652209, ] Getting connection from pool
2022-04-11 04:43:29.538 INFO  [139941896357632] ProducerImpl:189 | [persistent://public/default/client-test-reference-count-1649652209, ] Created producer on broker [127.0.0.1:45924 -> 127.0.0.1:6650]
2022-04-11 04:43:29.538 INFO  [1399420[184](https://github.com/apache/pulsar/runs/5966290255?check_suite_focus=true#step:11:184)97280] ClientTest:210 | Reference count of the producer: 1
2022-04-11 04:43:29.538 INFO  [139942018497280] Client:88 | Subscribing on Topic :client-test-reference-count-1649652209
2022-04-11 04:43:29.540 INFO  [13994[189](https://github.com/apache/pulsar/runs/5966290255?check_suite_focus=true#step:11:189)6357632] HandlerBase:64 | [persistent://public/default/client-test-reference-count-1649652209, my-sub, 0] Getting connection from pool
2022-04-11 04:43:29.558 INFO  [139941896357632] ConsumerImpl:224 | [persistent://public/default/client-test-reference-count-1649652209, my-sub, 0] Created consumer on broker [127.0.0.1:45924 -> 127.0.0.1:6650]
2022-04-11 04:43:29.560 INFO  [139942018497280] ClientTest:216 | Reference count of the consumer: 7
2022-04-11 04:43:29.562 INFO  [139941896357632] HandlerBase:64 | [persistent://public/default/client-test-reference-count-1649652209-reader, reader-0fd8e7ba92, 1] Getting connection from pool
2022-04-11 04:43:29.597 INFO  [139941896357632] ConsumerImpl:224 | [persistent://public/default/client-test-reference-count-1649652209-reader, reader-0fd8e7ba92, 1] Created consumer on broker [127.0.0.1:45924 -> 127.0.0.1:6650]
2022-04-11 04:43:29.598 INFO  [139942018497280] ClientTest:224 | Reference count of the reader's underlying consumer: 3
/pulsar/pulsar-client-cpp/tests/ClientTest.cc:227: Failure
Value of: 1
Expected: readerWeakPtr.use_count()
Which is: 5
2022-04-11 04:43:29.598 WARN  [139942018497280] ConsumerImpl:116 | [persistent://public/default/client-test-reference-count-1649652209, my-sub, 0] Destroyed consumer which was not properly closed
2022-04-11 04:43:29.598 INFO  [139942018497280] ConsumerImpl:124 | [persistent://public/default/client-test-reference-count-1649652209, my-sub, 0] Closed consumer for race condition: 0
2022-04-11 04:43:29.598 INFO  [139942018497280] ProducerImpl:628 | Producer - [persistent://public/default/client-test-reference-count-1649652209, standalone-0-203] , [batchMessageContainer = { BatchMessageContainer [size = 0] [bytes = 0] [maxSize = 1000] [maxBytes = 131072] [topicName = persistent://public/default/client-test-reference-count-1649652209] [numberOfBatchesSent_ = 0] [averageBatchSize_ = 0] }]
2022-04-11 04:43:29.598 WARN  [139942018497280] ProducerImpl:114 | [persistent://public/default/client-test-reference-count-1649652209, standalone-0-203] Destroyed producer which was not properly closed
2022-04-11 04:43:29.598 INFO  [139942018497280] ClientConnection:1560 | [127.0.0.1:45924 -> 127.0.0.1:6650] Connection closed
[  FAILED  ] ClientTest.testReferenceCount (101 ms)
[----------] 1 test from ClientTest (101 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (101 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] ClientTest.testReferenceCount

PTAL @BewareMyPower

@RobertIndie RobertIndie reopened this Apr 11, 2022
@BewareMyPower
Copy link
Contributor Author

@RobertIndie Good point. I'll fix it soon.

Nicklee007 pushed a commit to Nicklee007/pulsar that referenced this issue Apr 20, 2022
Fixes apache#14848
Fixes apache#14719

### Motivation

apache#7793 introduced a `testReferenceLeak` to avoid cyclic referenece of the
reader. However, it adds a unused field `readerImplWeakPtr_` only for
tests. The access to this field is not thread safe that the write
operation happens in `handleConsumerCreated` while the read operation
can happen anywhere via the getter. So there is a little chance that
`readerPtr` in `testReferenceLeak` doesn't point to the right object.

In addition, we should only guarantee the reference count becomes 0
after the producer, consumer or reader goes out of its scope. apache#14797
adds a `ClientTest.testReferenceCount` but it's also flaky. It's caused
by the shared pointer of `ProducerImpl` is published to another thread
via `shared_from_this()` but the test has a strong expectation that the
reference count is exactly 1.

### Modifications

- Remove `readerImplWeakPtr_` from `ReaderImpl` and get the weak pointer
  from `Reader` directly by adding a method to `PulsarFriend`.
- Add the check of reader's reference count to `testReferenceCount` and
  remove the redundant `testReferenceLeak`.
- Instead of asserting the reference count of producer/consumer/reader
  is 1, just assume the it's greater than 0.
@BewareMyPower
Copy link
Contributor Author

Another failure

2022-04-21 07:18:25.262 INFO  [140554697825024] ClientTest:211 | Reference count of the reader's underlying consumer: 3
/pulsar/pulsar-client-cpp/tests/ClientTest.cc:214: Failure
Value of: 1
Expected: readerWeakPtr.use_count()
Which is: 5

BewareMyPower added a commit to BewareMyPower/pulsar that referenced this issue Apr 22, 2022
Fixes apache#13849
Fixes apache#14848

### Motivation

apache#11570 adds a `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` for
the case that some `sendAsync` calls that are invoked after `closeAsync`
is called in another thread must complete with `ResultAlreadyClosed`.
It's flaky because the synchronization between two threads is not
strict. This test uses `sendStartLatch` for the order of `sendAsync` and
`closeAsync`:

```
sendAsync 0,1,...,9 -> sendStartLatch is done -> closeAsync
```

However, it cannot guarantee the rest `sendAsync` calls happen after
`closeAsync` is called. If so, all `sendAsync` calls will complete with
`ResultOk`.

On the other hand, this test is meaningless because it requires strict
synchronization between two threads so there is no need to run
`sendAsync` and `closeAsync` in two threads.

The verification of this test is also wrong, see
apache#13849 (comment).
When `closeAsync` is called, the previous `sendAsync` calls might not
complete, so all `sendAsync` will complete with `ResultAlreadyClosed`,
not only those called after `closeAsync`.

In addition, this PR also tries to fix the flaky `testReferenceCount`,
which assumes too strictly.

### Modifications

- Remove `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers`
- Only check the reference count is greater than 0 instead of equal to 1
BewareMyPower added a commit that referenced this issue Apr 22, 2022
Fixes #13849
Fixes #14848

### Motivation

#11570 adds a `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` for
the case that some `sendAsync` calls that are invoked after `closeAsync`
is called in another thread must complete with `ResultAlreadyClosed`.
It's flaky because the synchronization between two threads is not
strict. This test uses `sendStartLatch` for the order of `sendAsync` and
`closeAsync`:

```
sendAsync 0,1,...,9 -> sendStartLatch is done -> closeAsync
```

However, it cannot guarantee the rest `sendAsync` calls happen after
`closeAsync` is called. If so, all `sendAsync` calls will complete with
`ResultOk`.

On the other hand, this test is meaningless because it requires strict
synchronization between two threads so there is no need to run
`sendAsync` and `closeAsync` in two threads.

The verification of this test is also wrong, see
#13849 (comment).
When `closeAsync` is called, the previous `sendAsync` calls might not
complete, so all `sendAsync` will complete with `ResultAlreadyClosed`,
not only those called after `closeAsync`.

In addition, this PR also tries to fix the flaky `testReferenceCount`,
which assumes too strictly.

### Modifications

- Remove `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers`
- Only check the reference count is greater than 0 instead of equal to 1
codelipenghui pushed a commit that referenced this issue Apr 28, 2022
Fixes #13849
Fixes #14848

### Motivation

#11570 adds a `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` for
the case that some `sendAsync` calls that are invoked after `closeAsync`
is called in another thread must complete with `ResultAlreadyClosed`.
It's flaky because the synchronization between two threads is not
strict. This test uses `sendStartLatch` for the order of `sendAsync` and
`closeAsync`:

```
sendAsync 0,1,...,9 -> sendStartLatch is done -> closeAsync
```

However, it cannot guarantee the rest `sendAsync` calls happen after
`closeAsync` is called. If so, all `sendAsync` calls will complete with
`ResultOk`.

On the other hand, this test is meaningless because it requires strict
synchronization between two threads so there is no need to run
`sendAsync` and `closeAsync` in two threads.

The verification of this test is also wrong, see
#13849 (comment).
When `closeAsync` is called, the previous `sendAsync` calls might not
complete, so all `sendAsync` will complete with `ResultAlreadyClosed`,
not only those called after `closeAsync`.

In addition, this PR also tries to fix the flaky `testReferenceCount`,
which assumes too strictly.

### Modifications

- Remove `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers`
- Only check the reference count is greater than 0 instead of equal to 1

(cherry picked from commit eeea9ca)
nicoloboschi pushed a commit to datastax/pulsar that referenced this issue May 9, 2022
Fixes apache#13849
Fixes apache#14848

### Motivation

apache#11570 adds a `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` for
the case that some `sendAsync` calls that are invoked after `closeAsync`
is called in another thread must complete with `ResultAlreadyClosed`.
It's flaky because the synchronization between two threads is not
strict. This test uses `sendStartLatch` for the order of `sendAsync` and
`closeAsync`:

```
sendAsync 0,1,...,9 -> sendStartLatch is done -> closeAsync
```

However, it cannot guarantee the rest `sendAsync` calls happen after
`closeAsync` is called. If so, all `sendAsync` calls will complete with
`ResultOk`.

On the other hand, this test is meaningless because it requires strict
synchronization between two threads so there is no need to run
`sendAsync` and `closeAsync` in two threads.

The verification of this test is also wrong, see
apache#13849 (comment).
When `closeAsync` is called, the previous `sendAsync` calls might not
complete, so all `sendAsync` will complete with `ResultAlreadyClosed`,
not only those called after `closeAsync`.

In addition, this PR also tries to fix the flaky `testReferenceCount`,
which assumes too strictly.

### Modifications

- Remove `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers`
- Only check the reference count is greater than 0 instead of equal to 1

(cherry picked from commit eeea9ca)
(cherry picked from commit 83b6833)
BewareMyPower added a commit that referenced this issue May 24, 2022
Fixes #13849
Fixes #14848

### Motivation

#11570 adds a `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` for
the case that some `sendAsync` calls that are invoked after `closeAsync`
is called in another thread must complete with `ResultAlreadyClosed`.
It's flaky because the synchronization between two threads is not
strict. This test uses `sendStartLatch` for the order of `sendAsync` and
`closeAsync`:

```
sendAsync 0,1,...,9 -> sendStartLatch is done -> closeAsync
```

However, it cannot guarantee the rest `sendAsync` calls happen after
`closeAsync` is called. If so, all `sendAsync` calls will complete with
`ResultOk`.

On the other hand, this test is meaningless because it requires strict
synchronization between two threads so there is no need to run
`sendAsync` and `closeAsync` in two threads.

The verification of this test is also wrong, see
#13849 (comment).
When `closeAsync` is called, the previous `sendAsync` calls might not
complete, so all `sendAsync` will complete with `ResultAlreadyClosed`,
not only those called after `closeAsync`.

In addition, this PR also tries to fix the flaky `testReferenceCount`,
which assumes too strictly.

### Modifications

- Remove `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers`
- Only check the reference count is greater than 0 instead of equal to 1

(cherry picked from commit eeea9ca)
BewareMyPower added a commit that referenced this issue Jul 27, 2022
Fixes #13849
Fixes #14848

### Motivation

#11570 adds a `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` for
the case that some `sendAsync` calls that are invoked after `closeAsync`
is called in another thread must complete with `ResultAlreadyClosed`.
It's flaky because the synchronization between two threads is not
strict. This test uses `sendStartLatch` for the order of `sendAsync` and
`closeAsync`:

```
sendAsync 0,1,...,9 -> sendStartLatch is done -> closeAsync
```

However, it cannot guarantee the rest `sendAsync` calls happen after
`closeAsync` is called. If so, all `sendAsync` calls will complete with
`ResultOk`.

On the other hand, this test is meaningless because it requires strict
synchronization between two threads so there is no need to run
`sendAsync` and `closeAsync` in two threads.

The verification of this test is also wrong, see
#13849 (comment).
When `closeAsync` is called, the previous `sendAsync` calls might not
complete, so all `sendAsync` will complete with `ResultAlreadyClosed`,
not only those called after `closeAsync`.

In addition, this PR also tries to fix the flaky `testReferenceCount`,
which assumes too strictly.

### Modifications

- Remove `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers`
- Only check the reference count is greater than 0 instead of equal to 1

(cherry picked from commit eeea9ca)
@codelipenghui
Copy link
Contributor

@BewareMyPower Looks like the test still can get failed https://github.com/apache/pulsar/runs/7948115233?check_suite_focus=true

@codelipenghui codelipenghui reopened this Aug 22, 2022
@BewareMyPower
Copy link
Contributor Author

BewareMyPower commented Aug 23, 2022

This flaky test can be reproduced by

./tests/main --gtest_filter='ClientTest.testReferenceCount' --gtest_repeat=20

(the repeat count cannot be too large like 100, otherwise it might fail with ConnectError)

Repeating all tests (iteration 10) . . .

Note: Google Test filter = ClientTest.testReferenceCount
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from ClientTest
[ RUN      ] ClientTest.testReferenceCount
2022-08-23 15:58:05.988 INFO  [0x116028600] ClientConnection:189 | [<none> -> pulsar://localhost:6650] Create ClientConnection, timeout=10000
2022-08-23 15:58:05.988 INFO  [0x116028600] ConnectionPool:96 | Created connection for pulsar://localhost:6650
2022-08-23 15:58:05.989 INFO  [0x70000c17a000] ClientConnection:375 | [127.0.0.1:59906 -> 127.0.0.1:6650] Connected to broker
2022-08-23 15:58:05.990 INFO  [0x70000c17a000] HandlerBase:61 | [persistent://public/default/client-test-reference-count-1661241485, ] Getting connection from pool
2022-08-23 15:58:05.993 INFO  [0x70000c17a000] ProducerImpl:186 | [persistent://public/default/client-test-reference-count-1661241485, ] Created producer on broker [127.0.0.1:59906 -> 127.0.0.1:6650] 
2022-08-23 15:58:05.993 INFO  [0x116028600] ClientTest:210 | Reference count of the producer: 1
2022-08-23 15:58:05.993 INFO  [0x116028600] Client:88 | Subscribing on Topic :client-test-reference-count-1661241485
2022-08-23 15:58:05.995 INFO  [0x70000c17a000] HandlerBase:61 | [persistent://public/default/client-test-reference-count-1661241485, my-sub, 0] Getting connection from pool
2022-08-23 15:58:05.997 INFO  [0x70000c17a000] ConsumerImpl:220 | [persistent://public/default/client-test-reference-count-1661241485, my-sub, 0] Created consumer on broker [127.0.0.1:59906 -> 127.0.0.1:6650] 
2022-08-23 15:58:05.997 INFO  [0x116028600] ClientTest:216 | Reference count of the consumer: 1
2022-08-23 15:58:05.998 INFO  [0x70000c17a000] HandlerBase:61 | [persistent://public/default/client-test-reference-count-1661241485-reader, reader-3414259df0, 1] Getting connection from pool
2022-08-23 15:58:06.000 INFO  [0x70000c17a000] ConsumerImpl:220 | [persistent://public/default/client-test-reference-count-1661241485-reader, reader-3414259df0, 1] Created consumer on broker [127.0.0.1:59906 -> 127.0.0.1:6650] 
/Users/xuyunze/github.com/BewareMyPower/pulsar/pulsar-client-cpp/tests/ClientTest.cc:222: Failure
Expected equality of these values:
  consumers.size()
    Which is: 1
  2

@Demogorgon314
Copy link
Member

Another failure

2022-08-26 01:19:26.474 INFO  [140275677335232] ClientConnection:189 | [<none> -> pulsar://localhost:6650] Create ClientConnection, timeout=10000
2022-08-26 01:19:26.474 INFO  [140275677335232] ConnectionPool:96 | Created connection for pulsar://localhost:6650
2022-08-26 01:19:26.477 INFO  [140275674232576] ClientConnection:375 | [127.0.0.1:48584 -> 127.0.0.1:6650] Connected to broker
2022-08-26 01:19:26.478 INFO  [140275674232576] HandlerBase:61 | [persistent://public/default/client-test-reference-count-1661476766, ] Getting connection from pool
2022-08-26 01:19:26.486 INFO  [140275674232576] ProducerImpl:186 | [persistent://public/default/client-test-reference-count-1661476766, ] Created producer on broker [127.0.0.1:48584 -> 127.0.0.1:6650] 
2022-08-26 01:19:26.486 INFO  [140275677335232] ClientTest:210 | Reference count of the producer: 7
2022-08-26 01:19:26.486 INFO  [140275677335232] Client:88 | Subscribing on Topic :client-test-reference-count-1661476766
2022-08-26 01:19:26.486 INFO  [140275674232576] HandlerBase:61 | [persistent://public/default/client-test-reference-count-1661476766, my-sub, 0] Getting connection from pool
2022-08-26 01:19:26.490 INFO  [140275674232576] ConsumerImpl:224 | [persistent://public/default/client-test-reference-count-1661476766, my-sub, 0] Created consumer on broker [127.0.0.1:48584 -> 127.0.0.1:6650] 
2022-08-26 01:19:26.490 INFO  [140275677335232] ClientTest:216 | Reference count of the consumer: 7
2022-08-26 01:19:26.491 INFO  [140275674232576] HandlerBase:61 | [persistent://public/default/client-test-reference-count-1661476766-reader, reader-0dc8501272, 1] Getting connection from pool
2022-08-26 01:19:26.496 INFO  [140275674232576] ConsumerImpl:224 | [persistent://public/default/client-test-reference-count-1661476766-reader, reader-0dc8501272, 1] Created consumer on broker [127.0.0.1:48584 -> 127.0.0.1:6650] 
/pulsar/pulsar-client-cpp/tests/ClientTest.cc:222: Failure
Expected equality of these values:
  consumers.size()
    Which is: 1
  2
2022-08-26 01:19:26.496 WARN  [140275677335232] ConsumerImpl:118 | [persistent://public/default/client-test-reference-count-1661476766, my-sub, 0] Destroyed consumer which was not properly closed
2022-08-26 01:19:26.496 INFO  [140275677335232] ConsumerImpl:126 | [persistent://public/default/client-test-reference-count-1661476766, my-sub, 0] Closed consumer for race condition: 0
2022-08-26 01:19:26.496 INFO  [140275677335232] ProducerImpl:619 | Producer - [persistent://public/default/client-test-reference-count-1661476766, standalone-0-213] , [batchMessageContainer = { BatchMessageContainer [size = 0] [bytes = 0] [maxSize = 1000] [maxBytes = 131072] [topicName = persistent://public/default/client-test-reference-count-1661476766] [numberOfBatchesSent_ = 0] [averageBatchSize_ = 0] }]
2022-08-26 01:19:26.496 WARN  [140275677335232] ProducerImpl:114 | [persistent://public/default/client-test-reference-count-1661476766, standalone-0-213] Destroyed producer which was not properly closed

@coderzc
Copy link
Member

coderzc commented Aug 29, 2022

BewareMyPower pushed a commit that referenced this issue Aug 29, 2022
…CreatedCallback_`. (#17325)

Fixes #14848

### Motivation

We should execute `callback` before executing `readerCreatedCallback_`, otherwise, we may get the wrong consumers size. More see:

https://github.com/apache/pulsar/blob/e23d312c04da1d82d35f9e2faf8a446f8e8a4eeb/pulsar-client-cpp/lib/ReaderImpl.cc#L84-L92

https://github.com/apache/pulsar/blob/c48a3243287c7d775459b6437d9f4b24ed44cf4c/pulsar-client-cpp/lib/ClientImpl.cc#L250-L254

### Modifications

execute `callback` before executing `readerCreatedCallback_`
@RobertIndie
Copy link
Member

This issue still exists when I use the latest master branch code:
https://github.com/apache/pulsar/runs/8086575635?check_suite_focus=true#step:11:256

[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from ClientTest
[ RUN      ] ClientTest.testReferenceCount
2022-08-30 08:17:10.954 INFO  [140089060101824] ClientConnection:189 | [<none> -> pulsar://localhost:6650] Create ClientConnection, timeout=10000
2022-08-30 08:17:10.954 INFO  [140089060101824] ConnectionPool:96 | Created connection for pulsar://localhost:6650
2022-08-30 08:17:10.956 INFO  [140089056999168] ClientConnection:375 | [127.0.0.1:41008 -> 127.0.0.1:6650] Connected to broker
2022-08-30 08:17:10.957 INFO  [140089056999168] HandlerBase:61 | [persistent://public/default/client-test-reference-count-1661847430, ] Getting connection from pool
2022-08-30 08:17:10.960 INFO  [140089056999168] ProducerImpl:186 | [persistent://public/default/client-test-reference-count-1661847430, ] Created producer on broker [127.0.0.1:41008 -> 127.0.0.1:6650]
2022-08-30 08:17:10.960 INFO  [140089060101824] ClientTest:210 | Reference count of the producer: 7
2022-08-30 08:17:10.960 INFO  [140089060101824] Client:88 | Subscribing on Topic :client-test-reference-count-1661847430
2022-08-30 08:17:10.961 INFO  [140089056999168] HandlerBase:61 | [persistent://public/default/client-test-reference-count-1661847430, my-sub, 0] Getting connection from pool
2022-08-30 08:17:10.962 INFO  [140089056999168] ConsumerImpl:220 | [persistent://public/default/client-test-reference-count-1661847430, my-sub, 0] Created consumer on broker [127.0.0.1:41008 -> 127.0.0.1:6650]
2022-08-30 08:17:10.962 INFO  [140089060101824] ClientTest:216 | Reference count of the consumer: 7
2022-08-30 08:17:10.962 INFO  [140089056999168] HandlerBase:61 | [persistent://public/default/client-test-reference-count-1661847430-reader, reader-a635288687, 1] Getting connection from pool
2022-08-30 08:17:10.964 INFO  [140089056999168] ConsumerImpl:220 | [persistent://public/default/client-test-reference-count-1661847430-reader, reader-a635288687, 1] Created consumer on broker [127.0.0.1:41008 -> 127.0.0.1:6650]
2022-08-30 08:17:10.964 INFO  [140089060101824] ClientTest:224 | Reference count of the reader's underlying consumer: 3
2022-08-30 08:17:10.964 INFO  [140089060101824] ClientTest:228 | Reference count of the reader: 4
2022-08-30 08:17:10.964 WARN  [140089060101824] ConsumerImpl:116 | [persistent://public/default/client-test-reference-count-1661847430, my-sub, 0] Destroyed consumer which was not properly closed
2022-08-30 08:17:10.964 INFO  [140089060101824] ConsumerImpl:124 | [persistent://public/default/client-test-reference-count-1661847430, my-sub, 0] Closed consumer for race condition: 0
2022-08-30 08:17:10.964 INFO  [140089060101824] ProducerImpl:619 | Producer - [persistent://public/default/client-test-reference-count-1661847430, standalone-0-213] , [batchMessageContainer = { BatchMessageContainer [size = 0] [bytes = 0] [maxSize = 1000] [maxBytes = 131072] [topicName = persistent://public/default/client-test-reference-count-1661847430] [numberOfBatchesSent_ = 0] [averageBatchSize_ = 0] }]
2022-08-30 08:17:10.964 WARN  [140089060101824] ProducerImpl:114 | [persistent://public/default/client-test-reference-count-1661847430, standalone-0-213] Destroyed producer which was not properly closed
/pulsar/pulsar-client-cpp/tests/ClientTest.cc:[235](https://github.com/apache/pulsar/runs/8086575635?check_suite_focus=true#step:11:236): Failure
Expected equality of these values:
  consumers[1].use_count()
    Which is: 3
  0
[  FAILED  ] ClientTest.testReferenceCount (10 ms)
[----------] 1 test from ClientTest (10 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (10 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] ClientTest.testReferenceCount

@BewareMyPower
Copy link
Contributor Author

Another failure

2022-09-05 15:55:00.990 INFO  [139735811178176] ClientConnection:189 | [<none> -> pulsar://localhost:6650] Create ClientConnection, timeout=10000
2022-09-05 15:55:00.990 INFO  [139735811178176] ConnectionPool:96 | Created connection for pulsar://localhost:6650
2022-09-05 15:55:00.991 INFO  [139735808075520] ClientConnection:375 | [127.0.0.1:54416 -> 127.0.0.1:6650] Connected to broker
2022-09-05 15:55:00.992 INFO  [139735808075520] HandlerBase:61 | [persistent://public/default/client-test-reference-count-1662393300, ] Getting connection from pool
2022-09-05 15:55:01.000 INFO  [139735808075520] ProducerImpl:186 | [persistent://public/default/client-test-reference-count-1662393300, ] Created producer on broker [127.0.0.1:54416 -> 127.0.0.1:6650]
2022-09-05 15:55:01.000 INFO  [139735811178176] ClientTest:198 | Reference count of the producer: 7
2022-09-05 15:55:01.000 INFO  [139735811178176] Client:88 | Subscribing on Topic :client-test-reference-count-1662393300
2022-09-05 15:55:01.001 INFO  [139735808075520] HandlerBase:61 | [persistent://public/default/client-test-reference-count-1662393300, my-sub, 0] Getting connection from pool
2022-09-05 15:55:01.002 INFO  [139735808075520] ConsumerImpl:220 | [persistent://public/default/client-test-reference-count-166[239](https://github.com/apache/pulsar/runs/8190894294?check_suite_focus=true#step:11:240)3300, my-sub, 0] Created consumer on broker [127.0.0.1:54416 -> 127.0.0.1:6650]
2022-09-05 15:55:01.002 INFO  [139735811178176] ClientTest:204 | Reference count of the consumer: 7
2022-09-05 15:55:01.003 INFO  [139735808075520] HandlerBase:61 | [persistent://public/default/client-test-reference-count-1662393300-reader, reader-462cf2b281, 1] Getting connection from pool
2022-09-05 15:55:01.007 INFO  [139735808075520] ConsumerImpl:220 | [persistent://public/default/client-test-reference-count-1662393300-reader, reader-462cf2b281, 1] Created consumer on broker [127.0.0.1:54416 -> 127.0.0.1:6650]
2022-09-05 15:55:01.007 INFO  [139735811178176] ClientTest:212 | Reference count of the reader's underlying consumer: 3
2022-09-05 15:55:01.007 INFO  [139735811178176] ClientTest:216 | Reference count of the reader: 4
2022-09-05 15:55:01.007 WARN  [139735811178176] ConsumerImpl:116 | [persistent://public/default/client-test-reference-count-1662393300, my-sub, 0] Destroyed consumer which was not properly closed
2022-09-05 15:55:01.007 INFO  [139735811178176] ConsumerImpl:124 | [persistent://public/default/client-test-reference-count-1662393300, my-sub, 0] Closed consumer for race condition: 0
2022-09-05 15:55:01.007 INFO  [139735811178176] ProducerImpl:619 | Producer - [persistent://public/default/client-test-reference-count-1662393300, standalone-0-204] , [batchMessageContainer = { BatchMessageContainer [size = 0] [bytes = 0] [maxSize = 1000] [maxBytes = 131072] [topicName = persistent://public/default/client-test-reference-count-1662393300] [numberOfBatchesSent_ = 0] [averageBatchSize_ = 0] }]
2022-09-05 15:55:01.007 WARN  [139735811178176] ProducerImpl:114 | [persistent://public/default/client-test-reference-count-1662393300, standalone-0-204] Destroyed producer which was not properly closed
/pulsar/pulsar-client-cpp/tests/ClientTest.cc:223: Failure
Expected equality of these values:
  consumers[1].use_count()
    Which is: 3
  0

@BewareMyPower BewareMyPower self-assigned this Sep 8, 2022
@BewareMyPower
Copy link
Contributor Author

Now it cannot be reproduced in my local env (macOS 12.3.1, Apple clang version 13.1.6) but it can be reproduced in ubuntu 20.04 container (Ubuntu 9.4.0-1ubuntu1~20.04.1). All the error logs are the same that the reference count of the producer, consumer, reader are 7, 7, 4.

I will investigate more for this flaky test.

coderzc added a commit to coderzc/pulsar that referenced this issue Sep 14, 2022
…CreatedCallback_`. (apache#17325)

Fixes apache#14848

### Motivation

We should execute `callback` before executing `readerCreatedCallback_`, otherwise, we may get the wrong consumers size. More see:

https://github.com/apache/pulsar/blob/e23d312c04da1d82d35f9e2faf8a446f8e8a4eeb/pulsar-client-cpp/lib/ReaderImpl.cc#L84-L92

https://github.com/apache/pulsar/blob/c48a3243287c7d775459b6437d9f4b24ed44cf4c/pulsar-client-cpp/lib/ClientImpl.cc#L250-L254

### Modifications

execute `callback` before executing `readerCreatedCallback_`

(cherry picked from commit 3bc50a4)
@BewareMyPower
Copy link
Contributor Author

Steps to reproduce

First, start a container:

mvn clean install -DskipTests -Pcore-modules,-main
docker run -v $PWD:/pulsar -it apachepulsar/pulsar-build:ubuntu-20.04 /bin/bash

Then, run the following commands inside the container:

# Build the unit test (increase the docker memory if segfault happened, or remove -j4 option)
cd /pulsar/pulsar-client-cpp/
cmake -B _builds -DBUILD_DYNAMIC_LIB=OFF -DBUILD_PERF_TOOLS=OFF -DBUILD_PYTHON_WRAPPER=OFF
cmake --build _builds -j4

# Run the standalone
cd /pulsar/distribution/server/target/
tar zxf apache-pulsar-*.tar.gz
cd apache-pulsar-*-SNAPSHOT
./bin/pulsar-daemon start standalone -nss -nfw

# Run the test for multiple times
cd /pulsar/pulsar-client-cpp/_builds
./tests/main --gtest_filter='ClientTest.testReferenceCount' --gtest_repeat=10

BewareMyPower added a commit to BewareMyPower/pulsar that referenced this issue Sep 14, 2022
Fixes apache#14848

### Motivation

There were several fixes on `ClientTest.testReferenceCount` but it's
still very flaky.

The root cause is even after a `Reader` went out of the scope and
destructed, there was still a `Reader` object existed in the thread of
the event loop. See
https://github.com/apache/pulsar/blob/845daf5cac23a4dda4a209d91c85804a0bcaf28a/pulsar-client-cpp/lib/ReaderImpl.cc#L88

To verify this point, I added some logs and saw:

```
2022-09-14 03:52:28.427 INFO  [140046042864960] Reader:39 | Reader ctor 0x7fffd2a7c110
# ...
2022-09-14 03:52:28.444 INFO  [140046039774976] Reader:42 | Reader ctor 0x7f5f0273d720 ReaderImpl(0x7f5efc00a9d0, 3)
# ...
2022-09-14 03:52:28.445 INFO  [140046042864960] ClientTest:217 | Reference count of the reader: 4
# ...
2022-09-14 03:52:28.445 INFO  [140046042864960] ClientImpl:490 | Closing Pulsar client with 1 producers and 2 consumers
2022-09-14 03:52:28.445 INFO  [140046039774976] Reader:55 | Reader dtor 0x7f5f0273d720 ReaderImpl(0x7f5efc00a9d0, 3)
```

The first `Reader` object 0x7fffd2a7c110 was constructed in main thread
140046042864960. However, it destructed before another `Reader` object
0x0x7f5f0273d720 that was constructed in event loop thread
140046039774976. When the callback passed to `createReaderAsync`
completed the promise, the `createReader` immediately returns, at the
same time the `Reader` object in the callback was still in the scope and
not destructed.

Since `Reader` holds a `shared_ptr<ReaderImpl>` and `ReaderImpl` holds
a `shared_ptr<ConsumerImpl>`, if we check the reference count too
quickly, the reference count of the underlying consumer is still
positive because the `Reader` was not destructed at the moment.

### Modifications

Since we cannot determine the precise destructed time point because that
`Reader` object is in the event loop thread, we have to wait for a
while. This PR adds a `waitUntil` utility function to wait for at most
some time until the condition is met. Then wait until the reference
count becomes 0 after the `Reader` object goes out of scope.

Replace `ASSERT_EQ` with `EXPECT_EQ` to let the test continue if it
failed.

### Verifying this change

Following the steps here to reproduce:
apache#14848 (comment)

The test never failed even with `--gtest_repeat=100`.
merlimat pushed a commit that referenced this issue Sep 14, 2022
Fixes #14848

### Motivation

There were several fixes on `ClientTest.testReferenceCount` but it's
still very flaky.

The root cause is even after a `Reader` went out of the scope and
destructed, there was still a `Reader` object existed in the thread of
the event loop. See
https://github.com/apache/pulsar/blob/845daf5cac23a4dda4a209d91c85804a0bcaf28a/pulsar-client-cpp/lib/ReaderImpl.cc#L88

To verify this point, I added some logs and saw:

```
2022-09-14 03:52:28.427 INFO  [140046042864960] Reader:39 | Reader ctor 0x7fffd2a7c110
# ...
2022-09-14 03:52:28.444 INFO  [140046039774976] Reader:42 | Reader ctor 0x7f5f0273d720 ReaderImpl(0x7f5efc00a9d0, 3)
# ...
2022-09-14 03:52:28.445 INFO  [140046042864960] ClientTest:217 | Reference count of the reader: 4
# ...
2022-09-14 03:52:28.445 INFO  [140046042864960] ClientImpl:490 | Closing Pulsar client with 1 producers and 2 consumers
2022-09-14 03:52:28.445 INFO  [140046039774976] Reader:55 | Reader dtor 0x7f5f0273d720 ReaderImpl(0x7f5efc00a9d0, 3)
```

The first `Reader` object 0x7fffd2a7c110 was constructed in main thread
140046042864960. However, it destructed before another `Reader` object
0x0x7f5f0273d720 that was constructed in event loop thread
140046039774976. When the callback passed to `createReaderAsync`
completed the promise, the `createReader` immediately returns, at the
same time the `Reader` object in the callback was still in the scope and
not destructed.

Since `Reader` holds a `shared_ptr<ReaderImpl>` and `ReaderImpl` holds
a `shared_ptr<ConsumerImpl>`, if we check the reference count too
quickly, the reference count of the underlying consumer is still
positive because the `Reader` was not destructed at the moment.

### Modifications

Since we cannot determine the precise destructed time point because that
`Reader` object is in the event loop thread, we have to wait for a
while. This PR adds a `waitUntil` utility function to wait for at most
some time until the condition is met. Then wait until the reference
count becomes 0 after the `Reader` object goes out of scope.

Replace `ASSERT_EQ` with `EXPECT_EQ` to let the test continue if it
failed.

### Verifying this change

Following the steps here to reproduce:
#14848 (comment)

The test never failed even with `--gtest_repeat=100`.
BewareMyPower pushed a commit that referenced this issue Sep 16, 2022
…CreatedCallback_`. (#17325) (#17629)

Fixes #14848

### Motivation

We should execute `callback` before executing `readerCreatedCallback_`, otherwise, we may get the wrong consumers size. More see:

https://github.com/apache/pulsar/blob/e23d312c04da1d82d35f9e2faf8a446f8e8a4eeb/pulsar-client-cpp/lib/ReaderImpl.cc#L84-L92

https://github.com/apache/pulsar/blob/c48a3243287c7d775459b6437d9f4b24ed44cf4c/pulsar-client-cpp/lib/ClientImpl.cc#L250-L254

### Modifications

execute `callback` before executing `readerCreatedCallback_`

(cherry picked from commit 3bc50a4)
congbobo184 pushed a commit that referenced this issue Nov 8, 2022
Fixes #14848

### Motivation

There were several fixes on `ClientTest.testReferenceCount` but it's
still very flaky.

The root cause is even after a `Reader` went out of the scope and
destructed, there was still a `Reader` object existed in the thread of
the event loop. See
https://github.com/apache/pulsar/blob/845daf5cac23a4dda4a209d91c85804a0bcaf28a/pulsar-client-cpp/lib/ReaderImpl.cc#L88

To verify this point, I added some logs and saw:

```
2022-09-14 03:52:28.427 INFO  [140046042864960] Reader:39 | Reader ctor 0x7fffd2a7c110
# ...
2022-09-14 03:52:28.444 INFO  [140046039774976] Reader:42 | Reader ctor 0x7f5f0273d720 ReaderImpl(0x7f5efc00a9d0, 3)
# ...
2022-09-14 03:52:28.445 INFO  [140046042864960] ClientTest:217 | Reference count of the reader: 4
# ...
2022-09-14 03:52:28.445 INFO  [140046042864960] ClientImpl:490 | Closing Pulsar client with 1 producers and 2 consumers
2022-09-14 03:52:28.445 INFO  [140046039774976] Reader:55 | Reader dtor 0x7f5f0273d720 ReaderImpl(0x7f5efc00a9d0, 3)
```

The first `Reader` object 0x7fffd2a7c110 was constructed in main thread
140046042864960. However, it destructed before another `Reader` object
0x0x7f5f0273d720 that was constructed in event loop thread
140046039774976. When the callback passed to `createReaderAsync`
completed the promise, the `createReader` immediately returns, at the
same time the `Reader` object in the callback was still in the scope and
not destructed.

Since `Reader` holds a `shared_ptr<ReaderImpl>` and `ReaderImpl` holds
a `shared_ptr<ConsumerImpl>`, if we check the reference count too
quickly, the reference count of the underlying consumer is still
positive because the `Reader` was not destructed at the moment.

### Modifications

Since we cannot determine the precise destructed time point because that
`Reader` object is in the event loop thread, we have to wait for a
while. This PR adds a `waitUntil` utility function to wait for at most
some time until the condition is met. Then wait until the reference
count becomes 0 after the `Reader` object goes out of scope.

Replace `ASSERT_EQ` with `EXPECT_EQ` to let the test continue if it
failed.

### Verifying this change

Following the steps here to reproduce:
#14848 (comment)

The test never failed even with `--gtest_repeat=100`.

(cherry picked from commit 4ef8dc5)
Technoboy- pushed a commit that referenced this issue Nov 15, 2022
Fixes #14848

### Motivation

There were several fixes on `ClientTest.testReferenceCount` but it's
still very flaky.

The root cause is even after a `Reader` went out of the scope and
destructed, there was still a `Reader` object existed in the thread of
the event loop. See
https://github.com/apache/pulsar/blob/845daf5cac23a4dda4a209d91c85804a0bcaf28a/pulsar-client-cpp/lib/ReaderImpl.cc#L88

To verify this point, I added some logs and saw:

```
2022-09-14 03:52:28.427 INFO  [140046042864960] Reader:39 | Reader ctor 0x7fffd2a7c110
# ...
2022-09-14 03:52:28.444 INFO  [140046039774976] Reader:42 | Reader ctor 0x7f5f0273d720 ReaderImpl(0x7f5efc00a9d0, 3)
# ...
2022-09-14 03:52:28.445 INFO  [140046042864960] ClientTest:217 | Reference count of the reader: 4
# ...
2022-09-14 03:52:28.445 INFO  [140046042864960] ClientImpl:490 | Closing Pulsar client with 1 producers and 2 consumers
2022-09-14 03:52:28.445 INFO  [140046039774976] Reader:55 | Reader dtor 0x7f5f0273d720 ReaderImpl(0x7f5efc00a9d0, 3)
```

The first `Reader` object 0x7fffd2a7c110 was constructed in main thread
140046042864960. However, it destructed before another `Reader` object
0x0x7f5f0273d720 that was constructed in event loop thread
140046039774976. When the callback passed to `createReaderAsync`
completed the promise, the `createReader` immediately returns, at the
same time the `Reader` object in the callback was still in the scope and
not destructed.

Since `Reader` holds a `shared_ptr<ReaderImpl>` and `ReaderImpl` holds
a `shared_ptr<ConsumerImpl>`, if we check the reference count too
quickly, the reference count of the underlying consumer is still
positive because the `Reader` was not destructed at the moment.

### Modifications

Since we cannot determine the precise destructed time point because that
`Reader` object is in the event loop thread, we have to wait for a
while. This PR adds a `waitUntil` utility function to wait for at most
some time until the condition is met. Then wait until the reference
count becomes 0 after the `Reader` object goes out of scope.

Replace `ASSERT_EQ` with `EXPECT_EQ` to let the test continue if it
failed.

### Verifying this change

Following the steps here to reproduce:
#14848 (comment)

The test never failed even with `--gtest_repeat=100`.
congbobo184 pushed a commit that referenced this issue Nov 30, 2022
Fixes #14848

### Motivation

There were several fixes on `ClientTest.testReferenceCount` but it's
still very flaky.

The root cause is even after a `Reader` went out of the scope and
destructed, there was still a `Reader` object existed in the thread of
the event loop. See
https://github.com/apache/pulsar/blob/845daf5cac23a4dda4a209d91c85804a0bcaf28a/pulsar-client-cpp/lib/ReaderImpl.cc#L88

To verify this point, I added some logs and saw:

```
2022-09-14 03:52:28.427 INFO  [140046042864960] Reader:39 | Reader ctor 0x7fffd2a7c110
# ...
2022-09-14 03:52:28.444 INFO  [140046039774976] Reader:42 | Reader ctor 0x7f5f0273d720 ReaderImpl(0x7f5efc00a9d0, 3)
# ...
2022-09-14 03:52:28.445 INFO  [140046042864960] ClientTest:217 | Reference count of the reader: 4
# ...
2022-09-14 03:52:28.445 INFO  [140046042864960] ClientImpl:490 | Closing Pulsar client with 1 producers and 2 consumers
2022-09-14 03:52:28.445 INFO  [140046039774976] Reader:55 | Reader dtor 0x7f5f0273d720 ReaderImpl(0x7f5efc00a9d0, 3)
```

The first `Reader` object 0x7fffd2a7c110 was constructed in main thread
140046042864960. However, it destructed before another `Reader` object
0x0x7f5f0273d720 that was constructed in event loop thread
140046039774976. When the callback passed to `createReaderAsync`
completed the promise, the `createReader` immediately returns, at the
same time the `Reader` object in the callback was still in the scope and
not destructed.

Since `Reader` holds a `shared_ptr<ReaderImpl>` and `ReaderImpl` holds
a `shared_ptr<ConsumerImpl>`, if we check the reference count too
quickly, the reference count of the underlying consumer is still
positive because the `Reader` was not destructed at the moment.

### Modifications

Since we cannot determine the precise destructed time point because that
`Reader` object is in the event loop thread, we have to wait for a
while. This PR adds a `waitUntil` utility function to wait for at most
some time until the condition is met. Then wait until the reference
count becomes 0 after the `Reader` object goes out of scope.

Replace `ASSERT_EQ` with `EXPECT_EQ` to let the test continue if it
failed.

### Verifying this change

Following the steps here to reproduce:
#14848 (comment)

The test never failed even with `--gtest_repeat=100`.

(cherry picked from commit 4ef8dc5)
liangyepianzhou pushed a commit that referenced this issue Dec 5, 2022
Fixes #14848

### Motivation

There were several fixes on `ClientTest.testReferenceCount` but it's
still very flaky.

The root cause is even after a `Reader` went out of the scope and
destructed, there was still a `Reader` object existed in the thread of
the event loop. See
https://github.com/apache/pulsar/blob/845daf5cac23a4dda4a209d91c85804a0bcaf28a/pulsar-client-cpp/lib/ReaderImpl.cc#L88

To verify this point, I added some logs and saw:

```
2022-09-14 03:52:28.427 INFO  [140046042864960] Reader:39 | Reader ctor 0x7fffd2a7c110
# ...
2022-09-14 03:52:28.444 INFO  [140046039774976] Reader:42 | Reader ctor 0x7f5f0273d720 ReaderImpl(0x7f5efc00a9d0, 3)
# ...
2022-09-14 03:52:28.445 INFO  [140046042864960] ClientTest:217 | Reference count of the reader: 4
# ...
2022-09-14 03:52:28.445 INFO  [140046042864960] ClientImpl:490 | Closing Pulsar client with 1 producers and 2 consumers
2022-09-14 03:52:28.445 INFO  [140046039774976] Reader:55 | Reader dtor 0x7f5f0273d720 ReaderImpl(0x7f5efc00a9d0, 3)
```

The first `Reader` object 0x7fffd2a7c110 was constructed in main thread
140046042864960. However, it destructed before another `Reader` object
0x0x7f5f0273d720 that was constructed in event loop thread
140046039774976. When the callback passed to `createReaderAsync`
completed the promise, the `createReader` immediately returns, at the
same time the `Reader` object in the callback was still in the scope and
not destructed.

Since `Reader` holds a `shared_ptr<ReaderImpl>` and `ReaderImpl` holds
a `shared_ptr<ConsumerImpl>`, if we check the reference count too
quickly, the reference count of the underlying consumer is still
positive because the `Reader` was not destructed at the moment.

### Modifications

Since we cannot determine the precise destructed time point because that
`Reader` object is in the event loop thread, we have to wait for a
while. This PR adds a `waitUntil` utility function to wait for at most
some time until the condition is met. Then wait until the reference
count becomes 0 after the `Reader` object goes out of scope.

Replace `ASSERT_EQ` with `EXPECT_EQ` to let the test continue if it
failed.

### Verifying this change

Following the steps here to reproduce:
#14848 (comment)

The test never failed even with `--gtest_repeat=100`.

(cherry picked from commit 4ef8dc5)
nicoloboschi pushed a commit to datastax/pulsar that referenced this issue Dec 6, 2022
Fixes apache#14848

### Motivation

There were several fixes on `ClientTest.testReferenceCount` but it's
still very flaky.

The root cause is even after a `Reader` went out of the scope and
destructed, there was still a `Reader` object existed in the thread of
the event loop. See
https://github.com/apache/pulsar/blob/845daf5cac23a4dda4a209d91c85804a0bcaf28a/pulsar-client-cpp/lib/ReaderImpl.cc#L88

To verify this point, I added some logs and saw:

```
2022-09-14 03:52:28.427 INFO  [140046042864960] Reader:39 | Reader ctor 0x7fffd2a7c110
# ...
2022-09-14 03:52:28.444 INFO  [140046039774976] Reader:42 | Reader ctor 0x7f5f0273d720 ReaderImpl(0x7f5efc00a9d0, 3)
# ...
2022-09-14 03:52:28.445 INFO  [140046042864960] ClientTest:217 | Reference count of the reader: 4
# ...
2022-09-14 03:52:28.445 INFO  [140046042864960] ClientImpl:490 | Closing Pulsar client with 1 producers and 2 consumers
2022-09-14 03:52:28.445 INFO  [140046039774976] Reader:55 | Reader dtor 0x7f5f0273d720 ReaderImpl(0x7f5efc00a9d0, 3)
```

The first `Reader` object 0x7fffd2a7c110 was constructed in main thread
140046042864960. However, it destructed before another `Reader` object
0x0x7f5f0273d720 that was constructed in event loop thread
140046039774976. When the callback passed to `createReaderAsync`
completed the promise, the `createReader` immediately returns, at the
same time the `Reader` object in the callback was still in the scope and
not destructed.

Since `Reader` holds a `shared_ptr<ReaderImpl>` and `ReaderImpl` holds
a `shared_ptr<ConsumerImpl>`, if we check the reference count too
quickly, the reference count of the underlying consumer is still
positive because the `Reader` was not destructed at the moment.

### Modifications

Since we cannot determine the precise destructed time point because that
`Reader` object is in the event loop thread, we have to wait for a
while. This PR adds a `waitUntil` utility function to wait for at most
some time until the condition is met. Then wait until the reference
count becomes 0 after the `Reader` object goes out of scope.

Replace `ASSERT_EQ` with `EXPECT_EQ` to let the test continue if it
failed.

### Verifying this change

Following the steps here to reproduce:
apache#14848 (comment)

The test never failed even with `--gtest_repeat=100`.

(cherry picked from commit 4ef8dc5)
(cherry picked from commit 380031d)
nicoloboschi pushed a commit to datastax/pulsar that referenced this issue Jan 10, 2023
…CreatedCallback_`. (apache#17325)

Fixes apache#14848

### Motivation

We should execute `callback` before executing `readerCreatedCallback_`, otherwise, we may get the wrong consumers size. More see:

https://github.com/apache/pulsar/blob/e23d312c04da1d82d35f9e2faf8a446f8e8a4eeb/pulsar-client-cpp/lib/ReaderImpl.cc#L84-L92

https://github.com/apache/pulsar/blob/c48a3243287c7d775459b6437d9f4b24ed44cf4c/pulsar-client-cpp/lib/ClientImpl.cc#L250-L254

### Modifications

execute `callback` before executing `readerCreatedCallback_`

(cherry picked from commit 3bc50a4)
(cherry picked from commit 2672446)
nicoloboschi pushed a commit to datastax/pulsar that referenced this issue Jan 11, 2023
…CreatedCallback_`. (apache#17325)

Fixes apache#14848

### Motivation

We should execute `callback` before executing `readerCreatedCallback_`, otherwise, we may get the wrong consumers size. More see:

https://github.com/apache/pulsar/blob/e23d312c04da1d82d35f9e2faf8a446f8e8a4eeb/pulsar-client-cpp/lib/ReaderImpl.cc#L84-L92

https://github.com/apache/pulsar/blob/c48a3243287c7d775459b6437d9f4b24ed44cf4c/pulsar-client-cpp/lib/ClientImpl.cc#L250-L254

### Modifications

execute `callback` before executing `readerCreatedCallback_`

(cherry picked from commit 3bc50a4)
(cherry picked from commit 2672446)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment