-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky-test: C++ ClientTest.testReferenceCount #14848
Flaky-test: C++ ClientTest.testReferenceCount #14848
Comments
Fixes apache#14848 Fixes apache#14719 ### Motivation apache#7793 introduced a `testReferenceLeak` to avoid cyclic referenece of the reader. However, it adds a unused field `readerImplWeakPtr_` only for tests. The access to this field is not thread safe that the write operation happens in `handleConsumerCreated` while the read operation can happen anywhere via the getter. So there is a little chance that `readerPtr` in `testReferenceLeak` doesn't point to the right object. In addition, we should only guarantee the reference count becomes 0 after the producer, consumer or reader goes out of its scope. apache#14797 adds a `ClientTest.testReferenceCount` but it's also flaky. It's caused by the shared pointer of `ProducerImpl` is published to another thread via `shared_from_this()` but the test has a strong expectation that the reference count is exactly 1. ### Modifications - Remove `readerImplWeakPtr_` from `ReaderImpl` and get the weak pointer from `Reader` directly by adding a method to `PulsarFriend`. - Add the check of reader's reference count to `testReferenceCount` and remove the redundant `testReferenceLeak`. - Instead of asserting the reference count of producer/consumer/reader is 1, just assume the it's greater than 0.
Fixes apache#14848 Fixes apache#14719 ### Motivation apache#7793 introduced a `testReferenceLeak` to avoid cyclic referenece of the reader. However, it adds a unused field `readerImplWeakPtr_` only for tests. The access to this field is not thread safe that the write operation happens in `handleConsumerCreated` while the read operation can happen anywhere via the getter. So there is a little chance that `readerPtr` in `testReferenceLeak` doesn't point to the right object. In addition, we should only guarantee the reference count becomes 0 after the producer, consumer or reader goes out of its scope. apache#14797 adds a `ClientTest.testReferenceCount` but it's also flaky. It's caused by the shared pointer of `ProducerImpl` is published to another thread via `shared_from_this()` but the test has a strong expectation that the reference count is exactly 1. ### Modifications - Remove `readerImplWeakPtr_` from `ReaderImpl` and get the weak pointer from `Reader` directly by adding a method to `PulsarFriend`. - Add the check of reader's reference count to `testReferenceCount` and remove the redundant `testReferenceLeak`. - Instead of asserting the reference count of producer/consumer/reader is 1, just assume the it's greater than 0.
Fixes apache#14848 Fixes apache#14719 ### Motivation apache#7793 introduced a `testReferenceLeak` to avoid cyclic referenece of the reader. However, it adds a unused field `readerImplWeakPtr_` only for tests. The access to this field is not thread safe that the write operation happens in `handleConsumerCreated` while the read operation can happen anywhere via the getter. So there is a little chance that `readerPtr` in `testReferenceLeak` doesn't point to the right object. In addition, we should only guarantee the reference count becomes 0 after the producer, consumer or reader goes out of its scope. apache#14797 adds a `ClientTest.testReferenceCount` but it's also flaky. It's caused by the shared pointer of `ProducerImpl` is published to another thread via `shared_from_this()` but the test has a strong expectation that the reference count is exactly 1. ### Modifications - Remove `readerImplWeakPtr_` from `ReaderImpl` and get the weak pointer from `Reader` directly by adding a method to `PulsarFriend`. - Add the check of reader's reference count to `testReferenceCount` and remove the redundant `testReferenceLeak`. - Instead of asserting the reference count of producer/consumer/reader is 1, just assume the it's greater than 0.
Fixes #14848 Fixes #14719 ### Motivation #7793 introduced a `testReferenceLeak` to avoid cyclic referenece of the reader. However, it adds a unused field `readerImplWeakPtr_` only for tests. The access to this field is not thread safe that the write operation happens in `handleConsumerCreated` while the read operation can happen anywhere via the getter. So there is a little chance that `readerPtr` in `testReferenceLeak` doesn't point to the right object. In addition, we should only guarantee the reference count becomes 0 after the producer, consumer or reader goes out of its scope. #14797 adds a `ClientTest.testReferenceCount` but it's also flaky. It's caused by the shared pointer of `ProducerImpl` is published to another thread via `shared_from_this()` but the test has a strong expectation that the reference count is exactly 1. ### Modifications - Remove `readerImplWeakPtr_` from `ReaderImpl` and get the weak pointer from `Reader` directly by adding a method to `PulsarFriend`. - Add the check of reader's reference count to `testReferenceCount` and remove the redundant `testReferenceLeak`. - Instead of asserting the reference count of producer/consumer/reader is 1, just assume the it's greater than 0.
Fixes #14848 Fixes #14719 ### Motivation #7793 introduced a `testReferenceLeak` to avoid cyclic referenece of the reader. However, it adds a unused field `readerImplWeakPtr_` only for tests. The access to this field is not thread safe that the write operation happens in `handleConsumerCreated` while the read operation can happen anywhere via the getter. So there is a little chance that `readerPtr` in `testReferenceLeak` doesn't point to the right object. In addition, we should only guarantee the reference count becomes 0 after the producer, consumer or reader goes out of its scope. #14797 adds a `ClientTest.testReferenceCount` but it's also flaky. It's caused by the shared pointer of `ProducerImpl` is published to another thread via `shared_from_this()` but the test has a strong expectation that the reference count is exactly 1. ### Modifications - Remove `readerImplWeakPtr_` from `ReaderImpl` and get the weak pointer from `Reader` directly by adding a method to `PulsarFriend`. - Add the check of reader's reference count to `testReferenceCount` and remove the redundant `testReferenceLeak`. - Instead of asserting the reference count of producer/consumer/reader is 1, just assume the it's greater than 0. (cherry picked from commit f84ff57)
Fixes #14848 Fixes #14719 ### Motivation #7793 introduced a `testReferenceLeak` to avoid cyclic referenece of the reader. However, it adds a unused field `readerImplWeakPtr_` only for tests. The access to this field is not thread safe that the write operation happens in `handleConsumerCreated` while the read operation can happen anywhere via the getter. So there is a little chance that `readerPtr` in `testReferenceLeak` doesn't point to the right object. In addition, we should only guarantee the reference count becomes 0 after the producer, consumer or reader goes out of its scope. #14797 adds a `ClientTest.testReferenceCount` but it's also flaky. It's caused by the shared pointer of `ProducerImpl` is published to another thread via `shared_from_this()` but the test has a strong expectation that the reference count is exactly 1. ### Modifications - Remove `readerImplWeakPtr_` from `ReaderImpl` and get the weak pointer from `Reader` directly by adding a method to `PulsarFriend`. - Add the check of reader's reference count to `testReferenceCount` and remove the redundant `testReferenceLeak`. - Instead of asserting the reference count of producer/consumer/reader is 1, just assume the it's greater than 0. (cherry picked from commit f84ff57)
Fixes #14848 Fixes #14719 ### Motivation #7793 introduced a `testReferenceLeak` to avoid cyclic referenece of the reader. However, it adds a unused field `readerImplWeakPtr_` only for tests. The access to this field is not thread safe that the write operation happens in `handleConsumerCreated` while the read operation can happen anywhere via the getter. So there is a little chance that `readerPtr` in `testReferenceLeak` doesn't point to the right object. In addition, we should only guarantee the reference count becomes 0 after the producer, consumer or reader goes out of its scope. #14797 adds a `ClientTest.testReferenceCount` but it's also flaky. It's caused by the shared pointer of `ProducerImpl` is published to another thread via `shared_from_this()` but the test has a strong expectation that the reference count is exactly 1. ### Modifications - Remove `readerImplWeakPtr_` from `ReaderImpl` and get the weak pointer from `Reader` directly by adding a method to `PulsarFriend`. - Add the check of reader's reference count to `testReferenceCount` and remove the redundant `testReferenceLeak`. - Instead of asserting the reference count of producer/consumer/reader is 1, just assume the it's greater than 0. (cherry picked from commit f84ff57)
After #14854 get merged, I still get this error.
PTAL @BewareMyPower |
@RobertIndie Good point. I'll fix it soon. |
Fixes apache#14848 Fixes apache#14719 ### Motivation apache#7793 introduced a `testReferenceLeak` to avoid cyclic referenece of the reader. However, it adds a unused field `readerImplWeakPtr_` only for tests. The access to this field is not thread safe that the write operation happens in `handleConsumerCreated` while the read operation can happen anywhere via the getter. So there is a little chance that `readerPtr` in `testReferenceLeak` doesn't point to the right object. In addition, we should only guarantee the reference count becomes 0 after the producer, consumer or reader goes out of its scope. apache#14797 adds a `ClientTest.testReferenceCount` but it's also flaky. It's caused by the shared pointer of `ProducerImpl` is published to another thread via `shared_from_this()` but the test has a strong expectation that the reference count is exactly 1. ### Modifications - Remove `readerImplWeakPtr_` from `ReaderImpl` and get the weak pointer from `Reader` directly by adding a method to `PulsarFriend`. - Add the check of reader's reference count to `testReferenceCount` and remove the redundant `testReferenceLeak`. - Instead of asserting the reference count of producer/consumer/reader is 1, just assume the it's greater than 0.
|
Fixes apache#13849 Fixes apache#14848 ### Motivation apache#11570 adds a `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` for the case that some `sendAsync` calls that are invoked after `closeAsync` is called in another thread must complete with `ResultAlreadyClosed`. It's flaky because the synchronization between two threads is not strict. This test uses `sendStartLatch` for the order of `sendAsync` and `closeAsync`: ``` sendAsync 0,1,...,9 -> sendStartLatch is done -> closeAsync ``` However, it cannot guarantee the rest `sendAsync` calls happen after `closeAsync` is called. If so, all `sendAsync` calls will complete with `ResultOk`. On the other hand, this test is meaningless because it requires strict synchronization between two threads so there is no need to run `sendAsync` and `closeAsync` in two threads. The verification of this test is also wrong, see apache#13849 (comment). When `closeAsync` is called, the previous `sendAsync` calls might not complete, so all `sendAsync` will complete with `ResultAlreadyClosed`, not only those called after `closeAsync`. In addition, this PR also tries to fix the flaky `testReferenceCount`, which assumes too strictly. ### Modifications - Remove `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` - Only check the reference count is greater than 0 instead of equal to 1
Fixes #13849 Fixes #14848 ### Motivation #11570 adds a `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` for the case that some `sendAsync` calls that are invoked after `closeAsync` is called in another thread must complete with `ResultAlreadyClosed`. It's flaky because the synchronization between two threads is not strict. This test uses `sendStartLatch` for the order of `sendAsync` and `closeAsync`: ``` sendAsync 0,1,...,9 -> sendStartLatch is done -> closeAsync ``` However, it cannot guarantee the rest `sendAsync` calls happen after `closeAsync` is called. If so, all `sendAsync` calls will complete with `ResultOk`. On the other hand, this test is meaningless because it requires strict synchronization between two threads so there is no need to run `sendAsync` and `closeAsync` in two threads. The verification of this test is also wrong, see #13849 (comment). When `closeAsync` is called, the previous `sendAsync` calls might not complete, so all `sendAsync` will complete with `ResultAlreadyClosed`, not only those called after `closeAsync`. In addition, this PR also tries to fix the flaky `testReferenceCount`, which assumes too strictly. ### Modifications - Remove `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` - Only check the reference count is greater than 0 instead of equal to 1
Fixes #13849 Fixes #14848 ### Motivation #11570 adds a `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` for the case that some `sendAsync` calls that are invoked after `closeAsync` is called in another thread must complete with `ResultAlreadyClosed`. It's flaky because the synchronization between two threads is not strict. This test uses `sendStartLatch` for the order of `sendAsync` and `closeAsync`: ``` sendAsync 0,1,...,9 -> sendStartLatch is done -> closeAsync ``` However, it cannot guarantee the rest `sendAsync` calls happen after `closeAsync` is called. If so, all `sendAsync` calls will complete with `ResultOk`. On the other hand, this test is meaningless because it requires strict synchronization between two threads so there is no need to run `sendAsync` and `closeAsync` in two threads. The verification of this test is also wrong, see #13849 (comment). When `closeAsync` is called, the previous `sendAsync` calls might not complete, so all `sendAsync` will complete with `ResultAlreadyClosed`, not only those called after `closeAsync`. In addition, this PR also tries to fix the flaky `testReferenceCount`, which assumes too strictly. ### Modifications - Remove `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` - Only check the reference count is greater than 0 instead of equal to 1 (cherry picked from commit eeea9ca)
Fixes apache#13849 Fixes apache#14848 ### Motivation apache#11570 adds a `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` for the case that some `sendAsync` calls that are invoked after `closeAsync` is called in another thread must complete with `ResultAlreadyClosed`. It's flaky because the synchronization between two threads is not strict. This test uses `sendStartLatch` for the order of `sendAsync` and `closeAsync`: ``` sendAsync 0,1,...,9 -> sendStartLatch is done -> closeAsync ``` However, it cannot guarantee the rest `sendAsync` calls happen after `closeAsync` is called. If so, all `sendAsync` calls will complete with `ResultOk`. On the other hand, this test is meaningless because it requires strict synchronization between two threads so there is no need to run `sendAsync` and `closeAsync` in two threads. The verification of this test is also wrong, see apache#13849 (comment). When `closeAsync` is called, the previous `sendAsync` calls might not complete, so all `sendAsync` will complete with `ResultAlreadyClosed`, not only those called after `closeAsync`. In addition, this PR also tries to fix the flaky `testReferenceCount`, which assumes too strictly. ### Modifications - Remove `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` - Only check the reference count is greater than 0 instead of equal to 1 (cherry picked from commit eeea9ca) (cherry picked from commit 83b6833)
Fixes #13849 Fixes #14848 ### Motivation #11570 adds a `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` for the case that some `sendAsync` calls that are invoked after `closeAsync` is called in another thread must complete with `ResultAlreadyClosed`. It's flaky because the synchronization between two threads is not strict. This test uses `sendStartLatch` for the order of `sendAsync` and `closeAsync`: ``` sendAsync 0,1,...,9 -> sendStartLatch is done -> closeAsync ``` However, it cannot guarantee the rest `sendAsync` calls happen after `closeAsync` is called. If so, all `sendAsync` calls will complete with `ResultOk`. On the other hand, this test is meaningless because it requires strict synchronization between two threads so there is no need to run `sendAsync` and `closeAsync` in two threads. The verification of this test is also wrong, see #13849 (comment). When `closeAsync` is called, the previous `sendAsync` calls might not complete, so all `sendAsync` will complete with `ResultAlreadyClosed`, not only those called after `closeAsync`. In addition, this PR also tries to fix the flaky `testReferenceCount`, which assumes too strictly. ### Modifications - Remove `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` - Only check the reference count is greater than 0 instead of equal to 1 (cherry picked from commit eeea9ca)
Fixes #13849 Fixes #14848 ### Motivation #11570 adds a `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` for the case that some `sendAsync` calls that are invoked after `closeAsync` is called in another thread must complete with `ResultAlreadyClosed`. It's flaky because the synchronization between two threads is not strict. This test uses `sendStartLatch` for the order of `sendAsync` and `closeAsync`: ``` sendAsync 0,1,...,9 -> sendStartLatch is done -> closeAsync ``` However, it cannot guarantee the rest `sendAsync` calls happen after `closeAsync` is called. If so, all `sendAsync` calls will complete with `ResultOk`. On the other hand, this test is meaningless because it requires strict synchronization between two threads so there is no need to run `sendAsync` and `closeAsync` in two threads. The verification of this test is also wrong, see #13849 (comment). When `closeAsync` is called, the previous `sendAsync` calls might not complete, so all `sendAsync` will complete with `ResultAlreadyClosed`, not only those called after `closeAsync`. In addition, this PR also tries to fix the flaky `testReferenceCount`, which assumes too strictly. ### Modifications - Remove `testSendAsyncCloseAsyncConcurrentlyWithLazyProducers` - Only check the reference count is greater than 0 instead of equal to 1 (cherry picked from commit eeea9ca)
@BewareMyPower Looks like the test still can get failed https://github.com/apache/pulsar/runs/7948115233?check_suite_focus=true |
This flaky test can be reproduced by ./tests/main --gtest_filter='ClientTest.testReferenceCount' --gtest_repeat=20 (the repeat count cannot be too large like 100, otherwise it might fail with
|
|
…CreatedCallback_`. (#17325) Fixes #14848 ### Motivation We should execute `callback` before executing `readerCreatedCallback_`, otherwise, we may get the wrong consumers size. More see: https://github.com/apache/pulsar/blob/e23d312c04da1d82d35f9e2faf8a446f8e8a4eeb/pulsar-client-cpp/lib/ReaderImpl.cc#L84-L92 https://github.com/apache/pulsar/blob/c48a3243287c7d775459b6437d9f4b24ed44cf4c/pulsar-client-cpp/lib/ClientImpl.cc#L250-L254 ### Modifications execute `callback` before executing `readerCreatedCallback_`
This issue still exists when I use the latest master branch code:
|
|
Now it cannot be reproduced in my local env (macOS 12.3.1, Apple clang version 13.1.6) but it can be reproduced in ubuntu 20.04 container (Ubuntu 9.4.0-1ubuntu1~20.04.1). All the error logs are the same that the reference count of the producer, consumer, reader are 7, 7, 4. I will investigate more for this flaky test. |
…CreatedCallback_`. (apache#17325) Fixes apache#14848 ### Motivation We should execute `callback` before executing `readerCreatedCallback_`, otherwise, we may get the wrong consumers size. More see: https://github.com/apache/pulsar/blob/e23d312c04da1d82d35f9e2faf8a446f8e8a4eeb/pulsar-client-cpp/lib/ReaderImpl.cc#L84-L92 https://github.com/apache/pulsar/blob/c48a3243287c7d775459b6437d9f4b24ed44cf4c/pulsar-client-cpp/lib/ClientImpl.cc#L250-L254 ### Modifications execute `callback` before executing `readerCreatedCallback_` (cherry picked from commit 3bc50a4)
Steps to reproduce First, start a container: mvn clean install -DskipTests -Pcore-modules,-main
docker run -v $PWD:/pulsar -it apachepulsar/pulsar-build:ubuntu-20.04 /bin/bash Then, run the following commands inside the container: # Build the unit test (increase the docker memory if segfault happened, or remove -j4 option)
cd /pulsar/pulsar-client-cpp/
cmake -B _builds -DBUILD_DYNAMIC_LIB=OFF -DBUILD_PERF_TOOLS=OFF -DBUILD_PYTHON_WRAPPER=OFF
cmake --build _builds -j4
# Run the standalone
cd /pulsar/distribution/server/target/
tar zxf apache-pulsar-*.tar.gz
cd apache-pulsar-*-SNAPSHOT
./bin/pulsar-daemon start standalone -nss -nfw
# Run the test for multiple times
cd /pulsar/pulsar-client-cpp/_builds
./tests/main --gtest_filter='ClientTest.testReferenceCount' --gtest_repeat=10 |
Fixes apache#14848 ### Motivation There were several fixes on `ClientTest.testReferenceCount` but it's still very flaky. The root cause is even after a `Reader` went out of the scope and destructed, there was still a `Reader` object existed in the thread of the event loop. See https://github.com/apache/pulsar/blob/845daf5cac23a4dda4a209d91c85804a0bcaf28a/pulsar-client-cpp/lib/ReaderImpl.cc#L88 To verify this point, I added some logs and saw: ``` 2022-09-14 03:52:28.427 INFO [140046042864960] Reader:39 | Reader ctor 0x7fffd2a7c110 # ... 2022-09-14 03:52:28.444 INFO [140046039774976] Reader:42 | Reader ctor 0x7f5f0273d720 ReaderImpl(0x7f5efc00a9d0, 3) # ... 2022-09-14 03:52:28.445 INFO [140046042864960] ClientTest:217 | Reference count of the reader: 4 # ... 2022-09-14 03:52:28.445 INFO [140046042864960] ClientImpl:490 | Closing Pulsar client with 1 producers and 2 consumers 2022-09-14 03:52:28.445 INFO [140046039774976] Reader:55 | Reader dtor 0x7f5f0273d720 ReaderImpl(0x7f5efc00a9d0, 3) ``` The first `Reader` object 0x7fffd2a7c110 was constructed in main thread 140046042864960. However, it destructed before another `Reader` object 0x0x7f5f0273d720 that was constructed in event loop thread 140046039774976. When the callback passed to `createReaderAsync` completed the promise, the `createReader` immediately returns, at the same time the `Reader` object in the callback was still in the scope and not destructed. Since `Reader` holds a `shared_ptr<ReaderImpl>` and `ReaderImpl` holds a `shared_ptr<ConsumerImpl>`, if we check the reference count too quickly, the reference count of the underlying consumer is still positive because the `Reader` was not destructed at the moment. ### Modifications Since we cannot determine the precise destructed time point because that `Reader` object is in the event loop thread, we have to wait for a while. This PR adds a `waitUntil` utility function to wait for at most some time until the condition is met. Then wait until the reference count becomes 0 after the `Reader` object goes out of scope. Replace `ASSERT_EQ` with `EXPECT_EQ` to let the test continue if it failed. ### Verifying this change Following the steps here to reproduce: apache#14848 (comment) The test never failed even with `--gtest_repeat=100`.
Fixes #14848 ### Motivation There were several fixes on `ClientTest.testReferenceCount` but it's still very flaky. The root cause is even after a `Reader` went out of the scope and destructed, there was still a `Reader` object existed in the thread of the event loop. See https://github.com/apache/pulsar/blob/845daf5cac23a4dda4a209d91c85804a0bcaf28a/pulsar-client-cpp/lib/ReaderImpl.cc#L88 To verify this point, I added some logs and saw: ``` 2022-09-14 03:52:28.427 INFO [140046042864960] Reader:39 | Reader ctor 0x7fffd2a7c110 # ... 2022-09-14 03:52:28.444 INFO [140046039774976] Reader:42 | Reader ctor 0x7f5f0273d720 ReaderImpl(0x7f5efc00a9d0, 3) # ... 2022-09-14 03:52:28.445 INFO [140046042864960] ClientTest:217 | Reference count of the reader: 4 # ... 2022-09-14 03:52:28.445 INFO [140046042864960] ClientImpl:490 | Closing Pulsar client with 1 producers and 2 consumers 2022-09-14 03:52:28.445 INFO [140046039774976] Reader:55 | Reader dtor 0x7f5f0273d720 ReaderImpl(0x7f5efc00a9d0, 3) ``` The first `Reader` object 0x7fffd2a7c110 was constructed in main thread 140046042864960. However, it destructed before another `Reader` object 0x0x7f5f0273d720 that was constructed in event loop thread 140046039774976. When the callback passed to `createReaderAsync` completed the promise, the `createReader` immediately returns, at the same time the `Reader` object in the callback was still in the scope and not destructed. Since `Reader` holds a `shared_ptr<ReaderImpl>` and `ReaderImpl` holds a `shared_ptr<ConsumerImpl>`, if we check the reference count too quickly, the reference count of the underlying consumer is still positive because the `Reader` was not destructed at the moment. ### Modifications Since we cannot determine the precise destructed time point because that `Reader` object is in the event loop thread, we have to wait for a while. This PR adds a `waitUntil` utility function to wait for at most some time until the condition is met. Then wait until the reference count becomes 0 after the `Reader` object goes out of scope. Replace `ASSERT_EQ` with `EXPECT_EQ` to let the test continue if it failed. ### Verifying this change Following the steps here to reproduce: #14848 (comment) The test never failed even with `--gtest_repeat=100`.
…CreatedCallback_`. (#17325) (#17629) Fixes #14848 ### Motivation We should execute `callback` before executing `readerCreatedCallback_`, otherwise, we may get the wrong consumers size. More see: https://github.com/apache/pulsar/blob/e23d312c04da1d82d35f9e2faf8a446f8e8a4eeb/pulsar-client-cpp/lib/ReaderImpl.cc#L84-L92 https://github.com/apache/pulsar/blob/c48a3243287c7d775459b6437d9f4b24ed44cf4c/pulsar-client-cpp/lib/ClientImpl.cc#L250-L254 ### Modifications execute `callback` before executing `readerCreatedCallback_` (cherry picked from commit 3bc50a4)
Fixes #14848 ### Motivation There were several fixes on `ClientTest.testReferenceCount` but it's still very flaky. The root cause is even after a `Reader` went out of the scope and destructed, there was still a `Reader` object existed in the thread of the event loop. See https://github.com/apache/pulsar/blob/845daf5cac23a4dda4a209d91c85804a0bcaf28a/pulsar-client-cpp/lib/ReaderImpl.cc#L88 To verify this point, I added some logs and saw: ``` 2022-09-14 03:52:28.427 INFO [140046042864960] Reader:39 | Reader ctor 0x7fffd2a7c110 # ... 2022-09-14 03:52:28.444 INFO [140046039774976] Reader:42 | Reader ctor 0x7f5f0273d720 ReaderImpl(0x7f5efc00a9d0, 3) # ... 2022-09-14 03:52:28.445 INFO [140046042864960] ClientTest:217 | Reference count of the reader: 4 # ... 2022-09-14 03:52:28.445 INFO [140046042864960] ClientImpl:490 | Closing Pulsar client with 1 producers and 2 consumers 2022-09-14 03:52:28.445 INFO [140046039774976] Reader:55 | Reader dtor 0x7f5f0273d720 ReaderImpl(0x7f5efc00a9d0, 3) ``` The first `Reader` object 0x7fffd2a7c110 was constructed in main thread 140046042864960. However, it destructed before another `Reader` object 0x0x7f5f0273d720 that was constructed in event loop thread 140046039774976. When the callback passed to `createReaderAsync` completed the promise, the `createReader` immediately returns, at the same time the `Reader` object in the callback was still in the scope and not destructed. Since `Reader` holds a `shared_ptr<ReaderImpl>` and `ReaderImpl` holds a `shared_ptr<ConsumerImpl>`, if we check the reference count too quickly, the reference count of the underlying consumer is still positive because the `Reader` was not destructed at the moment. ### Modifications Since we cannot determine the precise destructed time point because that `Reader` object is in the event loop thread, we have to wait for a while. This PR adds a `waitUntil` utility function to wait for at most some time until the condition is met. Then wait until the reference count becomes 0 after the `Reader` object goes out of scope. Replace `ASSERT_EQ` with `EXPECT_EQ` to let the test continue if it failed. ### Verifying this change Following the steps here to reproduce: #14848 (comment) The test never failed even with `--gtest_repeat=100`. (cherry picked from commit 4ef8dc5)
Fixes #14848 ### Motivation There were several fixes on `ClientTest.testReferenceCount` but it's still very flaky. The root cause is even after a `Reader` went out of the scope and destructed, there was still a `Reader` object existed in the thread of the event loop. See https://github.com/apache/pulsar/blob/845daf5cac23a4dda4a209d91c85804a0bcaf28a/pulsar-client-cpp/lib/ReaderImpl.cc#L88 To verify this point, I added some logs and saw: ``` 2022-09-14 03:52:28.427 INFO [140046042864960] Reader:39 | Reader ctor 0x7fffd2a7c110 # ... 2022-09-14 03:52:28.444 INFO [140046039774976] Reader:42 | Reader ctor 0x7f5f0273d720 ReaderImpl(0x7f5efc00a9d0, 3) # ... 2022-09-14 03:52:28.445 INFO [140046042864960] ClientTest:217 | Reference count of the reader: 4 # ... 2022-09-14 03:52:28.445 INFO [140046042864960] ClientImpl:490 | Closing Pulsar client with 1 producers and 2 consumers 2022-09-14 03:52:28.445 INFO [140046039774976] Reader:55 | Reader dtor 0x7f5f0273d720 ReaderImpl(0x7f5efc00a9d0, 3) ``` The first `Reader` object 0x7fffd2a7c110 was constructed in main thread 140046042864960. However, it destructed before another `Reader` object 0x0x7f5f0273d720 that was constructed in event loop thread 140046039774976. When the callback passed to `createReaderAsync` completed the promise, the `createReader` immediately returns, at the same time the `Reader` object in the callback was still in the scope and not destructed. Since `Reader` holds a `shared_ptr<ReaderImpl>` and `ReaderImpl` holds a `shared_ptr<ConsumerImpl>`, if we check the reference count too quickly, the reference count of the underlying consumer is still positive because the `Reader` was not destructed at the moment. ### Modifications Since we cannot determine the precise destructed time point because that `Reader` object is in the event loop thread, we have to wait for a while. This PR adds a `waitUntil` utility function to wait for at most some time until the condition is met. Then wait until the reference count becomes 0 after the `Reader` object goes out of scope. Replace `ASSERT_EQ` with `EXPECT_EQ` to let the test continue if it failed. ### Verifying this change Following the steps here to reproduce: #14848 (comment) The test never failed even with `--gtest_repeat=100`.
Fixes #14848 ### Motivation There were several fixes on `ClientTest.testReferenceCount` but it's still very flaky. The root cause is even after a `Reader` went out of the scope and destructed, there was still a `Reader` object existed in the thread of the event loop. See https://github.com/apache/pulsar/blob/845daf5cac23a4dda4a209d91c85804a0bcaf28a/pulsar-client-cpp/lib/ReaderImpl.cc#L88 To verify this point, I added some logs and saw: ``` 2022-09-14 03:52:28.427 INFO [140046042864960] Reader:39 | Reader ctor 0x7fffd2a7c110 # ... 2022-09-14 03:52:28.444 INFO [140046039774976] Reader:42 | Reader ctor 0x7f5f0273d720 ReaderImpl(0x7f5efc00a9d0, 3) # ... 2022-09-14 03:52:28.445 INFO [140046042864960] ClientTest:217 | Reference count of the reader: 4 # ... 2022-09-14 03:52:28.445 INFO [140046042864960] ClientImpl:490 | Closing Pulsar client with 1 producers and 2 consumers 2022-09-14 03:52:28.445 INFO [140046039774976] Reader:55 | Reader dtor 0x7f5f0273d720 ReaderImpl(0x7f5efc00a9d0, 3) ``` The first `Reader` object 0x7fffd2a7c110 was constructed in main thread 140046042864960. However, it destructed before another `Reader` object 0x0x7f5f0273d720 that was constructed in event loop thread 140046039774976. When the callback passed to `createReaderAsync` completed the promise, the `createReader` immediately returns, at the same time the `Reader` object in the callback was still in the scope and not destructed. Since `Reader` holds a `shared_ptr<ReaderImpl>` and `ReaderImpl` holds a `shared_ptr<ConsumerImpl>`, if we check the reference count too quickly, the reference count of the underlying consumer is still positive because the `Reader` was not destructed at the moment. ### Modifications Since we cannot determine the precise destructed time point because that `Reader` object is in the event loop thread, we have to wait for a while. This PR adds a `waitUntil` utility function to wait for at most some time until the condition is met. Then wait until the reference count becomes 0 after the `Reader` object goes out of scope. Replace `ASSERT_EQ` with `EXPECT_EQ` to let the test continue if it failed. ### Verifying this change Following the steps here to reproduce: #14848 (comment) The test never failed even with `--gtest_repeat=100`. (cherry picked from commit 4ef8dc5)
Fixes #14848 ### Motivation There were several fixes on `ClientTest.testReferenceCount` but it's still very flaky. The root cause is even after a `Reader` went out of the scope and destructed, there was still a `Reader` object existed in the thread of the event loop. See https://github.com/apache/pulsar/blob/845daf5cac23a4dda4a209d91c85804a0bcaf28a/pulsar-client-cpp/lib/ReaderImpl.cc#L88 To verify this point, I added some logs and saw: ``` 2022-09-14 03:52:28.427 INFO [140046042864960] Reader:39 | Reader ctor 0x7fffd2a7c110 # ... 2022-09-14 03:52:28.444 INFO [140046039774976] Reader:42 | Reader ctor 0x7f5f0273d720 ReaderImpl(0x7f5efc00a9d0, 3) # ... 2022-09-14 03:52:28.445 INFO [140046042864960] ClientTest:217 | Reference count of the reader: 4 # ... 2022-09-14 03:52:28.445 INFO [140046042864960] ClientImpl:490 | Closing Pulsar client with 1 producers and 2 consumers 2022-09-14 03:52:28.445 INFO [140046039774976] Reader:55 | Reader dtor 0x7f5f0273d720 ReaderImpl(0x7f5efc00a9d0, 3) ``` The first `Reader` object 0x7fffd2a7c110 was constructed in main thread 140046042864960. However, it destructed before another `Reader` object 0x0x7f5f0273d720 that was constructed in event loop thread 140046039774976. When the callback passed to `createReaderAsync` completed the promise, the `createReader` immediately returns, at the same time the `Reader` object in the callback was still in the scope and not destructed. Since `Reader` holds a `shared_ptr<ReaderImpl>` and `ReaderImpl` holds a `shared_ptr<ConsumerImpl>`, if we check the reference count too quickly, the reference count of the underlying consumer is still positive because the `Reader` was not destructed at the moment. ### Modifications Since we cannot determine the precise destructed time point because that `Reader` object is in the event loop thread, we have to wait for a while. This PR adds a `waitUntil` utility function to wait for at most some time until the condition is met. Then wait until the reference count becomes 0 after the `Reader` object goes out of scope. Replace `ASSERT_EQ` with `EXPECT_EQ` to let the test continue if it failed. ### Verifying this change Following the steps here to reproduce: #14848 (comment) The test never failed even with `--gtest_repeat=100`. (cherry picked from commit 4ef8dc5)
Fixes apache#14848 ### Motivation There were several fixes on `ClientTest.testReferenceCount` but it's still very flaky. The root cause is even after a `Reader` went out of the scope and destructed, there was still a `Reader` object existed in the thread of the event loop. See https://github.com/apache/pulsar/blob/845daf5cac23a4dda4a209d91c85804a0bcaf28a/pulsar-client-cpp/lib/ReaderImpl.cc#L88 To verify this point, I added some logs and saw: ``` 2022-09-14 03:52:28.427 INFO [140046042864960] Reader:39 | Reader ctor 0x7fffd2a7c110 # ... 2022-09-14 03:52:28.444 INFO [140046039774976] Reader:42 | Reader ctor 0x7f5f0273d720 ReaderImpl(0x7f5efc00a9d0, 3) # ... 2022-09-14 03:52:28.445 INFO [140046042864960] ClientTest:217 | Reference count of the reader: 4 # ... 2022-09-14 03:52:28.445 INFO [140046042864960] ClientImpl:490 | Closing Pulsar client with 1 producers and 2 consumers 2022-09-14 03:52:28.445 INFO [140046039774976] Reader:55 | Reader dtor 0x7f5f0273d720 ReaderImpl(0x7f5efc00a9d0, 3) ``` The first `Reader` object 0x7fffd2a7c110 was constructed in main thread 140046042864960. However, it destructed before another `Reader` object 0x0x7f5f0273d720 that was constructed in event loop thread 140046039774976. When the callback passed to `createReaderAsync` completed the promise, the `createReader` immediately returns, at the same time the `Reader` object in the callback was still in the scope and not destructed. Since `Reader` holds a `shared_ptr<ReaderImpl>` and `ReaderImpl` holds a `shared_ptr<ConsumerImpl>`, if we check the reference count too quickly, the reference count of the underlying consumer is still positive because the `Reader` was not destructed at the moment. ### Modifications Since we cannot determine the precise destructed time point because that `Reader` object is in the event loop thread, we have to wait for a while. This PR adds a `waitUntil` utility function to wait for at most some time until the condition is met. Then wait until the reference count becomes 0 after the `Reader` object goes out of scope. Replace `ASSERT_EQ` with `EXPECT_EQ` to let the test continue if it failed. ### Verifying this change Following the steps here to reproduce: apache#14848 (comment) The test never failed even with `--gtest_repeat=100`. (cherry picked from commit 4ef8dc5) (cherry picked from commit 380031d)
…CreatedCallback_`. (apache#17325) Fixes apache#14848 ### Motivation We should execute `callback` before executing `readerCreatedCallback_`, otherwise, we may get the wrong consumers size. More see: https://github.com/apache/pulsar/blob/e23d312c04da1d82d35f9e2faf8a446f8e8a4eeb/pulsar-client-cpp/lib/ReaderImpl.cc#L84-L92 https://github.com/apache/pulsar/blob/c48a3243287c7d775459b6437d9f4b24ed44cf4c/pulsar-client-cpp/lib/ClientImpl.cc#L250-L254 ### Modifications execute `callback` before executing `readerCreatedCallback_` (cherry picked from commit 3bc50a4) (cherry picked from commit 2672446)
…CreatedCallback_`. (apache#17325) Fixes apache#14848 ### Motivation We should execute `callback` before executing `readerCreatedCallback_`, otherwise, we may get the wrong consumers size. More see: https://github.com/apache/pulsar/blob/e23d312c04da1d82d35f9e2faf8a446f8e8a4eeb/pulsar-client-cpp/lib/ReaderImpl.cc#L84-L92 https://github.com/apache/pulsar/blob/c48a3243287c7d775459b6437d9f4b24ed44cf4c/pulsar-client-cpp/lib/ClientImpl.cc#L250-L254 ### Modifications execute `callback` before executing `readerCreatedCallback_` (cherry picked from commit 3bc50a4) (cherry picked from commit 2672446)
ClientTest.testReferenceCount is flaky. It fails sporadically.
example failure
The text was updated successfully, but these errors were encountered: