Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose #7099

Merged
merged 2 commits into from
Aug 20, 2024

Conversation

jojochuang
Copy link
Contributor

What changes were proposed in this pull request?

HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose

Please describe your PR in detail:
The assertion assumes the buffer pool is allocated 4 buffers.

After HDDS-9844, the block output stream buffer allocation becomes asynchronous and it's possible to allocate fewer buffers than before, because some response already come back and the associated buffer can be reused.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11325

How was this patch tested?

Unit test change only.

…#testContainerClose.

Change-Id: Ie3cad994ad003bac75b1cae82c63cb8d9b45684f
Change-Id: I9be6fad05c1c859779499ccc962767dfd57fc3d0
@jojochuang jojochuang added the hbase HBase on Ozone support label Aug 20, 2024
Copy link
Contributor

@duongkame duongkame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the fix @jojochuang .
Can you also run a repeat test on the TestBlockOutputStreamWithFailures#testContainerClose to confirm fixing the intermittent failure.

To run the repeating test on your branch, you have to up date your for master branch, there will be a workflow available at https://github.com/jojochuang/ozone/actions/workflows/intermittent-test-check.yml

You can refer to the repeating test submissions here

@jojochuang
Copy link
Contributor Author

Cool bean! Running it now.

@jojochuang
Copy link
Contributor Author

Here's the test run 10x10 without failure
https://github.com/jojochuang/ozone/actions/runs/10467209337

@jojochuang jojochuang merged commit 2d3ba80 into apache:master Aug 20, 2024
28 checks passed
@jojochuang
Copy link
Contributor Author

Thanks @duongkame merged.

errose28 added a commit to errose28/ozone that referenced this pull request Aug 21, 2024
* master: (50 commits)
  HDDS-11331. Fix Datanode unable to report for a long time (apache#7090)
  HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102)
  HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103)
  HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974)
  HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035)
  HDDS-9790. Add tests for Overview page (apache#6983)
  HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074)
  HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098)
  HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099)
  HDDS-11340. Avoid extra PubBlock call when a full block is closed (apache#7094)
  HDDS-11155. Improve Volumes page UI (apache#7048)
  HDDS-11324. Negative value preOpLatencyMs in DN audit log (apache#7093)
  HDDS-11246. [Recon] Use optional chaining instead of explicit undefined check for Objects in Container and Pipeline Module. (apache#7037)
  HDDS-11323. Mark TestLeaseRecovery as flaky
  HDDS-11338. Bump zstd-jni to 1.5.6-4 (apache#7085)
  HDDS-11337. Bump Spring Framework to 5.3.39 (apache#7084)
  HDDS-11327. [hsync] Revert config default ozone.fs.hsync.enabled to false (apache#7079)
  HDDS-11325. Mark testWriteMoreThanMaxFlushSize as flaky
  HDDS-11336. Bump slf4j to 2.0.16 (apache#7086)
  HDDS-11335. Bump exec-maven-plugin to 3.4.1 (apache#7087)
  ...

Conflicts:
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
@errose28
Copy link
Contributor

I'm trying to assess the current health of the master branch before bring it in to our feature branch and I'm confused by this change:

  • The Jira title says this is a fix for TestBlockOutputStreamWithFailures#testContainerClose
  • The code change is a fix for TestBlockOutputStream#testWriteMoreThanMaxFlushSize
  • The flaky annotation was not removed from TestBlockOutputStream#testWriteMoreThanMaxFlushSize. I'm not sure if that means it was run in the test split or not.

So is this PR supposed to fix failues in TestBlockOutputStream#testWriteMoreThanMaxFlushSize, TestBlockOutputStreamWithFailures#testContainerClose, or both?

@jojochuang
Copy link
Contributor Author

it's supposed to fix both. TestBlockOutputStreamWithFailures#testContainerClose calls TestBlockOutputStream#testWriteMoreThanMaxFlushSize

@errose28
Copy link
Contributor

TestBlockOutputStreamWithFailures#testContainerClose calls TestBlockOutputStream#testWriteMoreThanMaxFlushSize

Intellij shows no callers of TestBlockOutputStream#testWriteMoreThanMaxFlushSize other than the test annotation. TestBlockOutputStreamWithFailures#testContainerClose calls TestBlockOutputStreamWithFailures#testWriteMoreThanMaxFlushSize which is a different private method in TestBlockOutputStreamWithFailures. I'm looking at recent master fb43023

Flaky annotations on both tests mentioned in my original comment are also still present and now pointing to a resolved Jira so it is still unclear what was fixed here.

@jojochuang
Copy link
Contributor Author

Aaah. ok I should reopen and redo it again.

errose28 added a commit to errose28/ozone that referenced this pull request Aug 21, 2024
* master: (50 commits)
  HDDS-11331. Fix Datanode unable to report for a long time (apache#7090)
  HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102)
  HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103)
  HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974)
  HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035)
  HDDS-9790. Add tests for Overview page (apache#6983)
  HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074)
  HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098)
  HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099)
  HDDS-11340. Avoid extra PubBlock call when a full block is closed (apache#7094)
  HDDS-11155. Improve Volumes page UI (apache#7048)
  HDDS-11324. Negative value preOpLatencyMs in DN audit log (apache#7093)
  HDDS-11246. [Recon] Use optional chaining instead of explicit undefined check for Objects in Container and Pipeline Module. (apache#7037)
  HDDS-11323. Mark TestLeaseRecovery as flaky
  HDDS-11338. Bump zstd-jni to 1.5.6-4 (apache#7085)
  HDDS-11337. Bump Spring Framework to 5.3.39 (apache#7084)
  HDDS-11327. [hsync] Revert config default ozone.fs.hsync.enabled to false (apache#7079)
  HDDS-11325. Mark testWriteMoreThanMaxFlushSize as flaky
  HDDS-11336. Bump slf4j to 2.0.16 (apache#7086)
  HDDS-11335. Bump exec-maven-plugin to 3.4.1 (apache#7087)
  ...

Conflicts:
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
errose28 added a commit to errose28/ozone that referenced this pull request Aug 26, 2024
…an-on-error

* HDDS-10239-container-reconciliation: (428 commits)
  HDDS-11081. Use thread-local instance of FileSystem in Freon tests (apache#7091)
  HDDS-11333. Avoid hard-coded current version in upgrade/xcompat tests (apache#7089)
  Mark TestPipelineManagerMXBean#testPipelineInfo as flaky
  Mark TestAddRemoveOzoneManager#testForceBootstrap as flaky
  HDDS-11352. HDDS-11353. Mark TestOzoneManagerHAWithStoppedNodes as flaky
  HDDS-11354. Mark TestOzoneManagerSnapshotAcl#testLookupKeyWithNotAllowedUserForPrefixAcl as flaky
  HDDS-11355. Mark TestMultiBlockWritesWithDnFailures#testMultiBlockWritesWithIntermittentDnFailures as flaky
  HDDS-11227. Use server default key provider to encrypt/decrypt keys from multiple OMs. (apache#7081)
  HDDS-11316. Improve Create Key and Chunk IO Dashboards (apache#7075)
  HDDS-11239. Fix KeyOutputStream's exception handling when calling hsync concurrently (apache#7047)
  HDDS-11254. Reconcile commands should be handled by datanode ReplicationSupervisor (apache#7076)
  HDDS-11331. Fix Datanode unable to report for a long time (apache#7090)
  HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102)
  HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103)
  HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974)
  HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035)
  HDDS-9790. Add tests for Overview page (apache#6983)
  HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074)
  HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098)
  HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099)
  ...

Conflicts:
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainerCheck.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
errose28 added a commit to errose28/ozone that referenced this pull request Aug 28, 2024
…rrupt-files

* HDDS-10239-container-reconciliation: (61 commits)
  HDDS-11081. Use thread-local instance of FileSystem in Freon tests (apache#7091)
  HDDS-11333. Avoid hard-coded current version in upgrade/xcompat tests (apache#7089)
  Mark TestPipelineManagerMXBean#testPipelineInfo as flaky
  Mark TestAddRemoveOzoneManager#testForceBootstrap as flaky
  HDDS-11352. HDDS-11353. Mark TestOzoneManagerHAWithStoppedNodes as flaky
  HDDS-11354. Mark TestOzoneManagerSnapshotAcl#testLookupKeyWithNotAllowedUserForPrefixAcl as flaky
  HDDS-11355. Mark TestMultiBlockWritesWithDnFailures#testMultiBlockWritesWithIntermittentDnFailures as flaky
  HDDS-11227. Use server default key provider to encrypt/decrypt keys from multiple OMs. (apache#7081)
  HDDS-11316. Improve Create Key and Chunk IO Dashboards (apache#7075)
  HDDS-11239. Fix KeyOutputStream's exception handling when calling hsync concurrently (apache#7047)
  HDDS-11254. Reconcile commands should be handled by datanode ReplicationSupervisor (apache#7076)
  HDDS-11331. Fix Datanode unable to report for a long time (apache#7090)
  HDDS-11346. FS CLI gives incorrect recursive volume deletion prompt (apache#7102)
  HDDS-11349. Add NullPointer handling when volume/bucket tables are not initialized (apache#7103)
  HDDS-11209. Avoid insufficient EC pipelines in the container pipeline cache (apache#6974)
  HDDS-11284. refactor quota repair non-blocking while upgrade (apache#7035)
  HDDS-9790. Add tests for Overview page (apache#6983)
  HDDS-10904. [hsync] Enable PutBlock piggybacking and incremental chunk list by default (apache#7074)
  HDDS-11322. [hsync] Block ECKeyOutputStream from calling hsync and hflush (apache#7098)
  HDDS-11325. Intermittent failure in TestBlockOutputStreamWithFailures#testContainerClose (apache#7099)
  ...

Conflicts:
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerChecksumTreeManager.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hbase HBase on Ozone support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants