HDDS-11463. Track and display failed DataNode storage locations in SCM. #7266

slfan1989 · 2024-10-03T01:20:07Z

What changes were proposed in this pull request?

Currently, we lack tools on the SCM side to track failed disks on DataNodes. DataNodes have already reported this information, and we need to display it.

In this PR, we will display the failed disks on the DataNode. The information can be displayed in JSON format or using the default format.

Default format

Datanode Volume Failures (5 Volumes)

Node         : localhost-62.238.104.185 (de97aaf3-99ad-449d-ad92-2c4f5a744b49) 
Failed Volume: /data0/ozonedata/hdds 
Capacity Lost: 7430477791683 B (6.76 TB) 
Failure Date : Thu Oct 03 09:25:16 +0800 2024 

Node         : localhost-163.120.165.68 (cf40e987-8952-4f7a-88b7-096e6b285243) 
Failed Volume: /data1/ozonedata/hdds 
Capacity Lost: 7430477791683 B (6.76 TB) 
Failure Date : Thu Oct 03 09:25:16 +0800 2024 

Node         : localhost-253.243.206.120 (0cc77921-489d-4cf0-a036-475faa16d443) 
Failed Volume: /data2/ozonedata/hdds 
Capacity Lost: 7430477791683 B (6.76 TB) 
Failure Date : Thu Oct 03 09:25:16 +0800 2024 

Node         : localhost-136.194.243.81 (5cb6430d-0ce5-4204-b265-179ee38fb30e) 
Failed Volume: /data3/ozonedata/hdds 
Capacity Lost: 7430477791683 B (6.76 TB) 
Failure Date : Thu Oct 03 09:25:16 +0800 2024 

Node         : localhost-48.253.209.226 (f99a8374-edb0-419d-9cba-cfab9d9e8a2e) 
Failed Volume: /data4/ozonedata/hdds 
Capacity Lost: 7430477791683 B (6.76 TB) 
Failure Date : Thu Oct 03 09:25:16 +0800 2024

Json format

[ {
  "node" : "localhost-161.170.151.131 (155bb574-7ed8-41cd-a868-815f4c2b0d60)",
  "volumeName" : "/data0/ozonedata/hdds",
  "failureDate" : 1727918794694,
  "capacityLost" : 7430477791683
}, {
  "node" : "localhost-67.218.46.23 (520d29eb-8387-4cda-bcb1-8727fdddd451)",
  "volumeName" : "/data1/ozonedata/hdds",
  "failureDate" : 1727918794695,
  "capacityLost" : 7430477791683
}, {
  "node" : "localhost-30.151.88.21 (d66cab50-bbf8-4199-9d7f-82da84a30137)",
  "volumeName" : "/data2/ozonedata/hdds",
  "failureDate" : 1727918794695,
  "capacityLost" : 7430477791683
}, {
  "node" : "localhost-78.50.38.217 (a673f50a-6f74-4e62-8c0c-f7337d5f3ce5)",
  "volumeName" : "/data3/ozonedata/hdds",
  "failureDate" : 1727918794695,
  "capacityLost" : 7430477791683
}, {
  "node" : "localhost-138.205.52.25 (84b7e49a-9bd4-4115-96fa-69f2d259343c)",
  "volumeName" : "/data4/ozonedata/hdds",
  "failureDate" : 1727918794695,
  "capacityLost" : 7430477791683
} ]

Table format

+-------------------------------------------------------------------------------------------------------------------------------------------+
|                                                         Datanode Volume Failures                                                          |
+------------------------------------------------------------------+-----------------------+---------------+--------------------------------+
|                               Node                               |      Volume Name      | Capacity Lost |          Failure Date          |
+------------------------------------------------------------------+-----------------------+---------------+--------------------------------+
|  localhost-83.212.219.28 (8b6addb1-759a-49e9-99fb-0d1a6cfb2d7f)  | /data0/ozonedata/hdds |    6.76 TB    | Sat Oct 05 17:47:47 +0800 2024 |
| localhost-103.199.236.47 (0dbe503a-3382-4753-b95a-447bab5766c4)  | /data1/ozonedata/hdds |    6.76 TB    | Sat Oct 05 17:47:47 +0800 2024 |
|  localhost-178.123.46.32 (2017076a-e763-4f47-abce-78535b5770a3)  | /data2/ozonedata/hdds |    6.76 TB    | Sat Oct 05 17:47:47 +0800 2024 |
| localhost-123.112.235.228 (aaebb6a7-6b62-4160-9934-b16b8fdde65e) | /data3/ozonedata/hdds |    6.76 TB    | Sat Oct 05 17:47:47 +0800 2024 |
| localhost-249.235.216.19 (cbc7c0b5-5ae0-4e40-91b8-1d9c419a007c)  | /data4/ozonedata/hdds |    6.76 TB    | Sat Oct 05 17:47:47 +0800 2024 |
+------------------------------------------------------------------+-----------------------+---------------+--------------------------------+

What is the link to the Apache JIRA

JIRA: HDDS-11463. Track and display failed DataNode storage locations in SCM.

How was this patch tested?

Add Junit Test & Testing in a test environment.

slfan1989 · 2024-10-23T09:01:44Z

@errose28 Could you please help review this PR? Thank you very much! We discussed the relevant implementation together in HDDS-11463.

errose28

Thanks for working on this @slfan1989, this looks like a useful addition. I only had time for a quick high level look for now.

errose28 · 2024-10-28T20:43:42Z

hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/admin/scm/VolumeFailureSubCommand.java

+ * Handler of ozone admin scm volumesfailure command.
+ */
+@Command(
+    name = "volumesfailure",


For the CLI, we should probably use something like ozone admin datanode volume list. The datanode subcommand is already used to retrieve information about datanodes from SCM. Splitting the commands so that volume has its own subcommand gives us more options in the future.

To distinguish failed and healthy volumes and filter out different nodes, we can either add some kind of filter flag, or leave it up to grep/jq to be applied to the output.

This also means we should make the RPC more generic to support pulling all volume information.

Thank you for helping to review this PR! I will continue to improve the relevant code based on your suggestions.

errose28 · 2024-10-28T20:48:01Z

...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java

@@ -382,6 +383,7 @@ public abstract static class Builder<T extends Builder<T>> {
    private boolean failedVolume = false;
    private String datanodeUuid;
    private String clusterID;
+    private long failureDate;


Lets use failureTime. I'm assuming this is being stored as millis since epoch, so it will have data and time information.

I have improved the relevant code.

errose28 · 2024-10-28T20:51:55Z

...ner-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java

+    // Ensure it is set only once,
+    // which is the time when the failure was first detected.
+    if (failureDate == 0L) {
+      setFailureDate(Time.now());


Let's use Instant.now() per HDDS-7911.

@errose28 Can you help review this PR again? Thank you very much!

adoroszlai · 2024-11-05T18:33:44Z

Thanks @slfan1989 for working on this. Converted it to draft because there is a failing test:

[ERROR] org.apache.hadoop.hdds.scm.node.TestSCMNodeManager.tesVolumeInfoFromNodeReport  Time elapsed: 1.105 s  <<< ERROR!
java.lang.UnsupportedOperationException
	at java.base/java.util.AbstractList.add(AbstractList.java:153)
	at java.base/java.util.AbstractList.add(AbstractList.java:111)
	at org.apache.hadoop.hdds.scm.node.DatanodeInfo.updateStorageReports(DatanodeInfo.java:186)
	at org.apache.hadoop.hdds.scm.node.SCMNodeManager.processNodeReport(SCMNodeManager.java:674)
	at org.apache.hadoop.hdds.scm.node.SCMNodeManager.register(SCMNodeManager.java:423)
	at org.apache.hadoop.hdds.scm.node.SCMNodeManager.register(SCMNodeManager.java:360)
	at org.apache.hadoop.hdds.scm.node.TestSCMNodeManager.tesVolumeInfoFromNodeReport(TestSCMNodeManager.java:1591)

https://github.com/slfan1989/ozone/actions/runs/11471452180
https://github.com/slfan1989/ozone/actions/runs/11476535807
https://github.com/slfan1989/ozone/actions/runs/11625983369

slfan1989 · 2024-11-06T09:29:07Z

Thanks @slfan1989 for working on this. Converted it to draft because there is a failing test:

[ERROR] org.apache.hadoop.hdds.scm.node.TestSCMNodeManager.tesVolumeInfoFromNodeReport  Time elapsed: 1.105 s  <<< ERROR!
java.lang.UnsupportedOperationException
	at java.base/java.util.AbstractList.add(AbstractList.java:153)
	at java.base/java.util.AbstractList.add(AbstractList.java:111)
	at org.apache.hadoop.hdds.scm.node.DatanodeInfo.updateStorageReports(DatanodeInfo.java:186)
	at org.apache.hadoop.hdds.scm.node.SCMNodeManager.processNodeReport(SCMNodeManager.java:674)
	at org.apache.hadoop.hdds.scm.node.SCMNodeManager.register(SCMNodeManager.java:423)
	at org.apache.hadoop.hdds.scm.node.SCMNodeManager.register(SCMNodeManager.java:360)
	at org.apache.hadoop.hdds.scm.node.TestSCMNodeManager.tesVolumeInfoFromNodeReport(TestSCMNodeManager.java:1591)

https://github.com/slfan1989/ozone/actions/runs/11471452180 https://github.com/slfan1989/ozone/actions/runs/11476535807 https://github.com/slfan1989/ozone/actions/runs/11625983369

@adoroszlai Thank you for reviewing this PR！ I am currently making improvements, and once the changes pass the CI tests in my branch, I will reopen the PR.

cc: @errose28

slfan1989 · 2024-11-07T23:48:24Z

@adoroszlai Thank you for reviewing this PR! I will also pay closer attention to CI issues in future development. I understand that CI testing resources are valuable.

I have made improvements to the code based on @errose28 suggestions and also fixed the related unit test errors. The CI for my branch has passed(https://github.com/slfan1989/ozone/actions/runs/11719380711), and I have updated the PR status to 'Ready for Review'.

slfan1989 · 2024-11-19T14:18:04Z

@errose28 Could you please help review this PR again? Thank you very much! I’ve made some additional improvements to this PR, as we wanted to print all the disk information. However, since there’s quite a lot of disk data, I’ve added pagination functionality.

slfan1989 mentioned this pull request Oct 3, 2024

HDDS-11268. Add --table mode for OM/SCM Roles CLI #7016

Merged

slfan1989 force-pushed the HDDS-11463 branch from 56230df to a84c575 Compare October 5, 2024 09:57

slfan1989 marked this pull request as ready for review October 5, 2024 11:48

slfan1989 marked this pull request as draft October 6, 2024 01:46

slfan1989 closed this Oct 22, 2024

slfan1989 force-pushed the HDDS-11463 branch from d93e9a3 to 3fb2cf0 Compare October 22, 2024 04:44

slfan1989 reopened this Oct 23, 2024

slfan1989 marked this pull request as ready for review October 23, 2024 04:41

slfan1989 force-pushed the HDDS-11463 branch from 08e0f93 to 4e58fca Compare October 23, 2024 08:57

errose28 reviewed Oct 28, 2024

View reviewed changes

adoroszlai marked this pull request as draft November 5, 2024 18:30

slfan1989 force-pushed the HDDS-11463 branch 2 times, most recently from a83a8f7 to b1df492 Compare November 6, 2024 09:26

slfan1989 force-pushed the HDDS-11463 branch from 6a4a1e8 to 43ed6fe Compare November 7, 2024 07:17

HDDS-11463. Track and display failed DataNode storage locations in SCM.

ff43fac

slfan1989 force-pushed the HDDS-11463 branch from 8bc7ae0 to ff43fac Compare November 7, 2024 08:37

slfan1989 marked this pull request as ready for review November 7, 2024 23:48

adoroszlai requested a review from errose28 November 8, 2024 04:59

errose28 mentioned this pull request Dec 2, 2024

HDDS-11770. Change default failed volume tolerated to 0 #7499

Closed

slfan1989 added 3 commits December 3, 2024 10:49

Merge branch 'master' into HDDS-11463

aa00406

Merge branch 'apache:master' into HDDS-11463

32de1b5

Merge branch 'apache:master' into HDDS-11463

aa43865

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-11463. Track and display failed DataNode storage locations in SCM. #7266

HDDS-11463. Track and display failed DataNode storage locations in SCM. #7266

slfan1989 commented Oct 3, 2024 •

edited

Loading

slfan1989 commented Oct 23, 2024

errose28 left a comment

errose28 Oct 28, 2024

errose28 Oct 28, 2024

slfan1989 Oct 31, 2024

errose28 Oct 28, 2024

slfan1989 Nov 6, 2024

errose28 Oct 28, 2024

slfan1989 Nov 13, 2024

adoroszlai commented Nov 5, 2024

slfan1989 commented Nov 6, 2024

slfan1989 commented Nov 7, 2024

slfan1989 commented Nov 19, 2024

HDDS-11463. Track and display failed DataNode storage locations in SCM. #7266

Are you sure you want to change the base?

HDDS-11463. Track and display failed DataNode storage locations in SCM. #7266

Conversation

slfan1989 commented Oct 3, 2024 • edited Loading

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

slfan1989 commented Oct 23, 2024

errose28 left a comment

Choose a reason for hiding this comment

errose28 Oct 28, 2024

Choose a reason for hiding this comment

errose28 Oct 28, 2024

Choose a reason for hiding this comment

slfan1989 Oct 31, 2024

Choose a reason for hiding this comment

errose28 Oct 28, 2024

Choose a reason for hiding this comment

slfan1989 Nov 6, 2024

Choose a reason for hiding this comment

errose28 Oct 28, 2024

Choose a reason for hiding this comment

slfan1989 Nov 13, 2024

Choose a reason for hiding this comment

adoroszlai commented Nov 5, 2024

slfan1989 commented Nov 6, 2024

slfan1989 commented Nov 7, 2024

slfan1989 commented Nov 19, 2024

slfan1989 commented Oct 3, 2024 •

edited

Loading