-
Notifications
You must be signed in to change notification settings - Fork 517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-11463. Track and display failed DataNode storage locations in SCM. #7266
base: master
Are you sure you want to change the base?
Conversation
@errose28 Could you please help review this PR? Thank you very much! We discussed the relevant implementation together in HDDS-11463. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @slfan1989, this looks like a useful addition. I only had time for a quick high level look for now.
* Handler of ozone admin scm volumesfailure command. | ||
*/ | ||
@Command( | ||
name = "volumesfailure", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the CLI, we should probably use something like ozone admin datanode volume list
. The datanode
subcommand is already used to retrieve information about datanodes from SCM. Splitting the commands so that volume
has its own subcommand gives us more options in the future.
To distinguish failed and healthy volumes and filter out different nodes, we can either add some kind of filter flag, or leave it up to grep/jq to be applied to the output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also means we should make the RPC more generic to support pulling all volume information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for helping to review this PR! I will continue to improve the relevant code based on your suggestions.
@@ -382,6 +383,7 @@ public abstract static class Builder<T extends Builder<T>> { | |||
private boolean failedVolume = false; | |||
private String datanodeUuid; | |||
private String clusterID; | |||
private long failureDate; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets use failureTime
. I'm assuming this is being stored as millis since epoch, so it will have data and time information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have improved the relevant code.
// Ensure it is set only once, | ||
// which is the time when the failure was first detected. | ||
if (failureDate == 0L) { | ||
setFailureDate(Time.now()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use Instant.now()
per HDDS-7911.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@errose28 Can you help review this PR again? Thank you very much!
Thanks @slfan1989 for working on this. Converted it to draft because there is a failing test:
https://github.com/slfan1989/ozone/actions/runs/11471452180 |
a83a8f7
to
b1df492
Compare
@adoroszlai Thank you for reviewing this PR! I am currently making improvements, and once the changes pass the CI tests in my branch, I will reopen the PR. cc: @errose28 |
@adoroszlai Thank you for reviewing this PR! I will also pay closer attention to CI issues in future development. I understand that CI testing resources are valuable. I have made improvements to the code based on @errose28 suggestions and also fixed the related unit test errors. The CI for my branch has passed(https://github.com/slfan1989/ozone/actions/runs/11719380711), and I have updated the PR status to 'Ready for Review'. |
@errose28 Could you please help review this PR again? Thank you very much! I’ve made some additional improvements to this PR, as we wanted to print all the disk information. However, since there’s quite a lot of disk data, I’ve added pagination functionality. |
What changes were proposed in this pull request?
Currently, we lack tools on the SCM side to track failed disks on DataNodes. DataNodes have already reported this information, and we need to display it.
In this PR, we will display the failed disks on the DataNode. The information can be displayed in JSON format or using the default format.
What is the link to the Apache JIRA
JIRA: HDDS-11463. Track and display failed DataNode storage locations in SCM.
How was this patch tested?
Add Junit Test & Testing in a test environment.