Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-10666. Replication Manager 2: Architecture and concepts #88

Open
wants to merge 4 commits into
base: HDDS-9225-website-v2
Choose a base branch
from

Conversation

sodonnel
Copy link

@sodonnel sodonnel commented Apr 8, 2024

What changes were proposed in this pull request?

Added a page related to Replication Manager V2 to the new docs website.

I have added it under Core Concepts. Some of it outlines design decisions, some workings, config parameters etc. It may make sense to split it up into different sections, but it would be good to keep as much of the information as possible available.

What is the link to the Apache Jira?

https://issues.apache.org/jira/browse/HDDS-10666

How was this patch tested?

I ran the site locally and ensure the new page is working, and the headings seem correct etc.

@github-actions github-actions bot added the website-v2 Tasks for the new ozone website targeting the HDDS-9225-website-v2 branch label Apr 8, 2024
@sodonnel
Copy link
Author

sodonnel commented Apr 8, 2024

Unsure how to handle the spelling check, which is highlight some words in code blocks and things like "mis-replicated" as incorrect. Is there any way to get it to ignore these words and code blocks?

@errose28 errose28 self-requested a review April 8, 2024 16:38
@errose28
Copy link
Contributor

errose28 commented Apr 8, 2024

Yeah the job documents how to fix false-positive spelling errors, but unfortunately this tip only shows up on the job summary page, which is not the first thing visible when you click on the failed job. I wish GitHub would improve this, but I can amend it to dump this message to stderr as well so it shows up when clicking on the failed job.

In this case BCSID and mis-replicated should be added to the global words config in cspell.yaml. I tried to get as many Ozone terms as I could think of into this list initially but I missed these ones.

mis and outofservice should probably be added using inline directives since they are specific to this page. Note that cspell splits on camel case which is why ecMisReplicationCheckHandler flags Mis as the spelling error. Looks like ec is already defined in some dictionaries being used according to pnpm cspell trace ec.

@errose28
Copy link
Contributor

errose28 commented Apr 8, 2024

Also not sure if you've seen #83 but as one of the first people writing a page for the new site it would be great if you could read through that quick start guide and see if there's anything else that would help there.

@errose28
Copy link
Contributor

errose28 commented Apr 8, 2024

RE: spelling

I had some other changes saved on a local branch as well so I went ahead and pushed the global spelling updates needed to #89. We can merge that and you can rebase off of that for global settings. Then just add the front matter spelling directives for the relevant files in this change.

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this up @sodonnel. Overall the explanations are nice and most comments are just minor things about formatting. I haven't finished reading all the content yet but I'm posting what I have so far.

It may make sense to split it up into different sections, but it would be good to keep as much of the information as possible available.

Yes we should definitely split this into multiple sections. Here's what I'm thinking (with links to the spots on the staging site for reference):

  • Admin CLI
    Some of the sections I stubbed out with the possibility of splitting them to multiple pages if needed once docs writing started. I think this is one of those sections. We can change docs tree to look like this, and add the container report CLI to the Containers page:
Administrator Guide
| Operations
| | Observability
| | | CLI
| | | | Containers
| | | | Pipelines
| | | | Datanodes
| | | | ...
| | | | <page for each admin subcommand>

From the current CLI doc, I think we should move the explanations of container states to this System Internals page and the explanations of health states to Troubleshooting Storage Containers. We can link to both pages from this one. This centralizes the information since it is more general across Ozone and not specific to this CLI. For example, degraded containers show up in Recon too and we don't want to duplicate the explanations of over/under/mis replicated etc on that page.

  • Replication Manager System Internals
    Anything mentioning Java classes or threads should go here. I would probably put task throttling here as well. We should try to keep the admin guide distilled down to the most common things admins would care about on their cluster, otherwise the system seems unmanageable. Advanced users can proceed to the System Internals section.

  • Configurations?
    If you look under the Admin Guide/Configuration section, it only has sections for the configs whose default values would likely need to be adjusted. I'm not sure that applies to any of the configs listed here. I think these sort of configurations could be better documented by generating a table of all configs and descriptions from ozone-default.xml, similar to Hadoop, although we can parse our XML file into a markdown table and add it to the site as part of the build. I can file a Jira to implement this for reference.

  • Future Ideas?
    I think these types of things are better suited for Jira issues where they can be tracked, categorized, implemented, and resolved. People will not think to look for issues or improvements in the website's developer guide.

.addNext(new VulnerableUnhealthyReplicasHandler(this));
```

## ReplicationManager Report
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## ReplicationManager Report
## Replication Manager Report


## ReplicationManager Report

Each time the check phase of the Replication Manager runs, it generates a report which can be accessed via the command “ozone admin container report”. This report provides details of all containers in the cluster which have an issue to be corrected by the Replication Manager.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Each time the check phase of the Replication Manager runs, it generates a report which can be accessed via the command ozone admin container report. This report provides details of all containers in the cluster which have an issue to be corrected by the Replication Manager.
Each time the check phase of the Replication Manager runs, it generates a report which can be accessed via the command `ozone admin container report`. This report provides details of all containers in the cluster which have an issue to be corrected by the Replication Manager.


Closing containers are in the process of being closed. They will transition to closing when they have enough data to be considered full, or there is a problem with the write pipeline, such as a Datanode going down.

#### Quasi Closed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether the plain text version of quasiClosed is "quasi closed" or "quasi-closed". I'm leaning towards "quasi-closed" though. We can adjust website configuration files to only allow whichever version we choose as correct.


#### Closing

Closing containers are in the process of being closed. They will transition to closing when they have enough data to be considered full, or there is a problem with the write pipeline, such as a Datanode going down.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Closing containers are in the process of being closed. They will transition to closing when they have enough data to be considered full, or there is a problem with the write pipeline, such as a Datanode going down.
Closing containers are in the process of being closed. Containers will transition to closing when they have enough data to be considered full, or there is a problem with the write pipeline, such as a Datanode going down.


#### Closed

Closed containers have successfully transitioned from closing to closed. This is a normal state for containers to move to, and the majority of containers in the cluster should be in this state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Closed containers have successfully transitioned from closing to closed. This is a normal state for containers to move to, and the majority of containers in the cluster should be in this state.
Closed containers have successfully transitioned from closing or quasi-closed to closed. This is a normal state for containers to move to, and the majority of containers in the cluster should be in this state.


#### Under Replicated

Under-Replicated containers have less than the number of expected replicas. This could be caused by decommissioning or maintenance mode transitions on the Datanode, or due to failed disks or failed nodes within the cluster. Unhealthy replicas also make a container under-replicated, as they have a problem which must be corrected. See the Unhealthy section below for more details on the unhealthy state. The Replication Manager will schedule commands to make additional copies of the under replicated containers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Under-Replicated containers have less than the number of expected replicas. This could be caused by decommissioning or maintenance mode transitions on the Datanode, or due to failed disks or failed nodes within the cluster. Unhealthy replicas also make a container under-replicated, as they have a problem which must be corrected. See the Unhealthy section below for more details on the unhealthy state. The Replication Manager will schedule commands to make additional copies of the under replicated containers.
Under-Replicated containers have less than the number of expected replicas. This could be caused by decommissioning or maintenance mode transitions on the Datanode, failed disks, or failed nodes within the cluster. Unhealthy replicas also make a container under-replicated, as they have a problem which must be corrected. See the [Unhealthy](#unhealthy) section below for more details on the unhealthy state. The Replication Manager will schedule commands to make additional copies of the under replicated containers.

This should convert it to a link directly to the referenced heading. I haven't tried it rendered yet though.


#### Mis-Replicated

If the container has the correct number of replicas, but they are not spread across sufficient racks to meet the requirements of the container placement policy, the container is Mis-Replicated. Again, Replication Manager will work to move replicas to additional racks by making new copies of the relevant replicas and then removing the excess.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the container has the correct number of replicas, but they are not spread across sufficient racks to meet the requirements of the container placement policy, the container is Mis-Replicated. Again, Replication Manager will work to move replicas to additional racks by making new copies of the relevant replicas and then removing the excess.
If the container has the correct number of replicas, but they are not spread across sufficient racks to meet the requirements of the container's network topology placement policy, the container is mis-replicated. Again, Replication Manager will work to move replicas to additional racks by making new copies of the relevant replicas and then removing the excess. See [Configuring Network Topology](administrator-guide/configuration/performance/topology) for more information.

We can link to other pages on the site even if they are not complete yet. If the page gets moved or renamed in the mean time this link will break and Docusaurus will fail the build.


#### Missing

A container is missing if there are not enough replicas available to read it. For a Ratis container, that would mean zero copies are online. For an EC container, it is marked as missing if less than “data number” of replicas are available. Eg, for a RS-6-3 container, having less than 6 replicas online would render it missing. For missing containers, the Replication Manager cannot do anything to correct them. Manual intervention will be needed to bring lost nodes back into the cluster, or take steps to remove the containers from SCM and any related keys from OM, as the data will not be accessible.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A container is missing if there are not enough replicas available to read it. For a Ratis container, that would mean zero copies are online. For an EC container, it is marked as missing if less than “data number” of replicas are available. Eg, for a RS-6-3 container, having less than 6 replicas online would render it missing. For missing containers, the Replication Manager cannot do anything to correct them. Manual intervention will be needed to bring lost nodes back into the cluster, or take steps to remove the containers from SCM and any related keys from OM, as the data will not be accessible.
A container is missing if there are not enough replicas available to read it. For a Ratis container, that would mean zero copies are online. For an EC container, it is marked as missing if less than “data number” of replicas are available. For example a container created with replication configuration `rs-6-3-1024` with less than 6 replicas online would render it missing. See [Erasure Coding](core-concepts/replication/erasure-coding) for more details. For missing containers, the Replication Manager cannot do anything to correct them. Manual intervention will be needed to bring lost nodes back into the cluster, or take steps to remove the containers' keys from OM, as the data will not be accessible.
  • Since rs-6-3-1024 is the replication config used at the command line, I guess we can use that in the docs too. Or some other standard way to reference EC replication types.
  • Currently non-empty containers cannot be removed from SCM. The only option is to delete the affected keys from OM.
    • We can link to a section in the troubleshooting guide here when we have it, but there's not a relevant page for this right now.


#### Unhealthy Container Samples

To facilitate investigating problems with degraded containers, the report includes a sample of the first 100 container IDs in each state and includes them in the report. Given these IDs, it is possible to see if the same containers are continuously stuck, and also get more information about the container via the “ozone admin container info ID” command.
Copy link
Contributor

@errose28 errose28 Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To facilitate investigating problems with degraded containers, the report includes a sample of the first 100 container IDs in each state and includes them in the report. Given these IDs, it is possible to see if the same containers are continuously stuck, and also get more information about the container via the ozone admin container info ID” command.
To facilitate investigating problems with degraded containers, the report includes a sample of the first 100 container IDs in each state and includes them in the report. Given these IDs, it is possible to see if the same containers are continuously stuck, and also get more information about the container via the `ozone admin container info` command. See [Troubleshooting Containers](troubleshooting/storage-containers) for more information.

Defaults are given in brackets after the parameter.

* `hdds.scm.replication.datanode.replication.limit` - (20) Total number of replication commands that can be queued on a Datanode. The limit is made up of number_of_replication_commands + reconstruction_weight * number_of_reconstruction_commands
* `hdds.scm.replication.datanode.reconstruction.weight` - (3) The weight to apply to multiple reconstruction commands before adding to the Datanode.replication.limit.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `hdds.scm.replication.datanode.reconstruction.weight` - (3) The weight to apply to multiple reconstruction commands before adding to the Datanode.replication.limit.
* `hdds.scm.replication.datanode.reconstruction.weight` - (3) The weight to apply to each reconstruction command before adding it to `hdds.scm.replication.datanode.replication.limit`.

Capitalization enforcement is ignored in code blocks.

@errose28 errose28 added the docs Changes updating documentation on the website label May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Changes updating documentation on the website website-v2 Tasks for the new ozone website targeting the HDDS-9225-website-v2 branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants