HDDS-10666. Replication Manager 2: Architecture and concepts #88

sodonnel · 2024-04-08T14:44:59Z

What changes were proposed in this pull request?

Added a page related to Replication Manager V2 to the new docs website.

I have added it under Core Concepts. Some of it outlines design decisions, some workings, config parameters etc. It may make sense to split it up into different sections, but it would be good to keep as much of the information as possible available.

What is the link to the Apache Jira?

https://issues.apache.org/jira/browse/HDDS-10666

How was this patch tested?

I ran the site locally and ensure the new page is working, and the headings seem correct etc.

sodonnel · 2024-04-08T15:02:32Z

Unsure how to handle the spelling check, which is highlight some words in code blocks and things like "mis-replicated" as incorrect. Is there any way to get it to ignore these words and code blocks?

errose28 · 2024-04-08T16:58:04Z

Yeah the job documents how to fix false-positive spelling errors, but unfortunately this tip only shows up on the job summary page, which is not the first thing visible when you click on the failed job. I wish GitHub would improve this, but I can amend it to dump this message to stderr as well so it shows up when clicking on the failed job.

In this case BCSID and mis-replicated should be added to the global words config in cspell.yaml. I tried to get as many Ozone terms as I could think of into this list initially but I missed these ones.

mis and outofservice should probably be added using inline directives since they are specific to this page. Note that cspell splits on camel case which is why ecMisReplicationCheckHandler flags Mis as the spelling error. Looks like ec is already defined in some dictionaries being used according to pnpm cspell trace ec.

errose28 · 2024-04-08T17:15:46Z

Also not sure if you've seen #83 but as one of the first people writing a page for the new site it would be great if you could read through that quick start guide and see if there's anything else that would help there.

errose28 · 2024-04-08T23:12:02Z

RE: spelling

I had some other changes saved on a local branch as well so I went ahead and pushed the global spelling updates needed to #89. We can merge that and you can rebase off of that for global settings. Then just add the front matter spelling directives for the relevant files in this change.

errose28

Thanks for writing this up @sodonnel. Overall the explanations are nice and most comments are just minor things about formatting. I haven't finished reading all the content yet but I'm posting what I have so far.

It may make sense to split it up into different sections, but it would be good to keep as much of the information as possible available.

Yes we should definitely split this into multiple sections. Here's what I'm thinking (with links to the spots on the staging site for reference):

Admin CLI
Some of the sections I stubbed out with the possibility of splitting them to multiple pages if needed once docs writing started. I think this is one of those sections. We can change docs tree to look like this, and add the container report CLI to the Containers page:

Administrator Guide
| Operations
| | Observability
| | | CLI
| | | | Containers
| | | | Pipelines
| | | | Datanodes
| | | | ...
| | | | <page for each admin subcommand>

From the current CLI doc, I think we should move the explanations of container states to this System Internals page and the explanations of health states to Troubleshooting Storage Containers. We can link to both pages from this one. This centralizes the information since it is more general across Ozone and not specific to this CLI. For example, degraded containers show up in Recon too and we don't want to duplicate the explanations of over/under/mis replicated etc on that page.

Replication Manager System Internals
Anything mentioning Java classes or threads should go here. I would probably put task throttling here as well. We should try to keep the admin guide distilled down to the most common things admins would care about on their cluster, otherwise the system seems unmanageable. Advanced users can proceed to the System Internals section.
Configurations?
If you look under the Admin Guide/Configuration section, it only has sections for the configs whose default values would likely need to be adjusted. I'm not sure that applies to any of the configs listed here. I think these sort of configurations could be better documented by generating a table of all configs and descriptions from ozone-default.xml, similar to Hadoop, although we can parse our XML file into a markdown table and add it to the site as part of the build. I can file a Jira to implement this for reference.
Future Ideas?
I think these types of things are better suited for Jira issues where they can be tracked, categorized, implemented, and resolved. People will not think to look for issues or improvements in the website's developer guide.

errose28 · 2024-04-11T20:42:03Z