Create new "health check" method #2809

mDuo13 · 2018-12-14T00:41:06Z

The server_info command is very cluttered and in many cases it's not easy to diagnose whether the server is healthy or not from that. A simple "health check" method could make it easier to monitor rippled with industry-standard tooling and also easier to diagnose manually as well.

Ideally:

It would be an HTTP GET call, to maximize compatibility with common monitoring software. (Maybe using the same port as "crawl"?)
It would report a health score ("Healthy", "Warning", or "Critical") depending on several factors:
- Age of last validated ledger is < 7s (healthy), <=20s (warning) or >20s (critical)
- Not amendment blocked (critical) and amendments unknown to the server don't currently have a majority in the network (warning)
- Number of peers is >7 (healthy) <5 (warning unless configured w/ peer_private), or 0 (critical)
- server_state is full/validating/proposing (healthy), syncing/tracking/connected (warning), or disconnected (critical)
- Maybe a load_factor based warning, too? Not sure what thresholds to use there.
If the health score is not "healthy", it reports which factor(s) are not healthy. If it's healthy, the word "HEALTHY" is the only thing in the response.

The text was updated successfully, but these errors were encountered:

ximinez · 2018-12-14T14:36:51Z

Speaking as an engineer, not an end-user, I think that reporting the values of the individual factors along with the health score could be valuable.
Reasons:

Some people just like to watch the blinkenlites, and having these top factors only may be interesting.
Advanced users may want to monitor how healthy their node is. For example, if I'm running a hub, I would expect hundreds of peers, not just 7. It would be handy to get this information in one place so I can confirm the node thinks itself is healthy, but I can keep an extra watch on the number of peers, and have some local alert if it drops below my own threshold. Yes, an advanced user can get this info from server_info, but including it here, when we also return it when there are problems, saves an RPC call.

MarkusTeufelberger · 2018-12-14T15:15:31Z

It might be useful to split this up into "liveness" and "readiness" (see https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/ for example) and I'm not convinced that a server that has constant 6 seconds lag is "healthy".

garay-daniel · 2018-12-14T16:05:53Z

Agree with Ed's point. I think this also pushes operators to understand what metrics are important to monitor.

If we make this compatible with crawl, then it would be easy for anyone to scrape this data to build dashboards and monitoring of the network, which would be great.

MarkusTeufelberger · 2018-12-14T16:53:55Z

If you want people to build dashboards, implement https://openmetrics.io/ instead of the current statsd-only metrics framework.

I currently collect metrics via server_status and a custom prometheus exporter, so most/all of the data mentioned is anyways already available. It is just clunky to extract.

carlhua · 2020-03-23T13:54:30Z

Lets discuss this as part of 1.6. I added a tag to this issue. @mayurbhandary

* Gives a summary of the health of the node: Healthy, Warning, or Critical * Last validated ledger age: <7s is Healthy, 7s to 20s is Warning > 20s is Critcal * If amendment blocked, Critical * Number of peers: > 7 is Healthy 1 to 7 is Warning 0 is Critical * server state: One of full, validating or proposing is Healthy One of syncing, tracking or connected is Warning All other states are Critical * load factor: <= 100 is Healthy 101 to 999 is Warning >= 1000 is Critical * If not Healthy, info field contains data that is considered not Healthy. Fixes: XRPLF#2809

mDuo13 added API Change Feature Request Used to indicate requests to add new features labels Dec 14, 2018

mDuo13 mentioned this issue Dec 14, 2018

Log message indicating successful sync #2802

Open

MarkusTeufelberger mentioned this issue Mar 19, 2020

Respond 200 on empty HTTP request to support GCE ingress health check #3308

Closed

carlhua assigned HowardHinnant Mar 23, 2020

HowardHinnant mentioned this issue Apr 17, 2020

Create health_check rpc #3365

Closed

HowardHinnant added the Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. label May 21, 2020

manojsdoshi closed this as completed in 0290d0b Jun 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create new "health check" method #2809

Create new "health check" method #2809

mDuo13 commented Dec 14, 2018

ximinez commented Dec 14, 2018

MarkusTeufelberger commented Dec 14, 2018

garay-daniel commented Dec 14, 2018

MarkusTeufelberger commented Dec 14, 2018

carlhua commented Mar 23, 2020 •

edited

Loading

Create new "health check" method #2809

Create new "health check" method #2809

Comments

mDuo13 commented Dec 14, 2018

ximinez commented Dec 14, 2018

MarkusTeufelberger commented Dec 14, 2018

garay-daniel commented Dec 14, 2018

MarkusTeufelberger commented Dec 14, 2018

carlhua commented Mar 23, 2020 • edited Loading

carlhua commented Mar 23, 2020 •

edited

Loading