Create health_check rpc #3365

HowardHinnant · 2020-04-17T19:49:35Z

Gives a summary of the health of the node:
Healthy, Warning, or Critical
Last validated ledger age:
<7s is Healthy,
7s to 20s is Warning
> 20s is Critcal
If amendment blocked, Critical
Number of peers:
> 7 is Healthy
1 to 7 is Warning
0 is Critical
server state:
One of full, validating or proposing is Healthy
One of syncing, tracking or connected is Warning
All other states are Critical
load factor:
<= 100 is Healthy
101 to 999 is Warning
>= 1000 is Critical
If not Healthy, info field contains data that is considered not
Healthy.

For reviews, I'm looking for more than code reviews. I need to know:

Is this helpful at all?
Are these the right metrics to be checking?
Are the limits on the metrics set to the right values?

Access this just like server_info: health_check. I find watching it with a loop is helpful:

watch -n .5 ../build/rippled -q  health_check

cjcobb23 · 2020-04-17T20:55:25Z

src/ripple/rpc/handlers/HealthCheck.cpp

+
+    if (number_peers <= 7)
+    {
+        ret[jss::info]["peers"] = number_peers;


I kind of think we should populate "info" with the data even when that data is healthy. Meaning always return the number of peers. Same for the rest of these metrics.

I thought about that, and we can do that. But it makes it more difficult to know what is wrong when the status isn't Healthy.

I haven't actually played with the tool, but my intuition aligns with @HowardHinnant's. I think we want to present as little information as possible to most users. If it's Healthy that's really all they want to know. But, since there are folks like @cjcobb23 who always want all the information, we might consider a --verbose argument?

We kind of already have it, but it is called server_info. All of this information is collected from that command, and this command just filters that and spits it back out with attitude.

@HowardHinnant I guess you are right, that all of that information is in server_info. I'm fine then with leaving this command as you have implemented it here.

scottschurr · 2020-04-17T22:48:22Z

Just looking at the description, I wonder whether the amount of time that the server has been running should be taken into account? A server that's been up for less than 60 seconds probably is not at full yet, but it could still be perfectly healthy. Maybe we return an Unknown result for the first 15 minutes (or so) of up time? Any other time-based heuristic might get too complicated and hard to explain.

HowardHinnant · 2020-04-17T22:52:51Z

I kind of expected server_state to have the value "syncing" during this time, but it says "connected" instead. In any event this state has the Warning status (as opposed to Healthy or Critical), which seems about right. But if the server_state said "syncing" that would be more intuitive. I can look into why it is not saying that.

scottschurr · 2020-04-17T23:24:52Z

One other thought is about the heuristics that are used. When possible it's nice if a health check identifies the possible cause of a problem. For example, if the server is amendment blocked the obvious answer is that the server code is too old.

We have (I think) seven typical sources of problems:

The server code must be updated.
Insufficient disk speed (do we distinguish between read and write here?)
Insufficient RAM.
Insufficient compute power (cores or core speed).
Insufficient network bandwidth.
A database type mismatch (e.g., Nubd on a spinning disk or RocksDB without enough RAM).
There's some problem with the UNL.

There may be others? But I feel like those are the top seven.

It would be great if we could identify heuristics that provide guidance regarding which of those are the likely problems. Load factor and server state are great heuristics for general health, but I don't think either of them help to diagnose the source of the issue without drilling in further.

I don't personally know what to watch to make any of those diagnoses. But I think there are folks who do. Hopefully one of them will chime in and offer some guidance.

If we pick the right heuristics to watch, then @mDuo13 could publish a flow chart that takes the heuristics and produces a likely diagnosis.

codecov-io · 2020-04-17T23:50:42Z

Codecov Report

Merging #3365 into develop will decrease coverage by 0.07%.
The diff coverage is 0.00%.

@@             Coverage Diff             @@
##           develop    #3365      +/-   ##
===========================================
- Coverage    70.41%   70.34%   -0.08%     
===========================================
  Files          683      684       +1     
  Lines        54775    54832      +57     
===========================================
  Hits         38570    38570              
- Misses       16205    16262      +57

Impacted Files	Coverage Δ
src/ripple/app/main/Main.cpp	`40.13% <ø> (ø)`
src/ripple/net/impl/RPCCall.cpp	`94.91% <ø> (ø)`
src/ripple/rpc/handlers/HealthCheck.cpp	`0.00% <0.00%> (ø)`
src/ripple/rpc/impl/Handler.cpp	`96.49% <ø> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 020b285...c608ad0. Read the comment docs.

MarkusTeufelberger · 2020-04-18T08:34:12Z

As I already wrote in that ticket, please take a look at liveness and readiness probes: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes

Having "healthy/warning/critical" states sounds an awful lot like Nagios/Icinga, not like anything modern... ;-)

If this returns a HTTP 200 with "critical", it will be quite a bit of extra work to add this as a health check e.g. in Docker or Kubernetes. If rippled isn't healthy, it should just send an error code in the 400 or 500 range instead of forcing users to parse text after querying.

Please don't try to be extra smart by looking at environment data in this program. For example it would be very common to let rippled run on a server that has hundreds of GB of RAM, but restrict its RAM usage via cgroups (aka "container limits"). Unless you correctly identify that this is the case, you're not able to know why you're suddenly getting OOM killed even though there's a ton of RAM free from your program's point of view. To monitor this stuff is not the job of rippled, but of the administrators of the machine(s) it runs on and I'd argue it is hardly possible to do system monitoring from a program that doesn't run as root in the first place.

Lastly: Peers, amendment blocks and server state are not really needed in your heuristic. There are perfectly fine reasons to only have a single or very few peers on a server that don't necessitate a warning. Amendment blocked servers won't make progress anyways, so this will be already caught by the ledger age. Same with server state.

Tl;dr:
A healthy server is synced (to within a few seconds of real-time) and not overloaded. If this isn't the case, it is not healthy (not "warning", "critical", "alert" or anything else - just "unhealthy") and it shouldn't answer with an HTTP success code on a health check call (this would be a "readiness probe" btw., a "liveness probe" could be the same call or just a call to server_info, to make sure the server still responds to RPC calls in general and isn't deadlocked/overloaded).

movitto · 2020-04-19T22:11:34Z

src/ripple/rpc/handlers/HealthCheck.cpp

+                health = state;
+        };
+
+    if (last_validated_ledger_age >= 7)


I'm curious as to the rationality that went into these static values, after all since a ledger currently closes every 2-4 seconds, doesn't 7 seconds seem to be a bit excessive / long to report as being healthy? Was this based on data collected from actual propagation times to read-only nodes and such?

Also since close time is variable and changes on a ledger by ledger basis, and the algorithm itself is subject to change (see: #2958) does it make sense to have these hard coded values in this rpc method? Furthermore node operators leverage hardware w/ different performance capabilities may want to map different performance levels to the healthy, warning, critical states. Thoughts on moving these static / hard coded values to the config file or similar?

Remember that the validated ledger age is calculated from the close time, which is quantized and doesn't have second resolution. So you can't really look at the age and compare it to the ledger interarrival time. Even 7 seconds is likely to flap between "healthy" to 'warning" with no reason, since at 4 seconds you will, on average, have two ledgers every "interval" so their close times will differ by precisely one second.

movitto · 2020-04-19T22:19:06Z

@scottschurr @MarkusTeufelberger an alternate suggestion would be to provide a separate tool that could be launched by an administrator upon the reporting of a 'non-healthy' health check to examine the underlying operational environment of the server and report inadequancies and/or suggested operational changes (eg, 'insufficient disk speed' after running an I/O test, or 'insufficient memory for huge node, suggest changing to medium).

This would be in line with the unix philosophy of tools doing only one thing well, and could be pluggable / more modular to account for different hardware and runtime polling / allocation mechanisms available for different operating systems, etc.

Thoughts? Dev Null Productions could dedicate some cycles to helping make that happen with the support of ripple & the community.

nbougalis

This is cool, but @MarkusTeufelberger is raising some interesting questions about health checks. We should investigate options before we merge this and end up with another public API endpoint that we have to support essentially in perpetuity.

nbougalis · 2020-04-19T23:22:36Z

src/ripple/rpc/handlers/HealthCheck.cpp

+    if (info.isMember("validated_ledger"))
+        last_validated_ledger_age = info["validated_ledger"]["age"].asInt();
+    bool amendment_blocked = false;
+    if (info.isMember("amendment_blocked"))


It would be more useful to provide a warning if an amendment the server doesn't support has received majority.

nbougalis · 2020-04-19T23:28:01Z

src/ripple/rpc/handlers/HealthCheck.cpp

+                health = state;
+        };
+
+    if (last_validated_ledger_age >= 7)


Remember that the validated ledger age is calculated from the close time, which is quantized and doesn't have second resolution. So you can't really look at the age and compare it to the ledger interarrival time. Even 7 seconds is likely to flap between "healthy" to 'warning" with no reason, since at 4 seconds you will, on average, have two ledgers every "interval" so their close times will differ by precisely one second.

nbougalis · 2020-04-20T23:01:53Z

@scottschurr @MarkusTeufelberger an alternate suggestion would be to provide a separate tool that could be launched by an administrator upon the reporting of a 'non-healthy' health check to examine the underlying operational environment of the server and report inadequancies and/or suggested operational changes (eg, 'insufficient disk speed' after running an I/O test, or 'insufficient memory for huge node, suggest changing to medium).

This would be in line with the unix philosophy of tools doing only one thing well, and could be pluggable / more modular to account for different hardware and runtime polling / allocation mechanisms available for different operating systems, etc.

Thoughts? Dev Null Productions could dedicate some cycles to helping make that happen with the support of ripple & the community.

I think that having a sort of "troubleshooting" script that can be launched on demand and walks through a series of steps to attempt to divine what the issue is and offer suggestions on how to resolve the situation would be tremendously useful to node operators.

MarkusTeufelberger · 2020-04-21T07:47:20Z

An separate tool that helps with sizing a node correctly and benchmarking drives etc. would be certainly useful for one-time debugging or initial checks, but I don't see how this would be directly related to a health check API. If benchmarks would get automatically launched after a failed health check, you risk tipping the node over due to the additional load, so this must be a manual process anyways. An interactive "debugging/troubleshooting/benchmarking" script would surely be nice to have.

I'm against solutions by the way that use more than one API call for a single metric. If there's a very fast moving problem, you might not be able to catch it by the time the second API call is issued (e.g. you're getting an "unhealthy" response, then run server_state but by that time the server is already synced up in the time it too for the RPC call to be handled, so you don't see any error condition in there). This is not the case currently, just something to keep in mind in case you want to pursue the "let's find out what actually is going wrong here" route.

By the way: All the metrics you're suggesting could also just be exposed in a nagios plugin script that calls server_state (https://exchange.nagios.org/about), there's no need to integrate this into rippled if you want to go the "healthy/warn/alert" route and help people that alert in this style.

If you want to read more about monitoring, I recommend the chapter https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/ from the SRE book, especially the "four golden signals" section.

mDuo13

The original task mentions that the request should be available as an HTTP GET method (not POST as is normal for JSON-RPC). That would be a "special" case, like the peer crawler and /vl/ method. I notice that this method does not work that way. Normally I am in favor of reducing the number of special cases, but in this case the specialness is warranted.

I also think, as @MarkusTeufelberger pointed out, it would be a big improvement if the method returned a non-200 response when the status is not Healthy. I suggest 503 Service Unavailable as the general "it's not healthy" code.

I don't think the method should return details about the healthy stats. If you want to "watch the blinkenlites" as @ximinez put it, server_info has you covered. So in that regard, 👍 to the current behavior.

I don't think we should let the pursuit of perfection stop us from improving on this point, but I really do think making it available via HTTP GET method and non-200 status code things should be hard requirements.

intelliot

I agree with @mDuo13

* Gives a summary of the health of the node: Healthy, Warning, or Critical * Last validated ledger age: <7s is Healthy, 7s to 20s is Warning > 20s is Critcal * If amendment blocked, Critical * Number of peers: > 7 is Healthy 1 to 7 is Warning 0 is Critical * server state: One of full, validating or proposing is Healthy One of syncing, tracking or connected is Warning All other states are Critical * load factor: <= 100 is Healthy 101 to 999 is Warning >= 1000 is Critical * If not Healthy, info field contains data that is considered not Healthy. Fixes: XRPLF#2809

miguelportilla · 2020-05-08T11:00:13Z

@MarkusTeufelberger agreed.

This reverts commit d2f1bad.

HowardHinnant · 2020-05-12T15:37:33Z

This is ready to be re-reviewed per our conversation last week. This can be tested with curl like so:

curl -ki https://127.0.0.1:51235/health

mDuo13 · 2020-05-14T02:31:13Z

I am using a relatively powerful home workstation, and my load_factor seems to bounce around from a baseline of ~130 when tracking/proposing, to momentarily ~1500 when handling RPC commands. I don't know how high the load factor can go, but maybe increasing those numbers a bit would work.

It would probably be better if the load-measuring system itself had a more stable number for use in cases like this, but that's probably beyond the scope of this task. But if we did, I would want maybe an average of the past 2s window? And the health check could just return critical if the uptime isn't enough to have a 2s load average.

MarkusTeufelberger · 2020-05-14T06:49:04Z

I'd leave the interpretation of the failed health checks up to the monitoring system rather than arbitrarily smooth (=invent fictional) data.

If there's a lot of single checks that fail even though the system is fine, then make sure to only alert if 2 or 3 checks fail in a row.

Maybe instead of only the Load_factor the ratio between load_factor and load_base should be considered? If load_base was 9999, a load_factor of 1500 is no problem, if it is 2, a factor of 1500 is a BIG problem!

scottschurr

I think this code does what it's intended to do. I left a number of thoughts for your consideration, but none of them are mandatory. You can see most of those thoughts implemented here: scottschurr@5989fc9

The bigger consideration, which I don't have a great answer for, is that none of this code is touched by any of our unit tests. Zero. Zilch.

That said, the unit test coverage in all of OverlayImpl.cpp is miserable. I suspect (but don't know) that's because it's hard to test. But if you can figure out a way to get some degree of coverage of the code you're adding I think that would be great.

scottschurr · 2020-05-18T22:09:29Z