-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create health_check rpc #3365
Create health_check rpc #3365
Conversation
|
||
if (number_peers <= 7) | ||
{ | ||
ret[jss::info]["peers"] = number_peers; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of think we should populate "info" with the data even when that data is healthy. Meaning always return the number of peers. Same for the rest of these metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that, and we can do that. But it makes it more difficult to know what is wrong when the status isn't Healthy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't actually played with the tool, but my intuition aligns with @HowardHinnant's. I think we want to present as little information as possible to most users. If it's Healthy that's really all they want to know. But, since there are folks like @cjcobb23 who always want all the information, we might consider a --verbose
argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We kind of already have it, but it is called server_info
. All of this information is collected from that command, and this command just filters that and spits it back out with attitude.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HowardHinnant I guess you are right, that all of that information is in server_info
. I'm fine then with leaving this command as you have implemented it here.
Just looking at the description, I wonder whether the amount of time that the server has been running should be taken into account? A server that's been up for less than 60 seconds probably is not at |
I kind of expected |
One other thought is about the heuristics that are used. When possible it's nice if a health check identifies the possible cause of a problem. For example, if the server is amendment blocked the obvious answer is that the server code is too old. We have (I think) seven typical sources of problems:
There may be others? But I feel like those are the top seven. It would be great if we could identify heuristics that provide guidance regarding which of those are the likely problems. Load factor and server state are great heuristics for general health, but I don't think either of them help to diagnose the source of the issue without drilling in further. I don't personally know what to watch to make any of those diagnoses. But I think there are folks who do. Hopefully one of them will chime in and offer some guidance. If we pick the right heuristics to watch, then @mDuo13 could publish a flow chart that takes the heuristics and produces a likely diagnosis. |
Codecov Report
@@ Coverage Diff @@
## develop #3365 +/- ##
===========================================
- Coverage 70.41% 70.34% -0.08%
===========================================
Files 683 684 +1
Lines 54775 54832 +57
===========================================
Hits 38570 38570
- Misses 16205 16262 +57
Continue to review full report at Codecov.
|
As I already wrote in that ticket, please take a look at liveness and readiness probes: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes Having "healthy/warning/critical" states sounds an awful lot like Nagios/Icinga, not like anything modern... ;-) If this returns a HTTP 200 with "critical", it will be quite a bit of extra work to add this as a health check e.g. in Docker or Kubernetes. If Please don't try to be extra smart by looking at environment data in this program. For example it would be very common to let Lastly: Peers, amendment blocks and server state are not really needed in your heuristic. There are perfectly fine reasons to only have a single or very few peers on a server that don't necessitate a warning. Amendment blocked servers won't make progress anyways, so this will be already caught by the ledger age. Same with server state. Tl;dr: |
health = state; | ||
}; | ||
|
||
if (last_validated_ledger_age >= 7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious as to the rationality that went into these static values, after all since a ledger currently closes every 2-4 seconds, doesn't 7 seconds seem to be a bit excessive / long to report as being healthy? Was this based on data collected from actual propagation times to read-only nodes and such?
Also since close time is variable and changes on a ledger by ledger basis, and the algorithm itself is subject to change (see: #2958) does it make sense to have these hard coded values in this rpc method? Furthermore node operators leverage hardware w/ different performance capabilities may want to map different performance levels to the healthy, warning, critical states. Thoughts on moving these static / hard coded values to the config file or similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember that the validated ledger age is calculated from the close time, which is quantized and doesn't have second resolution. So you can't really look at the age and compare it to the ledger interarrival time. Even 7 seconds is likely to flap between "healthy" to 'warning" with no reason, since at 4 seconds you will, on average, have two ledgers every "interval" so their close times will differ by precisely one second.
@scottschurr @MarkusTeufelberger an alternate suggestion would be to provide a separate tool that could be launched by an administrator upon the reporting of a 'non-healthy' health check to examine the underlying operational environment of the server and report inadequancies and/or suggested operational changes (eg, 'insufficient disk speed' after running an I/O test, or 'insufficient memory for huge node, suggest changing to medium). This would be in line with the unix philosophy of tools doing only one thing well, and could be pluggable / more modular to account for different hardware and runtime polling / allocation mechanisms available for different operating systems, etc. Thoughts? Dev Null Productions could dedicate some cycles to helping make that happen with the support of ripple & the community. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cool, but @MarkusTeufelberger is raising some interesting questions about health checks. We should investigate options before we merge this and end up with another public API endpoint that we have to support essentially in perpetuity.
if (info.isMember("validated_ledger")) | ||
last_validated_ledger_age = info["validated_ledger"]["age"].asInt(); | ||
bool amendment_blocked = false; | ||
if (info.isMember("amendment_blocked")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more useful to provide a warning if an amendment the server doesn't support has received majority.
health = state; | ||
}; | ||
|
||
if (last_validated_ledger_age >= 7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember that the validated ledger age is calculated from the close time, which is quantized and doesn't have second resolution. So you can't really look at the age and compare it to the ledger interarrival time. Even 7 seconds is likely to flap between "healthy" to 'warning" with no reason, since at 4 seconds you will, on average, have two ledgers every "interval" so their close times will differ by precisely one second.
I think that having a sort of "troubleshooting" script that can be launched on demand and walks through a series of steps to attempt to divine what the issue is and offer suggestions on how to resolve the situation would be tremendously useful to node operators. |
An separate tool that helps with sizing a node correctly and benchmarking drives etc. would be certainly useful for one-time debugging or initial checks, but I don't see how this would be directly related to a health check API. If benchmarks would get automatically launched after a failed health check, you risk tipping the node over due to the additional load, so this must be a manual process anyways. An interactive "debugging/troubleshooting/benchmarking" script would surely be nice to have. I'm against solutions by the way that use more than one API call for a single metric. If there's a very fast moving problem, you might not be able to catch it by the time the second API call is issued (e.g. you're getting an "unhealthy" response, then run By the way: All the metrics you're suggesting could also just be exposed in a nagios plugin script that calls If you want to read more about monitoring, I recommend the chapter https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/ from the SRE book, especially the "four golden signals" section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original task mentions that the request should be available as an HTTP GET method (not POST as is normal for JSON-RPC). That would be a "special" case, like the peer crawler and /vl/
method. I notice that this method does not work that way. Normally I am in favor of reducing the number of special cases, but in this case the specialness is warranted.
I also think, as @MarkusTeufelberger pointed out, it would be a big improvement if the method returned a non-200 response when the status is not Healthy. I suggest 503 Service Unavailable as the general "it's not healthy" code.
I don't think the method should return details about the healthy stats. If you want to "watch the blinkenlites" as @ximinez put it, server_info
has you covered. So in that regard, 👍 to the current behavior.
I don't think we should let the pursuit of perfection stop us from improving on this point, but I really do think making it available via HTTP GET method and non-200 status code things should be hard requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @mDuo13
* Gives a summary of the health of the node: Healthy, Warning, or Critical * Last validated ledger age: <7s is Healthy, 7s to 20s is Warning > 20s is Critcal * If amendment blocked, Critical * Number of peers: > 7 is Healthy 1 to 7 is Warning 0 is Critical * server state: One of full, validating or proposing is Healthy One of syncing, tracking or connected is Warning All other states are Critical * load factor: <= 100 is Healthy 101 to 999 is Warning >= 1000 is Critical * If not Healthy, info field contains data that is considered not Healthy. Fixes: XRPLF#2809
@MarkusTeufelberger agreed. |
This is ready to be re-reviewed per our conversation last week. This can be tested with curl like so:
|
I am using a relatively powerful home workstation, and my load_factor seems to bounce around from a baseline of ~130 when tracking/proposing, to momentarily ~1500 when handling RPC commands. I don't know how high the load factor can go, but maybe increasing those numbers a bit would work. It would probably be better if the load-measuring system itself had a more stable number for use in cases like this, but that's probably beyond the scope of this task. But if we did, I would want maybe an average of the past 2s window? And the health check could just return critical if the uptime isn't enough to have a 2s load average. |
I'd leave the interpretation of the failed health checks up to the monitoring system rather than arbitrarily smooth (=invent fictional) data. If there's a lot of single checks that fail even though the system is fine, then make sure to only alert if 2 or 3 checks fail in a row. Maybe instead of only the Load_factor the ratio between load_factor and load_base should be considered? If load_base was 9999, a load_factor of 1500 is no problem, if it is 2, a factor of 1500 is a BIG problem! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this code does what it's intended to do. I left a number of thoughts for your consideration, but none of them are mandatory. You can see most of those thoughts implemented here: scottschurr@5989fc9
The bigger consideration, which I don't have a great answer for, is that none of this code is touched by any of our unit tests. Zero. Zilch.
That said, the unit test coverage in all of OverlayImpl.cpp is miserable. I suspect (but don't know) that's because it's hard to test. But if you can figure out a way to get some degree of coverage of the code you're adding I think that would be great.
bool amendment_blocked = false; | ||
if (info.isMember("amendment_blocked")) | ||
amendment_blocked = true; | ||
int number_peers = info["peers"].asInt(); | ||
std::string server_state = info["server_state"].asString(); | ||
auto load_factor = info["load_factor"].asDouble(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think most of these local variables can be const
.
auto info = getServerInfo(); | ||
|
||
int last_validated_ledger_age = std::numeric_limits<int>::max(); | ||
if (info.isMember("validated_ledger")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since all of the sting literals in they pull request are vaguely JSON-like, my inclination is to declare them in jss.h
and refer to them from there.
set_health(critical); | ||
} | ||
|
||
if (amendment_blocked) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When possible, I'd be inclined to declare the locals inside the scope of the if
using if
initializers.
|
||
if (last_validated_ledger_age >= 7) | ||
{ | ||
msg.body()[jss::info]["validated_ledger"] = last_validated_ledger_age; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to set the ["validated_ledger"]
to std::numeric_limits<int>::max()
if the server_info
does not return a ["validated_ledger"]["age"]
. That doesn't feel like the right thing to tell a user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just use some sort of string here, such as "Server has no validated ledger".
|
||
if (last_validated_ledger_age >= 7) | ||
{ | ||
msg.body()[jss::info]["validated_ledger"] = last_validated_ledger_age; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just use some sort of string here, such as "Server has no validated ledger".
How come this was implemented just as peer method? https://xrpl.org/peer-port-methods.html? Would be useful also as normal json rpc call to check the status of our node |
@milonite It's on the peer port so that you can use HTTP GET instead of making a POST request. That's supposed to make it easier, though I suppose in a way that makes it harder to query since you've got to bypass the self-signed TLS certificate (or configure your server with a proper TLS cert). |
Gives a summary of the health of the node:
Healthy, Warning, or Critical
Last validated ledger age:
<7s is Healthy,
7s to 20s is Warning
> 20s is Critcal
If amendment blocked, Critical
Number of peers:
> 7 is Healthy
1 to 7 is Warning
0 is Critical
server state:
One of full, validating or proposing is Healthy
One of syncing, tracking or connected is Warning
All other states are Critical
load factor:
<= 100 is Healthy
101 to 999 is Warning
>= 1000 is Critical
If not Healthy, info field contains data that is considered not
Healthy.
Fixes: #2809
For reviews, I'm looking for more than code reviews. I need to know:
Access this just like
server_info
:health_check
. I find watching it with a loop is helpful: