Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create health_check rpc #3365

Closed
wants to merge 5 commits into from
Closed

Conversation

HowardHinnant
Copy link
Contributor

@HowardHinnant HowardHinnant commented Apr 17, 2020

  • Gives a summary of the health of the node:
    Healthy, Warning, or Critical

  • Last validated ledger age:
    <7s is Healthy,
    7s to 20s is Warning
    > 20s is Critcal

  • If amendment blocked, Critical

  • Number of peers:
    > 7 is Healthy
    1 to 7 is Warning
    0 is Critical

  • server state:
    One of full, validating or proposing is Healthy
    One of syncing, tracking or connected is Warning
    All other states are Critical

  • load factor:
    <= 100 is Healthy
    101 to 999 is Warning
    >= 1000 is Critical

  • If not Healthy, info field contains data that is considered not
    Healthy.

Fixes: #2809

For reviews, I'm looking for more than code reviews. I need to know:

  1. Is this helpful at all?
  2. Are these the right metrics to be checking?
  3. Are the limits on the metrics set to the right values?

Access this just like server_info: health_check. I find watching it with a loop is helpful:

watch -n .5 ../build/rippled -q  health_check


if (number_peers <= 7)
{
ret[jss::info]["peers"] = number_peers;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of think we should populate "info" with the data even when that data is healthy. Meaning always return the number of peers. Same for the rest of these metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that, and we can do that. But it makes it more difficult to know what is wrong when the status isn't Healthy.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't actually played with the tool, but my intuition aligns with @HowardHinnant's. I think we want to present as little information as possible to most users. If it's Healthy that's really all they want to know. But, since there are folks like @cjcobb23 who always want all the information, we might consider a --verbose argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We kind of already have it, but it is called server_info. All of this information is collected from that command, and this command just filters that and spits it back out with attitude.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HowardHinnant I guess you are right, that all of that information is in server_info. I'm fine then with leaving this command as you have implemented it here.

@scottschurr
Copy link
Collaborator

scottschurr commented Apr 17, 2020

Just looking at the description, I wonder whether the amount of time that the server has been running should be taken into account? A server that's been up for less than 60 seconds probably is not at full yet, but it could still be perfectly healthy. Maybe we return an Unknown result for the first 15 minutes (or so) of up time? Any other time-based heuristic might get too complicated and hard to explain.

@HowardHinnant
Copy link
Contributor Author

I kind of expected server_state to have the value "syncing" during this time, but it says "connected" instead. In any event this state has the Warning status (as opposed to Healthy or Critical), which seems about right. But if the server_state said "syncing" that would be more intuitive. I can look into why it is not saying that.

@scottschurr
Copy link
Collaborator

One other thought is about the heuristics that are used. When possible it's nice if a health check identifies the possible cause of a problem. For example, if the server is amendment blocked the obvious answer is that the server code is too old.

We have (I think) seven typical sources of problems:

  1. The server code must be updated.
  2. Insufficient disk speed (do we distinguish between read and write here?)
  3. Insufficient RAM.
  4. Insufficient compute power (cores or core speed).
  5. Insufficient network bandwidth.
  6. A database type mismatch (e.g., Nubd on a spinning disk or RocksDB without enough RAM).
  7. There's some problem with the UNL.

There may be others? But I feel like those are the top seven.

It would be great if we could identify heuristics that provide guidance regarding which of those are the likely problems. Load factor and server state are great heuristics for general health, but I don't think either of them help to diagnose the source of the issue without drilling in further.

I don't personally know what to watch to make any of those diagnoses. But I think there are folks who do. Hopefully one of them will chime in and offer some guidance.

If we pick the right heuristics to watch, then @mDuo13 could publish a flow chart that takes the heuristics and produces a likely diagnosis.

@codecov-io
Copy link

Codecov Report

Merging #3365 into develop will decrease coverage by 0.07%.
The diff coverage is 0.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #3365      +/-   ##
===========================================
- Coverage    70.41%   70.34%   -0.08%     
===========================================
  Files          683      684       +1     
  Lines        54775    54832      +57     
===========================================
  Hits         38570    38570              
- Misses       16205    16262      +57     
Impacted Files Coverage Δ
src/ripple/app/main/Main.cpp 40.13% <ø> (ø)
src/ripple/net/impl/RPCCall.cpp 94.91% <ø> (ø)
src/ripple/rpc/handlers/HealthCheck.cpp 0.00% <0.00%> (ø)
src/ripple/rpc/impl/Handler.cpp 96.49% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 020b285...c608ad0. Read the comment docs.

@MarkusTeufelberger
Copy link
Collaborator

MarkusTeufelberger commented Apr 18, 2020

As I already wrote in that ticket, please take a look at liveness and readiness probes: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes

Having "healthy/warning/critical" states sounds an awful lot like Nagios/Icinga, not like anything modern... ;-)

If this returns a HTTP 200 with "critical", it will be quite a bit of extra work to add this as a health check e.g. in Docker or Kubernetes. If rippled isn't healthy, it should just send an error code in the 400 or 500 range instead of forcing users to parse text after querying.

Please don't try to be extra smart by looking at environment data in this program. For example it would be very common to let rippled run on a server that has hundreds of GB of RAM, but restrict its RAM usage via cgroups (aka "container limits"). Unless you correctly identify that this is the case, you're not able to know why you're suddenly getting OOM killed even though there's a ton of RAM free from your program's point of view. To monitor this stuff is not the job of rippled, but of the administrators of the machine(s) it runs on and I'd argue it is hardly possible to do system monitoring from a program that doesn't run as root in the first place.

Lastly: Peers, amendment blocks and server state are not really needed in your heuristic. There are perfectly fine reasons to only have a single or very few peers on a server that don't necessitate a warning. Amendment blocked servers won't make progress anyways, so this will be already caught by the ledger age. Same with server state.

Tl;dr:
A healthy server is synced (to within a few seconds of real-time) and not overloaded. If this isn't the case, it is not healthy (not "warning", "critical", "alert" or anything else - just "unhealthy") and it shouldn't answer with an HTTP success code on a health check call (this would be a "readiness probe" btw., a "liveness probe" could be the same call or just a call to server_info, to make sure the server still responds to RPC calls in general and isn't deadlocked/overloaded).

health = state;
};

if (last_validated_ledger_age >= 7)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious as to the rationality that went into these static values, after all since a ledger currently closes every 2-4 seconds, doesn't 7 seconds seem to be a bit excessive / long to report as being healthy? Was this based on data collected from actual propagation times to read-only nodes and such?

Also since close time is variable and changes on a ledger by ledger basis, and the algorithm itself is subject to change (see: #2958) does it make sense to have these hard coded values in this rpc method? Furthermore node operators leverage hardware w/ different performance capabilities may want to map different performance levels to the healthy, warning, critical states. Thoughts on moving these static / hard coded values to the config file or similar?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember that the validated ledger age is calculated from the close time, which is quantized and doesn't have second resolution. So you can't really look at the age and compare it to the ledger interarrival time. Even 7 seconds is likely to flap between "healthy" to 'warning" with no reason, since at 4 seconds you will, on average, have two ledgers every "interval" so their close times will differ by precisely one second.

@movitto
Copy link
Contributor

movitto commented Apr 19, 2020

@scottschurr @MarkusTeufelberger an alternate suggestion would be to provide a separate tool that could be launched by an administrator upon the reporting of a 'non-healthy' health check to examine the underlying operational environment of the server and report inadequancies and/or suggested operational changes (eg, 'insufficient disk speed' after running an I/O test, or 'insufficient memory for huge node, suggest changing to medium).

This would be in line with the unix philosophy of tools doing only one thing well, and could be pluggable / more modular to account for different hardware and runtime polling / allocation mechanisms available for different operating systems, etc.

Thoughts? Dev Null Productions could dedicate some cycles to helping make that happen with the support of ripple & the community.

Copy link
Contributor

@nbougalis nbougalis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool, but @MarkusTeufelberger is raising some interesting questions about health checks. We should investigate options before we merge this and end up with another public API endpoint that we have to support essentially in perpetuity.

if (info.isMember("validated_ledger"))
last_validated_ledger_age = info["validated_ledger"]["age"].asInt();
bool amendment_blocked = false;
if (info.isMember("amendment_blocked"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more useful to provide a warning if an amendment the server doesn't support has received majority.

health = state;
};

if (last_validated_ledger_age >= 7)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember that the validated ledger age is calculated from the close time, which is quantized and doesn't have second resolution. So you can't really look at the age and compare it to the ledger interarrival time. Even 7 seconds is likely to flap between "healthy" to 'warning" with no reason, since at 4 seconds you will, on average, have two ledgers every "interval" so their close times will differ by precisely one second.

@nbougalis
Copy link
Contributor

@scottschurr @MarkusTeufelberger an alternate suggestion would be to provide a separate tool that could be launched by an administrator upon the reporting of a 'non-healthy' health check to examine the underlying operational environment of the server and report inadequancies and/or suggested operational changes (eg, 'insufficient disk speed' after running an I/O test, or 'insufficient memory for huge node, suggest changing to medium).

This would be in line with the unix philosophy of tools doing only one thing well, and could be pluggable / more modular to account for different hardware and runtime polling / allocation mechanisms available for different operating systems, etc.

Thoughts? Dev Null Productions could dedicate some cycles to helping make that happen with the support of ripple & the community.

I think that having a sort of "troubleshooting" script that can be launched on demand and walks through a series of steps to attempt to divine what the issue is and offer suggestions on how to resolve the situation would be tremendously useful to node operators.

@MarkusTeufelberger
Copy link
Collaborator

An separate tool that helps with sizing a node correctly and benchmarking drives etc. would be certainly useful for one-time debugging or initial checks, but I don't see how this would be directly related to a health check API. If benchmarks would get automatically launched after a failed health check, you risk tipping the node over due to the additional load, so this must be a manual process anyways. An interactive "debugging/troubleshooting/benchmarking" script would surely be nice to have.

I'm against solutions by the way that use more than one API call for a single metric. If there's a very fast moving problem, you might not be able to catch it by the time the second API call is issued (e.g. you're getting an "unhealthy" response, then run server_state but by that time the server is already synced up in the time it too for the RPC call to be handled, so you don't see any error condition in there). This is not the case currently, just something to keep in mind in case you want to pursue the "let's find out what actually is going wrong here" route.

By the way: All the metrics you're suggesting could also just be exposed in a nagios plugin script that calls server_state (https://exchange.nagios.org/about), there's no need to integrate this into rippled if you want to go the "healthy/warn/alert" route and help people that alert in this style.

If you want to read more about monitoring, I recommend the chapter https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/ from the SRE book, especially the "four golden signals" section.

Copy link
Collaborator

@mDuo13 mDuo13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original task mentions that the request should be available as an HTTP GET method (not POST as is normal for JSON-RPC). That would be a "special" case, like the peer crawler and /vl/ method. I notice that this method does not work that way. Normally I am in favor of reducing the number of special cases, but in this case the specialness is warranted.

I also think, as @MarkusTeufelberger pointed out, it would be a big improvement if the method returned a non-200 response when the status is not Healthy. I suggest 503 Service Unavailable as the general "it's not healthy" code.

I don't think the method should return details about the healthy stats. If you want to "watch the blinkenlites" as @ximinez put it, server_info has you covered. So in that regard, 👍 to the current behavior.

I don't think we should let the pursuit of perfection stop us from improving on this point, but I really do think making it available via HTTP GET method and non-200 status code things should be hard requirements.

Copy link
Collaborator

@intelliot intelliot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @mDuo13

* Gives a summary of the health of the node:
  Healthy, Warning, or Critical

* Last validated ledger age:
  <7s is Healthy,
  7s to 20s is Warning
  > 20s is Critcal

* If amendment blocked, Critical

* Number of peers:
  > 7 is Healthy
  1 to 7 is Warning
  0 is Critical

* server state:
  One of full, validating or proposing is Healthy
  One of syncing, tracking or connected is Warning
  All other states are Critical

* load factor:
  <= 100 is Healthy
  101 to 999 is Warning
  >= 1000 is Critical

* If not Healthy, info field contains data that is considered not
  Healthy.

Fixes: XRPLF#2809
@miguelportilla
Copy link
Contributor

@MarkusTeufelberger agreed.

@HowardHinnant
Copy link
Contributor Author

This is ready to be re-reviewed per our conversation last week. This can be tested with curl like so:

curl -ki https://127.0.0.1:51235/health

@mDuo13
Copy link
Collaborator

mDuo13 commented May 14, 2020

I am using a relatively powerful home workstation, and my load_factor seems to bounce around from a baseline of ~130 when tracking/proposing, to momentarily ~1500 when handling RPC commands. I don't know how high the load factor can go, but maybe increasing those numbers a bit would work.

It would probably be better if the load-measuring system itself had a more stable number for use in cases like this, but that's probably beyond the scope of this task. But if we did, I would want maybe an average of the past 2s window? And the health check could just return critical if the uptime isn't enough to have a 2s load average.

@MarkusTeufelberger
Copy link
Collaborator

I'd leave the interpretation of the failed health checks up to the monitoring system rather than arbitrarily smooth (=invent fictional) data.

If there's a lot of single checks that fail even though the system is fine, then make sure to only alert if 2 or 3 checks fail in a row.

Maybe instead of only the Load_factor the ratio between load_factor and load_base should be considered? If load_base was 9999, a load_factor of 1500 is no problem, if it is 2, a factor of 1500 is a BIG problem!

Copy link
Collaborator

@scottschurr scottschurr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code does what it's intended to do. I left a number of thoughts for your consideration, but none of them are mandatory. You can see most of those thoughts implemented here: scottschurr@5989fc9

The bigger consideration, which I don't have a great answer for, is that none of this code is touched by any of our unit tests. Zero. Zilch.

That said, the unit test coverage in all of OverlayImpl.cpp is miserable. I suspect (but don't know) that's because it's hard to test. But if you can figure out a way to get some degree of coverage of the code you're adding I think that would be great.

Comment on lines +1092 to +1097
bool amendment_blocked = false;
if (info.isMember("amendment_blocked"))
amendment_blocked = true;
int number_peers = info["peers"].asInt();
std::string server_state = info["server_state"].asString();
auto load_factor = info["load_factor"].asDouble();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most of these local variables can be const.

auto info = getServerInfo();

int last_validated_ledger_age = std::numeric_limits<int>::max();
if (info.isMember("validated_ledger"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all of the sting literals in they pull request are vaguely JSON-like, my inclination is to declare them in jss.h and refer to them from there.

set_health(critical);
}

if (amendment_blocked)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When possible, I'd be inclined to declare the locals inside the scope of the if using if initializers.


if (last_validated_ledger_age >= 7)
{
msg.body()[jss::info]["validated_ledger"] = last_validated_ledger_age;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to set the ["validated_ledger"] to std::numeric_limits<int>::max() if the server_info does not return a ["validated_ledger"]["age"]. That doesn't feel like the right thing to tell a user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just use some sort of string here, such as "Server has no validated ledger".


if (last_validated_ledger_age >= 7)
{
msg.body()[jss::info]["validated_ledger"] = last_validated_ledger_age;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just use some sort of string here, such as "Server has no validated ledger".

@HowardHinnant HowardHinnant added Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. and removed Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. labels May 19, 2020
@carlhua carlhua added the Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. label May 26, 2020
This was referenced May 29, 2020
@carlhua carlhua added the Documentation README changes, code comments, etc. label Jul 27, 2020
@milonite
Copy link

How come this was implemented just as peer method? https://xrpl.org/peer-port-methods.html? Would be useful also as normal json rpc call to check the status of our node

@mDuo13
Copy link
Collaborator

mDuo13 commented Oct 23, 2020

@milonite It's on the peer port so that you can use HTTP GET instead of making a POST request. That's supposed to make it easier, though I suppose in a way that makes it harder to query since you've got to bypass the self-signed TLS certificate (or configure your server with a proper TLS cert).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Change Documentation README changes, code comments, etc. Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create new "health check" method