-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load_factor appears to be using wrong units in health check (Version: 1.6.0-b8) #3486
Comments
Just to clarify, the reported load factor should be And do you have recommendations for warning and critical limits? |
The reported
I don't have enough data to make recommendations. Perhaps @mayurbhandary's StatsD collector has enough data? Otherwise, I guess the 100/1000 limits are probably fine as long as Slightly off-topic: ledger age discussionAlso, while we're on the subject of the health check, I had a realization that might be worth further discussion. Basically, I discovered that the warning-level I wasn't one of the reviewers who raised this issue at the time but now that I've seen it in action I think I agree that this isn't ideal behavior. It's mildly annoying for me to document the MAXINT behavior since it's technically architecture-dependent. If you're processing the response, there's an implicit/practical threshold between "realllly old validated ledger" and "implausibly big value must mean there is no validated ledger" but it's not obvious where that threshold is. I think the right response would be to return a string value such as "unavailable" or "none" or something instead, or to use a special value like The bigger question is whether "no validated ledger" should automatically be a "critical" state. I think we might want the server to report a "warning" state, not critical, when the server is syncing. My intuition is that a "critical" state should prompt stronger interventions like restarting hardware or escalating to a human administrator, whereas a "warning" state might be better served by milder interventions like directing API traffic away from the affected server on the assumption that it has a good chance to recover on its own. I believe that the "no validated ledger" case only occurs during startup, before the server has synced; if the server was synced but now has lost sync, it should still have a validated ledger from back when it was synced, just an increasingly old one. So I think moving the special case of "no validated ledger" from critical to warning serves the purpose of having the server report its health status as warning, not critical, when syncing during startup; and this still preserves the critical status report if the validated ledger actually does get too old. Thing is, if we're thinking about the health check in those terms, I think we'd want the "warning" level to report a non-200 HTTP status code so that unsophisticated load balancers could use the status code to detect the warning state in order to actually direct traffic away like that. The right code for the warning status is probably 503 Service Unavailable: in my experience, 503 is typically used for issues that are temporary and can be resolved by a "wait and retry" strategy, especially load issues. Taking this idea a step further, I think it would also make sense to change the HTTP status code for "critical" level to something else such as 500 Internal Server Error. 500 errors seem to be more often associated with configuration issues or code errors that aren't likely to resolve themselves. Summary of my (updated) thinking on the statuses:
Most of the existing thresholds, I think, lend themselves pretty well to these definitions already, so it would just be a matter of:
|
* Fixes XRPLF#3486 * load factor computation normalized by load_base. * last validated ledger age set to -1 while syncing. * Return status changed: * healthy -> ok * warning -> service_unavailable * critical -> internal_server_error
Issue Description
While testing the health check method (#3365) with v1.6.0-b8, I've noticed that the
load_factor
number it reports appears to be the server_state version which is relative to to `load_base, but the thresholds appear to be based on the server_info version of the metric.See the source code for load_factor thresholds.
Basically, the thresholds of > 100 (warning) and >= 1000 (critical) make sense if you're using the server_info versions of load_factor where "no load" is 1, but don't make sense if you're using the server_state version of load_factor where "no load" is 256.
Steps to Reproduce
Start rippled and wait for it to sync. Compare the results of server_info and health check:
Expected Result
If the server is fully healthy, the
load_factor
inserver_info
should be 1 and theload_factor
should be omitted from the health check.Actual Result
The health check response contains
{"info":{"load_factor":256}}
which indicates it considers this load factor to be in a warning state.Environment
rippled 1.6.0-b8, self-compiled, on Arch Linux
Synced to mainnet. Unit tests pass.
The text was updated successfully, but these errors were encountered: