load_factor appears to be using wrong units in health check (Version: 1.6.0-b8) #3486

mDuo13 · 2020-07-01T23:45:01Z

Issue Description

While testing the health check method (#3365) with v1.6.0-b8, I've noticed that the load_factor number it reports appears to be the server_state version which is relative to to `load_base, but the thresholds appear to be based on the server_info version of the metric.

See the source code for load_factor thresholds.

Basically, the thresholds of > 100 (warning) and >= 1000 (critical) make sense if you're using the server_info versions of load_factor where "no load" is 1, but don't make sense if you're using the server_state version of load_factor where "no load" is 256.

Steps to Reproduce

Start rippled and wait for it to sync. Compare the results of server_info and health check:

$ curl -ki https://localhost:51235/health
HTTP/1.1 200 OK
Server: rippled-1.6.0-b8
Content-Type: application/json
Connection: close
Transfer-Encoding: chunked

{"info":{"load_factor":256}}

$ ./rippled server_info
... (trimmed) ...
         "load_factor" : 1,
... (trimmed) ...

$ ./rippled server_state
... (trimmed) ...
         "load_base" : 256,
         "load_factor" : 256,
         "load_factor_fee_escalation" : 256,
         "load_factor_fee_queue" : 256,
         "load_factor_fee_reference" : 256,
         "load_factor_server" : 256,
... (trimmed) ...

Expected Result

If the server is fully healthy, the load_factor in server_info should be 1 and the load_factor should be omitted from the health check.

Actual Result

The health check response contains {"info":{"load_factor":256}} which indicates it considers this load factor to be in a warning state.

Environment

rippled 1.6.0-b8, self-compiled, on Arch Linux
Synced to mainnet. Unit tests pass.

The text was updated successfully, but these errors were encountered:

HowardHinnant · 2020-07-06T18:11:38Z

Just to clarify, the reported load factor should be load_factor/load_base?

And do you have recommendations for warning and critical limits?

mDuo13 · 2020-07-06T21:49:12Z

Just to clarify, the reported load factor should be load_factor/load_base?

The reported load_factor should match the server_info load_factor. Which, yes, if you're using the "server_state" definitions, that would be load_factor / load_base.

And do you have recommendations for warning and critical limits?

I don't have enough data to make recommendations. Perhaps @mayurbhandary's StatsD collector has enough data? Otherwise, I guess the 100/1000 limits are probably fine as long as load_factor is using the server_info definition. Possibly it would be better to go with more conservative thresholds of 1000/10,000 instead to reduce the rate of false positives.

Slightly off-topic: ledger age discussion

Also, while we're on the subject of the health check, I had a realization that might be worth further discussion. Basically, I discovered that the warning-level server_state values never apply. That's because all the "warning" states are for when the server is syncing and doesn't yet have the latest validated ledger. In this case, the health check reports it in a critical state because the validated_ledger age is MAXINT.

I wasn't one of the reviewers who raised this issue at the time but now that I've seen it in action I think I agree that this isn't ideal behavior. It's mildly annoying for me to document the MAXINT behavior since it's technically architecture-dependent. If you're processing the response, there's an implicit/practical threshold between "realllly old validated ledger" and "implausibly big value must mean there is no validated ledger" but it's not obvious where that threshold is. I think the right response would be to return a string value such as "unavailable" or "none" or something instead, or to use a special value like -1 so that this field is always a consistent data type (signed int). I kind of like the implication that a negative age indicates that the validated ledger will be acquired in the future.

The bigger question is whether "no validated ledger" should automatically be a "critical" state. I think we might want the server to report a "warning" state, not critical, when the server is syncing. My intuition is that a "critical" state should prompt stronger interventions like restarting hardware or escalating to a human administrator, whereas a "warning" state might be better served by milder interventions like directing API traffic away from the affected server on the assumption that it has a good chance to recover on its own. I believe that the "no validated ledger" case only occurs during startup, before the server has synced; if the server was synced but now has lost sync, it should still have a validated ledger from back when it was synced, just an increasingly old one. So I think moving the special case of "no validated ledger" from critical to warning serves the purpose of having the server report its health status as warning, not critical, when syncing during startup; and this still preserves the critical status report if the validated ledger actually does get too old.

Thing is, if we're thinking about the health check in those terms, I think we'd want the "warning" level to report a non-200 HTTP status code so that unsophisticated load balancers could use the status code to detect the warning state in order to actually direct traffic away like that. The right code for the warning status is probably 503 Service Unavailable: in my experience, 503 is typically used for issues that are temporary and can be resolved by a "wait and retry" strategy, especially load issues. Taking this idea a step further, I think it would also make sense to change the HTTP status code for "critical" level to something else such as 500 Internal Server Error. 500 errors seem to be more often associated with configuration issues or code errors that aren't likely to resolve themselves.

Summary of my (updated) thinking on the statuses:

Health Status	HTTP Status Code	Typical Interventions
healthy	200 OK	None.
warning	503 Service Unavailable	Wait and see; redirect or reduce traffic; software restart.
critical	500 Internal Server Error	Hardware restart; automatic re-provisioning; reconfigure; escalate to higher-rank administrator.

Most of the existing thresholds, I think, lend themselves pretty well to these definitions already, so it would just be a matter of:

changing the response code for the warning level
changing the reported value for the "no validated ledger" special case from MAXINT to a different value that more explicitly indicates a special case
changing the "no validated ledger" special case from critical to warning status

* Fixes XRPLF#3486 * load factor computation normalized by load_base. * last validated ledger age set to -1 while syncing. * Return status changed: * healthy -> ok * warning -> service_unavailable * critical -> internal_server_error

mDuo13 added the Bug label Jul 1, 2020

mDuo13 assigned HowardHinnant Jul 1, 2020

mDuo13 mentioned this issue Jul 6, 2020

Health Check method XRPLF/xrpl-dev-portal#862

Merged

HowardHinnant mentioned this issue Jul 7, 2020

[Trivial] Modify health check API #3491

Closed

manojsdoshi closed this as completed in 3a3b0b4 Jul 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load_factor appears to be using wrong units in health check (Version: 1.6.0-b8) #3486

load_factor appears to be using wrong units in health check (Version: 1.6.0-b8) #3486

mDuo13 commented Jul 1, 2020 •

edited

Loading

HowardHinnant commented Jul 6, 2020

mDuo13 commented Jul 6, 2020

load_factor appears to be using wrong units in health check (Version: 1.6.0-b8) #3486

load_factor appears to be using wrong units in health check (Version: 1.6.0-b8) #3486

Comments

mDuo13 commented Jul 1, 2020 • edited Loading

Issue Description

Steps to Reproduce

Expected Result

Actual Result

Environment

HowardHinnant commented Jul 6, 2020

mDuo13 commented Jul 6, 2020

Slightly off-topic: ledger age discussion

mDuo13 commented Jul 1, 2020 •

edited

Loading