Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_factor appears to be using wrong units in health check (Version: 1.6.0-b8) #3486

Closed
mDuo13 opened this issue Jul 1, 2020 · 2 comments
Closed
Assignees
Labels

Comments

@mDuo13
Copy link
Collaborator

mDuo13 commented Jul 1, 2020

Issue Description

While testing the health check method (#3365) with v1.6.0-b8, I've noticed that the load_factor number it reports appears to be the server_state version which is relative to to `load_base, but the thresholds appear to be based on the server_info version of the metric.

See the source code for load_factor thresholds.

Basically, the thresholds of > 100 (warning) and >= 1000 (critical) make sense if you're using the server_info versions of load_factor where "no load" is 1, but don't make sense if you're using the server_state version of load_factor where "no load" is 256.

Steps to Reproduce

Start rippled and wait for it to sync. Compare the results of server_info and health check:

$ curl -ki https://localhost:51235/health
HTTP/1.1 200 OK
Server: rippled-1.6.0-b8
Content-Type: application/json
Connection: close
Transfer-Encoding: chunked

{"info":{"load_factor":256}}
$ ./rippled server_info
... (trimmed) ...
         "load_factor" : 1,
... (trimmed) ...
$ ./rippled server_state
... (trimmed) ...
         "load_base" : 256,
         "load_factor" : 256,
         "load_factor_fee_escalation" : 256,
         "load_factor_fee_queue" : 256,
         "load_factor_fee_reference" : 256,
         "load_factor_server" : 256,
... (trimmed) ...

Expected Result

If the server is fully healthy, the load_factor in server_info should be 1 and the load_factor should be omitted from the health check.

Actual Result

The health check response contains {"info":{"load_factor":256}} which indicates it considers this load factor to be in a warning state.

Environment

rippled 1.6.0-b8, self-compiled, on Arch Linux
Synced to mainnet. Unit tests pass.

@HowardHinnant
Copy link
Contributor

Just to clarify, the reported load factor should be load_factor/load_base?

And do you have recommendations for warning and critical limits?

@mDuo13
Copy link
Collaborator Author

mDuo13 commented Jul 6, 2020

Just to clarify, the reported load factor should be load_factor/load_base?

The reported load_factor should match the server_info load_factor. Which, yes, if you're using the "server_state" definitions, that would be load_factor / load_base.

And do you have recommendations for warning and critical limits?

I don't have enough data to make recommendations. Perhaps @mayurbhandary's StatsD collector has enough data? Otherwise, I guess the 100/1000 limits are probably fine as long as load_factor is using the server_info definition. Possibly it would be better to go with more conservative thresholds of 1000/10,000 instead to reduce the rate of false positives.

Slightly off-topic: ledger age discussion

Also, while we're on the subject of the health check, I had a realization that might be worth further discussion. Basically, I discovered that the warning-level server_state values never apply. That's because all the "warning" states are for when the server is syncing and doesn't yet have the latest validated ledger. In this case, the health check reports it in a critical state because the validated_ledger age is MAXINT.

I wasn't one of the reviewers who raised this issue at the time but now that I've seen it in action I think I agree that this isn't ideal behavior. It's mildly annoying for me to document the MAXINT behavior since it's technically architecture-dependent. If you're processing the response, there's an implicit/practical threshold between "realllly old validated ledger" and "implausibly big value must mean there is no validated ledger" but it's not obvious where that threshold is. I think the right response would be to return a string value such as "unavailable" or "none" or something instead, or to use a special value like -1 so that this field is always a consistent data type (signed int). I kind of like the implication that a negative age indicates that the validated ledger will be acquired in the future.

The bigger question is whether "no validated ledger" should automatically be a "critical" state. I think we might want the server to report a "warning" state, not critical, when the server is syncing. My intuition is that a "critical" state should prompt stronger interventions like restarting hardware or escalating to a human administrator, whereas a "warning" state might be better served by milder interventions like directing API traffic away from the affected server on the assumption that it has a good chance to recover on its own. I believe that the "no validated ledger" case only occurs during startup, before the server has synced; if the server was synced but now has lost sync, it should still have a validated ledger from back when it was synced, just an increasingly old one. So I think moving the special case of "no validated ledger" from critical to warning serves the purpose of having the server report its health status as warning, not critical, when syncing during startup; and this still preserves the critical status report if the validated ledger actually does get too old.

Thing is, if we're thinking about the health check in those terms, I think we'd want the "warning" level to report a non-200 HTTP status code so that unsophisticated load balancers could use the status code to detect the warning state in order to actually direct traffic away like that. The right code for the warning status is probably 503 Service Unavailable: in my experience, 503 is typically used for issues that are temporary and can be resolved by a "wait and retry" strategy, especially load issues. Taking this idea a step further, I think it would also make sense to change the HTTP status code for "critical" level to something else such as 500 Internal Server Error. 500 errors seem to be more often associated with configuration issues or code errors that aren't likely to resolve themselves.

Summary of my (updated) thinking on the statuses:

Health Status HTTP Status Code Typical Interventions
healthy 200 OK None.
warning 503 Service Unavailable Wait and see; redirect or reduce traffic; software restart.
critical 500 Internal Server Error Hardware restart; automatic re-provisioning; reconfigure; escalate to higher-rank administrator.

Most of the existing thresholds, I think, lend themselves pretty well to these definitions already, so it would just be a matter of:

  • changing the response code for the warning level
  • changing the reported value for the "no validated ledger" special case from MAXINT to a different value that more explicitly indicates a special case
  • changing the "no validated ledger" special case from critical to warning status

HowardHinnant added a commit to HowardHinnant/rippled that referenced this issue Jul 7, 2020
* Fixes XRPLF#3486
* load factor computation normalized by load_base.
* last validated ledger age set to -1 while syncing.
* Return status changed:
*    healthy  -> ok
*    warning  -> service_unavailable
*    critical -> internal_server_error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants