RPC tooBusy response now has 503 HTTP status code: #4143

scottschurr · 2022-04-11T18:49:18Z

High Level Overview of Change

Makes it possible for internal RPC Error Codes to associate themselves with a non-OK (200) HTTP status code. This commit also adds non-OK HTTP responses to a few additional RPC error responses.

Context of Change

Issue #4005 notes that returning an HTTP 200 (OK) status code when the server is overloaded is misleading for load balancers. This proposed change allows each RPC error response to be associated with its own HTTP status code.

If an explicit HTTP status code is not specified, then 200 (OK) is assumed. This preserves much of the status quo since prior to this change a 200 (OK) status code was returned in all cases.

The pull request includes a best guess at other cases where a non-200 status code should be returned. Reviewers are requested to look those over carefully and recommend changes. Thanks.

Type of Change

Potentially breaking change (fix or feature that could cause existing functionality to not work as expected).

The change is potentially breaking in cases where servers are expecting to get a 200 (OK) status code back since they will no longer get that status code in certain error cases. It's unknown whether that problem would occur in practice.

There is no amendment associated with the change, since it does not change transaction results. However the change does affect the API. So care is justified.

Before / After

I hacked the server_info command to return an rpc_TOO_BUSY error, then I used the following curl command:

#!/bin/bash
# Use curl to make a server_info request
curl -i -d '{ "method": "server_info", "params": [ {"ripplerpc": "1.0"} ] }' -H 'Content-Type: application/json' -X POST http://127.0.0.1:5005/

Without the change I get the following response:

HTTP/1.1 200 OK
Date: Mon, 11 Apr 2022 17:53:55 +0000
Connection: Keep-AliveContent-Length: 195
Content-Type: application/json; charset=UTF-8
Server: ripple-json-rpc/rippled-1.9.0+7c66747d27869f9f3c96617bd4227038f1fa92b8.DEBUG

{"result":{"error":"tooBusy","error_code":9,"error_message":"The server is too busy to help you now.","request":{"command":"server_info","ripplerpc":"1.0"},"status":"error"},"ripplerpc":"1.0"}

However with the change I get the following response (note the changed status code):

HTTP/1.1 503 Server is overloaded
Date: Mon, 11 Apr 2022 18:03:41 +0000
Connection: Keep-AliveContent-Length: 195
Content-Type: application/json; charset=UTF-8
Server: ripple-json-rpc/rippled-1.9.0+7c66747d27869f9f3c96617bd4227038f1fa92b8.DEBUG

{"result":{"error":"tooBusy","error_code":9,"error_message":"The server is too busy to help you now.","request":{"command":"server_info","ripplerpc":"1.0"},"status":"error"},"ripplerpc":"1.0"}

Test Plan

The rippled testing infrastructure currently ignores the returned HTTP status code. I could figure out how to fix that. But I'd like to suggest that we content ourselves with integration tests. Potentially the integration tests could use curl commands similar to what I used for manual testing. However we'd need to actually overload the server, rather than hacking a command to return an inaccurate result.

More thoughts are appreciated regarding testing.

Reviewers

I'd appreciate review and comments from @WietseWind who wrote the initial issue.

WietseWind · 2022-04-11T23:30:20Z

Thank you @scottschurr! Great solution, being able to configure alternative error codes per error.

I see that you already added error codes for the most important errors/codes. Going through the list I feel there are many more eligible for (mostly) 400 / 401 / 403 / 500 HTTP status codes (e.g. rpcFORBIDDEN, rpcNOT_READY, rpcNO_NETWORK, rpcJSON_RPC, rpcNO_PERMISSION).

Question is where to draw the line between something the client asked for (response still OK, 200, to be expected) and where it's a response that should explicitly inform the client that something is off.

What originally caused me to post #4005 is the load balancer auto-failover use case. I feel that's where we should draw the line: if a query fails in the 4xx/5xx range and the exact same query is likely to have a positive (200) outcome on another node, it should return a non-200 status code.

If you agree I'll go through the list and add some more 4xx/5xx status codes. But other than adding some status codes: perfect. Thanks again!

scottschurr · 2022-04-12T01:43:46Z

@WietseWind, thanks for your comment. I'm not really a load balancer kind of guy. I mostly keep my nose in the C++. So I'm happy to use your suggestions for which status codes should be returned. Then I'll rely on @cjcobb23 and @intelliot to verify that they look right from their perspectives.

So, yes, please go through the list and add some more 4xx/5xx status codes. I'll also add a comment regarding how the line was drawn between 200 and other. Thanks!

intelliot

I believe this change should be accompanied by an addition to the release notes and docs

scottschurr · 2022-05-11T17:15:21Z

@WietseWind, I'm hoping you'll have time to indicate where you'd like additional HTTP status codes added. Is there anything you need from me to push that forward? Thanks.

scottschurr · 2022-05-18T20:27:14Z

Hey @WietseWind, on April 11th you noted:

If you agree I'll go through the list and add some more 4xx/5xx status codes.

That's great and would be appreciated. But you've had five weeks now to do that. I know you're really busy. If you can get back to me within the next week to at least let me know when you can get around to identifying those additional status codes I'd really appreciate it.

If I don't hear back from you within the week I'll assume you're too busy to make the assessment right now and we can commit the HTTP status code changes that are currently in the pull request. If that should happen, then you can make a later pull request to put in the additional status codes.

Thanks for the help.

WietseWind · 2022-05-19T11:39:08Z

Hi @scottschurr, sorry for that. Will do right now, lost track of this thread. Thanks for the reminder.

I propose the codes below, where a non-standard HTTP code is only returned if one can expect (e.g. when submitting against a pool) another node to respond differently (first call: error, second call, other machine: OK) (except for calls needing auth).

https://gist.github.com/WietseWind/661776d0b4c849bfa5f237101104e02a

scottschurr · 2022-05-20T03:47:56Z

Thanks for the help @WietseWind. I've added a commit with the extra status codes and a bit more code that was required to make the new status codes work. I think the status codes are in their final state for this pull request. It would be great if the reviewers would look it over and let me know if there are other things that need fixing. Thanks.

WietseWind · 2022-05-20T07:43:08Z

@scottschurr Thanks, appreciate it :D

scottschurr · 2022-06-15T22:53:17Z

Ping @cjcobb23 and @mDuo13. How do you folks feel about this pull request in its current state? Do you want any changes? Thanks.

mDuo13

I'm supportive of the general idea, but I think this change is fairly likely to break clients, so we should not do it in a patch release and ideally it would be on a version switch or something.

There are languages & libraries that raise an error (potentially crashing) if they get a non-200 error code. This change means that some clients that previously correctly handled error messages may crash instead.

Also, there are a decent number of error codes that are more or less specific cases of invalidParams but that code uses 400 Bad Request and several of the other codes are still on 200 OK. That's…less than ideal from an API design standpoint. It wouldn't be as bad if it was limited to just 503 Service Unavailable when the server isn't fully operational, but returning 400 Bad Request based on user input is definitely a big risk in terms of breaking people's existing code.

#3418 is an example of a PR where we made a smaller change but put it on an API v2 switch. We should consider doing that with at least some of these changes.

WietseWind · 2022-07-01T19:34:32Z

Makes sense @mDuo13, thanks for chiming in.

Thoughts regarding the version flag: if that same flag would also change the behaviour of other version related options, one would never be able to use the previous (current, default) version of the API but with the response codes that allows for failover in reverse proxy environments (for example). Not ideal from a consistency standpoint, but could it be worth triggering this behaviour based on a HTTP request header?

mDuo13

I suggested appropriate HTTP status codes for all the other error codes, but I still think these breaking changes should not be implemented in the v1 API and we probably want to consolidate a lot of these errors into fewer for API v2 anyway

mDuo13 · 2022-07-01T19:20:42Z

src/ripple/protocol/impl/ErrorCodes.cpp

+    {rpcACT_MALFORMED,          "actMalformed",         "Account malformed."},
+    {rpcACT_NOT_FOUND,          "actNotFound",          "Account not found."},
+    {rpcALREADY_MULTISIG,       "alreadyMultisig",      "Already multisigned."},
+    {rpcALREADY_SINGLE_SIG,     "alreadySingleSig",     "Already single-signed."},


I think all four of these sound like cases that should be 400 Bad Request

src/ripple/protocol/impl/ErrorCodes.cpp

mDuo13 · 2022-07-01T20:41:22Z

src/ripple/protocol/impl/ErrorCodes.cpp

+    {rpcBAD_KEY_TYPE,           "badKeyType",           "Bad key type."},
+    {rpcBAD_FEATURE,            "badFeature",           "Feature unknown or invalid.", 500},
+    {rpcBAD_ISSUER,             "badIssuer",            "Issuer account malformed."},
+    {rpcBAD_MARKET,             "badMarket",            "No such market."},


Either 400 or 404. Might need to dig more into when this comes up

I went with 404, Not Found. But I'm happy to change that.

src/ripple/protocol/impl/ErrorCodes.cpp

intelliot · 2022-07-02T00:15:06Z

Thoughts regarding the version flag: if that same flag would also change the behaviour of other version related options, one would never be able to use the previous (current, default) version of the API but with the response codes that allows for failover in reverse proxy environments (for example). Not ideal from a consistency standpoint, but could it be worth triggering this behaviour based on a HTTP request header?

I recommend having HTTP status code changes be the the only change in "api_version": 3.

Using a special HTTP request header isn't worth the hassle in code, documentation, or increased number of special cases. Version numbers are free* and we will not run out. Let's use them.

I have updated XLS-22d API Versioning to note that changing status codes is a breaking change.

*(Thanks to @thejohnfreeman for this link.)

intelliot · 2022-07-02T00:21:54Z

Note, I'm not saying that this is a perfect solution or avoids all potential problems. We are just trying to make the best trade-off given our goals and constraints. Using "api_version": 3 will require clients to adapt to the change that was made for "api_version": 2:

#3770 - this is also an example of how a change is gated behind api_version / apiVersion in code

I think clients will have to adapt anyway, because clients may have existing code that expects certain status codes. When they are ready to gain the benefits of automatic failover at the reverse proxy layer, then they should audit / test their code to ensure that they comply with the latest API, including both the new status codes and the signer_list location.

intelliot

I agree with @mDuo13 's request except I think it should be "api_version": 3 or later

MarkusTeufelberger · 2022-07-03T20:46:44Z

I'm not too sure if HTTP return codes should reflect the underlying JSON-RPC calls, but the spec there is surprisingly sparse...

scottschurr

I looked into using the "api_version" to gate this feature, but I ran into some awkwardness. The "api_version" is not readily available where I need it.

How would y'all feel about using the "ripplerpc" version to gate the feature? I have easy access to that field where I need it. "ripplerpc" currently supports versions "1.0" and "2.0". I could bump that to "2.1" or "3.0". I wrote a quick test case using "ripplerpc" as the gate and it worked.

Thoughts?

scottschurr · 2022-07-07T22:03:09Z

src/ripple/protocol/impl/ErrorCodes.cpp

+    {rpcBAD_KEY_TYPE,           "badKeyType",           "Bad key type."},
+    {rpcBAD_FEATURE,            "badFeature",           "Feature unknown or invalid.", 500},
+    {rpcBAD_ISSUER,             "badIssuer",            "Issuer account malformed."},
+    {rpcBAD_MARKET,             "badMarket",            "No such market."},


I went with 404, Not Found. But I'm happy to change that.

scottschurr · 2022-07-07T22:17:31Z

src/ripple/protocol/impl/ErrorCodes.cpp

+    {rpcCHANNEL_MALFORMED,      "channelMalformed",     "Payment channel is malformed."},
+    {rpcCHANNEL_AMT_MALFORMED,  "channelAmtMalformed",  "Payment channel amount is malformed."},
+    {rpcCOMMAND_MISSING,        "commandMissing",       "Missing command entry."},
+    {rpcDB_DESERIALIZATION,     "dbDeserialization",    "Database deserialization error."},


I went with 502, since a 503 is usually a temporary problem. A problem deserializing from raw SQL is unlikely to be temporary.

scottschurr · 2022-07-07T22:27:26Z

src/ripple/protocol/impl/ErrorCodes.cpp

+    {rpcFAILED_TO_FORWARD,      "failedToForward",      "Failed to forward request to p2p node", 503},
+    {rpcHIGH_FEE,               "highFee",              "Current transaction fee exceeds your limit."},
+    {rpcINTERNAL,               "internal",             "Internal error.", 500},
+    {rpcINVALID_LGR_RANGE,      "invalidLgrRange",      "Ledger range is invalid."},


I looked. Currently this response is only caused by invalid user input. So 400 it is.

scottschurr · 2022-07-07T23:03:38Z

src/ripple/protocol/impl/ErrorCodes.cpp

+    {rpcLGR_IDXS_INVALID,       "lgrIdxsInvalid",       "Ledger indexes invalid."},
+    {rpcLGR_IDX_MALFORMED,      "lgrIdxMalformed",      "Ledger index malformed."},
+    {rpcLGR_NOT_FOUND,          "lgrNotFound",          "Ledger not found."},
+    {rpcLGR_NOT_VALIDATED,      "lgrNotValidated",      "Ledger not validated.", 202},


According to Google...

202 Accepted response status code indicates that the request has been accepted for processing, but the processing has not been completed; in fact, processing may not have started yet.

Yeah, I'm not completely sold on 202 myself. If you get back lgrNotValidated, then the request has been rejected. You could try the same request later, but it would be handled as a brand new request.

scottschurr · 2022-07-08T22:53:31Z

I've pushed a commit that adds the HTTP status codes suggested by @mDuo13. It also gates returning this status codes based on the "ripplerpc" field of the original request being "3.0" or greater.

Let me know what y'all think.

intelliot · 2022-07-08T23:01:20Z

I think it's fine to have this use "ripplerpc": "3.0". In fact, that might actually be preferable; I hadn't thought of it as an option before. ripplerpc affects HTTP, but not WebSockets. And that's good since HTTP status codes don't apply to WebSockets anyway.

intelliot · 2022-07-08T23:01:58Z

I would wait for confirmation from @WietseWind before merging, though :)

intelliot · 2022-07-08T23:03:37Z

Also, it's important that this gets documented on xrpl.org. We should also document the difference between ripplerpc 1.0 and 2.0, if we haven't already.

scottschurr · 2022-08-04T23:10:55Z

Rebased to 1.9.2. @mDuo13 and @WietseWind, are you good with this version of the code? Thanks.

WietseWind · 2022-08-05T23:28:30Z

Rebased to 1.9.2. @mDuo13 and @WietseWind, are you good with this version of the code? Thanks.

Very happy with it :) Thank you!

mDuo13

I am on board with applying these changes in the case that users supply "3.0", yes.

I also still need to research and fully understand the differences in "2.0" so I can document those.

scottschurr · 2022-08-22T17:51:44Z

Thanks @mDuo13. If you think these changes are ready to be committed, could you please mark the pull request as approved? But, since you want more time to research the "2.0" case, then you may want to withhold approval until that research is complete. That's fine also.

@cjcobb23, would you like to weigh in on these changes as well? They may have some consequences in Clio.

scottschurr · 2022-11-18T21:57:25Z

Thanks @mDuo13! @cjcobb23, it's now your call.

cjcobb23

LGTM

Fixes XRPLF#4005 Makes it possible for internal RPC Error Codes to associate themselves with a non-OK (200) HTTP status code. There are quite a number of RPC responses in addition to tooBusy that now have non-OK HTTP status codes. The new return HTTP return codes are only enabled by including "ripplerpc": "3.0" or higher in the original request. Otherwise the historical value, 200, continues to be returned.

scottschurr · 2022-11-29T19:42:10Z

Thanks for the review @cjcobb23! I've squashed and rebased the commit and marked it as passed.

…F#4143) Fixes XRPLF#4005 Makes it possible for internal RPC Error Codes to associate themselves with a non-OK (200) HTTP status code. There are quite a number of RPC responses in addition to tooBusy that now have non-OK HTTP status codes. The new return HTTP return codes are only enabled by including "ripplerpc": "3.0" or higher in the original request. Otherwise the historical value, 200, continues to be returned. This ensures that this is not a breaking change.

scottschurr requested review from intelliot and cjcobb23 April 11, 2022 18:49

scottschurr assigned intelliot and cjcobb23 Apr 11, 2022

intelliot approved these changes May 10, 2022

View reviewed changes

intelliot requested a review from mDuo13 May 10, 2022 22:06

scottschurr force-pushed the too-busy-503 branch from 86b7938 to d93a0ae Compare May 20, 2022 03:43

scottschurr assigned mDuo13 May 20, 2022

mDuo13 requested changes Jul 1, 2022

View reviewed changes

mDuo13 reviewed Jul 1, 2022

View reviewed changes

intelliot reviewed Jul 2, 2022

View reviewed changes

scottschurr commented Jul 8, 2022

View reviewed changes

scottschurr force-pushed the too-busy-503 branch from cec4656 to 98a0ad3 Compare August 4, 2022 22:54

intelliot self-requested a review August 5, 2022 23:30

intelliot approved these changes Aug 5, 2022

View reviewed changes

intelliot requested review from WietseWind and mDuo13 August 5, 2022 23:31

intelliot removed their assignment Aug 5, 2022

intelliot mentioned this pull request Aug 5, 2022

Document "ripplerpc": "3.0" option XRPLF/xrpl-dev-portal#1458

Open

mDuo13 reviewed Aug 22, 2022

View reviewed changes

mDuo13 approved these changes Nov 18, 2022

View reviewed changes

cjcobb23 approved these changes Nov 29, 2022

View reviewed changes

scottschurr force-pushed the too-busy-503 branch from 98a0ad3 to f08c1c1 Compare November 29, 2022 19:39

scottschurr added the Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. label Nov 29, 2022

intelliot unassigned mDuo13 and cjcobb23 Dec 23, 2022

intelliot added API Change Testable Will Need Documentation labels Dec 23, 2022

intelliot merged commit 6f87503 into XRPLF:develop Jan 3, 2023

scottschurr deleted the too-busy-503 branch January 13, 2023 01:39

RPC tooBusy response now has 503 HTTP status code: #4143

RPC tooBusy response now has 503 HTTP status code: #4143

Conversation

scottschurr commented Apr 11, 2022

High Level Overview of Change

Context of Change

Type of Change

Before / After

Test Plan

Reviewers

WietseWind commented Apr 11, 2022 • edited Loading

scottschurr commented Apr 12, 2022

intelliot left a comment

Choose a reason for hiding this comment

scottschurr commented May 11, 2022

scottschurr commented May 18, 2022

WietseWind commented May 19, 2022

scottschurr commented May 20, 2022

WietseWind commented May 20, 2022

scottschurr commented Jun 15, 2022

mDuo13 left a comment

Choose a reason for hiding this comment

WietseWind commented Jul 1, 2022

mDuo13 left a comment

Choose a reason for hiding this comment

mDuo13 Jul 1, 2022

Choose a reason for hiding this comment

mDuo13 Jul 1, 2022

Choose a reason for hiding this comment

scottschurr Jul 7, 2022

Choose a reason for hiding this comment

intelliot commented Jul 2, 2022 • edited Loading

intelliot commented Jul 2, 2022 • edited Loading

intelliot left a comment

Choose a reason for hiding this comment

MarkusTeufelberger commented Jul 3, 2022

scottschurr left a comment

Choose a reason for hiding this comment

scottschurr Jul 7, 2022

Choose a reason for hiding this comment

scottschurr Jul 7, 2022

Choose a reason for hiding this comment

scottschurr Jul 7, 2022

Choose a reason for hiding this comment

scottschurr Jul 7, 2022

Choose a reason for hiding this comment

scottschurr commented Jul 8, 2022

intelliot commented Jul 8, 2022

intelliot commented Jul 8, 2022

intelliot commented Jul 8, 2022

scottschurr commented Aug 4, 2022

WietseWind commented Aug 5, 2022

mDuo13 left a comment

Choose a reason for hiding this comment

scottschurr commented Aug 22, 2022

scottschurr commented Nov 18, 2022

cjcobb23 left a comment

Choose a reason for hiding this comment

scottschurr commented Nov 29, 2022

WietseWind commented Apr 11, 2022 •

edited

Loading

intelliot commented Jul 2, 2022 •

edited

Loading

intelliot commented Jul 2, 2022 •

edited

Loading