Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPC tooBusy response now has 503 HTTP status code: #4143

Merged
merged 1 commit into from
Jan 3, 2023

Conversation

scottschurr
Copy link
Collaborator

High Level Overview of Change

Fixes #4005

Makes it possible for internal RPC Error Codes to associate themselves with a non-OK (200) HTTP status code. This commit also adds non-OK HTTP responses to a few additional RPC error responses.

Context of Change

Issue #4005 notes that returning an HTTP 200 (OK) status code when the server is overloaded is misleading for load balancers. This proposed change allows each RPC error response to be associated with its own HTTP status code.

If an explicit HTTP status code is not specified, then 200 (OK) is assumed. This preserves much of the status quo since prior to this change a 200 (OK) status code was returned in all cases.

The pull request includes a best guess at other cases where a non-200 status code should be returned. Reviewers are requested to look those over carefully and recommend changes. Thanks.

Type of Change

  • Potentially breaking change (fix or feature that could cause existing functionality to not work as expected).

The change is potentially breaking in cases where servers are expecting to get a 200 (OK) status code back since they will no longer get that status code in certain error cases. It's unknown whether that problem would occur in practice.

There is no amendment associated with the change, since it does not change transaction results. However the change does affect the API. So care is justified.

Before / After

I hacked the server_info command to return an rpc_TOO_BUSY error, then I used the following curl command:

#!/bin/bash
# Use curl to make a server_info request
curl -i -d '{ "method": "server_info", "params": [ {"ripplerpc": "1.0"} ] }' -H 'Content-Type: application/json' -X POST http://127.0.0.1:5005/

Without the change I get the following response:

HTTP/1.1 200 OK
Date: Mon, 11 Apr 2022 17:53:55 +0000
Connection: Keep-AliveContent-Length: 195
Content-Type: application/json; charset=UTF-8
Server: ripple-json-rpc/rippled-1.9.0+7c66747d27869f9f3c96617bd4227038f1fa92b8.DEBUG

{"result":{"error":"tooBusy","error_code":9,"error_message":"The server is too busy to help you now.","request":{"command":"server_info","ripplerpc":"1.0"},"status":"error"},"ripplerpc":"1.0"}

However with the change I get the following response (note the changed status code):

HTTP/1.1 503 Server is overloaded
Date: Mon, 11 Apr 2022 18:03:41 +0000
Connection: Keep-AliveContent-Length: 195
Content-Type: application/json; charset=UTF-8
Server: ripple-json-rpc/rippled-1.9.0+7c66747d27869f9f3c96617bd4227038f1fa92b8.DEBUG

{"result":{"error":"tooBusy","error_code":9,"error_message":"The server is too busy to help you now.","request":{"command":"server_info","ripplerpc":"1.0"},"status":"error"},"ripplerpc":"1.0"}

Test Plan

The rippled testing infrastructure currently ignores the returned HTTP status code. I could figure out how to fix that. But I'd like to suggest that we content ourselves with integration tests. Potentially the integration tests could use curl commands similar to what I used for manual testing. However we'd need to actually overload the server, rather than hacking a command to return an inaccurate result.

More thoughts are appreciated regarding testing.

Reviewers

I'd appreciate review and comments from @WietseWind who wrote the initial issue.

@WietseWind
Copy link
Member

WietseWind commented Apr 11, 2022

Thank you @scottschurr! Great solution, being able to configure alternative error codes per error.

I see that you already added error codes for the most important errors/codes. Going through the list I feel there are many more eligible for (mostly) 400 / 401 / 403 / 500 HTTP status codes (e.g. rpcFORBIDDEN, rpcNOT_READY, rpcNO_NETWORK, rpcJSON_RPC, rpcNO_PERMISSION).

Question is where to draw the line between something the client asked for (response still OK, 200, to be expected) and where it's a response that should explicitly inform the client that something is off.

What originally caused me to post #4005 is the load balancer auto-failover use case. I feel that's where we should draw the line: if a query fails in the 4xx/5xx range and the exact same query is likely to have a positive (200) outcome on another node, it should return a non-200 status code.

If you agree I'll go through the list and add some more 4xx/5xx status codes. But other than adding some status codes: perfect. Thanks again!

@scottschurr
Copy link
Collaborator Author

@WietseWind, thanks for your comment. I'm not really a load balancer kind of guy. I mostly keep my nose in the C++. So I'm happy to use your suggestions for which status codes should be returned. Then I'll rely on @cjcobb23 and @intelliot to verify that they look right from their perspectives.

So, yes, please go through the list and add some more 4xx/5xx status codes. I'll also add a comment regarding how the line was drawn between 200 and other. Thanks!

Copy link
Collaborator

@intelliot intelliot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this change should be accompanied by an addition to the release notes and docs

@intelliot intelliot requested a review from mDuo13 May 10, 2022 22:06
@scottschurr
Copy link
Collaborator Author

@WietseWind, I'm hoping you'll have time to indicate where you'd like additional HTTP status codes added. Is there anything you need from me to push that forward? Thanks.

@scottschurr
Copy link
Collaborator Author

Hey @WietseWind, on April 11th you noted:

If you agree I'll go through the list and add some more 4xx/5xx status codes.

That's great and would be appreciated. But you've had five weeks now to do that. I know you're really busy. If you can get back to me within the next week to at least let me know when you can get around to identifying those additional status codes I'd really appreciate it.

If I don't hear back from you within the week I'll assume you're too busy to make the assessment right now and we can commit the HTTP status code changes that are currently in the pull request. If that should happen, then you can make a later pull request to put in the additional status codes.

Thanks for the help.

@WietseWind
Copy link
Member

Hi @scottschurr, sorry for that. Will do right now, lost track of this thread. Thanks for the reminder.

I propose the codes below, where a non-standard HTTP code is only returned if one can expect (e.g. when submitting against a pool) another node to respond differently (first call: error, second call, other machine: OK) (except for calls needing auth).

https://gist.github.com/WietseWind/661776d0b4c849bfa5f237101104e02a

@scottschurr
Copy link
Collaborator Author

Thanks for the help @WietseWind. I've added a commit with the extra status codes and a bit more code that was required to make the new status codes work. I think the status codes are in their final state for this pull request. It would be great if the reviewers would look it over and let me know if there are other things that need fixing. Thanks.

@WietseWind
Copy link
Member

@scottschurr Thanks, appreciate it :D

@scottschurr
Copy link
Collaborator Author

Ping @cjcobb23 and @mDuo13. How do you folks feel about this pull request in its current state? Do you want any changes? Thanks.

Copy link
Collaborator

@mDuo13 mDuo13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm supportive of the general idea, but I think this change is fairly likely to break clients, so we should not do it in a patch release and ideally it would be on a version switch or something.

There are languages & libraries that raise an error (potentially crashing) if they get a non-200 error code. This change means that some clients that previously correctly handled error messages may crash instead.

Also, there are a decent number of error codes that are more or less specific cases of invalidParams but that code uses 400 Bad Request and several of the other codes are still on 200 OK. That's…less than ideal from an API design standpoint. It wouldn't be as bad if it was limited to just 503 Service Unavailable when the server isn't fully operational, but returning 400 Bad Request based on user input is definitely a big risk in terms of breaking people's existing code.

#3418 is an example of a PR where we made a smaller change but put it on an API v2 switch. We should consider doing that with at least some of these changes.

@WietseWind
Copy link
Member

Makes sense @mDuo13, thanks for chiming in.

Thoughts regarding the version flag: if that same flag would also change the behaviour of other version related options, one would never be able to use the previous (current, default) version of the API but with the response codes that allows for failover in reverse proxy environments (for example). Not ideal from a consistency standpoint, but could it be worth triggering this behaviour based on a HTTP request header?

Copy link
Collaborator

@mDuo13 mDuo13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested appropriate HTTP status codes for all the other error codes, but I still think these breaking changes should not be implemented in the v1 API and we probably want to consolidate a lot of these errors into fewer for API v2 anyway

Comment on lines +48 to +51
{rpcACT_MALFORMED, "actMalformed", "Account malformed."},
{rpcACT_NOT_FOUND, "actNotFound", "Account not found."},
{rpcALREADY_MULTISIG, "alreadyMultisig", "Already multisigned."},
{rpcALREADY_SINGLE_SIG, "alreadySingleSig", "Already single-signed."},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all four of these sound like cases that should be 400 Bad Request

src/ripple/protocol/impl/ErrorCodes.cpp Outdated Show resolved Hide resolved
src/ripple/protocol/impl/ErrorCodes.cpp Outdated Show resolved Hide resolved
src/ripple/protocol/impl/ErrorCodes.cpp Outdated Show resolved Hide resolved
{rpcBAD_KEY_TYPE, "badKeyType", "Bad key type."},
{rpcBAD_FEATURE, "badFeature", "Feature unknown or invalid.", 500},
{rpcBAD_ISSUER, "badIssuer", "Issuer account malformed."},
{rpcBAD_MARKET, "badMarket", "No such market."},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either 400 or 404. Might need to dig more into when this comes up

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with 404, Not Found. But I'm happy to change that.

src/ripple/protocol/impl/ErrorCodes.cpp Outdated Show resolved Hide resolved
src/ripple/protocol/impl/ErrorCodes.cpp Outdated Show resolved Hide resolved
src/ripple/protocol/impl/ErrorCodes.cpp Outdated Show resolved Hide resolved
src/ripple/protocol/impl/ErrorCodes.cpp Outdated Show resolved Hide resolved
src/ripple/protocol/impl/ErrorCodes.cpp Outdated Show resolved Hide resolved
@intelliot
Copy link
Collaborator

intelliot commented Jul 2, 2022

Thoughts regarding the version flag: if that same flag would also change the behaviour of other version related options, one would never be able to use the previous (current, default) version of the API but with the response codes that allows for failover in reverse proxy environments (for example). Not ideal from a consistency standpoint, but could it be worth triggering this behaviour based on a HTTP request header?

I recommend having HTTP status code changes be the the only change in "api_version": 3.

Using a special HTTP request header isn't worth the hassle in code, documentation, or increased number of special cases. Version numbers are free* and we will not run out. Let's use them.

I have updated XLS-22d API Versioning to note that changing status codes is a breaking change.

*(Thanks to @thejohnfreeman for this link.)

@intelliot
Copy link
Collaborator

intelliot commented Jul 2, 2022

Note, I'm not saying that this is a perfect solution or avoids all potential problems. We are just trying to make the best trade-off given our goals and constraints. Using "api_version": 3 will require clients to adapt to the change that was made for "api_version": 2:

#3770 - this is also an example of how a change is gated behind api_version / apiVersion in code

I think clients will have to adapt anyway, because clients may have existing code that expects certain status codes. When they are ready to gain the benefits of automatic failover at the reverse proxy layer, then they should audit / test their code to ensure that they comply with the latest API, including both the new status codes and the signer_list location.

Copy link
Collaborator

@intelliot intelliot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @mDuo13 's request except I think it should be "api_version": 3 or later

@MarkusTeufelberger
Copy link
Collaborator

I'm not too sure if HTTP return codes should reflect the underlying JSON-RPC calls, but the spec there is surprisingly sparse...

Copy link
Collaborator Author

@scottschurr scottschurr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into using the "api_version" to gate this feature, but I ran into some awkwardness. The "api_version" is not readily available where I need it.

How would y'all feel about using the "ripplerpc" version to gate the feature? I have easy access to that field where I need it. "ripplerpc" currently supports versions "1.0" and "2.0". I could bump that to "2.1" or "3.0". I wrote a quick test case using "ripplerpc" as the gate and it worked.

Thoughts?

{rpcBAD_KEY_TYPE, "badKeyType", "Bad key type."},
{rpcBAD_FEATURE, "badFeature", "Feature unknown or invalid.", 500},
{rpcBAD_ISSUER, "badIssuer", "Issuer account malformed."},
{rpcBAD_MARKET, "badMarket", "No such market."},
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with 404, Not Found. But I'm happy to change that.

{rpcCHANNEL_MALFORMED, "channelMalformed", "Payment channel is malformed."},
{rpcCHANNEL_AMT_MALFORMED, "channelAmtMalformed", "Payment channel amount is malformed."},
{rpcCOMMAND_MISSING, "commandMissing", "Missing command entry."},
{rpcDB_DESERIALIZATION, "dbDeserialization", "Database deserialization error."},
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with 502, since a 503 is usually a temporary problem. A problem deserializing from raw SQL is unlikely to be temporary.

{rpcFAILED_TO_FORWARD, "failedToForward", "Failed to forward request to p2p node", 503},
{rpcHIGH_FEE, "highFee", "Current transaction fee exceeds your limit."},
{rpcINTERNAL, "internal", "Internal error.", 500},
{rpcINVALID_LGR_RANGE, "invalidLgrRange", "Ledger range is invalid."},
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked. Currently this response is only caused by invalid user input. So 400 it is.

{rpcLGR_IDXS_INVALID, "lgrIdxsInvalid", "Ledger indexes invalid."},
{rpcLGR_IDX_MALFORMED, "lgrIdxMalformed", "Ledger index malformed."},
{rpcLGR_NOT_FOUND, "lgrNotFound", "Ledger not found."},
{rpcLGR_NOT_VALIDATED, "lgrNotValidated", "Ledger not validated.", 202},
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to Google...

202 Accepted response status code indicates that the request has been accepted for processing, but the processing has not been completed; in fact, processing may not have started yet.

Yeah, I'm not completely sold on 202 myself. If you get back lgrNotValidated, then the request has been rejected. You could try the same request later, but it would be handled as a brand new request.

@scottschurr
Copy link
Collaborator Author

I've pushed a commit that adds the HTTP status codes suggested by @mDuo13. It also gates returning this status codes based on the "ripplerpc" field of the original request being "3.0" or greater.

Let me know what y'all think.

@intelliot
Copy link
Collaborator

I think it's fine to have this use "ripplerpc": "3.0". In fact, that might actually be preferable; I hadn't thought of it as an option before. ripplerpc affects HTTP, but not WebSockets. And that's good since HTTP status codes don't apply to WebSockets anyway.

@intelliot
Copy link
Collaborator

I would wait for confirmation from @WietseWind before merging, though :)

@intelliot
Copy link
Collaborator

Also, it's important that this gets documented on xrpl.org. We should also document the difference between ripplerpc 1.0 and 2.0, if we haven't already.

@scottschurr
Copy link
Collaborator Author

Rebased to 1.9.2. @mDuo13 and @WietseWind, are you good with this version of the code? Thanks.

@WietseWind
Copy link
Member

Rebased to 1.9.2. @mDuo13 and @WietseWind, are you good with this version of the code? Thanks.

Very happy with it :) Thank you!

@intelliot intelliot self-requested a review August 5, 2022 23:30
Copy link
Collaborator

@mDuo13 mDuo13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am on board with applying these changes in the case that users supply "3.0", yes.

I also still need to research and fully understand the differences in "2.0" so I can document those.

@scottschurr
Copy link
Collaborator Author

Thanks @mDuo13. If you think these changes are ready to be committed, could you please mark the pull request as approved? But, since you want more time to research the "2.0" case, then you may want to withhold approval until that research is complete. That's fine also.

@cjcobb23, would you like to weigh in on these changes as well? They may have some consequences in Clio.

@scottschurr
Copy link
Collaborator Author

Thanks @mDuo13! @cjcobb23, it's now your call.

Copy link
Contributor

@cjcobb23 cjcobb23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Fixes XRPLF#4005

Makes it possible for internal RPC Error Codes to associate
themselves with a non-OK (200) HTTP status code.  There are
quite a number of RPC responses in addition to tooBusy that
now have non-OK HTTP status codes.

The new return HTTP return codes are only enabled by including
"ripplerpc": "3.0" or higher in the original request.
Otherwise the historical value, 200, continues to be returned.
@scottschurr scottschurr added the Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. label Nov 29, 2022
@scottschurr
Copy link
Collaborator Author

Thanks for the review @cjcobb23! I've squashed and rebased the commit and marked it as passed.

@intelliot intelliot merged commit 6f87503 into XRPLF:develop Jan 3, 2023
@scottschurr scottschurr deleted the too-busy-503 branch January 13, 2023 01:39
dangell7 pushed a commit to Transia-RnD/rippled that referenced this pull request Jan 24, 2023
…F#4143)

Fixes XRPLF#4005

Makes it possible for internal RPC Error Codes to associate
themselves with a non-OK (200) HTTP status code.  There are
quite a number of RPC responses in addition to tooBusy that
now have non-OK HTTP status codes.

The new return HTTP return codes are only enabled by including
"ripplerpc": "3.0" or higher in the original request.
Otherwise the historical value, 200, continues to be returned.
This ensures that this is not a breaking change.
dangell7 pushed a commit to Transia-RnD/rippled that referenced this pull request Mar 5, 2023
…F#4143)

Fixes XRPLF#4005

Makes it possible for internal RPC Error Codes to associate
themselves with a non-OK (200) HTTP status code.  There are
quite a number of RPC responses in addition to tooBusy that
now have non-OK HTTP status codes.

The new return HTTP return codes are only enabled by including
"ripplerpc": "3.0" or higher in the original request.
Otherwise the historical value, 200, continues to be returned.
This ensures that this is not a breaking change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Change Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. Testable Will Need Documentation
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

RPC 'tooBusy' response should have 50x (503?) status code
6 participants