-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RPC tooBusy response now has 503 HTTP status code: #4143
Conversation
Thank you @scottschurr! Great solution, being able to configure alternative error codes per error. I see that you already added error codes for the most important errors/codes. Going through the list I feel there are many more eligible for (mostly) 400 / 401 / 403 / 500 HTTP status codes (e.g. Question is where to draw the line between something the client asked for (response still OK, 200, to be expected) and where it's a response that should explicitly inform the client that something is off. What originally caused me to post #4005 is the load balancer auto-failover use case. I feel that's where we should draw the line: if a query fails in the 4xx/5xx range and the exact same query is likely to have a positive (200) outcome on another node, it should return a non-200 status code. If you agree I'll go through the list and add some more 4xx/5xx status codes. But other than adding some status codes: perfect. Thanks again! |
@WietseWind, thanks for your comment. I'm not really a load balancer kind of guy. I mostly keep my nose in the C++. So I'm happy to use your suggestions for which status codes should be returned. Then I'll rely on @cjcobb23 and @intelliot to verify that they look right from their perspectives. So, yes, please go through the list and add some more 4xx/5xx status codes. I'll also add a comment regarding how the line was drawn between 200 and other. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this change should be accompanied by an addition to the release notes and docs
@WietseWind, I'm hoping you'll have time to indicate where you'd like additional HTTP status codes added. Is there anything you need from me to push that forward? Thanks. |
Hey @WietseWind, on April 11th you noted:
That's great and would be appreciated. But you've had five weeks now to do that. I know you're really busy. If you can get back to me within the next week to at least let me know when you can get around to identifying those additional status codes I'd really appreciate it. If I don't hear back from you within the week I'll assume you're too busy to make the assessment right now and we can commit the HTTP status code changes that are currently in the pull request. If that should happen, then you can make a later pull request to put in the additional status codes. Thanks for the help. |
Hi @scottschurr, sorry for that. Will do right now, lost track of this thread. Thanks for the reminder. I propose the codes below, where a non-standard HTTP code is only returned if one can expect (e.g. when submitting against a pool) another node to respond differently (first call: error, second call, other machine: OK) (except for calls needing auth). https://gist.github.com/WietseWind/661776d0b4c849bfa5f237101104e02a |
Thanks for the help @WietseWind. I've added a commit with the extra status codes and a bit more code that was required to make the new status codes work. I think the status codes are in their final state for this pull request. It would be great if the reviewers would look it over and let me know if there are other things that need fixing. Thanks. |
@scottschurr Thanks, appreciate it :D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm supportive of the general idea, but I think this change is fairly likely to break clients, so we should not do it in a patch release and ideally it would be on a version switch or something.
There are languages & libraries that raise an error (potentially crashing) if they get a non-200 error code. This change means that some clients that previously correctly handled error messages may crash instead.
Also, there are a decent number of error codes that are more or less specific cases of invalidParams
but that code uses 400 Bad Request and several of the other codes are still on 200 OK. That's…less than ideal from an API design standpoint. It wouldn't be as bad if it was limited to just 503 Service Unavailable when the server isn't fully operational, but returning 400 Bad Request based on user input is definitely a big risk in terms of breaking people's existing code.
#3418 is an example of a PR where we made a smaller change but put it on an API v2 switch. We should consider doing that with at least some of these changes.
Makes sense @mDuo13, thanks for chiming in. Thoughts regarding the version flag: if that same flag would also change the behaviour of other version related options, one would never be able to use the previous (current, default) version of the API but with the response codes that allows for failover in reverse proxy environments (for example). Not ideal from a consistency standpoint, but could it be worth triggering this behaviour based on a HTTP request header? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggested appropriate HTTP status codes for all the other error codes, but I still think these breaking changes should not be implemented in the v1 API and we probably want to consolidate a lot of these errors into fewer for API v2 anyway
{rpcACT_MALFORMED, "actMalformed", "Account malformed."}, | ||
{rpcACT_NOT_FOUND, "actNotFound", "Account not found."}, | ||
{rpcALREADY_MULTISIG, "alreadyMultisig", "Already multisigned."}, | ||
{rpcALREADY_SINGLE_SIG, "alreadySingleSig", "Already single-signed."}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all four of these sound like cases that should be 400 Bad Request
{rpcBAD_KEY_TYPE, "badKeyType", "Bad key type."}, | ||
{rpcBAD_FEATURE, "badFeature", "Feature unknown or invalid.", 500}, | ||
{rpcBAD_ISSUER, "badIssuer", "Issuer account malformed."}, | ||
{rpcBAD_MARKET, "badMarket", "No such market."}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either 400 or 404. Might need to dig more into when this comes up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with 404, Not Found. But I'm happy to change that.
I recommend having HTTP status code changes be the the only change in Using a special HTTP request header isn't worth the hassle in code, documentation, or increased number of special cases. Version numbers are free* and we will not run out. Let's use them. I have updated XLS-22d API Versioning to note that changing status codes is a breaking change. *(Thanks to @thejohnfreeman for this link.) |
Note, I'm not saying that this is a perfect solution or avoids all potential problems. We are just trying to make the best trade-off given our goals and constraints. Using #3770 - this is also an example of how a change is gated behind I think clients will have to adapt anyway, because clients may have existing code that expects certain status codes. When they are ready to gain the benefits of automatic failover at the reverse proxy layer, then they should audit / test their code to ensure that they comply with the latest API, including both the new status codes and the signer_list location. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @mDuo13 's request except I think it should be "api_version": 3
or later
I'm not too sure if HTTP return codes should reflect the underlying JSON-RPC calls, but the spec there is surprisingly sparse... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into using the "api_version"
to gate this feature, but I ran into some awkwardness. The "api_version"
is not readily available where I need it.
How would y'all feel about using the "ripplerpc"
version to gate the feature? I have easy access to that field where I need it. "ripplerpc
" currently supports versions "1.0" and "2.0". I could bump that to "2.1" or "3.0". I wrote a quick test case using "ripplerpc"
as the gate and it worked.
Thoughts?
{rpcBAD_KEY_TYPE, "badKeyType", "Bad key type."}, | ||
{rpcBAD_FEATURE, "badFeature", "Feature unknown or invalid.", 500}, | ||
{rpcBAD_ISSUER, "badIssuer", "Issuer account malformed."}, | ||
{rpcBAD_MARKET, "badMarket", "No such market."}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with 404, Not Found. But I'm happy to change that.
{rpcCHANNEL_MALFORMED, "channelMalformed", "Payment channel is malformed."}, | ||
{rpcCHANNEL_AMT_MALFORMED, "channelAmtMalformed", "Payment channel amount is malformed."}, | ||
{rpcCOMMAND_MISSING, "commandMissing", "Missing command entry."}, | ||
{rpcDB_DESERIALIZATION, "dbDeserialization", "Database deserialization error."}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with 502, since a 503 is usually a temporary problem. A problem deserializing from raw SQL is unlikely to be temporary.
{rpcFAILED_TO_FORWARD, "failedToForward", "Failed to forward request to p2p node", 503}, | ||
{rpcHIGH_FEE, "highFee", "Current transaction fee exceeds your limit."}, | ||
{rpcINTERNAL, "internal", "Internal error.", 500}, | ||
{rpcINVALID_LGR_RANGE, "invalidLgrRange", "Ledger range is invalid."}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked. Currently this response is only caused by invalid user input. So 400 it is.
{rpcLGR_IDXS_INVALID, "lgrIdxsInvalid", "Ledger indexes invalid."}, | ||
{rpcLGR_IDX_MALFORMED, "lgrIdxMalformed", "Ledger index malformed."}, | ||
{rpcLGR_NOT_FOUND, "lgrNotFound", "Ledger not found."}, | ||
{rpcLGR_NOT_VALIDATED, "lgrNotValidated", "Ledger not validated.", 202}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to Google...
202 Accepted response status code indicates that the request has been accepted for processing, but the processing has not been completed; in fact, processing may not have started yet.
Yeah, I'm not completely sold on 202 myself. If you get back lgrNotValidated
, then the request has been rejected. You could try the same request later, but it would be handled as a brand new request.
I've pushed a commit that adds the HTTP status codes suggested by @mDuo13. It also gates returning this status codes based on the Let me know what y'all think. |
I think it's fine to have this use |
I would wait for confirmation from @WietseWind before merging, though :) |
Also, it's important that this gets documented on xrpl.org. We should also document the difference between ripplerpc 1.0 and 2.0, if we haven't already. |
cec4656
to
98a0ad3
Compare
Rebased to 1.9.2. @mDuo13 and @WietseWind, are you good with this version of the code? Thanks. |
Very happy with it :) Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am on board with applying these changes in the case that users supply "3.0", yes.
I also still need to research and fully understand the differences in "2.0" so I can document those.
Thanks @mDuo13. If you think these changes are ready to be committed, could you please mark the pull request as approved? But, since you want more time to research the "2.0" case, then you may want to withhold approval until that research is complete. That's fine also. @cjcobb23, would you like to weigh in on these changes as well? They may have some consequences in Clio. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes XRPLF#4005 Makes it possible for internal RPC Error Codes to associate themselves with a non-OK (200) HTTP status code. There are quite a number of RPC responses in addition to tooBusy that now have non-OK HTTP status codes. The new return HTTP return codes are only enabled by including "ripplerpc": "3.0" or higher in the original request. Otherwise the historical value, 200, continues to be returned.
98a0ad3
to
f08c1c1
Compare
Thanks for the review @cjcobb23! I've squashed and rebased the commit and marked it as passed. |
…F#4143) Fixes XRPLF#4005 Makes it possible for internal RPC Error Codes to associate themselves with a non-OK (200) HTTP status code. There are quite a number of RPC responses in addition to tooBusy that now have non-OK HTTP status codes. The new return HTTP return codes are only enabled by including "ripplerpc": "3.0" or higher in the original request. Otherwise the historical value, 200, continues to be returned. This ensures that this is not a breaking change.
…F#4143) Fixes XRPLF#4005 Makes it possible for internal RPC Error Codes to associate themselves with a non-OK (200) HTTP status code. There are quite a number of RPC responses in addition to tooBusy that now have non-OK HTTP status codes. The new return HTTP return codes are only enabled by including "ripplerpc": "3.0" or higher in the original request. Otherwise the historical value, 200, continues to be returned. This ensures that this is not a breaking change.
High Level Overview of Change
Fixes #4005
Makes it possible for internal RPC Error Codes to associate themselves with a non-OK (200) HTTP status code. This commit also adds non-OK HTTP responses to a few additional RPC error responses.
Context of Change
Issue #4005 notes that returning an HTTP 200 (OK) status code when the server is overloaded is misleading for load balancers. This proposed change allows each RPC error response to be associated with its own HTTP status code.
If an explicit HTTP status code is not specified, then 200 (OK) is assumed. This preserves much of the status quo since prior to this change a 200 (OK) status code was returned in all cases.
The pull request includes a best guess at other cases where a non-200 status code should be returned. Reviewers are requested to look those over carefully and recommend changes. Thanks.
Type of Change
The change is potentially breaking in cases where servers are expecting to get a 200 (OK) status code back since they will no longer get that status code in certain error cases. It's unknown whether that problem would occur in practice.
There is no amendment associated with the change, since it does not change transaction results. However the change does affect the API. So care is justified.
Before / After
I hacked the
server_info
command to return anrpc_TOO_BUSY
error, then I used the followingcurl
command:Without the change I get the following response:
However with the change I get the following response (note the changed status code):
Test Plan
The rippled testing infrastructure currently ignores the returned HTTP status code. I could figure out how to fix that. But I'd like to suggest that we content ourselves with integration tests. Potentially the integration tests could use
curl
commands similar to what I used for manual testing. However we'd need to actually overload the server, rather than hacking a command to return an inaccurate result.More thoughts are appreciated regarding testing.
Reviewers
I'd appreciate review and comments from @WietseWind who wrote the initial issue.