-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-44872][CONNECT][FOLLOWUP] Deflake ReattachableExecuteSuite and increase retry buffer #42908
[SPARK-44872][CONNECT][FOLLOWUP] Deflake ReattachableExecuteSuite and increase retry buffer #42908
Conversation
7e774e2
to
f6aadee
Compare
@@ -139,7 +139,7 @@ object Connect { | |||
"With any value greater than 0, the last sent response will always be buffered.") | |||
.version("3.5.0") | |||
.bytesConf(ByteUnit.BYTE) | |||
.createWithDefaultString("1m") | |||
.createWithDefaultString("10m") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the purpose is to Deflake ReattachableExecuteSuite and increase retry buffer
, shall we increase this only at ReattachableExecuteSuite
instead of touching the default value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I explained in the PR description that I think that increasing this default is a genuine improvement that will help make reconnects more robust in case of actual network issues, while not increasing memory pressure in a normal scenario (where the client controls the flow with ReleaseExecute)
Since Spark 3.5 released before that suite was added, making this change now is low risk change at this point before the next release, and it will have good baking-in time before next release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather not be changing it in the suite, because that suite is suppose to stress test how the actual client behaves when faced with disconnects. If we changed it in the suite, that would be sweeping under the carpet that I think from the experiments performed now that it was a bit too small for retries robustness.
This is not a major issue, and this in practice only applies in situations when connect faces real intermittent connectivity issues, where before this was implemented, it would just fail.
class SparkConnectServerTest extends SharedSparkSession { | ||
trait SparkConnectServerTest extends SharedSparkSession { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having it as a class was making it execute as a suite with no tests, but still doing the beforeAll / afterAll.
Just to confirm, will the case mentioned by #42560 (comment) also be fixed in this PR? |
@dongjoon-hyun I don't think the SparkConnectSessionHolderSuite failures are related, and I don't know what's going on there.
it looks to me like some (intermittent?) environment issue. |
@LuciferYang I tried looking at #42560 (comment) but did not reproduce it yet. If you have more instances of CI runs where it failed with that stack overflow, that could be useful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @juliuszsompolski and all.
Given the current situation, I believe we proceed further after merging this.
Merged to master. |
It seems that this issue is relatively easy to reproduce on Github Action. In the past three days, daily tests on Scala 2.13 have all experienced |
@juliuszsompolski When I run the above command during local test, it is easier to reproduce |
Although it seems quite magical, after #42981 eliminated an ambiguous references in |
@dongjoon-hyun @juliuszsompolski I think branch-3.5 also need this pr due to #42560 has also been merged into the branch-3.5 |
Let me backport it to 3.5. |
… increase retry buffer ### What changes were proposed in this pull request? Deflake tests in ReattachableExecuteSuite and increase CONNECT_EXECUTE_REATTACHABLE_OBSERVER_RETRY_BUFFER_SIZE. ### Why are the changes needed? Two tests could be flaky with errors `INVALID_CURSOR.POSITION_NOT_AVAILABLE`. This is caused when a server releases the response when it falls more than CONNECT_EXECUTE_REATTACHABLE_OBSERVER_RETRY_BUFFER_SIZE behind the latest response it sent. However, because of HTTP2 flow control, the responses could still be in transit. In the test suite, we were explicitly disconnecting the iterators and later reconnect... In some cases they could not reconnect, because the response they last seen have fallen too fare behind. This not only changes the suite, but also adjust the default config. This potentially makes the reconnecting more robust. In normal situation, it should not lead to increased memory pressure, because the clients also release the responses using ReleaseExecute as soon as they are received. Normally, buffered responses should be freed by ReleaseExecute and this retry buffer is only a fallback mechanism. Therefore, it is safe to increase the default. In practice, this would only have effect in cases where there are actual network errors, and the increased buffer size should make the reconnects more robust in these cases. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ReattachableExecuteSuite. Did more manual experiments of how far the response sent by client can be behind the response sent by server (because of HTTP2 flow control window) ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42908 from juliuszsompolski/SPARK-44872-followup. Authored-by: Juliusz Sompolski <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
I have cherry-picked this to 3.5. |
Thanks @hvanhovell |
What changes were proposed in this pull request?
Deflake tests in ReattachableExecuteSuite and increase CONNECT_EXECUTE_REATTACHABLE_OBSERVER_RETRY_BUFFER_SIZE.
Why are the changes needed?
Two tests could be flaky with errors
INVALID_CURSOR.POSITION_NOT_AVAILABLE
.This is caused when a server releases the response when it falls more than CONNECT_EXECUTE_REATTACHABLE_OBSERVER_RETRY_BUFFER_SIZE behind the latest response it sent. However, because of HTTP2 flow control, the responses could still be in transit. In the test suite, we were explicitly disconnecting the iterators and later reconnect... In some cases they could not reconnect, because the response they last seen have fallen too fare behind.
This not only changes the suite, but also adjust the default config. This potentially makes the reconnecting more robust. In normal situation, it should not lead to increased memory pressure, because the clients also release the responses using ReleaseExecute as soon as they are received. Normally, buffered responses should be freed by ReleaseExecute and this retry buffer is only a fallback mechanism. Therefore, it is safe to increase the default.
In practice, this would only have effect in cases where there are actual network errors, and the increased buffer size should make the reconnects more robust in these cases.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
ReattachableExecuteSuite.
Did more manual experiments of how far the response sent by client can be behind the response sent by server (because of HTTP2 flow control window)
Was this patch authored or co-authored using generative AI tooling?
No.