-
-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Be more graceful when terminating keep-alive connections #277
Be more graceful when terminating keep-alive connections #277
Conversation
This comment has been minimized.
This comment has been minimized.
@webknjaz Can you work to clean up the linter error and get the unrelated failures to pass? |
@jaraco sure, on it. |
@jaraco that test you wrote seems to fail: https://github.com/cherrypy/cheroot/pull/277/checks?check_run_id=576115030#step:14:352 |
/me also asked @tobiashenkel to check it this fixes his issue |
I wonder whether the test fails under certain circumstances, because some of the connections end up being expired by the server timeout setting. If you increased that (or maybe even disabled it altogether), then you could probably see if that's the case. |
cheroot/test/test_wsgi.py
Outdated
|
||
host = '::' | ||
addr = host, port | ||
server = wsgi.Server(addr, app) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
server = wsgi.Server(addr, app) | |
server = wsgi.Server(addr, app, timeout=0, accepted_queue_timeout=0) |
@the-allanc like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConnectionManager.expire
looks at server.timeout
to work out whether the socket should be closed - so I think that's the only setting that we should try changing (and limit the effect to just this one test).
I'd probably be more inclined to increase the default timeout to a high number, rather than zero (just because I'm not sure what that will do).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, changing this revealed some bug: https://github.com/cherrypy/cheroot/pull/277/checks?check_run_id=576581037#step:14:413. It now returns HTTP 500.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And it's now 465 failures, not 10 like before: https://travis-ci.com/github/cherrypy/cheroot/jobs/318105925#L441
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, changing this revealed some bug: #277 (checks). It now returns HTTP 500.
Oh, so docs for io.BufferedReader.read()
don't mention the possibility of returning None
but https://bugs.python.org/issue35869 suggests that this may happen when in non-blocking mode (hence timeout=0
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
me waves at Jason
Ideally, it shouldn't be necessary to update this test as it passes on cheroot < 8.1
To clarify - the test passes for me locally, but under some setups, it seems to fail (mostly on MacOS machines, according to that test run). Are we sure that the test you added will work on the same machines using cheroot 8.1?
My expectation would be that the tests would be likely to fail in the same way - my theory is that the connections / sockets are being closed anyway, as it then exceeds the socket timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the test works reliably. I only tested it a few times, probably Python 3.8 and 2.7 on macOS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try cherry-picking that test onto 8.0.x so we can establish a baseline for the test and validate it against the suite of environments in CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jaraco I'm not sure you ran exactly the same test: this one uses concurrent.futures
which is not a part of Python 2. Did you have some other version of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not, and I suspect I did not run the test against Python 2. But now we have evidence that the tests pass on Python 2.7-3.6.
@the-allanc tests seem to fail in "slow" envs and the timeout of 20 seconds doesn't seem to be helpful. |
Any updates on this? Addressing it would be really awesome. |
@ssbarnea I think the main problem here is figuring out how to test the change. You may want to contribute to that by verifying if Zuul's case is solved. |
@webknjaz Sure, I can run zuul tests with code from this change once it does pass its tests, mainly because this would be a manual testing process and I doubt it makes sense to test with broken tests. For the moment I made a small change in zuul to also limit both https://review.opendev.org/#/c/723855/ |
@ssbarnea it may be that the fix is ok but the tests are not. And I have no idea how much time it'll take to figure out the tests here. But at least Zuul-side confirmation would be some sort of indication whether the fix is ok or not. |
Hi I'd like to help to land that change. I'm packaging Zuul for Fedora but the pinning of cheroot in Zuul requirements.txt due to #263 prevent to move forward. I've run the reproducer proposed by Tobias on the last Fedora Rawhide (cherrypy-18.4.0 and cheroot-8.2.1) and I start to get some ConnectionResetError starting around 11/12 concurrent connexions. (ThreadPoolExecutor(max_workers=(12)). 10 or less pass w/o errors. I've tested the reproducer successfully with cheroot 8.0.0. With 8.1.0 and last cheroot master the errors appear. Let me know how could I help, I'll have a look to the tests. |
@morucci thanks for the feedback. To merge this we have to at least make the CIs pass (mostly GHA+Travis at least, beware that Circle CI is often for unrelated reasons and there's some other flaky jobs). To me, it looks like additional instability is introduced when the workers are slow (which is often the case for macOS). |
Hi @webknjaz, I did some more testing with this PR, and tried to play with the timeout and the way to close the socket but w/o success on MacOS. Maybe someone with a better knowledge of the code base could help, (I was able to deploy a macos and reproduce the issue quite easily using this howto how-to-run-macos-on-kvm-qemu). The PR #287 I used to experiment pass the CI (a copy of this one + some other commits) mostly pass the CI: https://github.com/cherrypy/cheroot/pull/287/checks?check_run_id=738251920
I've not included it in the PR but it seems adding a max_retries to the HTTPAdapter in test_connection_keepalive make the test pass on MacOS, but not sure that's an acceptable solution ? Something I noticed in the code is that we set the socket timeout to 1 whatever the value configured with HTTPServer.timeout https://github.com/cherrypy/cheroot/blob/master/cheroot/server.py#L1765. I'm not going to continue this investigation, I'm out of clues and ideas + I'm not familiar enough with the cheroot codebase. But It is clear that this PR improve a lot the situation where cheroot fail to pass the test_connection_keepalive on all platform on master. The PR #287 shown it pass the CI GHA, with sometime (I re-triggered it multiple times) 1 connection error for test_connection_keepalive on a windows node. I'll like to known what's your point of view to move forward and merge that PR even if it does not fix the issue completely and on all platform, but it improves the current master. |
@morucci thanks for the investigation! I'm not opposed to merging things that improve master stability. The only concern is that I want to be careful about it and watch out for any possible regressions. |
The previous behaviour was that when we exceed our threshold of permitted keep-alive connections, we would evict the least recently used connection by forcibly shutting down the socket. This would cause problems with clients which wouldn't be expecting socket errors. Now, if we have a connection that we would usually keep-alive, but we have already reached our limit of allowed keep-alive connections, then we close the connection gracefully by sending a "Connection: Close" header for HTTP/1.1 or omitting the "Connection" header for HTTP/1.0. This has the downside of having cheroot holding on to connections which are less recently used, rather than the most recent ones - but it does ensure that we make the decision about whether to keep or drop a connection at the time we are writing the headers, which allows the client to have a better expectation about the state of the socket once the response has been read. However, we still forcibly close sockets for idle keep-alive connections once the server timeout has been exceeded.
β What kind of change does this PR introduce?
π What is the related issue number (starting with
#
)Fixes #263
β What is the current behavior? (You can also link to an open issue here)
Current mechanism for managing keep-alive connections will forcibly close the least recently-used connection to ensure that we keep to the threshold of the maximum number of allowed keep-alive connections.
β What is the new behavior (if this is a feature change)?
Examine if we have reached the threshold of keep-alive connections before writing response headers - if we have, then write the appropriate response headers to close the connection, rather than keeping the connection alive (even if the client has requested it).
π Other information:
π Checklist:
and description in grammatically correct, complete sentences
This change isβ