Be more graceful when terminating keep-alive connections #277

the-allanc · 2020-04-09T23:15:07Z

❓ What kind of change does this PR introduce?

📋 What is the related issue number (starting with #)

Fixes #263

❓ What is the current behavior? (You can also link to an open issue here)

Current mechanism for managing keep-alive connections will forcibly close the least recently-used connection to ensure that we keep to the threshold of the maximum number of allowed keep-alive connections.

❓ What is the new behavior (if this is a feature change)?

Examine if we have reached the threshold of keep-alive connections before writing response headers - if we have, then write the appropriate response headers to close the connection, rather than keeping the connection alive (even if the client has requested it).

📋 Other information:

📋 Checklist:

I think the code is well written
I wrote good commit messages
I have squashed related commits together after the changes have been approved
Unit tests for the changes exist
[-] Integration tests for the changes exist (if applicable)
I used the same coding conventions as the rest of the project
The new code doesn't generate linter offenses
[-] Documentation reflects the changes
The PR relates to only one subject with a clear title
and description in grammatically correct, complete sentences

This change is

jaraco · 2020-04-10T00:18:55Z

@webknjaz Can you work to clean up the linter error and get the unrelated failures to pass?

cheroot/connections.py

webknjaz · 2020-04-10T05:58:52Z

@jaraco sure, on it.

cheroot/connections.py

webknjaz · 2020-04-10T06:14:58Z

@jaraco that test you wrote seems to fail: https://github.com/cherrypy/cheroot/pull/277/checks?check_run_id=576115030#step:14:352

webknjaz · 2020-04-10T06:18:08Z

/me also asked @tobiashenkel to check it this fixes his issue

the-allanc · 2020-04-10T10:28:33Z

@jaraco that test you wrote seems to fail: https://github.com/cherrypy/cheroot/pull/277/checks?check_run_id=576115030#step:14:352

I wonder whether the test fails under certain circumstances, because some of the connections end up being expired by the server timeout setting. If you increased that (or maybe even disabled it altogether), then you could probably see if that's the case.

webknjaz · 2020-04-10T10:37:01Z

cheroot/test/test_wsgi.py

+
+    host = '::'
+    addr = host, port
+    server = wsgi.Server(addr, app)


Suggested change

server = wsgi.Server(addr, app)

server = wsgi.Server(addr, app, timeout=0, accepted_queue_timeout=0)

@the-allanc like this?

ConnectionManager.expire looks at server.timeout to work out whether the socket should be closed - so I think that's the only setting that we should try changing (and limit the effect to just this one test).

I'd probably be more inclined to increase the default timeout to a high number, rather than zero (just because I'm not sure what that will do).

Interesting, changing this revealed some bug: https://github.com/cherrypy/cheroot/pull/277/checks?check_run_id=576581037#step:14:413. It now returns HTTP 500.

And it's now 465 failures, not 10 like before: https://travis-ci.com/github/cherrypy/cheroot/jobs/318105925#L441

Interesting, changing this revealed some bug: #277 (checks). It now returns HTTP 500.

Oh, so docs for io.BufferedReader.read() don't mention the possibility of returning None but https://bugs.python.org/issue35869 suggests that this may happen when in non-blocking mode (hence timeout=0)

me waves at Jason

Ideally, it shouldn't be necessary to update this test as it passes on cheroot < 8.1

To clarify - the test passes for me locally, but under some setups, it seems to fail (mostly on MacOS machines, according to that test run). Are we sure that the test you added will work on the same machines using cheroot 8.1?

My expectation would be that the tests would be likely to fail in the same way - my theory is that the connections / sockets are being closed anyway, as it then exceeds the socket timeout.

I'm not sure the test works reliably. I only tested it a few times, probably Python 3.8 and 2.7 on macOS.

I'll try cherry-picking that test onto 8.0.x so we can establish a baseline for the test and validate it against the suite of environments in CI.

@jaraco I'm not sure you ran exactly the same test: this one uses concurrent.futures which is not a part of Python 2. Did you have some other version of it?

I did not, and I suspect I did not run the test against Python 2. But now we have evidence that the tests pass on Python 2.7-3.6.

cheroot/test/test_wsgi.py

setup.cfg

webknjaz · 2020-04-14T12:14:00Z

@the-allanc tests seem to fail in "slow" envs and the timeout of 20 seconds doesn't seem to be helpful.

ssbarnea · 2020-04-28T09:41:32Z

Any updates on this? Addressing it would be really awesome.

webknjaz · 2020-04-28T12:34:42Z

@ssbarnea I think the main problem here is figuring out how to test the change. You may want to contribute to that by verifying if Zuul's case is solved.

ssbarnea · 2020-04-28T12:52:59Z

@webknjaz Sure, I can run zuul tests with code from this change once it does pass its tests, mainly because this would be a manual testing process and I doubt it makes sense to test with broken tests. For the moment I made a small change in zuul to also limit both https://review.opendev.org/#/c/723855/

webknjaz · 2020-04-28T13:12:24Z

@ssbarnea it may be that the fix is ok but the tests are not. And I have no idea how much time it'll take to figure out the tests here. But at least Zuul-side confirmation would be some sort of indication whether the fix is ok or not.

morucci · 2020-05-29T12:52:41Z

Hi I'd like to help to land that change. I'm packaging Zuul for Fedora but the pinning of cheroot in Zuul requirements.txt due to #263 prevent to move forward.

I've run the reproducer proposed by Tobias on the last Fedora Rawhide (cherrypy-18.4.0 and cheroot-8.2.1) and I start to get some ConnectionResetError starting around 11/12 concurrent connexions. (ThreadPoolExecutor(max_workers=(12)). 10 or less pass w/o errors.

I've tested the reproducer successfully with cheroot 8.0.0. With 8.1.0 and last cheroot master the errors appear.
With this patch on top of master the reproducer does not raise any error.

Let me know how could I help, I'll have a look to the tests.

webknjaz · 2020-05-29T13:40:12Z

@morucci thanks for the feedback. To merge this we have to at least make the CIs pass (mostly GHA+Travis at least, beware that Circle CI is often for unrelated reasons and there's some other flaky jobs). To me, it looks like additional instability is introduced when the workers are slow (which is often the case for macOS).

morucci · 2020-06-04T11:59:20Z

Hi @webknjaz, I did some more testing with this PR, and tried to play with the timeout and the way to close the socket but w/o success on MacOS. Maybe someone with a better knowledge of the code base could help, (I was able to deploy a macos and reproduce the issue quite easily using this howto how-to-run-macos-on-kvm-qemu).

The PR #287 I used to experiment pass the CI (a copy of this one + some other commits) mostly pass the CI: https://github.com/cherrypy/cheroot/pull/287/checks?check_run_id=738251920

I added a xfail for test_connection_keepalive only on darwin
I added a commit to close asap the connection when we reach the limit 2103191. I saw sometime the test_keepalive_conn_management test failed and it looks like this commit fixed it.

I've not included it in the PR but it seems adding a max_retries to the HTTPAdapter in test_connection_keepalive make the test pass on MacOS, but not sure that's an acceptable solution ?

Something I noticed in the code is that we set the socket timeout to 1 whatever the value configured with HTTPServer.timeout https://github.com/cherrypy/cheroot/blob/master/cheroot/server.py#L1765.

I'm not going to continue this investigation, I'm out of clues and ideas + I'm not familiar enough with the cheroot codebase.

But It is clear that this PR improve a lot the situation where cheroot fail to pass the test_connection_keepalive on all platform on master. The PR #287 shown it pass the CI GHA, with sometime (I re-triggered it multiple times) 1 connection error for test_connection_keepalive on a windows node.

I'll like to known what's your point of view to move forward and merge that PR even if it does not fix the issue completely and on all platform, but it improves the current master.

webknjaz · 2020-06-06T22:06:39Z

@morucci thanks for the investigation! I'm not opposed to merging things that improve master stability. The only concern is that I want to be careful about it and watch out for any possible regressions.
Increasing retries on macOS is something that I'd like to consider since in my experience those workers are slow by default and often when we spawn one thing and wait for it to be up in some other component, it times out simply because we don't wait long enough.

The previous behaviour was that when we exceed our threshold of permitted keep-alive connections, we would evict the least recently used connection by forcibly shutting down the socket. This would cause problems with clients which wouldn't be expecting socket errors. Now, if we have a connection that we would usually keep-alive, but we have already reached our limit of allowed keep-alive connections, then we close the connection gracefully by sending a "Connection: Close" header for HTTP/1.1 or omitting the "Connection" header for HTTP/1.0. This has the downside of having cheroot holding on to connections which are less recently used, rather than the most recent ones - but it does ensure that we make the decision about whether to keep or drop a connection at the time we are writing the headers, which allows the client to have a better expectation about the state of the socket once the response has been read. However, we still forcibly close sockets for idle keep-alive connections once the server timeout has been exceeded.

…tions." This reverts commit 0bb15fc.

cherrypy/cheroot#277

the-allanc mentioned this pull request Apr 9, 2020

[cheroot==8.1.0 regression] Occasional connection resets with concurrent requests #263

Closed

3 tasks

This comment has been minimized.

Sign in to view