Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Be more graceful when terminating keep-alive connections #277

Merged
merged 9 commits into from
Jul 13, 2020
Merged

Be more graceful when terminating keep-alive connections #277

merged 9 commits into from
Jul 13, 2020

Conversation

the-allanc
Copy link
Contributor

@the-allanc the-allanc commented Apr 9, 2020

❓ What kind of change does this PR introduce?

  • 🐞 bug fix
  • 🐣 feature
  • πŸ“‹ docs update
  • πŸ“‹ tests/coverage improvement
  • πŸ“‹ refactoring
  • πŸ’₯ other

πŸ“‹ What is the related issue number (starting with #)

Fixes #263

❓ What is the current behavior? (You can also link to an open issue here)

Current mechanism for managing keep-alive connections will forcibly close the least recently-used connection to ensure that we keep to the threshold of the maximum number of allowed keep-alive connections.

❓ What is the new behavior (if this is a feature change)?

Examine if we have reached the threshold of keep-alive connections before writing response headers - if we have, then write the appropriate response headers to close the connection, rather than keeping the connection alive (even if the client has requested it).

πŸ“‹ Other information:

πŸ“‹ Checklist:

  • I think the code is well written
  • I wrote good commit messages
  • I have squashed related commits together after the changes have been approved
  • Unit tests for the changes exist
  • [-] Integration tests for the changes exist (if applicable)
  • I used the same coding conventions as the rest of the project
  • The new code doesn't generate linter offenses
  • [-] Documentation reflects the changes
  • The PR relates to only one subject with a clear title
    and description in grammatically correct, complete sentences

This change is Reviewable

@lgtm-com

This comment has been minimized.

@jaraco
Copy link
Member

jaraco commented Apr 10, 2020

@webknjaz Can you work to clean up the linter error and get the unrelated failures to pass?

cheroot/connections.py Outdated Show resolved Hide resolved
@webknjaz
Copy link
Member

@jaraco sure, on it.

@webknjaz
Copy link
Member

@webknjaz
Copy link
Member

/me also asked @tobiashenkel to check it this fixes his issue

@the-allanc
Copy link
Contributor Author

@jaraco that test you wrote seems to fail: https://github.com/cherrypy/cheroot/pull/277/checks?check_run_id=576115030#step:14:352

I wonder whether the test fails under certain circumstances, because some of the connections end up being expired by the server timeout setting. If you increased that (or maybe even disabled it altogether), then you could probably see if that's the case.


host = '::'
addr = host, port
server = wsgi.Server(addr, app)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
server = wsgi.Server(addr, app)
server = wsgi.Server(addr, app, timeout=0, accepted_queue_timeout=0)

@the-allanc like this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConnectionManager.expire looks at server.timeout to work out whether the socket should be closed - so I think that's the only setting that we should try changing (and limit the effect to just this one test).

I'd probably be more inclined to increase the default timeout to a high number, rather than zero (just because I'm not sure what that will do).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, changing this revealed some bug: https://github.com/cherrypy/cheroot/pull/277/checks?check_run_id=576581037#step:14:413. It now returns HTTP 500.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it's now 465 failures, not 10 like before: https://travis-ci.com/github/cherrypy/cheroot/jobs/318105925#L441

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, changing this revealed some bug: #277 (checks). It now returns HTTP 500.

Oh, so docs for io.BufferedReader.read() don't mention the possibility of returning None but https://bugs.python.org/issue35869 suggests that this may happen when in non-blocking mode (hence timeout=0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

me waves at Jason

Ideally, it shouldn't be necessary to update this test as it passes on cheroot < 8.1

To clarify - the test passes for me locally, but under some setups, it seems to fail (mostly on MacOS machines, according to that test run). Are we sure that the test you added will work on the same machines using cheroot 8.1?

My expectation would be that the tests would be likely to fail in the same way - my theory is that the connections / sockets are being closed anyway, as it then exceeds the socket timeout.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the test works reliably. I only tested it a few times, probably Python 3.8 and 2.7 on macOS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try cherry-picking that test onto 8.0.x so we can establish a baseline for the test and validate it against the suite of environments in CI.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaraco I'm not sure you ran exactly the same test: this one uses concurrent.futures which is not a part of Python 2. Did you have some other version of it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not, and I suspect I did not run the test against Python 2. But now we have evidence that the tests pass on Python 2.7-3.6.

cheroot/test/test_wsgi.py Outdated Show resolved Hide resolved
cheroot/test/test_wsgi.py Outdated Show resolved Hide resolved
cheroot/test/test_wsgi.py Outdated Show resolved Hide resolved
setup.cfg Outdated Show resolved Hide resolved
@webknjaz
Copy link
Member

@the-allanc tests seem to fail in "slow" envs and the timeout of 20 seconds doesn't seem to be helpful.

@ssbarnea
Copy link

Any updates on this? Addressing it would be really awesome.

@webknjaz
Copy link
Member

@ssbarnea I think the main problem here is figuring out how to test the change. You may want to contribute to that by verifying if Zuul's case is solved.

@ssbarnea
Copy link

@webknjaz Sure, I can run zuul tests with code from this change once it does pass its tests, mainly because this would be a manual testing process and I doubt it makes sense to test with broken tests. For the moment I made a small change in zuul to also limit both https://review.opendev.org/#/c/723855/

@webknjaz
Copy link
Member

@ssbarnea it may be that the fix is ok but the tests are not. And I have no idea how much time it'll take to figure out the tests here. But at least Zuul-side confirmation would be some sort of indication whether the fix is ok or not.

@morucci
Copy link

morucci commented May 29, 2020

Hi I'd like to help to land that change. I'm packaging Zuul for Fedora but the pinning of cheroot in Zuul requirements.txt due to #263 prevent to move forward.

I've run the reproducer proposed by Tobias on the last Fedora Rawhide (cherrypy-18.4.0 and cheroot-8.2.1) and I start to get some ConnectionResetError starting around 11/12 concurrent connexions. (ThreadPoolExecutor(max_workers=(12)). 10 or less pass w/o errors.

I've tested the reproducer successfully with cheroot 8.0.0. With 8.1.0 and last cheroot master the errors appear.
With this patch on top of master the reproducer does not raise any error.

Let me know how could I help, I'll have a look to the tests.

@webknjaz
Copy link
Member

@morucci thanks for the feedback. To merge this we have to at least make the CIs pass (mostly GHA+Travis at least, beware that Circle CI is often for unrelated reasons and there's some other flaky jobs). To me, it looks like additional instability is introduced when the workers are slow (which is often the case for macOS).

@morucci
Copy link

morucci commented Jun 4, 2020

Hi @webknjaz, I did some more testing with this PR, and tried to play with the timeout and the way to close the socket but w/o success on MacOS. Maybe someone with a better knowledge of the code base could help, (I was able to deploy a macos and reproduce the issue quite easily using this howto how-to-run-macos-on-kvm-qemu).

The PR #287 I used to experiment pass the CI (a copy of this one + some other commits) mostly pass the CI: https://github.com/cherrypy/cheroot/pull/287/checks?check_run_id=738251920

  • I added a xfail for test_connection_keepalive only on darwin
  • I added a commit to close asap the connection when we reach the limit 2103191. I saw sometime the test_keepalive_conn_management test failed and it looks like this commit fixed it.

I've not included it in the PR but it seems adding a max_retries to the HTTPAdapter in test_connection_keepalive make the test pass on MacOS, but not sure that's an acceptable solution ?

Something I noticed in the code is that we set the socket timeout to 1 whatever the value configured with HTTPServer.timeout https://github.com/cherrypy/cheroot/blob/master/cheroot/server.py#L1765.

I'm not going to continue this investigation, I'm out of clues and ideas + I'm not familiar enough with the cheroot codebase.

But It is clear that this PR improve a lot the situation where cheroot fail to pass the test_connection_keepalive on all platform on master. The PR #287 shown it pass the CI GHA, with sometime (I re-triggered it multiple times) 1 connection error for test_connection_keepalive on a windows node.

I'll like to known what's your point of view to move forward and merge that PR even if it does not fix the issue completely and on all platform, but it improves the current master.

@webknjaz
Copy link
Member

webknjaz commented Jun 6, 2020

@morucci thanks for the investigation! I'm not opposed to merging things that improve master stability. The only concern is that I want to be careful about it and watch out for any possible regressions.
Increasing retries on macOS is something that I'd like to consider since in my experience those workers are slow by default and often when we spawn one thing and wait for it to be up in some other component, it times out simply because we don't wait long enough.

the-allanc and others added 5 commits July 13, 2020 16:12
The previous behaviour was that when we exceed our threshold of permitted
keep-alive connections, we would evict the least recently used connection
by forcibly shutting down the socket. This would cause problems with clients
which wouldn't be expecting socket errors.

Now, if we have a connection that we would usually keep-alive, but we have
already reached our limit of allowed keep-alive connections, then we close
the connection gracefully by sending a "Connection: Close" header for
HTTP/1.1 or omitting the "Connection" header for HTTP/1.0.

This has the downside of having cheroot holding on to connections which are
less recently used, rather than the most recent ones - but it does ensure that
we make the decision about whether to keep or drop a connection at the time we
are writing the headers, which allows the client to have a better expectation
about the state of the socket once the response has been read.

However, we still forcibly close sockets for idle keep-alive connections once
the server timeout has been exceeded.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[cheroot==8.1.0 regression] Occasional connection resets with concurrent requests
5 participants