Skip to content
This repository has been archived by the owner on Dec 18, 2018. It is now read-only.

Deadlocks (?) on Kestrel. #271

Closed
Bartmax opened this issue Oct 21, 2015 · 30 comments
Closed

Deadlocks (?) on Kestrel. #271

Bartmax opened this issue Oct 21, 2015 · 30 comments

Comments

@Bartmax
Copy link

Bartmax commented Oct 21, 2015

I tried to make a repro but failing miserably, it's too inconsistent.
When this happens, it's at the start of a request, in debug mode, no breakpoint is reached. if you pause and continue, the request will not take place untill you make another request, then both request fire together. (but not always)

The error if you let it 'timeout' is

HTTP Error 502.3 - Bad Gateway
The specified CGI application encountered an error and the server terminated the process.

It happens very often but couldn't find a way to repro in a consistent way...
Projects with EntityFramework and database access looks like fails often.

This does happens with kestrel with and without HttpPlatformHandler.
This does happens with RC.
This does not happen with web listener.

Sorry, maybe this is not much help, but I get constant stalls on the applications I'm working on, so if you need some specific details, just let me know, I'm more than happy to help.

@Flavien
Copy link

Flavien commented Oct 23, 2015

Just a theory here, but libuv is single threaded. If you're using blocking IO in your code, you might end up with a deadlock somewhere.

@davidfowl
Copy link
Member

@Flavien We don't run user code on the UV loop.

@Flavien
Copy link

Flavien commented Oct 23, 2015

@davidfowl Ok, that makes sense.

@qin-nz
Copy link

qin-nz commented Nov 25, 2015

I got a 502 Bad Gateway on Linux.
But it's okay on windows (web & kestrel)

@introsuit
Copy link

We are getting Bad Gateway on linux as well. After RC1 update, our asp.net app hangs on the first request until Bad Gateway is returned to the user. Same happens with scaffolded basic web app from "yo aspnet". However it is fine on OS X. So it seems it's solely issue for linux platform?

@davidfowl
Copy link
Member

To make any sort of progress in his bug we're going to need more specifics:

  • Repro steps
  • OS
  • A sample application

@qin-nz
Copy link

qin-nz commented Nov 26, 2015

I notice that when i run a container, the COMMAND is "dnx -p project.json " (use docker ps )
But I write in Dockerfile is ENTRYPOINT ["dnx", "-p", "project.json", "kestrel"]

@introsuit
Copy link

OS: Ubuntu 14.04 64bit
dnx: 1.0.0-rc1-final @ mono (our mono version is mono-4.2.1.102)

The bug can be reproduced (at least on our machine) using "yo aspnet" generator and choosing Web Application Basic (other templates like the one with Identity and MusicStore app from github seem to function ok)

Reproduce steps:
build the generated app
launch with dnx web
Then when trying to load the page, page stalls and eventually 502 Bad Gateway is returned. At this point kestrel hanged completely, hitting Ctrl+C does not respond, have to kill the process manually. Same thing happens for any page, Index.chstml or whichever.

Last thing that was printed from kestrel in the console was:
info: Microsoft.AspNet.Mvc.Controllers.ControllerActionInvoker[1] Executing action method webbasic.Controllers.HomeController.Index with arguments () - ModelState is Valid'

When using Verbose logging, the last print before "death" was:
verb: Microsoft.AspNet.Mvc.Controllers.ControllerActionInvoker[2] Executed action method webbasic.Controllers.HomeController.About, returned result Microsoft.AspNet.Mvc.ViewResult.'

Our application fails the same way.

@qin-nz
Copy link

qin-nz commented Nov 28, 2015

Maybe I know the reason.
the default config of server.urls is "http://localhost:5000" . That means you can only access the website locally.

Otherwise, you should use ; to separate urls you want use to access the website.

@davidfowl
Copy link
Member

Then when trying to load the page, page stalls and eventually 502 Bad Gateway is returned. At this point kestrel hanged completely, hitting Ctrl+C does not respond, have to kill the process manually. Same thing happens for any page, Index.chstml or whichever.

Can you be more specific? What are you using to generate load? Are you just hitting f5 in the browser?

When using Verbose logging, the last print before "death" was:
verb: Microsoft.AspNet.Mvc.Controllers.ControllerActionInvoker[2] Executed action method webbasic.Controllers.HomeController.About, returned result Microsoft.AspNet.Mvc.ViewResult.'

This is unrelated to the hang right?

@introsuit
Copy link

Can you be more specific? What are you using to generate load? Are you just hitting f5 in the browser?

Yes, load in the browser.

This is unrelated to the hang right?

It's just to mark at which point it hangs based from output.

@muratg
Copy link
Contributor

muratg commented Dec 21, 2015

Is this Mono only or does it repro on CoreCLR as well?

@muratg
Copy link
Contributor

muratg commented Dec 29, 2015

Pinging the issue. Did you guys see similar issues with CoreCLR?

@introsuit
Copy link

On our setup, with CoreCLR Kestrel does not hang.
However some resources fail to be delivered and browser times out, even though the page appears to be displayed properly. This is probably unrelated issue though.

@muratg
Copy link
Contributor

muratg commented Jan 4, 2016

@introsuit Yup, that sounds like a separate issue. If you can provide repro steps, please feel free to file another ticket for that issue.

@muratg
Copy link
Contributor

muratg commented Jan 14, 2016

@introsuit Are you reproing this with current RC2 bits?

@muratg muratg added this to the Backlog milestone Jan 22, 2016
@muratg muratg closed this as completed Jul 25, 2016
@txchen
Copy link

txchen commented Aug 5, 2016

@muratg I probably got the same issue on RTM. It is rare, I have noticed similar symptom twice. Here is some information about it:

We have a service on ubuntu 16.06 x64, running RTM aspnet core. It is a typical asp.net app, serving some pages via MVC.
We have a task inside the app, by using RecurrentTask.
When the issue happens, all the http request timeout, basically the server never returns anything. No log. But from the log I can see RecurrentTask is still working, maybe it means dotnet core process state is still good. All the db operations also failed, in the recurrent task.
We are not doing any blocking IO like write file or access network, we only use EF core.

And the EF error is like The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max BTW, we close dbContext right after using it, everywhere is using (var db = new XXXContext()) so I don't think connection leak should happen.

Not sure if it is a kestrel bug or dotnet core bug, I didn't know how to dig deeper and for sake of our service, I had to restart the service.

Any advise on what should I do next time I see this happens? Thanks!

@cesarblum
Copy link
Contributor

@txchen Do you have any middleware that's replacing streams? We've recently helped someone who had an app that was timing out some requests and they were hitting #940. Please check if that's the case in your app.

@txchen
Copy link

txchen commented Aug 5, 2016

@CesarBS
I have some custom middlewares but none of them are replacing streams.

Other than the Microsoft middlewares, my own ones are doing something like:

try
 {
    await next.Invoke();
  }
catch (System.Exception ex)
 {
   // handle the exception manually, don't let it go to error page
  context.Response.StatusCode = 500;
    await context.Response.WriteAsync(ex.ToString());
 }

or

context.Response.StatusCode = 401;
await context.Response.WriteAsync("Unauthorized to call API");
return;

Basically just write the response. Is async/await related?

@txchen
Copy link

txchen commented Aug 5, 2016

@CesarBS and btw, for my case, it is not timing out for some requests, once it happens, the entire app is just like dead, every request will timeout.

@Tratcher
Copy link
Member

Tratcher commented Aug 5, 2016

@txchen check context.Response.HasStarted before you attempt to modify or write to the Response after calling next.

@txchen
Copy link

txchen commented Aug 5, 2016

@Tratcher thanks a lot for the tip. I will check that.

But will that lead to entire app dead? Or it would just fail the current request. I just want to raise the concern that maybe there is still unknown deadlock bug.

@cesarblum
Copy link
Contributor

@txchen How large is your app? Are you able to provide us a minimal repro of what you're seeing?

@txchen
Copy link

txchen commented Aug 5, 2016

@CesarBS not very large one, MVC has about 10 pages. Inside the app, we also have some APIs, RPS is about 50 - 100. Usually it is quite stable, but I saw this dead bug twice. (in these 3 weeks)

I really want to find the repro step as well, but I still cannot.

The interesting part as I said, is the RecurrenTask is still working, but inside the task, all the operation in EF would get exception, complaining about cannot get connection from pool. I have double checked my code, it should not leak connection since I use using (context) every where.

I think even EF has something wrong, it should not impact rendering my home page as it has nothing to do with DB. So the root cause is either in dotnet or kestrel I assume.

I don't know how to take dump or something on linux for dotnet core, if you can shed some light, I can try to get something when I see it next time.

@Tratcher
Copy link
Member

Tratcher commented Aug 5, 2016

@txchen It's probably not related to your current issue.

@muratg muratg reopened this Aug 8, 2016
@muratg muratg removed this from the Backlog milestone Aug 8, 2016
@muratg
Copy link
Contributor

muratg commented Aug 17, 2016

Looks like no repro. If you're still seeing issues, please file another bug.

Re: taking dump on linux, I think you can use gdb. Not sure how easy that would be. If you can run on Windows and hit the same issue, you can use procdump.

@muratg muratg closed this as completed Aug 17, 2016
@hheexx
Copy link

hheexx commented Apr 8, 2017

@Bartmax Hi Bartmax, did you fix the issue ?

I am having similiar simtoms.

Looks like all outbound connections from .net proccess stop including connection to db using EF (same error as yours) and connection to solr.
Also kastrel stops serving.

Ubuntu 16.04. Behind nginx.
Node services attached for Angular prerendering + web api.
I tried to split traffic for prerendering and for api to 2 instances and from time to time they both hang.

@Bartmax
Copy link
Author

Bartmax commented Apr 8, 2017

@hheexx this was open on 2015. Lot of stuff changed since then and no, I had no more problem since ages.

@hheexx
Copy link

hheexx commented Apr 8, 2017

I saw the date but I have very similar problems and no idea what it could be.
It did not show up in testing, only in production under load.

Very strange.

Thanks!

@Bartmax
Copy link
Author

Bartmax commented Apr 8, 2017

Are you sure the request it's not just timing out ?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants