-
Notifications
You must be signed in to change notification settings - Fork 636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process never exits due to hang in shutdown #2284
Comments
This shouldn't happen and looks like a bug. I wasn't able to reproduce it with a simple example. Could you please share a reproducible example with all otel dependencies listed? |
Thanks, @owais . I was able to reproduce the issue with this stand-alone test: import time
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import (
OTLPSpanExporter,
)
from opentelemetry.sdk.resources import DEPLOYMENT_ENVIRONMENT
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.resources import SERVICE_NAME
from opentelemetry.sdk.resources import SERVICE_NAMESPACE
from opentelemetry.sdk.resources import SERVICE_VERSION
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
from opentelemetry.trace import set_tracer_provider
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
def main():
print("Initializing OpenTelemetry...")
provider = TracerProvider(
sampler=TraceIdRatioBased(0.1),
resource=Resource.create(
{
DEPLOYMENT_ENVIRONMENT: "production",
"platform": "chorus",
SERVICE_NAME: "mike",
SERVICE_NAMESPACE: "chorus",
SERVICE_VERSION: "1.0",
}
),
)
set_tracer_provider(provider)
span_processor = BatchSpanProcessor(
OTLPSpanExporter(endpoint="otel.chorus.internal:4317")
)
provider.add_span_processor(span_processor)
print("Pretending to do work...")
with tracer.start_as_current_span("foo"):
time.sleep(0.5)
print("Done! (Python should exit now.)")
if __name__ == "__main__":
main() It took 10-20 tries but eventually exhibited the "hung" behavior reported above. After it hung, I hit Ctrl+C to get the process to exit. Here's the console output of that run:
And here are all of the OT packages I have installed:
|
FYI - when I replace |
Restoring the original |
To help illuminate what's going on during
I can provide my monkey-patching if needed but, hopefully, this is enough to figure out what's going on. |
There is an exponential backoff and jitter retry mechanism (64 secs) for |
Thanks for the detailed explanation @flashgorman-chorus. This is a great report! @lonewolf3739 Shouldn't ctrl-c cancel all in-flight requests and exit immediately? That's what I would expect as an end-user. |
Author was expecting it to end without having to hit CTRL-C, which makes sense. It exists after a minute but nobody waits till that point and perceive this as a process hung. |
Ah, that makes sense. |
I think this is working as expected in that case and can probably be closed. It would be great if processors could watch for sigkill, log a helpful message and exit instead of crashing but not a big deal especially if the implementation is not going to be straight forward. |
I agree with @owais, what would really "fix" this issue is a log message that lets the user know the process is waiting |
Thanks, all. I would like to get clarification if you are saying there is nothing we can do to influence this "process shutdown" behavior. For example, is there a timeout configuration we can provide if we want the shutdown behavior to not wait? Ideally, we would like a property that says "how long should |
@flashgorman-chorus there is no special timeout during shutdown today. We just have a single retry timeout which we could make configurable but I'm not sure if setting it to I think what we should do is detect SIGKILL and make one last attempt to flush everything in memory without retry. |
I think tracer provider should be responsible for detecting this and calling |
@ocelotl @lonewolf3739 thoughts about this? ^ |
Thanks. Will await your thoughtful decision. But, just to be clear, we are talking about a process exiting naturally - no human or Ctrl+C involved. I don't know if SIGKILL is involved under "natural exit" but if that only happens under Ctrl+C or |
@flashgorman-chorus I'm not sure what we can do to "automate" that case. Tracer provider does have a shutdown method and for "natural" exit cases, you could call |
if __name__ == "__main__":
main()
provider = get_tracer_provider()
provider.force_flush()
provider.shutdown() See if this works for you ^ @flashgorman-chorus |
That, unfortunately, will not work because we have at least 100 different scripts so there's no single point of entry and, thus, no single point of exit where we could call this. Moreover, we see def shutdown(self, force_flush: bool = True):
if force_flush:
self.force_flush()
# proceed with shutdown... |
I think shutdown should force flush implicitly if it doesn't already. |
I tested this with an explicit call to I next attempted to add a call to I did verify that if I simply wait 64 seconds, the process will exit. What I don't understand is what is it trying to do for 64 seconds? It can't be "that's the time it takes to 'flush' all of the queued-up spans" because I only create one span in this test. |
It is mostly waiting. We have to do this, it is required in the spec and the intention of the spec is to protect the destination from being overwhelmed. If we allowed the exporting to wait 0 seconds we would be making it possible for this to happen and it would be against the spec. In my opinion we should not allow the user to introduce another delay algorithm, as the spec requires this one. If you have specific purposes, testing or otherwise, I would recommend to make a custom exporter, it should not be very hard: Changing I'm sorry, I wish we could provide a solution for this directly in our published code, but I think for this particular situation the spec does not allow us to do so (@owais, @lonewolf3739 please comment if you disagree). Please let me know if you need help implementing this, @flashgorman-chorus, you can contact me directly via Slack at the CNCF workspace. |
Alright! Thanks for the thoughtful response. We'll put our heads together here and decide how we want to proceed (or if we need any help). Everyone's desire to do the right thing has been very refreshing. Thank you. |
@ocelotl @lonewolf3739 I think retry mechanism with jitter should only apply during normal operation of a service. That should be enough to comply with the spec. Once an application starts to shut down, I think it might be more reasonable to make an immediate and final (or may be a few without exponential back-off) attempt at exporting everything in memory and quit after that. |
Hi guys, We are facing same challenges about the stuck in |
I'm occasionally seeing a hang similar to this. My app seems to hang indefinitely (> 1h), which doesn't match up with the 64s timeout described above. The app is using Python multiprocessing. I think I'm seeing the same stack trace when I eventually ^C the program:
I'm using:
I'll look out for what conditions trigger it / if I can come up with a repro case that hangs for many minutes, but here's at least the initial information. |
Hello colleagues. My setup:
The problem The reason is exactly as discussed above: the thread which sends the data never exits as it doesn't give up after several numbers of failed retries. The question is: are there any plans to handle this issue? |
I am still seeing this problem. Is there a work around? Error:
|
I wonder if this could be caused by this deadlock issue in |
We have a process that sometimes fails to exit until we hit
ctrl-c
. Traceback at that time shows the process is hung, attempting to "join" a worker thread:In this case, the process completed in about 500 ms and would have produced only a few spans. The process was configured to export to a remote OT collector which may or may not have been reachable at the time - but we expect the "reachability" of the collector to not interfere with process execution. We gave it like 30 seconds to complete before terminating the process.
This, of course, interferes with auto-scaling and our ability to manage processes.
Please advise.
The text was updated successfully, but these errors were encountered: