Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process never exits due to hang in shutdown #2284

Open
flashgorman-chorus opened this issue Nov 18, 2021 · 30 comments
Open

Process never exits due to hang in shutdown #2284

flashgorman-chorus opened this issue Nov 18, 2021 · 30 comments
Labels
bug Something isn't working

Comments

@flashgorman-chorus
Copy link

We have a process that sometimes fails to exit until we hit ctrl-c. Traceback at that time shows the process is hung, attempting to "join" a worker thread:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/opentelemetry/sdk/trace/__init__.py", line 1136, in shutdown
    self._active_span_processor.shutdown()
  File "/usr/local/lib/python3.7/dist-packages/opentelemetry/sdk/trace/__init__.py", line 166, in shutdown
    sp.shutdown()
  File "/usr/local/lib/python3.7/dist-packages/opentelemetry/sdk/trace/export/__init__.py", line 385, in shutdown
    self.worker_thread.join()
  File "/usr/lib/python3.7/threading.py", line 1044, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.7/threading.py", line 1060, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):

In this case, the process completed in about 500 ms and would have produced only a few spans. The process was configured to export to a remote OT collector which may or may not have been reachable at the time - but we expect the "reachability" of the collector to not interfere with process execution. We gave it like 30 seconds to complete before terminating the process.

This, of course, interferes with auto-scaling and our ability to manage processes.

Please advise.

@owais
Copy link
Contributor

owais commented Nov 18, 2021

This shouldn't happen and looks like a bug. I wasn't able to reproduce it with a simple example. Could you please share a reproducible example with all otel dependencies listed?

@owais owais added the bug Something isn't working label Nov 18, 2021
@flashgorman-chorus
Copy link
Author

Thanks, @owais . I was able to reproduce the issue with this stand-alone test:

import time

from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import (
    OTLPSpanExporter,
)
from opentelemetry.sdk.resources import DEPLOYMENT_ENVIRONMENT
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.resources import SERVICE_NAME
from opentelemetry.sdk.resources import SERVICE_NAMESPACE
from opentelemetry.sdk.resources import SERVICE_VERSION
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
from opentelemetry.trace import set_tracer_provider
from opentelemetry import trace


tracer = trace.get_tracer(__name__)


def main():
    print("Initializing OpenTelemetry...")
    provider = TracerProvider(
        sampler=TraceIdRatioBased(0.1),
        resource=Resource.create(
            {
                DEPLOYMENT_ENVIRONMENT: "production",
                "platform": "chorus",
                SERVICE_NAME: "mike",
                SERVICE_NAMESPACE: "chorus",
                SERVICE_VERSION: "1.0",
            }
        ),
    )
    set_tracer_provider(provider)
    span_processor = BatchSpanProcessor(
        OTLPSpanExporter(endpoint="otel.chorus.internal:4317")
    )
    provider.add_span_processor(span_processor)
    print("Pretending to do work...")

    with tracer.start_as_current_span("foo"):
        time.sleep(0.5)

    print("Done!  (Python should exit now.)")


if __name__ == "__main__":
    main()

It took 10-20 tries but eventually exhibited the "hung" behavior reported above. After it hung, I hit Ctrl+C to get the process to exit. Here's the console output of that run:

Initializing OpenTelemetry...
Pretending to do work...
Done!  (Python should exit now.)
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/Users/mgorman/projects/web-server/.env/lib/python3.7/site-packages/opentelemetry/sdk/trace/__init__.py", line 1136, in shutdown
    self._active_span_processor.shutdown()
  File "/Users/mgorman/projects/web-server/.env/lib/python3.7/site-packages/opentelemetry/sdk/trace/__init__.py", line 166, in shutdown
    sp.shutdown()
  File "/Users/mgorman/projects/web-server/.env/lib/python3.7/site-packages/opentelemetry/sdk/trace/export/__init__.py", line 385, in shutdown
    self.worker_thread.join()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 1044, in join
    self._wait_for_tstate_lock()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 1060, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt

And here are all of the OT packages I have installed:

opentelemetry-api                           1.6.2
opentelemetry-exporter-otlp                 1.6.2
opentelemetry-exporter-otlp-proto-grpc      1.6.2
opentelemetry-exporter-otlp-proto-http      1.6.2
opentelemetry-instrumentation               0.25b2
opentelemetry-instrumentation-boto          0.25b2
opentelemetry-instrumentation-botocore      0.25b2
opentelemetry-instrumentation-celery        0.25b2
opentelemetry-instrumentation-elasticsearch 0.25b2
opentelemetry-instrumentation-pymongo       0.25b2
opentelemetry-instrumentation-redis         0.25b2
opentelemetry-instrumentation-requests      0.25b2
opentelemetry-instrumentation-sqlalchemy    0.25b2
opentelemetry-instrumentation-tornado       0.25b2
opentelemetry-proto                         1.6.2
opentelemetry-sdk                           1.6.2
opentelemetry-semantic-conventions          0.25b2
opentelemetry-util-http                     0.25b2

@flashgorman-chorus
Copy link
Author

flashgorman-chorus commented Nov 18, 2021

FYI - when I replace OTLPSpanExporter with ConsoleSpanExporter, I am not able to reproduce the issue (after 70 tests).

@flashgorman-chorus
Copy link
Author

Restoring the original OTLPSpanExporter from my test but disabling the sampler, the issue now happens on every execution. So, previously, the reason it took "10 or 20 times" was because, when the sampler did not sample the span, then everything was fine. But when every span is processed (due to commenting-out the sampler), it causes the Python process to hang every time.

@flashgorman-chorus
Copy link
Author

To help illuminate what's going on during shutdown, I "monkey patched" BatchSpanProcessor.shutdown method in order to print out a traceback of each thread, to see what each thread is doing, prior to dispatching to the "real" shutdown method. The results were, indeed, illuminating:

Initializing OpenTelemetry...
Pretending to do work...
Done!  (Python should exit now.)
in BatchSpanProcessor.shutdown!
These threads exist:
<_MainThread(MainThread, stopped 4441382400)>
<Thread(OtelBatchSpanProcessor, started daemon 123145478844416)>
BatchSpanProcessor's worker_thread is <Thread(OtelBatchSpanProcessor, started daemon 123145478844416)> (123145478844416)
Here's what all threads are doing as we speak:

# ThreadID: 123145478844416
File: "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 890, in _bootstrap
  self._bootstrap_inner()
File: "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 926, in _bootstrap_inner
  self.run()
File: "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 870, in run
  self._target(*self._args, **self._kwargs)
File: "/Users/mgorman/projects/web-server/.env/lib/python3.7/site-packages/opentelemetry/sdk/trace/export/__init__.py", line 237, in worker
  self.condition.wait(timeout)
File: "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 300, in wait
  gotit = waiter.acquire(True, timeout)

# ThreadID: 4441382400
File: "/Users/mgorman/projects/web-server/.env/lib/python3.7/site-packages/opentelemetry/sdk/trace/__init__.py", line 1136, in shutdown
  self._active_span_processor.shutdown()
File: "/Users/mgorman/projects/web-server/.env/lib/python3.7/site-packages/opentelemetry/sdk/trace/__init__.py", line 166, in shutdown
  sp.shutdown()
File: "/Users/mgorman/projects/web-server/ot_test.py", line 39, in __my_shutdown
  for filename, lineno, name, line in traceback.extract_stack(stack):
Forwarding the call to shutdown...

I can provide my monkey-patching if needed but, hopefully, this is enough to figure out what's going on.

@flashgorman-chorus
Copy link
Author

Appears the worker thread is hung here:
image

@srikanthccv
Copy link
Member

srikanthccv commented Nov 19, 2021

There is an exponential backoff and jitter retry mechanism (64 secs) for OTLPSpanExporter. This same issue has come up a few times since people are not aware of it. The spec is vague and doesn't have the recommended timeout. The last time we talked about it we agreed on this value (earlier it was 900 seconds taken from go impl) and didn't want to make this configureable (via constructor/env) to not introduce breaking changes. I think in this kind of scenarios it should finish up as much as possible and cancel the ongoing work and exit.

@owais
Copy link
Contributor

owais commented Nov 19, 2021

Thanks for the detailed explanation @flashgorman-chorus. This is a great report!

@lonewolf3739 Shouldn't ctrl-c cancel all in-flight requests and exit immediately? That's what I would expect as an end-user.

@srikanthccv
Copy link
Member

srikanthccv commented Nov 19, 2021

Author was expecting it to end without having to hit CTRL-C, which makes sense. It exists after a minute but nobody waits till that point and perceive this as a process hung.

@owais
Copy link
Contributor

owais commented Nov 19, 2021

Ah, that makes sense.

@owais
Copy link
Contributor

owais commented Nov 19, 2021

I think this is working as expected in that case and can probably be closed. It would be great if processors could watch for sigkill, log a helpful message and exit instead of crashing but not a big deal especially if the implementation is not going to be straight forward.

@ocelotl
Copy link
Contributor

ocelotl commented Nov 19, 2021

I agree with @owais, what would really "fix" this issue is a log message that lets the user know the process is waiting

@flashgorman-chorus
Copy link
Author

Thanks, all. I would like to get clarification if you are saying there is nothing we can do to influence this "process shutdown" behavior. For example, is there a timeout configuration we can provide if we want the shutdown behavior to not wait?

Ideally, we would like a property that says "how long should OTLPSpanExporter wait to flush spans (what I assume it is waiting on) during shutdown?" The default can be 64 seconds, as it is today, but if your priority is "process management" and don't care about "lost spans", you might set this to 0 seconds which says "never delay process shutdown due to... well, for any reason."

@owais
Copy link
Contributor

owais commented Nov 22, 2021

@flashgorman-chorus there is no special timeout during shutdown today. We just have a single retry timeout which we could make configurable but I'm not sure if setting it to 0 would be a good solution to this problem.

I think what we should do is detect SIGKILL and make one last attempt to flush everything in memory without retry.

@owais
Copy link
Contributor

owais commented Nov 22, 2021

I think tracer provider should be responsible for detecting this and calling shutdown() on itself resulting in all processors shutting down and flushing ASAP.

@owais
Copy link
Contributor

owais commented Nov 22, 2021

@ocelotl @lonewolf3739 thoughts about this? ^

@flashgorman-chorus
Copy link
Author

Thanks. Will await your thoughtful decision. But, just to be clear, we are talking about a process exiting naturally - no human or Ctrl+C involved. I don't know if SIGKILL is involved under "natural exit" but if that only happens under Ctrl+C or kill <pid> scenarios, I wanted to make sure it is clear we are hoping for natural process exit without human intervention and without delay.

@owais
Copy link
Contributor

owais commented Nov 22, 2021

@flashgorman-chorus I'm not sure what we can do to "automate" that case. Tracer provider does have a shutdown method and for "natural" exit cases, you could call provider.shutdown() explicitly in your code and that should have the same effect.

@owais
Copy link
Contributor

owais commented Nov 22, 2021

if __name__ == "__main__":
    main()
    provider = get_tracer_provider()
    provider.force_flush()
    provider.shutdown()

See if this works for you ^ @flashgorman-chorus
Not sure if you need to call force_flush as well but you can try and s

@flashgorman-chorus
Copy link
Author

flashgorman-chorus commented Nov 22, 2021

That, unfortunately, will not work because we have at least 100 different scripts so there's no single point of entry and, thus, no single point of exit where we could call this.

Moreover, we see shutdown being called - so that is not the problem. It is what's going on inside shutdown that causes the process to hang. If force_flush would do the trick, then what would be nice is if shutdown looked like this:

def shutdown(self, force_flush: bool = True):
    if force_flush:
        self.force_flush()

    # proceed with shutdown...

@owais
Copy link
Contributor

owais commented Nov 22, 2021

I think shutdown should force flush implicitly if it doesn't already.

@flashgorman-chorus
Copy link
Author

I tested this with an explicit call to provider.force_flush() right before I expect the process to exit and all that did was add a 30-second delay.

I next attempted to add a call to self.force_flush() inside my monkey-patched version of shutdown (along the lines I suggested above) with the same results - an extra 30 seconds.

I did verify that if I simply wait 64 seconds, the process will exit. What I don't understand is what is it trying to do for 64 seconds? It can't be "that's the time it takes to 'flush' all of the queued-up spans" because I only create one span in this test.

@ocelotl
Copy link
Contributor

ocelotl commented Nov 22, 2021

What I don't understand is _what is it trying to do for 64 seconds?

It is mostly waiting. We have to do this, it is required in the spec and the intention of the spec is to protect the destination from being overwhelmed. If we allowed the exporting to wait 0 seconds we would be making it possible for this to happen and it would be against the spec. In my opinion we should not allow the user to introduce another delay algorithm, as the spec requires this one.

If you have specific purposes, testing or otherwise, I would recommend to make a custom exporter, it should not be very hard:

https://github.com/open-telemetry/opentelemetry-python/blob/main/exporter/opentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py#L310

Changing delay to 0 in that line should do the trick.

I'm sorry, I wish we could provide a solution for this directly in our published code, but I think for this particular situation the spec does not allow us to do so (@owais, @lonewolf3739 please comment if you disagree).

Please let me know if you need help implementing this, @flashgorman-chorus, you can contact me directly via Slack at the CNCF workspace.

@flashgorman-chorus
Copy link
Author

Alright! Thanks for the thoughtful response. We'll put our heads together here and decide how we want to proceed (or if we need any help). Everyone's desire to do the right thing has been very refreshing. Thank you.

@owais
Copy link
Contributor

owais commented Nov 22, 2021

@ocelotl @lonewolf3739 I think retry mechanism with jitter should only apply during normal operation of a service. That should be enough to comply with the spec. Once an application starts to shut down, I think it might be more reasonable to make an immediate and final (or may be a few without exponential back-off) attempt at exporting everything in memory and quit after that.

@Shihao-Zhong
Copy link

Hi guys,

We are facing same challenges about the stuck in _wait_for_tstate_lock function, can we apply some special logic for that when the application is ending?

@markfickett
Copy link

I'm occasionally seeing a hang similar to this. My app seems to hang indefinitely (> 1h), which doesn't match up with the 64s timeout described above. The app is using Python multiprocessing.

I think I'm seeing the same stack trace when I eventually ^C the program:

^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/.../python3.9/site-packages/opentelemetry/sdk/trace/__init__.py", line 1169, in shutdown
    self._active_span_processor.shutdown()

  File "/.../python3.9/site-packages/opentelemetry/sdk/trace/__init__.py", line 171, in shutdown
    sp.shutdown()
  File "/.../python3.9/site-packages/opentelemetry/sdk/trace/export/__init__.py", line 402, in shutdown
    self.worker_thread.join()
  File "/usr/lib/python3.9/threading.py", line 1033, in join
Aborted!
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.9/threading.py", line 1049, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt

I'm using:

opentelemetry-api==1.11.1
opentelemetry-sdk==1.11.1
opentelemetry-exporter-otlp-proto-grpc==1.11.1
opentelemetry-instrumentation-requests==0.30b1
opentelemetry-instrumentation-logging==0.30b1
opentelemetry-instrumentation-sqlalchemy==0.30b1

I'll look out for what conditions trigger it / if I can come up with a repro case that hangs for many minutes, but here's at least the initial information.

@alkurbatov
Copy link

alkurbatov commented Nov 28, 2022

Hello colleagues.
I experience the same issue :(

My setup:

  • jaeger-all-in-one (including collector) docker container with grpc used to deliver tracing data
  • my service with async nature in a docker container with proper handling of signals (sigint, sigterm)

The problem
If by some reason the all-in-one image goes down before shutdown of my service, the service fails to send tracing data and goes into endless loop retrying with different timeouts. In this case the only way I can stop it is to send SIGKILL.

The reason is exactly as discussed above: the thread which sends the data never exits as it doesn't give up after several numbers of failed retries.

The question is: are there any plans to handle this issue?
To me it would be enough to provide a kind of cancellation api to forcibly finish the task during service shutdown or even specify maximum number of retries to give up on retrying.

@cazador481
Copy link

I am still seeing this problem. Is there a work around?

Error:

^CException ignored in atexit callback: <bound method MeterProvider.shutdown of <opentelemetry.sdk.metrics._internal.MeterProvider object at 0x7fffe6ca54e0>>
Traceback (most recent call last):
  File "/home/nv/utils/ci/nvci_create/1.2.1/lib/python3.10/site-packages/opentelemetry/sdk/metrics/_internal/__init__.py", line 440, in shutdown
    metric_reader.shutdown(
  File "/home/nv/utils/ci/nvci_create/1.2.1/lib/python3.10/site-packages/opentelemetry/sdk/metrics/_internal/export/__init__.py", line 543, in shutdown
    self._daemon_thread.join(
  File "/home/utils/Python/builds/3.10.10-20230601/lib/python3.10/threading.py", line 1100, in join
    self._wait_for_tstate_lock(timeout=max(timeout, 0))
  File "/home/utils/Python/builds/3.10.10-20230601/lib/python3.10/threading.py", line 1116, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):

@clumsy
Copy link

clumsy commented Mar 11, 2024

I wonder if this could be caused by this deadlock issue in ProcessPoolExecutor. In such case the workaround is switching to multiprocessing.Pool or to the latest Python version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants