-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Due to missing Python GIL management, Python clients constructed with a non-default Logger fail to clean up their threads and often segfault when Python garbage collects them #16527
Comments
My theory as to why this is happening: The problem is the GIL. When Python object reference counts are manipulated from compiled code, those manipulations are not atomic or protected by the GIL in any way. Incrementing a refcount is often coincidentally safe to do without the GIL, since the data structures in the Python interpreter that are altered by a refcount-bump are few and not terribly shared. However, decrementing a refcount without the GIL is extremely dangerous; the act of decrementing a refcount can trigger object destruction, which can then trigger more object destruction, and so on: decrementing a refcount triggers an arbitrary number of user functions (destructors) to run in Python, and can trigger wide-ranging changes (including system calls, memory allocation/deallocation--basically anything) across the interpreter's internal state. Running such operations in true multi-threaded parallel in Python is basically guaranteed to break things. In most cases (I'm guessing here, as I don't know Boost/C++ well), I think the attempt to clean up the reference either blocks or fails in such a way that the C++ runtime won't properly clean up an object, preventing thread reaping from running internally. In some cases, the racy python GC operations overlap with shared interpreter data structures and cause segfaults. In rare cases, "impossible" (i.e. random objects changing type) errors are raised in Python itself, though you may have to run the above snippet for a long time to see one of those. Such GIL-unprotected refcount manipulation occurs in the Pulsar client here, here, here, and here, though the first and third may be safe from this condition by dint of the fact that they're only invoked directly from calling Python code which already has the GIL. To see if this theory's correct, I'll put up a PR in a few hours which GIL-protects those segments and see if that makes the issues go away. |
The issue had no activity for 30 days, mark with Stale label. |
… C++-owned objects (#16535) Fixes apache/pulsar#16527
… C++-owned objects (#16535) Fixes apache/pulsar#16527
Describe the bug
If I build a Pulsar client object in Python and supply a
logger=
value that is not the default result oflogging.getLogger()
, e.g.logging.getLogger("foobar")
, and if I interact with that Pulsar object from a thread in Python while another thread is running, two things happen:Symptoms: if you interact with pulsar Client objects in threads, disconnecting/reconnecting leaks threads and other resources, and can segfault Python.
To Reproduce
TOPIC_NAME
, run the below snippet:while python repro.py; echo iterated; done
), observe that the code eventually segfaults and crashes.Expected behavior
.join
ed, no side effects from its runtime should unexpectedly exist. The "final threadcount" value printed above should always be 1.Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: