Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Due to missing Python GIL management, Python clients constructed with a non-default Logger fail to clean up their threads and often segfault when Python garbage collects them #16527

Closed
zbentley opened this issue Jul 11, 2022 · 2 comments · Fixed by #16535
Labels
Stale type/bug The PR fixed a bug or issue reported a bug

Comments

@zbentley
Copy link
Contributor

Describe the bug
If I build a Pulsar client object in Python and supply a logger= value that is not the default result of logging.getLogger(), e.g. logging.getLogger("foobar"), and if I interact with that Pulsar object from a thread in Python while another thread is running, two things happen:

  • When that Client object and its thread are GC'd by Python, the internal threads that the pulsar driver used continue existing, resulting in a thread leak (and a file descriptor leak; I suspect this is the same pathology as Pulsar Python client leaks filehandles after destruction if a consumer is created on a partitioned topic #14588).
  • When that Client object is GC'd by python, there is a chance of a segmentation fault occurring if one or more other Python interpreter threads are in the midst of some fairly common interpreter operations (also related to GC).

Symptoms: if you interact with pulsar Client objects in threads, disconnecting/reconnecting leaks threads and other resources, and can segfault Python.

To Reproduce

  1. With an existing topic in TOPIC_NAME, run the below snippet:
import logging
import threading
from pulsar import Client


TOPIC_NAME = "persistent://chariot1/chariot_ns_sre--heartbeat/chariot_topic_heartbeat"


def _do_connect():
    logger = logging.getLogger(str(threading.currentThread().ident))
    logger.setLevel(logging.INFO)
    c = Client(service_url="pulsar://localhost:6650",
               io_threads=4,
               message_listener_threads=4,
               operation_timeout_seconds=1,
               log_conf_file_path=None,
               authentication=None,
               logger=logger,
    )
    c.get_topic_partitions(TOPIC_NAME)
    c.close()


def test_leaks():
    threads = []
    for i in range(10):
        threads.append(threading.Thread(target=_do_connect))
        threads[-1].start()
    print("Joining")
    for thread in threads:
        thread.join()
    print("Final threadcount", threading.active_count())


if __name__ == '__main__':
    test_leaks()
  1. Observe that the "final threadcount" value emitted by that code is >1.
  2. On repeat runs (I did a while python repro.py; echo iterated; done), observe that the code eventually segfaults and crashes.

Expected behavior

  1. After a thread running the Pulsar client is .joined, no side effects from its runtime should unexpectedly exist. The "final threadcount" value printed above should always be 1.
  2. Segfaults should not occur while using Pulsar clients normally.

Desktop (please complete the following information):

  • OS: MacOS 12.4 x86 (also observed segfaults on Amazon Linux and Ubuntu-in-docker on MacOS).
  • Pulsar client: 2.10.1.
@zbentley zbentley added the type/bug The PR fixed a bug or issue reported a bug label Jul 11, 2022
@zbentley
Copy link
Contributor Author

My theory as to why this is happening:

The problem is the GIL. When Python object reference counts are manipulated from compiled code, those manipulations are not atomic or protected by the GIL in any way. Incrementing a refcount is often coincidentally safe to do without the GIL, since the data structures in the Python interpreter that are altered by a refcount-bump are few and not terribly shared. However, decrementing a refcount without the GIL is extremely dangerous; the act of decrementing a refcount can trigger object destruction, which can then trigger more object destruction, and so on: decrementing a refcount triggers an arbitrary number of user functions (destructors) to run in Python, and can trigger wide-ranging changes (including system calls, memory allocation/deallocation--basically anything) across the interpreter's internal state.

Running such operations in true multi-threaded parallel in Python is basically guaranteed to break things. In most cases (I'm guessing here, as I don't know Boost/C++ well), I think the attempt to clean up the reference either blocks or fails in such a way that the C++ runtime won't properly clean up an object, preventing thread reaping from running internally. In some cases, the racy python GC operations overlap with shared interpreter data structures and cause segfaults. In rare cases, "impossible" (i.e. random objects changing type) errors are raised in Python itself, though you may have to run the above snippet for a long time to see one of those.

Such GIL-unprotected refcount manipulation occurs in the Pulsar client here, here, here, and here, though the first and third may be safe from this condition by dint of the fact that they're only invoked directly from calling Python code which already has the GIL.

To see if this theory's correct, I'll put up a PR in a few hours which GIL-protects those segments and see if that makes the issues go away.

@github-actions
Copy link

The issue had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label Aug 11, 2022
BewareMyPower pushed a commit that referenced this issue Sep 22, 2022
merlimat pushed a commit to apache/pulsar-client-cpp that referenced this issue Sep 29, 2022
merlimat pushed a commit to apache/pulsar-client-python that referenced this issue Sep 29, 2022
BewareMyPower pushed a commit that referenced this issue Nov 15, 2022
congbobo184 pushed a commit that referenced this issue Dec 7, 2022
… C++-owned objects (#16535)

Fixes #16527

(cherry picked from commit a8b265d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stale type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
1 participant