Due to missing Python GIL management, Python clients constructed with a non-default Logger fail to clean up their threads and often segfault when Python garbage collects them #16527

zbentley · 2022-07-11T20:31:39Z

Describe the bug
If I build a Pulsar client object in Python and supply a logger= value that is not the default result of logging.getLogger(), e.g. logging.getLogger("foobar"), and if I interact with that Pulsar object from a thread in Python while another thread is running, two things happen:

When that Client object and its thread are GC'd by Python, the internal threads that the pulsar driver used continue existing, resulting in a thread leak (and a file descriptor leak; I suspect this is the same pathology as Pulsar Python client leaks filehandles after destruction if a consumer is created on a partitioned topic #14588).
When that Client object is GC'd by python, there is a chance of a segmentation fault occurring if one or more other Python interpreter threads are in the midst of some fairly common interpreter operations (also related to GC).

Symptoms: if you interact with pulsar Client objects in threads, disconnecting/reconnecting leaks threads and other resources, and can segfault Python.

To Reproduce

With an existing topic in TOPIC_NAME, run the below snippet:

import logging
import threading
from pulsar import Client


TOPIC_NAME = "persistent://chariot1/chariot_ns_sre--heartbeat/chariot_topic_heartbeat"


def _do_connect():
    logger = logging.getLogger(str(threading.currentThread().ident))
    logger.setLevel(logging.INFO)
    c = Client(service_url="pulsar://localhost:6650",
               io_threads=4,
               message_listener_threads=4,
               operation_timeout_seconds=1,
               log_conf_file_path=None,
               authentication=None,
               logger=logger,
    )
    c.get_topic_partitions(TOPIC_NAME)
    c.close()


def test_leaks():
    threads = []
    for i in range(10):
        threads.append(threading.Thread(target=_do_connect))
        threads[-1].start()
    print("Joining")
    for thread in threads:
        thread.join()
    print("Final threadcount", threading.active_count())


if __name__ == '__main__':
    test_leaks()

Observe that the "final threadcount" value emitted by that code is >1.
On repeat runs (I did a while python repro.py; echo iterated; done), observe that the code eventually segfaults and crashes.

Expected behavior

After a thread running the Pulsar client is .joined, no side effects from its runtime should unexpectedly exist. The "final threadcount" value printed above should always be 1.
Segfaults should not occur while using Pulsar clients normally.

Desktop (please complete the following information):

OS: MacOS 12.4 x86 (also observed segfaults on Amazon Linux and Ubuntu-in-docker on MacOS).
Pulsar client: 2.10.1.

The text was updated successfully, but these errors were encountered:

zbentley · 2022-07-11T20:31:48Z

My theory as to why this is happening:

The problem is the GIL. When Python object reference counts are manipulated from compiled code, those manipulations are not atomic or protected by the GIL in any way. Incrementing a refcount is often coincidentally safe to do without the GIL, since the data structures in the Python interpreter that are altered by a refcount-bump are few and not terribly shared. However, decrementing a refcount without the GIL is extremely dangerous; the act of decrementing a refcount can trigger object destruction, which can then trigger more object destruction, and so on: decrementing a refcount triggers an arbitrary number of user functions (destructors) to run in Python, and can trigger wide-ranging changes (including system calls, memory allocation/deallocation--basically anything) across the interpreter's internal state.

Running such operations in true multi-threaded parallel in Python is basically guaranteed to break things. In most cases (I'm guessing here, as I don't know Boost/C++ well), I think the attempt to clean up the reference either blocks or fails in such a way that the C++ runtime won't properly clean up an object, preventing thread reaping from running internally. In some cases, the racy python GC operations overlap with shared interpreter data structures and cause segfaults. In rare cases, "impossible" (i.e. random objects changing type) errors are raised in Python itself, though you may have to run the above snippet for a long time to see one of those.

Such GIL-unprotected refcount manipulation occurs in the Pulsar client here, here, here, and here, though the first and third may be safe from this condition by dint of the fact that they're only invoked directly from calling Python code which already has the GIL.

To see if this theory's correct, I'll put up a PR in a few hours which GIL-protects those segments and see if that makes the issues go away.

github-actions · 2022-08-11T02:24:18Z

The issue had no activity for 30 days, mark with Stale label.

… C++-owned objects (#16535) Fixes #16527

… C++-owned objects (#16535) Fixes apache/pulsar#16527

… C++-owned objects (#16535) Fixes #16527 (cherry picked from commit a8b265d)

zbentley added the type/bug The PR fixed a bug or issue reported a bug label Jul 11, 2022

sijie mentioned this issue Jul 11, 2022

ISSUE-16527: Due to missing Python GIL management, Python clients constructed with a non-default Logger fail to clean up their threads and often segfault when Python garbage collects them streamnative/pulsar-archived#4522

Open

This was referenced Jul 12, 2022

[fix] [python client] Better Python garbage collection management for C++-owned objects #16535

Merged

[2.8.0] Python client instances emit logs for every client instance ever constructed with a logger argument apache/pulsar-client-python#40

Open

github-actions bot added the Stale label Aug 11, 2022

BewareMyPower closed this as completed in #16535 Sep 22, 2022

BewareMyPower pushed a commit that referenced this issue Sep 22, 2022

[fix] [python client] Better Python garbage collection management for…

a8b265d

… C++-owned objects (#16535) Fixes #16527

merlimat pushed a commit to apache/pulsar-client-cpp that referenced this issue Sep 29, 2022

[fix] [python client] Better Python garbage collection management for…

d2e9e7c

… C++-owned objects (#16535) Fixes apache/pulsar#16527

merlimat pushed a commit to apache/pulsar-client-python that referenced this issue Sep 29, 2022

[fix] [python client] Better Python garbage collection management for…

ede2e97

… C++-owned objects (#16535) Fixes apache/pulsar#16527

BewareMyPower pushed a commit that referenced this issue Nov 15, 2022

[fix] [python client] Better Python garbage collection management for…

cdc1ee3

… C++-owned objects (#16535) Fixes #16527 (cherry picked from commit a8b265d)

congbobo184 pushed a commit that referenced this issue Dec 7, 2022

[fix] [python client] Better Python garbage collection management for…

9b858bb

… C++-owned objects (#16535) Fixes #16527 (cherry picked from commit a8b265d)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Due to missing Python GIL management, Python clients constructed with a non-default Logger fail to clean up their threads and often segfault when Python garbage collects them #16527

Due to missing Python GIL management, Python clients constructed with a non-default Logger fail to clean up their threads and often segfault when Python garbage collects them #16527

zbentley commented Jul 11, 2022

zbentley commented Jul 11, 2022

github-actions bot commented Aug 11, 2022

Due to missing Python GIL management, Python clients constructed with a non-default Logger fail to clean up their threads and often segfault when Python garbage collects them #16527

Due to missing Python GIL management, Python clients constructed with a non-default Logger fail to clean up their threads and often segfault when Python garbage collects them #16527

Comments

zbentley commented Jul 11, 2022

zbentley commented Jul 11, 2022

github-actions bot commented Aug 11, 2022