-
Notifications
You must be signed in to change notification settings - Fork 795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
__traverse__
should not alter any refcounts, at least in CPython
#3165
Comments
I think so too. What I did not yet understand in your exposition is whether it is actually necessary to change What is also not clear to me is how the implementation of |
@davidhewitt I am currently thinking how to get a grip on One option would be to provide any API similar to // Some long-running process like a webserver, which never releases the GIL.
loop {
// Create a new pool, so that PyO3 can clear memory at the end of the loop.
let pool = unsafe { py.new_pool() };
// It is recommended to *always* immediately set py to the pool's Python, to help
// avoid creating references with invalid lifetimes.
let py = pool.python();
// do stuff...
} could be replaced by // Some long-running process like a webserver, which never releases the GIL.
loop {
Python::with_pool(|py| {
// do stuff...
});
} which creates the pool behind the scenes and just exposes its GIL token. I mainly wonder whether this would be just too much breakage for 0.19 together with the EDIT: I think this could also significantly simplify the implementation of |
You are right that it is not necessary. That's about all though, it doesn't give you any additional optimization opportunity. I just added
I believe you can still borrow
I didn't think too much about making anything |
I think we should push that as a
The problem is that |
Ah, but we can still expose |
@lifthrasiir Could you try whether #3168 would also work for you? |
Yeah, the only
I haven't checked yet---I need an access to a company machine for the original code, so was making an independent test for the PR. I believe your draft still creates and drops Footnotes
|
The latest version of my draft enforces the signature fn(&T, PyVisit<'_>) -> Result<(), PyTraverseError>, instead of just calling |
Now I have an access to the company machine again and I confirmed that #3168 fixes my crash. Great! @adamreichold I've added more tests (#3175) that I was originally planning to use for my PR, feel free to use them. |
Bug Description
Back when we've got a support for the GC protocol, while it is not usual for
__traverse__
to do anything more than callingvisit.call
, it was assumed that running any Python code is not harmful and currently__traverse__
trampoline is no more than an ordinary trampoline with a slightly different signature.It turns out that not only this assumption is wrong but also any currently generated
tp_traverse
implementation is prone to crash. CPython has reusedPyGC_Head
fields to store intermediate refcounts for a long time, sotp_traverse
should be never able to read or write anyPyGC_Head
fields.In particular the
__traverse__
trampoline would first allocate a newGILPool
and apply deferred refcount updates at once, possibly deallocating some objects in the pool and in turn somehow touching temporarily corruptedPyGC_Head
(e.g.PyObject_GC_Untrack
orPyObject_GC_Del
). For this reason I believe this is the root cause of #1623 and #3064; both suggests thatPyGC_Head
is corrupted and threading seems to be relevant here (it is much easier to defer refcount updates in non-interpreter threads).Reproduction
Too lengthy to open by default
I was able to reliably trigger SIGSEGV on Python 3.8 through 3.11 in x86-64 Linux. Given the following
src/lib.rs
and typicalCargo.toml
(which has been omitted):Running the following Python code will crash inside
gc.collect()
:Here
Ref
is holding an optional reference to otherRef
, possibly itself. Settingclear_on_traverse
will clear the reference during the traversal, simulating an otherwise totally safe operation to do.In the
crash1
the refcounts fora
andb
will be 2 and 1 respectively at the time of GC. GC will overwritePyGC_Head._gc_prev
for botha
andb
(effectively converting the list to singly-linked), and then__traverse__
will be called fora
, creating a newGILPool
and triggeringReferencePool::update_counts
. As both values are registered forPy_DECREF
the refcount ofb
will be now 0, and this will eventually callPyObject_GC_Del
which assumes that the list is still doubly-linked and crashes.The situation is similar for
crash2
:a2
gets dropped with the GIL still acquired, so the refcounts fora1
,a2
andb
are 2, 1 and 1 respectively. Now assume thatGILPool::new
somehow recognizes__traverse__
and doesn't callupdate_counts
. Even in this casea1
will now remove a reference toa2
during__traverse__
, falsely belivingPy_DECREF
is safe to call as the GIL is acquired, so the same thing will happen fora2
.Possible Solution
It seems evident to me that
__traverse__
should be handled as if no GIL is (and can be) acquired. This implies thatPython::with_gil
should probably fail andgil_is_acquired
should return false inside__traverse__
.I believe it is also possible to have multiple nested
__traverse__
calls. The GC process drains the linked list of objects while updatingPyGC_Head
and it setsgcstate->collecting
to 1 during the process anyway, so__traverse__
cannot be called twice for the same object and nested calls should be rare, but there are at least three possible paths (as of the main branch of CPython):PyObject_GC_Track
, which implicitly callstp_traverse
in debug builds.gc.get_referrers
andgc.get_referrents
usetp_traverse
to collect objects, and are not affected bygcstate->collecting
. They can be independently nested if__traverse__
somehow calls those functions.gc.collect()
checks forgcstate->collecting
and returns immediately, the automatic GC scheduling may result in a double GC. Specifically there is an interval between the scheduling (_Py_ScheduleGC
) and the actual GC (_Py_RunGC
) and some opcode may end up callinggc.collect()
inbetween. This is extremely rare though, as this interval can be at most a single opcode, and is otherwise safe.Probably it is sufficient to add the second "traversal" counter for the nested
__traverse__
calls in addition toGIL_COUNT
. If the traversal counter is non-zero the GIL cannot be acquired andgil_is_acquired
should return false regardless ofGIL_COUNT
. TheGIL_COUNT
is still updated as usual, but it is effectively ignored until the traversal counter drops to zero, at which pointupdate_counts
should be considered (but only whenGIL_COUNT
was non-zero). But I'm not very much confident about this solution for multiple reasons, including that there is no actual guarantee that we can call any other Python API fromtp_traverse
and someday CPython may touchPyGC_Head
from unexpected functions---I mean, evengcstate->collecting
check is not complete so who knows? If possible, a fully static solution to prevent anything butPyVisit
would be much more desirable.Suggestions
Given a full solution needs much testing at least, I looked for possible workarounds without a local PyO3 update. Unfortunately
GILPool::new
call is mandatory for the trampoline and the only indirect approach would be temporarily untracking objects viaPyObject_GC_UnTrack
, but that pretty much contradicts why we havetp_traverse
in the first place.So it seems that PyO3 needs a minimal update to ensure that you remain safe if you don't do weird things inside
__traverse__
. No promises here, but I hope to file a small PR to disableupdate_counts
from the trampoline soon. In my local testing this fixescrash1
but notcrash2
.Metadata
Your operating system and version
x86-64 Linux (at least)
Your Python version (
python --version
)3.8.10, 3.9.16, 3.10.11, 3.11.13
Your Rust version (
rustc --version
)1.69.0
Your PyO3 version
0.18.3
How did you install python? Did you use a virtualenv?
deadsnakes
PPA, the bundled version ofvenv
,pip
and the most recent version ofmaturin
(0.15.2).The text was updated successfully, but these errors were encountered: