[CI][Python] wheel-manylinux2014-* sometimes crashed on pytest exit #15054

kou · 2022-12-20T21:20:25Z

Describe the bug, including details regarding any error messages, version, and platform.

wheel-manylinux2014-cp38-amd64 2022-12-20 nighty:

https://github.com/ursacomputing/crossbow/actions/runs/3738813751/jobs/6345300403#step:9:775

= 4294 passed, 382 skipped, 18 xfailed, 2 xpassed, 4 warnings in 522.05s (0:08:42) =
/arrow/ci/scripts/python_wheel_unix_test.sh: line 95:    79 Segmentation fault      (core dumped) python -m pytest -r s --pyargs pyarrow

wheel-manylinux2014-cp39-arm64 2022-12-20 nighty:

https://app.travis-ci.com/github/ursacomputing/crossbow/builds/259026563#L5708

= 4324 passed, 352 skipped, 18 xfailed, 2 xpassed, 1 warning in 141.20s (0:02:21) =
/arrow/ci/scripts/python_wheel_unix_test.sh: line 95:    15 Segmentation fault      (core dumped) python -m pytest -r s --pyargs pyarrow

Component(s)

Continuous Integration, Packaging, Python

The text was updated successfully, but these errors were encountered:

raulcd · 2023-01-09T17:30:41Z

This is still happening, core dump on those wheel jobs yesterday:
wheel-manylinux2014-cp310-arm64
wheel-manylinux2014-cp311-amd64
wheel-manylinux2014-cp37-amd64
wheel-manylinux2014-cp39-amd64
wheel-manylinux2014-cp39-arm64
I am marking this as a release blocker.

lidavidm · 2023-01-13T18:17:06Z

I can sort of reproduce; you have to go through the release verification process. Still trying to get GDB inside this process so I can try to backtrace (the test that SIGINTs itself is giving me a bit of trouble). I can't yet reproduce with a 22.04 container directly; trying to make sure the setup is exactly the same.

....and it doesn't crash under gdb, hmm.

lidavidm · 2023-01-13T20:09:45Z

And now I can no longer get the crash to reproduce either, so I'm not sure I can make much headway on this.

assignUser · 2023-01-17T06:13:32Z

This is now the last non-archery blocker, with the difficulty in reproducing this should it still stay a blocker?

raulcd · 2023-01-17T09:44:54Z

It does seem to have happened on the latest test-fedora-35-python-3 on Azure too: https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=42753&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=5623

===== 6762 passed, 356 skipped, 18 xfailed, 2 xpassed in 295.11s (0:04:55) =====
/arrow/ci/scripts/python_test.sh: line 60: 22828 Segmentation fault      (core dumped) pytest -r s ${PYTEST_ARGS} --pyargs pyarrow
139

I am not entirely sure this is a blocker for the release at this point. It's quite difficult to reproduce and flaky (not always happens) and it only seems to happen on test exit (as all tests are successful). I would probably remove it from being a blocker on the release and keep investigating around it. @kou @lidavidm what are your thoughts about moving it out of the release?

lidavidm · 2023-01-17T13:07:27Z

I am OK with this, though we will have to note it at verification time. (And actually, I think it will make binary verification quite painful...)

Can we get some more systematic approaches to solving this? Would it be possible, for instance, to backtrace failing tests automatically, or upload a core dump? In #33720 I'm going to try enabling both PYTHONFAULTHANDLER and catchsegv to see if we can get any more info on what's happening.

lidavidm · 2023-01-17T14:27:35Z

I think it's an AWS SDK issue again!

I enabled catchsegv here: https://github.com/ursacomputing/crossbow/actions/runs/3939847223/jobs/6740188523

The crashing instruction is at 0x7fed49842bb7:

Backtrace:
/usr/local/lib/python3.9/site-packages/pyarrow/libarrow.so.1100(+0x17b5bb7)[0x7fed49842bb7]

libarrow.so is mapped to 0x7fed4808d000:

7fed4808d000-7fed484d7000 r--p 00000000 00:2c 2138445 /usr/local/lib/python3.9/site-packages/pyarrow/libarrow.so.1100

Subtracting the two, we're looking for 0x17b5bb7 in libarrow.so:

$ objdump --demangle -d -T libarrow.so.1100
...
00000000017b5b90 <Aws::Http::CurlHandleContainer::~CurlHandleContainer()>:
 17b5b90:	41 57                	push   %r15
 17b5b92:	49 89 ff             	mov    %rdi,%r15
 17b5b95:	41 56                	push   %r14
 17b5b97:	41 55                	push   %r13
 17b5b99:	41 54                	push   %r12
 17b5b9b:	55                   	push   %rbp
 17b5b9c:	53                   	push   %rbx
 17b5b9d:	48 81 ec 08 02 00 00 	sub    $0x208,%rsp
 17b5ba4:	e8 37 b3 f9 ff       	callq  1750ee0 <Aws::Utils::Logging::GetLogSystem()>
 17b5ba9:	48 85 c0             	test   %rax,%rax
 17b5bac:	74 15                	je     17b5bc3 <Aws::Http::CurlHandleContainer::~CurlHandleContainer()+0x33>
 17b5bae:	48 89 c5             	mov    %rax,%rbp
 17b5bb1:	48 8b 00             	mov    (%rax),%rax
 17b5bb4:	48 89 ef             	mov    %rbp,%rdi
 17b5bb7:	ff 50 10             	callq  *0x10(%rax)
...

lidavidm · 2023-01-17T14:30:58Z

Reconstructed backtrace:

Backtrace:
libarrow.so.1100(+0x17b5bb7)[0x7fed49842bb7] Aws::Http::CurlHandleContainer::~CurlHandleContainer()
libarrow.so.1100(+0x17a1d83)[0x7fed4982ed83] Aws::Http::CurlHttpClient::~CurlHttpClient()
libarrow.so.1100(+0x591c6a)[0x7fed4861ec6a]  std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()
libarrow.so.1100(+0x16d882f)[0x7fed4976582f] Aws::S3::S3Client::~S3Client()
libarrow.so.1100(+0x591c6a)[0x7fed4861ec6a]  std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()
libarrow.so.1100(+0x12d5a94)[0x7fed49362a94] std::_Sp_counted_ptr<arrow::fs::(anonymous namespace)::RegionResolver*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()
libarrow.so.1100(+0x591c6a)[0x7fed4861ec6a]  std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()

…maybe we could just leak the S3 client as a workaround?

kou · 2023-01-17T15:15:53Z

aws/aws-sdk-cpp#1833 ?

kou · 2023-01-17T15:16:43Z

I'm OK with removing the blocker label from this.

raulcd · 2023-01-17T15:20:59Z

I am not too familiar with this but as far as I understand this is not something we have introduced on this release, right? I can't see any change on how we deal with the S3Client and we have not changed the version of aws-sdk-cpp yet: #20272

kou · 2023-01-17T15:27:27Z

Right.

raulcd · 2023-01-19T09:54:03Z

As suggested by @lidavidm this is causing release verification (#33751) to be a bit painful.
I had to retry the wheel-manylinux2014-cp39-amd64 11 times for it to succeed (1 of the early failures was due to a failure uploading the artifact): https://github.com/ursacomputing/crossbow/actions/runs/3949372445/jobs/6760493478
And currently verify-rc-binaries-wheels-linux-conda-latest-amd64 is still failing with the crash on pytest exit after several retries: https://github.com/ursacomputing/crossbow/actions/runs/3954649758/jobs/6772203240
Those have been the only jobs that have been having the issue as far as I am aware but it has been quite consistent on those. I wanted to raise that so it is taken into account on the release vote.

pitrou · 2023-01-19T10:45:42Z

…maybe we could just leak the S3 client as a workaround?

We can probably do that indeed. We just have to make sure to disable S3 on ASAN and Valgrind CI jobs.

paleolimbot · 2023-01-19T16:13:54Z

FIWIW this looks very similar/maybe is the same as #15189, which the R package fixed by skipping on platforms with a very old SSL runtime (MacOS 10.13 was the culprit for us).

pitrou · 2023-01-19T16:18:00Z

Judging by the reconstructed stack trace (thanks @lidavidm !), this has nothing to do with OpenSSL but with calling a logging method:

CurlHandleContainer::~CurlHandleContainer()
{
    AWS_LOGSTREAM_INFO(CURL_HANDLE_CONTAINER_TAG, "Cleaning up CurlHandleContainer.");
    for (CURL* handle : m_handleContainer.ShutdownAndWait(m_poolSize))
    {
        AWS_LOGSTREAM_DEBUG(CURL_HANDLE_CONTAINER_TAG, "Cleaning up " << handle);
        curl_easy_cleanup(handle);
    }
}

The definition of AWS_LOGSTREAM_INFO is:

    #define AWS_LOGSTREAM_INFO(tag, streamExpression) \
        { \
            Aws::Utils::Logging::LogSystemInterface* logSystem = Aws::Utils::Logging::GetLogSystem(); \
            if ( logSystem && logSystem->GetLogLevel() >= Aws::Utils::Logging::LogLevel::Info ) \
            { \
                Aws::OStringStream logStream; \
                logStream << streamExpression; \
                logSystem->LogStream( Aws::Utils::Logging::LogLevel::Info, tag, logStream ); \
            } \
        }

westonpace · 2023-01-19T17:11:08Z

Looks like there is a global static RegionResolver that has an S3Client and that S3Client is being destroyed after S3 has already been destroyed?

pitrou · 2023-01-19T17:12:33Z

Yes, it's probably that. Though "S3 has already been destroyed" is a bit vague (is it after DLL unload?).

westonpace · 2023-01-19T19:45:32Z

Ah! No, a C atexit hook would probably not work. I'm talking about a Python atexit hook. However, there is the flipped issue that it might be called too early (potentially letting Arrow threads make calls into a finalized AWS SDK).

Perhaps it is time then for an arrow::Shutdown method?

pitrou · 2023-01-19T20:15:23Z

Perhaps it is time then for an arrow::Shutdown method?

That wouldn't really solve the issue I mentioned, would it? :-)

westonpace · 2023-01-19T21:30:40Z

That wouldn't really solve the issue I mentioned, would it? :-)

A python atexit hook could call arrow::EnsureShutdown. In arrow::Shutdown we would first shut down the thread pools. Then we would shut down S3.

westonpace · 2023-01-23T18:30:32Z

@pitrou @lidavidm

I prototyped a shutdown mechanism this morning (9d62b2c) and I think it could work but I realized there might be another way we can solve this problem as we ran into a similar problem with open telemetry.

We added the concept of Executor::KeepAlive which is a method that registers a resource with our thread pool instance and keeps that resource alive until all threads have exited. Our OT code then uses this to ensure that OT is not shutdown until all threads have exited.

We can extend this mechanism to S3. The main drawback will be that a call to FinalizeS3 will not actually call Aws::ShutdownAPI until process exit (after the thread pool has shut down). Would this be an acceptable workaround? It would allow us to safely use a python atexit hook to call FinalizeS3.

lidavidm · 2023-01-23T21:58:52Z

How would that work in this case? Something which keeps the AWS SDK alive until after all S3 clients have been destroyed?

westonpace · 2023-01-24T16:55:00Z

How would that work in this case? Something which keeps the AWS SDK alive until after all S3 clients have been destroyed?

S3's finalize would be guarded by a shared_ptr (or some kind of reference count). One copy would live with the thread pools (as a resource). Well, one copy with each thread pool. One copy would live with S3Finalize.
Python's atexit would call S3Finalize which would decrement the shared pointer by 1
As the process closes the thread pool is shut down as part of the normal destruction of static state. Once it finishes it then clears it's "resources" which would include the S3 finalizer.
The reference count would now drop to 0 and it should be safe to finalize S3 (presumable the user is done because we have hit python atexit and presumably arrow is done because the thread pools are shutdown).
We would then call Aws::ShutdownAPI as the destructor for the shared_ptr.

pitrou · 2023-01-24T17:00:34Z

It sounds like we can give this a try @westonpace .

…finished, add pyarrow exit hook (#33858) CRITICAL FIX: When statically linking error with AWS it was possible to have a crash on shutdown/exit. Now that should no longer be possible. BREAKING CHANGE: S3 can only be initialized and finalized once. BREAKING CHANGE: S3 (the AWS SDK) will not be finalized until after all CPU & I/O threads are finished. * Closes: #15054 Authored-by: Weston Pace <[email protected]> Signed-off-by: Weston Pace <[email protected]>

…reads finished, add pyarrow exit hook (apache#33858) CRITICAL FIX: When statically linking error with AWS it was possible to have a crash on shutdown/exit. Now that should no longer be possible. BREAKING CHANGE: S3 can only be initialized and finalized once. BREAKING CHANGE: S3 (the AWS SDK) will not be finalized until after all CPU & I/O threads are finished. * Closes: apache#15054 Authored-by: Weston Pace <[email protected]> Signed-off-by: Weston Pace <[email protected]>

…... hopefully will be fixed in v13?

kou added Type: bug Component: Python Component: Continuous Integration Component: Packaging labels Dec 20, 2022

kou added this to the 11.0.0 milestone Dec 20, 2022

raulcd added the Priority: Blocker Marks a blocker for the release label Jan 9, 2023

lidavidm mentioned this issue Jan 12, 2023

GH-32901: [C++][Python] Add Flight SQL ADBC driver #14082

Closed

lidavidm mentioned this issue Jan 17, 2023

DO NOT MERGE: Try to reproduce wheel failures #33720

Closed

raulcd added Priority: Critical and removed Priority: Blocker Marks a blocker for the release labels Jan 18, 2023

raulcd modified the milestones: 11.0.0, 12.0.0 Jan 18, 2023

kou mentioned this issue Jan 25, 2023

GH-33804: [Python] Add support for manylinux_2_28 wheel #33805

Closed

justinvyu mentioned this issue Mar 2, 2023

[AIR] read_file_from_uri() print Segmentation Fault message while loading from S3 bucket ray-project/ray#32931

Open

github-actions bot mentioned this issue Mar 27, 2023

GH-15054: [C++] Change s3 finalization to happen after arrow threads finished, add pyarrow exit hook #33858

Merged

westonpace added the Priority: Blocker Marks a blocker for the release label Apr 11, 2023

sjperkins mentioned this issue Apr 12, 2023

GH-33804: [Python] Add support for manylinux_2_28 wheel #34818

Merged

westonpace closed this as completed in #33858 Apr 15, 2023

westonpace modified the milestones: 12.0.0, 13.0.0 Apr 15, 2023

hellkite500 mentioned this issue May 11, 2023

Add stdexcept include to UnitsHelper header NOAA-OWP/ngen#524

Merged

11 tasks

mattw-nws added a commit to mattw-nws/ngen that referenced this issue May 17, 2023

Avoid pyarrow 12.0.0 (next is 13.0.0?) because of apache/arrow#15054 …

5de2fbd

…... hopefully will be fixed in v13?

mattw-nws mentioned this issue May 17, 2023

Fixing test_routing_pybind workflow failures NOAA-OWP/ngen#528

Merged

12 tasks

mattw-nws added a commit to NOAA-OWP/ngen that referenced this issue May 17, 2023

Avoid pyarrow 12.0.0 (next is 13.0.0?) because of apache/arrow#15054 …

315ea05

…... hopefully will be fixed in v13?

acezen mentioned this issue Jun 9, 2023

[Bug]: Loading from S3 bucket got segment fault error apache/incubator-graphar#187

Closed

1 task

This was referenced Jun 23, 2023

Crash on exist with pyarrow >= 12.0.0 NOAA-OWP/t-route#614

Closed

Pin pyarrow to avoid crash in 12.x NOAA-OWP/ngen#542

Merged

jorisvandenbossche mentioned this issue Oct 20, 2023

[Python] Flaky Segfault on Python Interpreter Exit #38352

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI][Python] wheel-manylinux2014-* sometimes crashed on pytest exit #15054

[CI][Python] wheel-manylinux2014-* sometimes crashed on pytest exit #15054

kou commented Dec 20, 2022

raulcd commented Jan 9, 2023

lidavidm commented Jan 13, 2023 •

edited

Loading

lidavidm commented Jan 13, 2023

assignUser commented Jan 17, 2023

raulcd commented Jan 17, 2023

lidavidm commented Jan 17, 2023 •

edited

Loading

lidavidm commented Jan 17, 2023

lidavidm commented Jan 17, 2023

kou commented Jan 17, 2023

kou commented Jan 17, 2023

raulcd commented Jan 17, 2023

kou commented Jan 17, 2023

raulcd commented Jan 19, 2023

pitrou commented Jan 19, 2023

paleolimbot commented Jan 19, 2023

pitrou commented Jan 19, 2023

westonpace commented Jan 19, 2023

pitrou commented Jan 19, 2023

westonpace commented Jan 19, 2023

pitrou commented Jan 19, 2023

westonpace commented Jan 19, 2023

westonpace commented Jan 23, 2023

lidavidm commented Jan 23, 2023

westonpace commented Jan 24, 2023

pitrou commented Jan 24, 2023

[CI][Python] wheel-manylinux2014-* sometimes crashed on pytest exit #15054

[CI][Python] wheel-manylinux2014-* sometimes crashed on pytest exit #15054

Comments

kou commented Dec 20, 2022

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

raulcd commented Jan 9, 2023

lidavidm commented Jan 13, 2023 • edited Loading

lidavidm commented Jan 13, 2023

assignUser commented Jan 17, 2023

raulcd commented Jan 17, 2023

lidavidm commented Jan 17, 2023 • edited Loading

lidavidm commented Jan 17, 2023

lidavidm commented Jan 17, 2023

kou commented Jan 17, 2023

kou commented Jan 17, 2023

raulcd commented Jan 17, 2023

kou commented Jan 17, 2023

raulcd commented Jan 19, 2023

pitrou commented Jan 19, 2023

paleolimbot commented Jan 19, 2023

pitrou commented Jan 19, 2023

westonpace commented Jan 19, 2023

pitrou commented Jan 19, 2023

westonpace commented Jan 19, 2023

pitrou commented Jan 19, 2023

westonpace commented Jan 19, 2023

westonpace commented Jan 23, 2023

lidavidm commented Jan 23, 2023

westonpace commented Jan 24, 2023

pitrou commented Jan 24, 2023

lidavidm commented Jan 13, 2023 •

edited

Loading

lidavidm commented Jan 17, 2023 •

edited

Loading