Rare segmentation fault - node v10.13 - CentOS (ip 0000000000efb532, node[400000+1e8c000]) #24955

assaf-xm · 2018-12-11T11:53:27Z

node -v
v10.13.0
Installed using: node-v10.13.0-linux-x64.tar.xz

uname -a
Linux <> 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

--------------- Bug description -------------

After upgrading from node v10.7.0 to node v10.13.0, we started to see crashes in the node process due to segmentation faults.
Crashes happens every one/few days, so it's really hard to reproduce, but it happens.

dmesg logs show some consistency with fixed instruction pointer (ip 0000000000efb532):

[Mon Dec 3 21:10:54 2018] node[25407]: segfault at 3ff0c07598f0 ip 0000000000efb532 sp 00007f1a1a830c40 error 4 in node[400000+1e8c000]
[Tue Dec 4 13:28:47 2018] node[11340]: segfault at 3aea57a0a9c0 ip 0000000000efb532 sp 00007f0ff238dc40 error 4 in node[400000+1e8c000]
[Wed Dec 5 16:13:54 2018] node[13359]: segfault at 329fc81dd2f0 ip 0000000000efb532 sp 00007fc21effcc40 error 4 in node[400000+1e8c000]
[Fri Dec 7 19:36:45 2018] node[29239]: segfault at 13604b37a558 ip 0000000000efb532 sp 00007f57ebffec40 error 4 in node[400000+1e8c000]
[Sat Dec 8 18:53:54 2018] node[30821]: segfault at 204d978e2e50 ip 0000000000efb532 sp 00007f212de0cc40 error 4 in node[400000+1e8c000]
[Sun Dec 9 18:05:08 2018] node[10990]: segfault at 9888ab3a790 ip 0000000000efb532 sp 00007f26261c2c40 error 4 in node[400000+1e8c000]
[Mon Dec 10 20:21:07 2018] node[14981]: segfault at 3cbdf8b02340 ip 0000000000efb532 sp 00007f38a894fc40 error 4 in node[400000+1e8c000]

The failing PID is not the exact PID of the node process, but usually close to it.

Looking at the symbols around 'IP 0000000000efb532':
Points to the symbol - RememberedSetUpdatingItem class, 'process' function:
0000000000efaf50 W v8::internal::RememberedSetUpdatingItemv8::internal::MajorNonAtomicMarkingState::Process()

I wasn't able to capture a coredump yet.

bnoordhuis · 2018-12-11T12:29:12Z

That's part of the GC. If the PID isn't that of the process, it means it's happening on a thread that isn't the main thread (because the main thread has PID == TID.)

Unfortunately GC crashes are hard to debug because nine times out of ten the real bug is elsewhere; e.g., memory corruption that doesn't manifest until the GC runs.

But let's try anyway. Does find node_modules/ -name \*.node print anything? Does node --predictable app.js work better?

assaf-xm · 2018-12-11T12:42:58Z

Thanks for the quick response!

I'll try running the system with the '--predictable' flag for a while

Regarding the native modules:

find node_modules/ -name *.node
node_modules/uws/uws_linux_51.node
node_modules/uws/uws_darwin_48.node
node_modules/uws/uws_win32_51.node
node_modules/uws/uws_linux_48.node
node_modules/uws/uws_linux_47.node
node_modules/uws/uws_darwin_47.node
node_modules/uws/uws_darwin_46.node
node_modules/uws/uws_darwin_51.node
node_modules/uws/uws_win32_48.node
node_modules/uws/uws_linux_46.node
node_modules/grpc/src/node/extension_binary/node-v64-linux-x64-glibc/grpc_node.node
node_modules/ref/build/Release/obj.target/binding.node
node_modules/ref/build/Release/binding.node
node_modules/modern-syslog/build/Release/obj.target/core.node
node_modules/modern-syslog/build/Release/core.node
node_modules/heapdump/build/Release/obj.target/addon.node
node_modules/heapdump/build/Release/addon.node
node_modules/sleep/build/Release/obj.target/node_sleep.node
node_modules/sleep/build/Release/node_sleep.node
node_modules/ffi/build/Release/obj.target/ffi_bindings.node
node_modules/ffi/build/Release/ffi_bindings.node
node_modules/diskusage/build/Release/obj.target/diskusage.node
node_modules/diskusage/build/Release/diskusage.node

Thanks,
Assaf

bnoordhuis · 2018-12-11T14:56:59Z

Right, that's quite a few native modules, anyone which might be the culprit. ref and ffi are the most likely but it could be anyone of them.¹ Try excluding them and see if the crashes go away.

¹ I'm reasonably sure it's not heapdump (I'm its author) because it doesn't do anything unless activated but still.

bnoordhuis · 2018-12-28T07:32:43Z

I'm going to close this out for lack of follow-up. Let me know if you still want to pursue this and I'll reopen.

assaf-xm · 2018-12-28T16:05:30Z

Thanks, it's still relevant, but we'll try other nodejs versions (10.14 / 10.15) and reopen in case we'll have the ability to reproduce it more frequently. Currently it's really hard to narrow it down.

bnoordhuis added memory Issues and PRs related to the memory management or memory footprint. v10.x labels Dec 11, 2018

bnoordhuis closed this as completed Dec 28, 2018

jvictor85 mentioned this issue Apr 29, 2019

Segfault on Node 10.x doesn't occur in 6.x #27107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rare segmentation fault - node v10.13 - CentOS (ip 0000000000efb532, node[400000+1e8c000]) #24955

Rare segmentation fault - node v10.13 - CentOS (ip 0000000000efb532, node[400000+1e8c000]) #24955

assaf-xm commented Dec 11, 2018

bnoordhuis commented Dec 11, 2018

assaf-xm commented Dec 11, 2018

bnoordhuis commented Dec 11, 2018

bnoordhuis commented Dec 28, 2018

assaf-xm commented Dec 28, 2018

Rare segmentation fault - node v10.13 - CentOS (ip 0000000000efb532, node[400000+1e8c000]) #24955

Rare segmentation fault - node v10.13 - CentOS (ip 0000000000efb532, node[400000+1e8c000]) #24955

Comments

assaf-xm commented Dec 11, 2018

bnoordhuis commented Dec 11, 2018

assaf-xm commented Dec 11, 2018

bnoordhuis commented Dec 11, 2018

bnoordhuis commented Dec 28, 2018

assaf-xm commented Dec 28, 2018