-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault on Node 10.x doesn't occur in 6.x #27107
Comments
Have you collected core files? They can be used to get more information about the crash. Can you add https://github.com/nodejs/node-report to your app? It will give you loggable text output on a crash, that will help a lot. Its builtin to node now, but for 10.x you'll have to add it to your env, its not hard: https://github.com/nodejs/node-report#usage |
Also, do you use native addons, and if so, which? ( To me, that makes it seem like an issue where a core dump would be more important than node-report output, although even the former might be hard to work with because it’s likely to be some sort of memory corruption issue where it’s hard to figure out the cause from the crash site. |
Thanks for the help! For now I tried enabling core dumps on one of the 10.x servers with ulimit.. assuming I did it right it could take days or weeks to see the crash though. I forgot to mention that I use PM2 to restart the server process whenever it goes down, hopefully that doesn't matter. As for native addons I'm not sure.. I ran that find command and it just shows: That module isn't loaded in by default though so I'm assuming it isn't related to the crash. I have to manually load it during run-time, and have only used it in some rare situations where I needed to profile for CPU bottlenecks. The basic modules I utilize other than ws and mongodb are http, https, os, fs. I'm not familiar with memory corruption issues, but let me know if there's anything else I can do that might help. I wondered if it might be something in the garbage collector.. I noticed that the GC in 10.x will have a rare delay spike of 2000-3000ms out of nowhere. On 6.x the sweep usually only takes 300ms or so, but they are more often. The current flags used (on all node versions): |
I got one pretty quick! A core file from this crash:
It's about 1gb in size (90mb compressed). I'd love for someone to take a look at it though I'm not sure the best way to send it, and I'd like to avoid posting it publicly if possible. I added an email to my profile if someone can get in touch :) |
@jvictor85 Can you get the stack trace? |
Thanks! I tried that and it said it loaded symbols from node though it just shows these addresses:
|
Apparently I had some core files from before and didn't even realize it. Here is the stack trace from the Feb 12th segfault above, it's one of 2 different types of crashes I get it seems - this one at addr
This EU server is running v10.15.0. |
Okay, so it's a crash during GC. That probably means memory corruption but unfortunately the real bug is usually not anywhere near the site of the crash, neither in time or space. What does |
Sorry not familiar with gdb but when I run that it says If it's memory corruption during GC I'm guessing this isn't anything I can fix in my own code then.. If it helps I remember trying to isolate this last year and that removing the check |
Sorry, I was thinking of Aside: |
Thanks I'll phase out that flag going forward. I tracked down some more core files and collected the backtraces below, along with the node version and date. The one for the SEA server I wasn't sure on the exact version used, so I included backtraces for both possibilities. SA1 (Node v10.2.1) May 27 2018
USW1 (Node v10.4.0) June 9 2018
USE2 (Node v10.15.0) March 16 2019
USW2 (Node v10.13.0) November 18 2018
SEA1 (Node v10.15.?) February 17 2019 Node v10.15.0:
Node v10.15.1:
UST1 (Node v10.15.0) February 28 2019
Hopefully someone can make sense of these or find a pattern that leads to something. We would love to finally be able to switch to node 10 for good, performance is critical for us. Let me know if there's anything else I can do. Thanks! |
Also just got a new one today: USE1 (Node v10.15.2) April 12 2019
Many of them seem to point to this same function, though on this server and a couple others it's at the |
the one thing what is worth mentioning is the presence of To confirm, could you please dump the stack of every threads? (gdb) info thr then switch to each thread by |
if the main thread is seen in the exit route with some advancement into cleanup actions, then we are looking at what I suspect. |
Sure thing, here are all 11 threads for the latest core dump from above:
Pretty neat. I wonder if there is anything I can change about my code or environment that would encourage this crash, or avoid it. Or maybe this is just something in v8, like the |
thanks for the quick revert! we don't see the main thread in the list! Except thread 9 whose bottom stack is truncated (due to probably JS code) everything else are helper threads. in summary, this is not conclusive of what I am suspecting. |
looking for ways to get hold of the core dump. Can you send it to my email directly? or upload it in a private gist? (not sure whether the mail box / gitst allow 90mb or not) |
Ah too bad it's not what you suspected.. but yes I'm happy to email it. I'll just upload it to my site and send the link shortly. |
@jvictor85 - while gdb interpretation of some instructions / rt library missing etc. are posing me challenges in interpreting the failing context, I see this sequence in my local build: (gdb) x/30i 0xf4c900
0xf4c900 <v8::internal::RememberedSetUpdatingItem::Process+1456>: add BYTE PTR [rdi],cl
0xf4c902 <v8::internal::RememberedSetUpdatingItem::Process+1458>: test DWORD PTR [rcx-0x2],ecx
0xf4c905 <v8::internal::RememberedSetUpdatingItem::Process+1461>: (bad)
0xf4c906 <v8::internal::RememberedSetUpdatingItem::Process+1462>: jmp <internal disassembler error>
0xf4c908 <v8::internal::RememberedSetUpdatingItem::Process+1464>: cdq
0xf4c909 <v8::internal::RememberedSetUpdatingItem::Process+1465>: (bad)
0xf4c90a <v8::internal::RememberedSetUpdatingItem::Process+1466>: (bad)
0xf4c90b <v8::internal::RememberedSetUpdatingItem::Process+1467>: jmp <internal disassembler error>
0xf4c90d <v8::internal::RememberedSetUpdatingItem::Process+1469>: (bad)
0xf4c90e <v8::internal::RememberedSetUpdatingItem::Process+1470>: fldcw WORD PTR [rsi]
0xf4c910 <v8::internal::RememberedSetUpdatingItem::Process+1472>: add BYTE PTR [rax-0x75],cl
0xf4c913 <v8::internal::RememberedSetUpdatingItem::Process+1475>: adc BYTE PTR [rcx-0x44],al
0xf4c916 <v8::internal::RememberedSetUpdatingItem::Process+1478>: pop rsp
0xf4c917 <v8::internal::RememberedSetUpdatingItem::Process+1479>: in al,dx
0xf4c918 <v8::internal::RememberedSetUpdatingItem::Process+1480>: push rcx
0xf4c919 <v8::internal::RememberedSetUpdatingItem::Process+1481>: add cl,BYTE PTR [rax-0x75]
0xf4c91c <v8::internal::RememberedSetUpdatingItem::Process+1484>: push rdx
0xf4c91d <v8::internal::RememberedSetUpdatingItem::Process+1485>: adc BYTE PTR [rax-0x7f],cl
0xf4c920 <v8::internal::RememberedSetUpdatingItem::Process+1488>: cli
0xf4c921 <v8::internal::RememberedSetUpdatingItem::Process+1489>: push rax
0xf4c922 <v8::internal::RememberedSetUpdatingItem::Process+1490>: or BYTE PTR [rsi-0x41f28c00],ch
0xf4c928 <v8::internal::RememberedSetUpdatingItem::Process+1496>: nop
0xf4c929 <v8::internal::RememberedSetUpdatingItem::Process+1497>: pushf
0xf4c92a <v8::internal::RememberedSetUpdatingItem::Process+1498>: mov ch,0x1
0xf4c92c <v8::internal::RememberedSetUpdatingItem::Process+1500>: mov rdi,rax
0xf4c92f <v8::internal::RememberedSetUpdatingItem::Process+1503>: call rdx
0xf4c931 <v8::internal::RememberedSetUpdatingItem::Process+1505>: mov r12,rax the call to that has a fix in v10.15.3 So while I debug further, can you please try with v10.15.3 and see if these crashes still occur? |
Alright that sounds promising! I'll be pleasantly surprised if it's such an easy fix. Thanks so much for spending the time - I'll start transitioning all the servers over today and let you know if I experience any further crashes within the next week or two. |
Got another crash, for v10.15.3
Unfortunately it didn't write the core file.. I thought I set ulimit to do that but maybe it was a temporary setting. I'll try to get that sorted out and deliver the next one as soon as possible. |
Another avenue you could try is a debug build. Debug builds have a lot more checks so there's a pretty good chance they'll catch the bug closer to the source. Caveat emptor though: they're also a fair bit slower than release builds. To compile a debug build, run |
Thanks I'll keep the debug build in mind, though I don't believe we can afford the performance hit. A couple crashes later and I've finally grabbed a core file again. Here is the log and the trace. I'll send the file shortly. This is from today on 10.15.3:
|
surprising to note that the crash is in the same method; it makes sense to believe that upgrade to v10.15.3 did not make any difference. |
@jvictor85 sent in another core file with so looks like the best way is to try with a debug build.
this is only to perform problem isolation and determination, definitely not recommending for production. |
Thanks for your help!
I can try to get a debug build going on a test server, it's just that in my experience this crash tends to show up under load with many real clients connecting. It has happened during testing, but it can take several months to see it occur once, while on production it happens once or twice a week (across many servers). Haven't had much luck simulating load to reproduce it more often. Though it does seem to occur more on particular servers in certain conditions, so I'll try grabbing database backups from before a crash and see if I can get something to happen. I also haven't tested v12 yet so I'll tinker with that to see if there is a difference. I remember seeing some issues similar to mine (like #24955) but it seems to be rare.. not sure if that helps narrow things down. |
Just to follow up - I've had a couple servers on v12.1.0 for over a week, and recently started moving a few others over as well and so far they haven't experienced this crash at all. The servers on v10.x still continue to crash. So I'm wondering if something in the v8 code in version 10 (or it could be anything between versions 6 and 12) is causing this. I'll post an update if the crash ever happens on 12, and will be keeping an eye on updates to 10. I would prefer to use that when possible since I've noticed a bit better GC performance with 10 compared to 12 so far. |
@jvictor85 - thanks for the update. Given that we don't have any further clues and debug build dump is unavailable, your action plan looks good to me. |
Tested with 10.16.0 and it's still crashing in the same manner. In version 12 it hasn't crashed in a month so it seems to be fixed there. Unfortunately performance is noticeably worse in 12 (same issue with 6) so I try to stick with 10 when possible so the users don't complain aside from the crashes.
I was hoping this was a known issue in v8 that was fixed and could be backported to 10 but maybe it was resolved unintentionally through some of the newer changes. If I can find the time I'll see if I can reproduce this on a test server again, but I'm not too hopeful that I can simulate the right conditions. Maybe someone familiar with the v8 version changes can identify where the fix may have happened? |
Alright maybe I found something - I accidentally stumbled on the right conditions for a crash and was able to reproduce it consistently. I reduced the code down to something that can be worked with. I'm not sure if this is the exact crash as on my production servers because the syslog reports it differently, but it behaves in a similar way: quits without an error and only occurs on node 10 and 11. I tested this with node 8 and 12 and don't see the crash. Here's the code:
It looks a little odd because it's a remnant of several thousand lines of code but I left in the lines that cause a quick crash. If you comment out either of the distance calls the program will run fine, even though they don't really do anything. Tested in windows and linux. I'm hoping this is at least related to the production crash I'm experiencing, but let me know if I'm missing something obvious here.. |
ping @jvictor85 - is this still an outstanding issue for you? |
Yes unfortunately this remaining crash is still very much an issue for me and my users. The performance in version 12 is still too slow that I end up bouncing between versions 10 and 12 depending on whether I want to go for performance or stability. Many of my servers that have lower user populations almost never experience the crash.. maybe every few weeks/months so I keep them on version 10. More specifically v10.16.x still has the crash, though I haven't tested with 10.17 I expect it is probably the same. I have 2 higher population servers in particular that experience the crash every couple days, so I try to keep them on v12. Sorry I wasn't able to dedicate the time required to bisecting on a production server. I understand this is a rare crash that not many people experience so currently I'm waiting on new versions of 12 (or 13) to solve some of the performance regressions so I can fully switch to that. I'll post when that time comes, and let me know if there is any other information I can provide. It's up to you if you want to close this for now or keep it open. Thanks for all the help up to now! |
is this still happening? should this remain open? |
My performance issues were improved enough by version 12.16.2 (and all versions of 14) that I moved every server over to those. Due to that I never got the chance to test on later versions of 10, but the crash was still happening when I switched over. I'm not sure if anyone else is experiencing this crash or can reproduce it, but I shouldn't be seeing it anymore so I'll close this for now. Thanks again! |
Since last year I've been having issues switching to 10.x from 6.x because of rare crashes that give no error. I have several production servers with hundreds of users and it's been too difficult to reproduce with a local server so far because it seems to only happens under load and it's fairly rare.
There's a large amount of code that I can't really reduce for testing without affecting my users. Since I haven't had the time or ability to debug this properly I was hoping someone could point me in the right direction, or that this would be helpful to the developers if this is a bug in 10.x or v8.
There are no errors to report, but here are some of the recent segfaults in various syslogs that only happen when a server is on node 10:
Feb 12 21:56:29 EU1 kernel: [22262355.042532] node[20512]: segfault at 3d77ac406468 ip 0000000000efade2 sp 00007f263e259c10 error 4 in node[400000+1e90000]
Mar 13 19:39:09 SEA1 kernel: [2908566.872204] node[4145]: segfault at 23e4891a2fb0 ip 0000000000efa272 sp 00007fbbf4dd0c10 error 4 in node[400000+1e8f000]
Mar 16 08:59:55 SEA1 kernel: [3129393.630360] node[14805]: segfault at 1ac3c9db79c8 ip 0000000000efa272 sp 00007f10629f0c10 error 4 in node[400000+1e8f000]
Mar 16 20:25:29 USW1 kernel: [3173535.851715] node[31823]: segfault at 13aa4aac9a78 ip 0000000000efa272 sp 00007fdb85380c10 error 4 in node[400000+1e8f000]
Mar 17 00:26:56 USE2 kernel: [25489067.874929] node[17011]: segfault at 93788ef0108 ip 0000000000efade2 sp 00007fc14bffec10 error 4 in node[400000+1e90000]
Mar 19 22:10:11 USW1 kernel: [3438995.257871] node[11791]: segfault at 7b0370d05c8 ip 0000000000efa272 sp 00007f88d3403c10 error 4 in node[400000+1e8f000]
Mar 21 11:46:28 USW1 kernel: [3574361.032453] node[18756]: segfault at 10f0170f9b8 ip 0000000000efa272 sp 00007fdb9e281c10 error 4 in node (deleted)[400000+1e8f000]
Mar 27 18:55:30 USE2 kernel: [26419545.476970] node[21011]: segfault at 706d2e8f0b8 ip 0000000000efade2 sp 00007f2b59fadc10 error 4 in node[400000+1e90000]
Apr 1 20:39:24 SEA1 kernel: [319450.383166] node[8710]: segfault at 16f9cfd832b0 ip 0000000000efa272 sp 00007f7850aa2c10 error 4 in node[400000+1e8f000]
Apr 1 20:23:53 USE1 kernel: [26742046.491931] node[4466]: segfault at 3253e97d4310 ip 0000000000efa272 sp 00007f1b2fffec10 error 4 in node[400000+1e8f000]
Apr 2 14:58:52 EU1 kernel: [26470558.192840] node[27273]: segfault at 1bbcf96ba358 ip 0000000000efade2 sp 00007f1e95a03c10 error 4 in node[400000+1e90000]
The module versions have been different throughout the last year and on different servers but here is what is currently being used on USW1 as an example:
I haven't done much testing in node 8.x due to some major garbage collection delay which is another issue altogether. Similarly I haven't had success on version 11.x. I was able to reproduce the crash on my local windows machine once upon a time after days of experimentation on node 10, but not reliably, and I can't seem to get it to occur anymore. It's possible some newer changes have reduced the frequency of this crash because I remember it happening more often in 2018.
I've been mostly using the N version switcher to test different versions. Let me know if there is any other info I can provide or some way to look up these segfault addresses to narrow things down. Thanks!
The text was updated successfully, but these errors were encountered: