Increase stack size at startup #2313

Fraccaman · 2023-12-19T12:50:57Z

Describe your changes

Try to increase stack size at ledger startup

Indicate on which release or other PRs this topic is based on

Checklist before merging to `draft`

I have added a changelog
Git history is in acceptable state

sug0

so, as discussed in our call, I think this is fine as a hotfix candidate. if we find that it does indeed work, we should refactor this code a bit, and move the process resource limit changes out of rocksdb code to another layer.

another possible fix is setting the stack limit via link args:
https://www.dsfcode.com/posts/increase-rust-stack-size-on-macos/

for now, approving this pr

sug0 · 2023-12-19T13:33:04Z

don't forget to make the changes proposed by clippy

Krewedk0 · 2023-12-19T17:51:09Z

tl;dr - this fix solves stack overflow issue upon a restart, but does not solve overflow around block 9582

Done a bit of testing on our node with LimitSTACK set to infinity. While tendermint is in fast sync mode (or blocksync) node can be gracefully stopped and then restarted with like 80% chance. No more stack overflow error here.

Sometimes it crashes on shutdown (logs - https://pastebin.com/reuwzrsP) and then tendermint won't start again, probably because of corrupted state. But this is most likely not caused by stack limit issues.
For reference this is logs of successful shutdown - https://pastebin.com/vxDQtv5V

There is also one more issue with stack overflow though - around block 9582 some nodes crash with following logs:
https://pastebin.com/N2E1msmq
and this fix does not resolve this issue

zenodeapp · 2023-12-19T19:31:10Z

Alright, so tried to test this as well like @Krewedk0 did. The only difference was that I didn't use LimitSTACK = infinity, but started with the default value of 8192 and kept increasing it by a factor of 2 until the stack overflow disappeared. This was at 134217728 which represents 128mb in bytes (this is only related to the problem of restarting).

I have to see how it performs for me around the infamous blocks 9500 - 10000 while having set it to an exact number. Will update in a bit, currently at height 900.

Though, this all may hint towards a potential leak (recursion, memory, call stack not properly garbage collected)? I dunno, my knowledge concerning Rust is fairly limited.

johnflexgang · 2023-12-19T19:40:04Z

I am agree with @Krewedk0 words. I am still meeting same error on my side with infinity LimitSTACK.
I've also tried to reinstall from zero progress with infinity LimitSTACK with thoughts about corrupted state but it shown the same error.

zenodeapp · 2023-12-19T19:43:46Z

Alright, so tried to test this as well like @Krewedk0 did. The only difference was that I didn't use LimitSTACK = infinity, but started with the default value of 8192 and kept increasing it by a factor of 2 until the stack overflow disappeared. This was at 134217728 which represents 128mb in bytes (this is only related to the problem of restarting).

I have to see how it performs for me around the infamous blocks 9500 - 10000 while having set it to an exact number. Will update in a bit, currently at height 900.

Though, this all may hint towards a potential leak (recursion, memory, call stack not properly garbage collected)? I dunno, my knowledge concerning Rust is fairly limited.

Checking the code now. Perhaps setting it as well to this value could solve it (the DEFAULT_STACK_LIMIT const to 134217728)? Though I still think this is a postpone of the emergence of the problem rather than a fix for it.

Fraccaman · 2023-12-19T19:48:13Z

tl;dr - this fix solves stack overflow issue upon a restart, but does not solve overflow around block 9582

Done a bit of testing on our node with LimitSTACK set to infinity. While tendermint is in fast sync mode (or blocksync) node can be gracefully stopped and then restarted with like 80% chance. No more stack overflow error here.

Sometimes it crashes on shutdown (logs - https://pastebin.com/reuwzrsP) and then tendermint won't start again, probably because of corrupted state. But this is most likely not caused by stack limit issues. For reference this is logs of successful shutdown - https://pastebin.com/vxDQtv5V

There is also one more issue with stack overflow though - around block 9582 some nodes crash with following logs: https://pastebin.com/N2E1msmq and this fix does not resolve this issue

Can't access the pastebins, would be nice if you could re-upload them :) thanks for the comment!

zenodeapp · 2023-12-19T19:49:43Z

Okay so no, I restarted around 5k blocks and I got the stack overflow issue again.

This means the necessary stack size keeps increasing. So there's definitely something up.

Krewedk0 · 2023-12-19T19:55:27Z

tl;dr - this fix solves stack overflow issue upon a restart, but does not solve overflow around block 9582
Done a bit of testing on our node with LimitSTACK set to infinity. While tendermint is in fast sync mode (or blocksync) node can be gracefully stopped and then restarted with like 80% chance. No more stack overflow error here.
Sometimes it crashes on shutdown (logs - https://pastebin.com/reuwzrsP) and then tendermint won't start again, probably because of corrupted state. But this is most likely not caused by stack limit issues. For reference this is logs of successful shutdown - https://pastebin.com/vxDQtv5V
There is also one more issue with stack overflow though - around block 9582 some nodes crash with following logs: https://pastebin.com/N2E1msmq and this fix does not resolve this issue

Can't access the pastebins, would be nice if you could re-upload them :) thanks for the comment!

For some reason first two are getting removed so i'll just leave them here
successful shutdown:

2023-12-19T18:58:36.045686Z  INFO namada_sdk::control_flow: Received interrupt signal, exiting...
2023-12-19T18:58:36.045727Z  INFO namada_apps::node::ledger::tendermint_node: Shutting down Tendermint node...
Stopping namada...
2023-12-19T18:58:36.053530Z  INFO namada_apps::node::ledger: Tendermint node is no longer running.
2023-12-19T18:58:36.053544Z  INFO namada_apps::node::ledger::broadcaster: Shutting down broadcaster...
2023-12-19T18:58:36.053547Z  INFO namada_apps::node::ledger: Broadcaster is no longer running.
2023-12-19T18:58:36.053596Z  INFO namada_apps::node::ledger: Shutting down ABCI server...
2023-12-19T18:58:36.055989Z  INFO namada_apps::node::ledger::ethereum_oracle: Ethereum event oracle is no longer running url="http://127.0.0.1:8545"
2023-12-19T18:58:36.056074Z  INFO namada_apps::node::ledger: Namada ledger node has shut down.
namada.service: Deactivated successfully.
Stopped namada.

Unsuccessful shutdown:

2023-12-19T18:59:08.292786Z  INFO namada_sdk::control_flow: Received interrupt signal, exiting...
2023-12-19T18:59:08.292825Z  INFO namada_apps::node::ledger::tendermint_node: Shutting down Tendermint node...
Stopping namada...
2023-12-19T18:59:08.301781Z  INFO namada_apps::node::ledger: Tendermint node is no longer running.
2023-12-19T18:59:08.301827Z  INFO namada_apps::node::ledger::broadcaster: Shutting down broadcaster...
2023-12-19T18:59:08.301834Z  INFO namada_apps::node::ledger: Broadcaster is no longer running.
2023-12-19T18:59:08.301887Z  INFO namada_apps::node::ledger: Shutting down ABCI server...
2023-12-19T18:59:08.308782Z  INFO namada_apps::node::ledger::shell: Committed block hash: b89a52f0d253d7340d928fc5764b4f5cbae8e66b28f13936c2310aa42d6b45ad, height: 257
2023-12-19T18:59:08.308792Z  INFO namada_apps::node::ledger::shell: Ethereum oracle has not yet fully processed any Ethereum blocks
The application panicked (crashed).
Message:  called `Result::unwrap()` on an `Err` value: Os { code: 32, kind: BrokenPipe, message: "Broken pipe" }
Location: /home/namada/.cargo/git/checkouts/tower-abci-0d01b039e0b7a0c9/a41ab4b/src/v037/server.rs:180
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
2023-12-19T18:59:08.312253Z  INFO namada_apps::node::ledger::ethereum_oracle: Ethereum event oracle is no longer running url="http://127.0.0.1:8545"
2023-12-19T18:59:08.312314Z  INFO namada_apps::node::ledger: Namada ledger node has shut down.
namada.service: Deactivated successfully.
Stopped namada.

crash around block 9582:

2023-12-19T18:49:10.712960Z  INFO namada_apps::node::ledger::shell::finalize_block: Block height: 9582, epoch: 21, is new epoch: false.
thread 'ledger-shell' has overflowed its stack
fatal runtime error: stack overflow
E[2023-12-19|18:49:10.834] Stopping abci.socketClient for error: read message: EOF module=abci-client connection=query
I[2023-12-19|18:49:10.834] service stop                                 module=abci-client connection=query msg="Stopping socketClient service" impl=socketClient
E[2023-12-19|18:49:10.834] Stopping abci.socketClient for error: read message: EOF module=abci-client connection=snapshot
I[2023-12-19|18:49:10.834] service stop                                 module=abci-client connection=snapshot msg="Stopping socketClient service" impl=socketClient
E[2023-12-19|18:49:10.834] Stopping abci.socketClient for error: read message: EOF module=abci-client connection=mempool
E[2023-12-19|18:49:10.834] Stopping abci.socketClient for error: read message: EOF module=abci-client connection=consensus
I[2023-12-19|18:49:10.834] service stop                                 module=abci-client connection=mempool msg="Stopping socketClient service" impl=socketClient
I[2023-12-19|18:49:10.834] service stop                                 module=abci-client connection=consensus msg="Stopping socketClient service" impl=socketClient
E[2023-12-19|18:49:10.834] query connection terminated. Did the application crash? Please restart CometBFT module=proxy err="read message: EOF"
E[2023-12-19|18:49:10.834] error in proxyAppConn.EndBlock               module=state err="read message: EOF"
namada.service: Main process exited, code=killed, status=6/ABRT
namada.service: Failed with result 'signal'.
namada.service: Unit process 2600964 (cometbft) remains running after unit stopped.
namada.service: Consumed 22min 34.567s CPU time.
panic: Failed to process committed block (9582:7857B68FAAA690E88D8E9AA0D146EA8F4D926A4D5FAB0D14DD430E5DB4B8E965): read message: EOF
goroutine 102 [running]:
github.com/cometbft/cometbft/blocksync.(*Reactor).poolRoutine(0xc0002b2fc0, 0x0)
        /home/runner/work/cometbft/cometbft/blocksync/reactor.go:392 +0x139f
created by github.com/cometbft/cometbft/blocksync.(*Reactor).OnStart
        /home/runner/work/cometbft/cometbft/blocksync/reactor.go:109 +0x7a

AlexToTheMoon · 2023-12-19T21:48:43Z

Got crashed at block 9595, Using LimitSTACK=infinity, v0.28.1-5-g679002d

I[2023-12-19|21:43:46.220] indexed block exents                         module=txindex height=9594
2023-12-19T21:43:46.297260Z  INFO namada_apps::node::ledger::shell::finalize_block: Block height: 9595, epoch: 21, is new epoch: false.
thread 'ledger-shell' has overflowed its stack
fatal runtime error: stack overflow
E[2023-12-19|21:43:46.459] Stopping abci.socketClient for error: read message: EOF module=abci-client connection=consensus
I[2023-12-19|21:43:46.459] service stop                                 module=abci-client connection=consensus msg="Stopping socketClient service" impl=socketClient
E[2023-12-19|21:43:46.459] consensus connection terminated. Did the application crash? Please restart CometBFT module=proxy err="read message: EOF"
E[2023-12-19|21:43:46.459] Stopping abci.socketClient for error: read message: EOF module=abci-client connection=snapshot
E[2023-12-19|21:43:46.459] Stopping abci.socketClient for error: read message: EOF module=abci-client connection=query
E[2023-12-19|21:43:46.459] error in proxyAppConn.EndBlock               module=state err="read message: EOF"
I[2023-12-19|21:43:46.459] service stop                                 module=abci-client connection=query msg="Stopping socketClient service" impl=socketClient
I[2023-12-19|21:43:46.459] signal trapped                               module=main msg="captured terminated, exiting..."
I[2023-12-19|21:43:46.459] service stop                                 module=main msg="Stopping Node service" impl=Node
I[2023-12-19|21:43:46.459] Stopping Node                                module=main
I[2023-12-19|21:43:46.459] service stop                                 module=events msg="Stopping EventBus service" impl=EventBus
I[2023-12-19|21:43:46.459] service stop                                 module=pubsub msg="Stopping PubSub service" impl=PubSub
E[2023-12-19|21:43:46.459] Stopping abci.socketClient for error: read message: EOF module=abci-client connection=mempool
I[2023-12-19|21:43:46.459] service stop                                 module=txindex msg="Stopping IndexerService service" impl=IndexerService
I[2023-12-19|21:43:46.459] service stop                                 module=abci-client connection=mempool msg="Stopping socketClient service" impl=socketClient
I[2023-12-19|21:43:46.459] service stop                                 module=p2p msg="Stopping P2P Switch service" impl="P2P Switch"
I[2023-12-19|21:43:46.459] service stop                                 module=abci-client connection=snapshot msg="Stopping socketClient service" impl=socketClient
I[2023-12-19|21:43:46.459] service stop                                 module=p2p [email protected]:26656 msg="Stopping MConnection service" impl=MConn{136.243.40.38:26656}
namadad.service: Main process exited, code=killed, status=6/ABRT
I[2023-12-19|21:43:46.459] service stop                                 module=p2p [email protected]:26656 msg="Stopping Peer service" impl="Peer{MConn{136.243.40.38:26656} 3d6b5aa8cf30fddf887bda6adb989c82c4982111 out}"
panic: Failed to process committed block (9595:AAD200B077FE191B75B4610C020C468FF9E62A602EB2BAD5A03C2BCA430C4CBC): read message: EOF
goroutine 72 [running]:
github.com/cometbft/cometbft/blocksync.(*Reactor).poolRoutine(0xc0002b2fc0, 0x0)
        /home/runner/work/cometbft/cometbft/blocksync/reactor.go:392 +0x139f
created by github.com/cometbft/cometbft/blocksync.(*Reactor).OnStart
        /home/runner/work/cometbft/cometbft/blocksync/reactor.go:109 +0x7a
namadad.service: Failed with result 'signal'.
namadad.service: Consumed 25min 4.818s CPU time.
namadad.service: Scheduled restart job, restart counter is at 1.
Stopped Namada Node.

Fraccaman · 2023-12-21T20:54:50Z

closing in favor of #2325

try to increase stack size

bd3d649

Fraccaman requested review from tzemanovic and sug0 December 19, 2023 12:50

sug0 previously approved these changes Dec 19, 2023

View reviewed changes

set stack limit to infinity

679002d

Fraccaman dismissed sug0’s stale review via 679002d December 19, 2023 19:53

Set the min stack size of the shell to 256 MiB

d8b02aa

Fraccaman closed this Dec 21, 2023

tzemanovic mentioned this pull request Jan 16, 2024

Draft 0.30.1 #2392

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase stack size at startup #2313

Increase stack size at startup #2313

Fraccaman commented Dec 19, 2023

sug0 left a comment

sug0 commented Dec 19, 2023

Krewedk0 commented Dec 19, 2023 •

edited

Loading

zenodeapp commented Dec 19, 2023 •

edited

Loading

johnflexgang commented Dec 19, 2023

zenodeapp commented Dec 19, 2023 •

edited

Loading

Fraccaman commented Dec 19, 2023

zenodeapp commented Dec 19, 2023 •

edited

Loading

Krewedk0 commented Dec 19, 2023

AlexToTheMoon commented Dec 19, 2023 •

edited

Loading

Fraccaman commented Dec 21, 2023

Increase stack size at startup #2313

Increase stack size at startup #2313

Conversation

Fraccaman commented Dec 19, 2023

Describe your changes

Indicate on which release or other PRs this topic is based on

Checklist before merging to draft

sug0 left a comment

Choose a reason for hiding this comment

sug0 commented Dec 19, 2023

Krewedk0 commented Dec 19, 2023 • edited Loading

zenodeapp commented Dec 19, 2023 • edited Loading

johnflexgang commented Dec 19, 2023

zenodeapp commented Dec 19, 2023 • edited Loading

Fraccaman commented Dec 19, 2023

zenodeapp commented Dec 19, 2023 • edited Loading

Krewedk0 commented Dec 19, 2023

AlexToTheMoon commented Dec 19, 2023 • edited Loading

Fraccaman commented Dec 21, 2023

Checklist before merging to `draft`

Krewedk0 commented Dec 19, 2023 •

edited

Loading

zenodeapp commented Dec 19, 2023 •

edited

Loading

zenodeapp commented Dec 19, 2023 •

edited

Loading

zenodeapp commented Dec 19, 2023 •

edited

Loading

AlexToTheMoon commented Dec 19, 2023 •

edited

Loading