Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase stack size at startup #2313

Closed
wants to merge 3 commits into from
Closed

Increase stack size at startup #2313

wants to merge 3 commits into from

Conversation

Fraccaman
Copy link
Member

Describe your changes

Try to increase stack size at ledger startup

Indicate on which release or other PRs this topic is based on

Checklist before merging to draft

  • I have added a changelog
  • Git history is in acceptable state

@Fraccaman Fraccaman requested review from tzemanovic and sug0 December 19, 2023 12:50
sug0
sug0 previously approved these changes Dec 19, 2023
Copy link
Collaborator

@sug0 sug0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, as discussed in our call, I think this is fine as a hotfix candidate. if we find that it does indeed work, we should refactor this code a bit, and move the process resource limit changes out of rocksdb code to another layer.

another possible fix is setting the stack limit via link args:
https://www.dsfcode.com/posts/increase-rust-stack-size-on-macos/

for now, approving this pr

@sug0
Copy link
Collaborator

sug0 commented Dec 19, 2023

don't forget to make the changes proposed by clippy

@Krewedk0
Copy link
Contributor

Krewedk0 commented Dec 19, 2023

tl;dr - this fix solves stack overflow issue upon a restart, but does not solve overflow around block 9582

Done a bit of testing on our node with LimitSTACK set to infinity. While tendermint is in fast sync mode (or blocksync) node can be gracefully stopped and then restarted with like 80% chance. No more stack overflow error here.

Sometimes it crashes on shutdown (logs - https://pastebin.com/reuwzrsP) and then tendermint won't start again, probably because of corrupted state. But this is most likely not caused by stack limit issues.
For reference this is logs of successful shutdown - https://pastebin.com/vxDQtv5V

There is also one more issue with stack overflow though - around block 9582 some nodes crash with following logs:
https://pastebin.com/N2E1msmq
and this fix does not resolve this issue

@zenodeapp
Copy link
Contributor

zenodeapp commented Dec 19, 2023

Alright, so tried to test this as well like @Krewedk0 did. The only difference was that I didn't use LimitSTACK = infinity, but started with the default value of 8192 and kept increasing it by a factor of 2 until the stack overflow disappeared. This was at 134217728 which represents 128mb in bytes (this is only related to the problem of restarting).

I have to see how it performs for me around the infamous blocks 9500 - 10000 while having set it to an exact number. Will update in a bit, currently at height 900.

Though, this all may hint towards a potential leak (recursion, memory, call stack not properly garbage collected)? I dunno, my knowledge concerning Rust is fairly limited.

@johnflexgang
Copy link

I am agree with @Krewedk0 words. I am still meeting same error on my side with infinity LimitSTACK.
I've also tried to reinstall from zero progress with infinity LimitSTACK with thoughts about corrupted state but it shown the same error.

@zenodeapp
Copy link
Contributor

zenodeapp commented Dec 19, 2023

Alright, so tried to test this as well like @Krewedk0 did. The only difference was that I didn't use LimitSTACK = infinity, but started with the default value of 8192 and kept increasing it by a factor of 2 until the stack overflow disappeared. This was at 134217728 which represents 128mb in bytes (this is only related to the problem of restarting).

I have to see how it performs for me around the infamous blocks 9500 - 10000 while having set it to an exact number. Will update in a bit, currently at height 900.

Though, this all may hint towards a potential leak (recursion, memory, call stack not properly garbage collected)? I dunno, my knowledge concerning Rust is fairly limited.

Checking the code now. Perhaps setting it as well to this value could solve it (the DEFAULT_STACK_LIMIT const to 134217728)? Though I still think this is a postpone of the emergence of the problem rather than a fix for it.

@Fraccaman
Copy link
Member Author

tl;dr - this fix solves stack overflow issue upon a restart, but does not solve overflow around block 9582

Done a bit of testing on our node with LimitSTACK set to infinity. While tendermint is in fast sync mode (or blocksync) node can be gracefully stopped and then restarted with like 80% chance. No more stack overflow error here.

Sometimes it crashes on shutdown (logs - https://pastebin.com/reuwzrsP) and then tendermint won't start again, probably because of corrupted state. But this is most likely not caused by stack limit issues. For reference this is logs of successful shutdown - https://pastebin.com/vxDQtv5V

There is also one more issue with stack overflow though - around block 9582 some nodes crash with following logs: https://pastebin.com/N2E1msmq and this fix does not resolve this issue

Can't access the pastebins, would be nice if you could re-upload them :) thanks for the comment!

@zenodeapp
Copy link
Contributor

zenodeapp commented Dec 19, 2023

Okay so no, I restarted around 5k blocks and I got the stack overflow issue again.

This means the necessary stack size keeps increasing. So there's definitely something up.

@Krewedk0
Copy link
Contributor

tl;dr - this fix solves stack overflow issue upon a restart, but does not solve overflow around block 9582
Done a bit of testing on our node with LimitSTACK set to infinity. While tendermint is in fast sync mode (or blocksync) node can be gracefully stopped and then restarted with like 80% chance. No more stack overflow error here.
Sometimes it crashes on shutdown (logs - https://pastebin.com/reuwzrsP) and then tendermint won't start again, probably because of corrupted state. But this is most likely not caused by stack limit issues. For reference this is logs of successful shutdown - https://pastebin.com/vxDQtv5V
There is also one more issue with stack overflow though - around block 9582 some nodes crash with following logs: https://pastebin.com/N2E1msmq and this fix does not resolve this issue

Can't access the pastebins, would be nice if you could re-upload them :) thanks for the comment!

For some reason first two are getting removed so i'll just leave them here
successful shutdown:

2023-12-19T18:58:36.045686Z  INFO namada_sdk::control_flow: Received interrupt signal, exiting...
2023-12-19T18:58:36.045727Z  INFO namada_apps::node::ledger::tendermint_node: Shutting down Tendermint node...
Stopping namada...
2023-12-19T18:58:36.053530Z  INFO namada_apps::node::ledger: Tendermint node is no longer running.
2023-12-19T18:58:36.053544Z  INFO namada_apps::node::ledger::broadcaster: Shutting down broadcaster...
2023-12-19T18:58:36.053547Z  INFO namada_apps::node::ledger: Broadcaster is no longer running.
2023-12-19T18:58:36.053596Z  INFO namada_apps::node::ledger: Shutting down ABCI server...
2023-12-19T18:58:36.055989Z  INFO namada_apps::node::ledger::ethereum_oracle: Ethereum event oracle is no longer running url="http://127.0.0.1:8545"
2023-12-19T18:58:36.056074Z  INFO namada_apps::node::ledger: Namada ledger node has shut down.
namada.service: Deactivated successfully.
Stopped namada.

Unsuccessful shutdown:

2023-12-19T18:59:08.292786Z  INFO namada_sdk::control_flow: Received interrupt signal, exiting...
2023-12-19T18:59:08.292825Z  INFO namada_apps::node::ledger::tendermint_node: Shutting down Tendermint node...
Stopping namada...
2023-12-19T18:59:08.301781Z  INFO namada_apps::node::ledger: Tendermint node is no longer running.
2023-12-19T18:59:08.301827Z  INFO namada_apps::node::ledger::broadcaster: Shutting down broadcaster...
2023-12-19T18:59:08.301834Z  INFO namada_apps::node::ledger: Broadcaster is no longer running.
2023-12-19T18:59:08.301887Z  INFO namada_apps::node::ledger: Shutting down ABCI server...
2023-12-19T18:59:08.308782Z  INFO namada_apps::node::ledger::shell: Committed block hash: b89a52f0d253d7340d928fc5764b4f5cbae8e66b28f13936c2310aa42d6b45ad, height: 257
2023-12-19T18:59:08.308792Z  INFO namada_apps::node::ledger::shell: Ethereum oracle has not yet fully processed any Ethereum blocks
The application panicked (crashed).
Message:  called `Result::unwrap()` on an `Err` value: Os { code: 32, kind: BrokenPipe, message: "Broken pipe" }
Location: /home/namada/.cargo/git/checkouts/tower-abci-0d01b039e0b7a0c9/a41ab4b/src/v037/server.rs:180
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
2023-12-19T18:59:08.312253Z  INFO namada_apps::node::ledger::ethereum_oracle: Ethereum event oracle is no longer running url="http://127.0.0.1:8545"
2023-12-19T18:59:08.312314Z  INFO namada_apps::node::ledger: Namada ledger node has shut down.
namada.service: Deactivated successfully.
Stopped namada.

crash around block 9582:

2023-12-19T18:49:10.712960Z  INFO namada_apps::node::ledger::shell::finalize_block: Block height: 9582, epoch: 21, is new epoch: false.
thread 'ledger-shell' has overflowed its stack
fatal runtime error: stack overflow
E[2023-12-19|18:49:10.834] Stopping abci.socketClient for error: read message: EOF module=abci-client connection=query
I[2023-12-19|18:49:10.834] service stop                                 module=abci-client connection=query msg="Stopping socketClient service" impl=socketClient
E[2023-12-19|18:49:10.834] Stopping abci.socketClient for error: read message: EOF module=abci-client connection=snapshot
I[2023-12-19|18:49:10.834] service stop                                 module=abci-client connection=snapshot msg="Stopping socketClient service" impl=socketClient
E[2023-12-19|18:49:10.834] Stopping abci.socketClient for error: read message: EOF module=abci-client connection=mempool
E[2023-12-19|18:49:10.834] Stopping abci.socketClient for error: read message: EOF module=abci-client connection=consensus
I[2023-12-19|18:49:10.834] service stop                                 module=abci-client connection=mempool msg="Stopping socketClient service" impl=socketClient
I[2023-12-19|18:49:10.834] service stop                                 module=abci-client connection=consensus msg="Stopping socketClient service" impl=socketClient
E[2023-12-19|18:49:10.834] query connection terminated. Did the application crash? Please restart CometBFT module=proxy err="read message: EOF"
E[2023-12-19|18:49:10.834] error in proxyAppConn.EndBlock               module=state err="read message: EOF"
namada.service: Main process exited, code=killed, status=6/ABRT
namada.service: Failed with result 'signal'.
namada.service: Unit process 2600964 (cometbft) remains running after unit stopped.
namada.service: Consumed 22min 34.567s CPU time.
panic: Failed to process committed block (9582:7857B68FAAA690E88D8E9AA0D146EA8F4D926A4D5FAB0D14DD430E5DB4B8E965): read message: EOF
goroutine 102 [running]:
github.com/cometbft/cometbft/blocksync.(*Reactor).poolRoutine(0xc0002b2fc0, 0x0)
        /home/runner/work/cometbft/cometbft/blocksync/reactor.go:392 +0x139f
created by github.com/cometbft/cometbft/blocksync.(*Reactor).OnStart
        /home/runner/work/cometbft/cometbft/blocksync/reactor.go:109 +0x7a

@AlexToTheMoon
Copy link

AlexToTheMoon commented Dec 19, 2023

Got crashed at block 9595, Using LimitSTACK=infinity, v0.28.1-5-g679002d

I[2023-12-19|21:43:46.220] indexed block exents                         module=txindex height=9594
2023-12-19T21:43:46.297260Z  INFO namada_apps::node::ledger::shell::finalize_block: Block height: 9595, epoch: 21, is new epoch: false.
thread 'ledger-shell' has overflowed its stack
fatal runtime error: stack overflow
E[2023-12-19|21:43:46.459] Stopping abci.socketClient for error: read message: EOF module=abci-client connection=consensus
I[2023-12-19|21:43:46.459] service stop                                 module=abci-client connection=consensus msg="Stopping socketClient service" impl=socketClient
E[2023-12-19|21:43:46.459] consensus connection terminated. Did the application crash? Please restart CometBFT module=proxy err="read message: EOF"
E[2023-12-19|21:43:46.459] Stopping abci.socketClient for error: read message: EOF module=abci-client connection=snapshot
E[2023-12-19|21:43:46.459] Stopping abci.socketClient for error: read message: EOF module=abci-client connection=query
E[2023-12-19|21:43:46.459] error in proxyAppConn.EndBlock               module=state err="read message: EOF"
I[2023-12-19|21:43:46.459] service stop                                 module=abci-client connection=query msg="Stopping socketClient service" impl=socketClient
I[2023-12-19|21:43:46.459] signal trapped                               module=main msg="captured terminated, exiting..."
I[2023-12-19|21:43:46.459] service stop                                 module=main msg="Stopping Node service" impl=Node
I[2023-12-19|21:43:46.459] Stopping Node                                module=main
I[2023-12-19|21:43:46.459] service stop                                 module=events msg="Stopping EventBus service" impl=EventBus
I[2023-12-19|21:43:46.459] service stop                                 module=pubsub msg="Stopping PubSub service" impl=PubSub
E[2023-12-19|21:43:46.459] Stopping abci.socketClient for error: read message: EOF module=abci-client connection=mempool
I[2023-12-19|21:43:46.459] service stop                                 module=txindex msg="Stopping IndexerService service" impl=IndexerService
I[2023-12-19|21:43:46.459] service stop                                 module=abci-client connection=mempool msg="Stopping socketClient service" impl=socketClient
I[2023-12-19|21:43:46.459] service stop                                 module=p2p msg="Stopping P2P Switch service" impl="P2P Switch"
I[2023-12-19|21:43:46.459] service stop                                 module=abci-client connection=snapshot msg="Stopping socketClient service" impl=socketClient
I[2023-12-19|21:43:46.459] service stop                                 module=p2p [email protected]:26656 msg="Stopping MConnection service" impl=MConn{136.243.40.38:26656}
namadad.service: Main process exited, code=killed, status=6/ABRT
I[2023-12-19|21:43:46.459] service stop                                 module=p2p [email protected]:26656 msg="Stopping Peer service" impl="Peer{MConn{136.243.40.38:26656} 3d6b5aa8cf30fddf887bda6adb989c82c4982111 out}"
panic: Failed to process committed block (9595:AAD200B077FE191B75B4610C020C468FF9E62A602EB2BAD5A03C2BCA430C4CBC): read message: EOF
goroutine 72 [running]:
github.com/cometbft/cometbft/blocksync.(*Reactor).poolRoutine(0xc0002b2fc0, 0x0)
        /home/runner/work/cometbft/cometbft/blocksync/reactor.go:392 +0x139f
created by github.com/cometbft/cometbft/blocksync.(*Reactor).OnStart
        /home/runner/work/cometbft/cometbft/blocksync/reactor.go:109 +0x7a
namadad.service: Failed with result 'signal'.
namadad.service: Consumed 25min 4.818s CPU time.
namadad.service: Scheduled restart job, restart counter is at 1.
Stopped Namada Node.

@Fraccaman
Copy link
Member Author

closing in favor of #2325

@Fraccaman Fraccaman closed this Dec 21, 2023
@tzemanovic tzemanovic mentioned this pull request Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants