-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panic: runtime error: invalid memory address or nil pointer dereference #1308
Comments
@frankie-lim-sweeho can you try something out for us? In your container configuration can you add the following:
I suspect this will give the process enough time to finish writing to the db. |
@antonydenyer will test. Thanks |
@antonydenyer I have tested and I can still reproduce the issue
|
@antonydenyer and @baptiste-b-pegasys, with version 2 of the fix (image: ghcr.io/baptiste-b-pegasys/quorum:fix21.10-2), I got a difference panic error, but atleast this panic in version 2 do not stuck in crash loop forever, will recover after restart. So it must be the different code change between version 2 (image: ghcr.io/baptiste-b-pegasys/quorum:fix21.10-2) and version 3 (image: ghcr.io/baptiste-b-pegasys/quorum:fix21.10-3) that cause the infinite/unrecoverable crash. @baptiste-b-pegasys mentioned version 3 will cause a reset of the storage when it happens. possible this reset of storage cause the unrecoverable crash? please check the code. version 2 (image: ghcr.io/baptiste-b-pegasys/quorum:fix21.10-2) crash with full log attached
|
@baptiste-b-pegasys @antonydenyer , finally I am able to capture the actual quorum crash log while sending high TPS transaction to Quorum using image: ghcr.io/baptiste-b-pegasys/quorum:fix21.10-3, which has the fix for original 'fatal error: concurrent map iteration and map write' issue.
Full log |
Can I have the tools you have in order to reproduce that. Thank you |
You can try two images: |
@baptiste-b-pegasys thanks! will test . I will also document as detail as possible how to compile and setup Hyperledger Caliper for Quorum SUT. |
Test Result Result:
|
Test Result Result:
|
sorry my mistake, I forgot one lock at the Copy place ghcr.io/baptiste-b-pegasys/quorum:fix21.10-6 |
cool. Let me try baptiste-b-pegasys/quorum:fix21.10-6 now! |
@baptiste-b-pegasys fix baptiste-b-pegasys/quorum:fix21.10-6 failed immediately at first attempt with a new error "fatal error: sync: unlock of unlocked mutex" Full log attached.
|
ok inner call |
ghcr.io/baptiste-b-pegasys/quorum:fix21.10-7 failed and crash at first attempt. full log attached.
|
hello, another try whould be nice to have the tool for reproducing |
@baptiste-b-pegasys thanks! will try. |
ghcr.io/baptiste-b-pegasys/quorum:fix21.10-8 failed and crash at first attempt. full log attached.
|
OK I am waiting for your tool. |
Please get the Caliper version that support Quorum SUT here. I have configured it to use WebSocket instead of RPC so that we dont have the TCP connections limitation. The step by step instruction here. This is still very experimental, so its abit manual steps here: Hopefully you can reproduce the issue. Meanwhile, anything else I can test? |
no unfortunately, the "-8" version looks like this: func (s *StateDB) Copy() *StateDB {
// Copy all the basic fields, initialize the memory ones
state := &StateDB{
db: s.db,
trie: s.db.CopyTrie(s.trie),
// Quorum - Privacy Enhancements
accountExtraDataTrie: s.db.CopyTrie(s.accountExtraDataTrie),
stateObjects: make(map[common.Address]*stateObject, len(s.journal.dirties)),
stateObjectsPending: make(map[common.Address]struct{}, len(s.stateObjectsPending)),
stateObjectsDirty: make(map[common.Address]struct{}, len(s.journal.dirties)),
refund: s.refund,
logs: make(map[common.Hash][]*types.Log, len(s.logs)),
logSize: s.logSize,
preimages: make(map[common.Hash][]byte, len(s.preimages)),
journal: newJournal(),
}
s.journal.mutex.Lock()
addresses := make([]common.Address, 0, len(s.journal.dirties))
for addr := range s.journal.dirties {
addresses = append(addresses, addr)
}
s.journal.mutex.Unlock() // HERE IS WHERE IT FAILS
// Copy the dirty states, logs, and preimages
for _, addr := range addresses { How |
@baptiste-b-pegasys are you able to deploy Caliper based on the steps I have provided? Are you able to reproduce the issue? |
I think we've identified part of the issue. What do you have as your entrypoint? Is it geth or are you running geth wrapped in a shell script? My concern is that GoQuorum isn't receiving the SIGTERM signal from docker because it's going to the shell script. Consequently, GoQuorum has no chance to be gracefully shut down. |
|
Have you tried the latest version 21.10.2 ? |
@baptiste-b-pegasys I have not. I can try. Do you think the changes in this version fixed the issue? |
@baptiste-b-pegasys with official version 21.10.2, "fatal error: concurrent map iteration and map write" is back after just two tests attempt. Fill log attached.
|
The concurrent map modification is on a different one, I add a lock on it. ghcr.io/baptiste-b-pegasys/quorum:fix21.10.2 |
thanks. testing. |
Test: ghcr.io/baptiste-b-pegasys/quorum:fix21.10.2 Result: Issue Reproducible - fatal error: concurrent map iteration and map write Full log attached
|
@baptiste-b-pegasys any luck? Where you able to reproduce with the Hyperledger caliper setup step I have provided? I am able to reproduce easily when I run a private transaction . |
I ran the caliper 1 node and 4 nodes, I am not able to reproduce yet. I think I am not reaching the point it happens. To be sure: Can I have the arguments and genesis file? I made another version of the fix, you can try: ghcr.io/baptiste-b-pegasys/quorum:lock |
command
genesis file
|
Test : ghcr.io/baptiste-b-pegasys/quorum:lock Panic with a different error now. Full log attached
|
Hi, still trying to reproduce the issue, I think I need more power to make this test. Meanwhile, can you try this version: ghcr.io/baptiste-b-pegasys/quorum:lock-2 |
test image: ghcr.io/baptiste-b-pegasys/quorum:lock-2 @baptiste-b-pegasys I think you are nearly there with the fix!! with this test build, it improved drastically! I am able to run 15 consecutive stress tests on the private contract before it crash. Before this, I cannot even repeat a second time without crashing. Here is the latest log attached. Hopefully u can nail down this final one :-)
|
Can you check what is the value of miner.threads ? I see 0 from the args I gave me, which means it's disabled. Can you test this version, where a new lock is added for dirties map: ghcr.io/baptiste-b-pegasys/quorum:lock-3 |
Thanks. will test the latest build. On the miner thread, default is 0 according to the documentation. Anyway, no miner running on the nodes that are crashing. I have the validator node separate out from the transaction node. |
@baptiste-b-pegasys Another significant improvement! With the same stress test case, I only manage to reproduce after 25 times. Full logs attached. hopefully a final push to fix the residual :-) test image: ghcr.io/baptiste-b-pegasys/quorum:lock-3
|
Hello, here is a new version of the lock Please tell me if it is OK for you. |
@baptiste-b-pegasys Well done!! I am calling it! I have send 300,000 private transactions at sent rate of >2500 TPS, batches in 20,000, no crashes seen! I think we are good! ghcr.io/baptiste-b-pegasys/quorum:lock-4 is good! Hopefully this will get into official release soon. Thanks Again! |
System information
Geth version: geth version
Geth
Version: 1.9.25-stable
Git Commit: 919800f
Quorum Version: 21.10.1
Architecture: amd64
Protocol Versions: [65 64 63]
Network Id: 1337
Go Version: go1.15.6
Operating System: linux
GOPATH=
GOROOT=go
OS & Version: Windows/Linux/OSX
Running on Google CLoud Platform GKE Kubernetes
/ # cat /etc/os-release
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.14.2
PRETTY_NAME="Alpine Linux v3.14"
HOME_URL="https://alpinelinux.org/"
BUG_REPORT_URL="https://bugs.alpinelinux.org/"
Expected behaviour
GoQuorum should gracefully shut down and restart.
Actual behaviour
Private state db is corrupted
Steps to reproduce the behaviour
Under heavy load with private contracts restart the server
The issue has been highlighted by investigations from #1287 into performance.
Thanks to @frankie-lim-sweeho
The text was updated successfully, but these errors were encountered: