-
Notifications
You must be signed in to change notification settings - Fork 20.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade from 1.9 to 1.10 makes Geth crash in loop #22502
Comments
I tried to fix it by restarting geth with --snapshot=false and restart again with --snapshot=true. After that the endless crashing loop stops ... and resumes after 10 minutes:
|
I have no issue with --snapshot=false |
We're having similar issues, upgraded from geth-1.9.25 stable to 1.10.1-stable using Docker on AWS: Wondering what to do next? Start from scratch? Try to roll back? Thoughts? We have around 1GiB of error messages in our logs, until the server gave up:
|
@tomw1808 does it work if you set --snapshot=false? That's not a solution to fix the snapshot process but at least geth runs with this for me. |
I mean, I started geth 1.10.1 from scratch now, without --snapshot=false, it's running and syncing. It's just a bummer, because the old chaindata got corrupted, which is something I did not expect to happen. |
same issue:
|
@mael4875215 Some data in your database is corrupted. The snapshotter just reads through all the state and hits the corrupted trie node. Without the snapshotter it's a time bomb, waiting for something to hit it and produce a bad block. I can't really say how you managed to end up with a database key with missing data though. Probably a resync would be the safest, since you're data is corrupted and will blow eventually either way. @tomw1808 We'd need the full crash dump, or more importantly the beginning of it. Yours is a random excerpt from the middle which is not particularly useful. @jeanhackpy That's an odd one I haven't seen yet, will need to take a closer look. |
@jeanhackpy Your crash is originating from a database with missing data. The interesting question is how you ended up with that in the first place. Unfortunately that I cannot say. Even before the crash, the bloom indexer says some data that th chain head says is present, is actually missing. Do you have the logs from your original run when you started to sync? |
Thank you for your advise @karalabe, I'll do a resync to be safe. My geth node was running since november, always powered on since then, and I did only 2 upgrades of geth. |
@karalabe Let me know how i can fix it and how can i send you the logs from original run . Thank you |
So, the old geth instance got stopped normally (geth 1.9.25): the end of the log is this
To my understanding that looks just fine. Then I switched the docker image to geth 1.10.1. That was the first time I started up geth 1.10.1. Below, that's when the server stopped reacting at all.
That's all there is, after about 1h I decided to terminate the instance, because it became absolutely unresponsive and I could not detect anything thats happening at all. The next one after the restart:
There I had not detected any problems, but suddenly, without any notice, the geth process died, nothing in the logs. ECS naturally started the process again, so here comes the third log. Yes, I should've scrolled to the top, and would've seen an "out of memory" exception. Which in itself is not super strange, but still, I haven't seen geth go beyond much of 10GiB. The instance has 16 Gib. What's the advised size?
I'll leave that here, it might helps someone else. The chaindata on the EBS volume is definitively destroyed, a rollback isn't possible and I am wondering why it did not just continue in the first place, but started again ~150k blocks ago to sync again. |
How do you maintain your mental strength please |
This. Also, reporting the same issue as stated above. Proving to be a real nuisance. |
Your shutdown log does not look right:
It never said |
Regarding the out of memory -- what is the RAM on the machine? |
@holiman the RAM is 16GiB, about 1GiB is used by another process, so that leaves around 15GiB of useable ram. It's indeed running as a deamon (ECS service for that matter). I didn't pay too much attention to the shutdown time. I was under the impression if it received SIGTERM it will stop the blockchain relevant services, write the last block to disk and that's it. According to https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_StopTask.html that's 30s on ECS. So, setting the shutdown time to 120s should suffice? Still confused what needs to be written for so long that currupts 150k blocks. I mean, last block or last few blocks (even last 128 blocks ok), but 150k blocks - can you eventually point me towards the theory behind this? I do understand that github issues are maybe not the right place for it, so feel free to skip it entirely, I'd be grateful for an explanation though to better estimate the shutdown time... |
This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have more relevant information or answers to our questions so that we can investigate further. |
@tomw1808 did you resolve your issue? I am getting the same thing |
@mael4875215 did disabling snapshot and resyncing help? |
@jayboy-mabushi I did sort it out. Threw away everything, started syncing from scratch and used Infura in the meantime. Learned a few things along the way:
Since then it seems to work fine, there was only one minor outage a few days ago which we also mitigated using Infura for a short while. Good luck! |
@tomw1808 Thanks. do you have a quide on how to set up and run ethereum full node? |
System information
Upgrade from: Geth/v1.9.25-stable-e7872729/linux-amd64/go1.15.6
to Geth version: Geth/v1.10.1-stable-c2d2f4ed/linux-amd64/go1.16
OS & Version: ubuntu 20.04.2 LTS
ExecStart=/usr/bin/geth --http --cache=1024
Expected behaviour
As it's the first time in 1.10 it should take 1-4 days to build the first snapshot.
Actual behaviour
After an upgrade from 1.9.25 to 1.10.1 and after 12 minutes of geth trying to build the snapshot it crashes (geth.service: Main process exited, code=exited, status=2/INVALIDARGUMENT). Then it starts again, crashes again, starts again, crashes again, for hours and every minute.
Steps to reproduce the behaviour
Upgrade from Geth/v1.9.25-stable-e7872729/linux-amd64/go1.15.6 to Geth/v1.10.1-stable-c2d2f4ed/linux-amd64/go1.16 by doing:
systemctl stop geth
apt-get update
apt-get upgrade
systemctl start geth
Backtrace
systemctl stop Geth
apt-get update
apt-get upgrade
systemctl start geth
[...]
And finally it crashes a first time after 12 minutes:
Then start again geth automatically and crashes again immediately:
this in loop.
The text was updated successfully, but these errors were encountered: