Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploy dedicated Geth instances for all Prater Nimbus nodes #125

Closed
jakubgs opened this issue Aug 30, 2022 · 23 comments
Closed

Deploy dedicated Geth instances for all Prater Nimbus nodes #125

jakubgs opened this issue Aug 30, 2022 · 23 comments
Assignees

Comments

@jakubgs
Copy link
Member

jakubgs commented Aug 30, 2022

This has been neglected due to other priorities, like Mainnet nodes, but it's time to do a proper setup of one Geth node for each Nimbus node, as God intended. This will require quite a lot of hardware as the prater fleet involves 11 hosts and 33 nodes in total.

Since the current size of a snap-synced Geth node is about ~160 GB:

[email protected]:~ % sudo du -hs /data/nimbus-goerli/node/data 
161G	/data/nimbus-goerli/node/data

We'll need at least 200 GB per node, and about 4 nodes on each host. So 1 TB NVMe should be sufficient for a while.
The most likely candidate is a Hetzner AX51-NVMe host with 2x1 TB NVMes:

@jakubgs jakubgs self-assigned this Aug 30, 2022
@jakubgs
Copy link
Member Author

jakubgs commented Aug 30, 2022

Another possible option would be to replace existing hosts with bigger ones instead of adding separate hosts for Geth.

We could use Hetzner AX61-NVMe which have 2 x 1.92 TB NVMe which would be enough to run both Geth and Nimbus nodes on the same host, which would simplify setup, management, and debugging.

@jakubgs
Copy link
Member Author

jakubgs commented Aug 30, 2022

Based on conversation with @zah I'm purchasing six AX61-NVMe hosts:

image

After the migration the leftover AX41-NVMe hosts will be reused for macos and windows Geth nodes, as well as CI.

jakubgs added a commit that referenced this issue Aug 30, 2022
Part of deploying dedicated Geth nodes.
#125

Signed-off-by: Jakub Sokołowski <[email protected]>
@jakubgs
Copy link
Member Author

jakubgs commented Aug 30, 2022

I've provisioned the hosts: 1bdcf1ca

linux-01.he-eu-hel1.nimbus.prater hostname=linux-01.he-eu-hel1.nimbus.prater ansible_host=95.217.198.113 env=nimbus stage=prater data_center=he-eu-hel1 region=eu-hel1 dns_entry=linux-01.he-eu-hel1.nimbus.prater.statusim.net
linux-02.he-eu-hel1.nimbus.prater hostname=linux-02.he-eu-hel1.nimbus.prater ansible_host=95.217.230.20 env=nimbus stage=prater data_center=he-eu-hel1 region=eu-hel1 dns_entry=linux-02.he-eu-hel1.nimbus.prater.statusim.net
linux-03.he-eu-hel1.nimbus.prater hostname=linux-03.he-eu-hel1.nimbus.prater ansible_host=65.108.132.230 env=nimbus stage=prater data_center=he-eu-hel1 region=eu-hel1 dns_entry=linux-03.he-eu-hel1.nimbus.prater.statusim.net
linux-04.he-eu-hel1.nimbus.prater hostname=linux-04.he-eu-hel1.nimbus.prater ansible_host=135.181.20.36 env=nimbus stage=prater data_center=he-eu-hel1 region=eu-hel1 dns_entry=linux-04.he-eu-hel1.nimbus.prater.statusim.net
linux-05.he-eu-hel1.nimbus.prater hostname=linux-05.he-eu-hel1.nimbus.prater ansible_host=95.217.224.92 env=nimbus stage=prater data_center=he-eu-hel1 region=eu-hel1 dns_entry=linux-05.he-eu-hel1.nimbus.prater.statusim.net
linux-06.he-eu-hel1.nimbus.prater hostname=linux-06.he-eu-hel1.nimbus.prater ansible_host=95.217.204.216 env=nimbus stage=prater data_center=he-eu-hel1 region=eu-hel1 dns_entry=linux-06.he-eu-hel1.nimbus.prater.statusim.net

And started deploying Geth nodes on them.

@jakubgs
Copy link
Member Author

jakubgs commented Aug 30, 2022

So far they have not started syncing:

[email protected]:/docker/geth-goerli-02 % d 
CONTAINER ID   NAMES                 IMAGE                         CREATED          STATUS
1ac7ddb275f1   geth-goerli-04-node   ethereum/client-go:v1.10.23   15 minutes ago   Up 15 minutes
28c7a83695b7   geth-goerli-03-node   ethereum/client-go:v1.10.23   16 minutes ago   Up 16 minutes
2f2aa5925bfb   geth-goerli-02-node   ethereum/client-go:v1.10.23   18 minutes ago   Up 18 minutes
5970e7efbbab   geth-goerli-01-node   ethereum/client-go:v1.10.23   19 minutes ago   Up 19 minutes

[email protected]:/docker/geth-goerli-02 % /docker/geth-goerli-02/rpc.sh eth_syncing                      
{
  "jsonrpc": "2.0",
  "id": 1,
  "result": false
}

[email protected]:/docker/geth-goerli-02 % /docker/geth-goerli-01/rpc.sh admin_peers | jq '.result[].name'
"Geth/v1.10.21-stable/linux-amd64/go1.18.4"
"Geth/v1.10.23-stable-d901d853/linux-amd64/go1.18.5"
"Geth/v1.10.21-stable-67109427/linux-amd64/go1.18.5"
"erigon/v2022.99.99-dev-18f9313c/linux-amd64/go1.19"
"Geth/v1.10.23-stable-d901d853/linux-amd64/go1.18.5"

But that's probably because of low peer numbers.

@jakubgs
Copy link
Member Author

jakubgs commented Aug 31, 2022

I don't get it. The nodes are not syncing at all:

[email protected]:~ % for dir in /docker/geth-goerli-*; do $dir/rpc.sh eth_syncing | jq -c; done
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":false}

Despite having 50 peers each:

[email protected]:~ % for dir in /docker/geth-goerli-*; do $dir/rpc.sh admin_peers | jq '.result | length'; done
50
50
50
50

But nothing has been synced:

[email protected]:~ % sudo du -hsc /docker/geth-goerli-0?/node
3.2M	/docker/geth-goerli-01/node
3.0M	/docker/geth-goerli-02/node
7.2M	/docker/geth-goerli-03/node
3.4M	/docker/geth-goerli-04/node
17M	total

What the fuck is going on...

@jakubgs
Copy link
Member Author

jakubgs commented Aug 31, 2022

The startup logs show we are correctly using Goerli network:

INFO [08-31|08:33:31.510] Starting Geth on Görli testnet... 
...
INFO [08-31|08:33:31.580] Chain ID:  5 (goerli) 
INFO [08-31|08:33:31.580] Consensus: Beacon (proof-of-stake), merged from Clique (proof-of-authority) 
...
INFO [08-31|08:33:31.581] Initialising Ethereum protocol           network=5 dbversion=8
INFO [08-31|08:33:31.587] Loaded most recent local header          number=0 hash=bf7e33..b88c1a td=1 age=3y7mo2w
INFO [08-31|08:33:31.587] Loaded most recent local full block      number=0 hash=bf7e33..b88c1a td=1 age=3y7mo2w
INFO [08-31|08:33:31.587] Loaded most recent local fast block      number=0 hash=bf7e33..b88c1a td=1 age=3y7mo2w
INFO [08-31|08:33:31.589] Loaded local transaction journal         transactions=0 dropped=0
INFO [08-31|08:33:31.589] Regenerated local transaction journal    transactions=0 accounts=0
INFO [08-31|08:33:31.589] Chain post-merge, sync via beacon client 
INFO [08-31|08:33:31.589] Gasprice oracle is ignoring threshold set threshold=2
INFO [08-31|08:33:31.589] Allocated cache and file handles         database=/data/geth/les.server              cache=16.00MiB handles=16
INFO [08-31|08:33:31.593] Configured checkpoint oracle             address=0x18CA0E045F0D772a851BC7e48357Bcaab0a0795D signers=5 threshold=2

I don't get why it's not syncing.

@jakubgs
Copy link
Member Author

jakubgs commented Aug 31, 2022

There's a lot of Snapshot extension registration failed messages in the logs:

[email protected]:~ % zcat /var/log/docker/geth-goerli-02-node/docker.*.gz | grep 'Snapshot extension registration failed' | wc -l
1370

[email protected]:~ % cat /var/log/docker/geth-goerli-02-node/docker.log | grep 'Snapshot extension registration failed' | wc -l 
346

Not sure if that's relevant though.

@jakubgs
Copy link
Member Author

jakubgs commented Aug 31, 2022

Oooh, ok, now I see it:

WARN [08-31|08:39:06.628] Post-merge network, but no beacon client seen. Please launch one to follow the chain! 

We NEED a consensus layer node to learn what is the current head of the blockchain so we can start syncing the exec node.

@jakubgs
Copy link
Member Author

jakubgs commented Aug 31, 2022

And now we are finally syncing:

[email protected]:~ % for dir in /docker/geth-goerli-*; do $dir/rpc.sh eth_syncing | jq -c '.result | { currentBlock, highestBlock }'; done
{"currentBlock":"0x2e488d","highestBlock":"0x727aee"}
{"currentBlock":"0x30f93e","highestBlock":"0x727af1"}
{"currentBlock":"0x25a9b8","highestBlock":"0x727af8"}
{"currentBlock":"0xc2b6f","highestBlock":"0x727b1f"}

@jakubgs
Copy link
Member Author

jakubgs commented Aug 31, 2022

Looks like for some reason some Geth nodes have fucked up ancient data:

INFO [08-31|15:08:52.349] Allocated cache and file handles         database=/data/geth/chaindata cache=15.71GiB handles=524,288
INFO [08-31|15:08:52.953] Opened ancient database                  database=/data/geth/chaindata/ancient/chain readonly=false
Fatal: Failed to register the Ethereum service: ancient chain segments already extracted, please set --datadir.ancient to the correct path
Fatal: Failed to register the Ethereum service: ancient chain segments already extracted, please set --datadir.ancient to the correct path

This issue suggests removing chaindata/ancient:

@jakubgs
Copy link
Member Author

jakubgs commented Aug 31, 2022

But that didn't help, and I had to remove all of chaindata to get the nodes to start syncing again:

[email protected]:~ % for dir in /docker/geth-*; do $dir/rpc.sh eth_syncing | jq -c; done
{"jsonrpc":"2.0","id":1,"result":{"currentBlock":"0x4a26f5","healedBytecodeBytes":"0x0","healedBytecodes":"0x0","healedTrienodeBytes":"0x0","healedTrienodes":"0x0","healingBytecode":"0x0","healingTrienodes":"0x0","highestBlock":"0x728011","startingBlock":"0x0","syncedAccountBytes":"0x625e0c71","syncedAccounts":"0x61c1f9","syncedBytecodeBytes":"0x51f24ccd","syncedBytecodes":"0x2c626","syncedStorage":"0x3ffa08d","syncedStorageBytes":"0x3623d0241"}}
{"jsonrpc":"2.0","id":1,"result":{"currentBlock":"0x376d9c","healedBytecodeBytes":"0x0","healedBytecodes":"0x0","healedTrienodeBytes":"0x0","healedTrienodes":"0x0","healingBytecode":"0x0","healingTrienodes":"0x0","highestBlock":"0x72801a","startingBlock":"0x0","syncedAccountBytes":"0x53392659","syncedAccounts":"0x512202","syncedBytecodeBytes":"0x47d7abc6","syncedBytecodes":"0x272c1","syncedStorage":"0x37a6c54","syncedStorageBytes":"0x2efa02ecb"}}
{"jsonrpc":"2.0","id":1,"result":{"currentBlock":"0x44fab1","healedBytecodeBytes":"0x0","healedBytecodes":"0x0","healedTrienodeBytes":"0x0","healedTrienodes":"0x0","healingBytecode":"0x0","healingTrienodes":"0x0","highestBlock":"0x728050","startingBlock":"0x44fab1","syncedAccountBytes":"0x54e431b6","syncedAccounts":"0x561382","syncedBytecodeBytes":"0x48d78cb8","syncedBytecodes":"0x279a7","syncedStorage":"0x391d59b","syncedStorageBytes":"0x304293e9d"}}
{"jsonrpc":"2.0","id":1,"result":false}

@jakubgs
Copy link
Member Author

jakubgs commented Aug 31, 2022

I have stopped the nodes on the old metal hosts and migrated the validators.

  • b3ba3211 - nimbus.prater: drop old metal linux hosts
  • 87663366 - nimbus.prater: deploy Geth nodes on new hosts

We're already seeing attestations and proposals:

image

@jakubgs
Copy link
Member Author

jakubgs commented Aug 31, 2022

And we are seeing nimbus hosts proposing: https://prater.beaconcha.in/blocks?q=Nimbus%2Fv

image

jakubgs added a commit that referenced this issue Aug 31, 2022
jakubgs added a commit that referenced this issue Aug 31, 2022
@jakubgs
Copy link
Member Author

jakubgs commented Aug 31, 2022

Tomorrow I will reuse 3 of the 6 leftover old prater hosts to do Geth nodes for the AWS/MacOS/Windows hosts.

The remaining 3 hosts will be used for CI or decommissioned.

jakubgs added a commit that referenced this issue Sep 6, 2022
@jakubgs
Copy link
Member Author

jakubgs commented Sep 6, 2022

I configured a dedicated set of Geth nodes for Windows:

  • 0d7e29b8 - add geth-windows-01.he-eu-hel1.nimbus.prater host
  • cb448d64 - nimbus-prater-windows: deploy dedicated Geth nodes

windows-goerli-01.he-eu-hel1.nimbus.geth hostname=windows-goerli-01.he-eu-hel1.nimbus.geth ansible_host=65.21.196.47 env=nimbus stage=geth data_center=he-eu-hel1 region=eu-hel1 dns_entry=windows-goerli-01.he-eu-hel1.nimbus.geth.statusim.net

@jakubgs
Copy link
Member Author

jakubgs commented Sep 6, 2022

The sync progress is good, might start working tomorrow:

image

@jakubgs
Copy link
Member Author

jakubgs commented Sep 7, 2022

We finished syncing:

image

[email protected]:~ % for dir in /docker/geth-*; do $dir/rpc.sh eth_syncing | jq -c; done
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":false}
admin@windows-01 MINGW64 ~                                                                                                                        
$ for port in $(seq 9300 9302); do curl -sS "localhost:$port/eth/v1/node/syncing" | jq -c; done
{"data":{"head_slot":"3835538","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3835538","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3835538","sync_distance":"0","is_syncing":false,"is_optimistic":false}}

So Windows host is done.

@jakubgs
Copy link
Member Author

jakubgs commented Sep 7, 2022

Deployed a host for MacOS Prater nodes:

  • 2dd9350f - add macos-goerli-01.he-eu-hel1.nimbus.geth host

macos-goerli-01.he-eu-hel1.nimbus.geth hostname=macos-goerli-01.he-eu-hel1.nimbus.geth ansible_host=65.21.196.48 env=nimbus stage=geth data_center=he-eu-hel1 region=eu-hel1 dns_entry=macos-goerli-01.he-eu-hel1.nimbus.geth.statusim.net

@jakubgs
Copy link
Member Author

jakubgs commented Sep 7, 2022

Decided to rename the hosts while adding a third one so as to simplify setup:

  • 08a744da - rename Goerli geth nodes to be part of one fleet

goerli-01.he-eu-hel1.nimbus.geth hostname=goerli-01.he-eu-hel1.nimbus.geth ansible_host=65.21.73.183 env=nimbus stage=geth data_center=he-eu-hel1 region=eu-hel1 dns_entry=goerli-01.he-eu-hel1.nimbus.geth.statusim.net
goerli-02.he-eu-hel1.nimbus.geth hostname=goerli-02.he-eu-hel1.nimbus.geth ansible_host=65.21.196.48 env=nimbus stage=geth data_center=he-eu-hel1 region=eu-hel1 dns_entry=goerli-02.he-eu-hel1.nimbus.geth.statusim.net
goerli-03.he-eu-hel1.nimbus.geth hostname=goerli-03.he-eu-hel1.nimbus.geth ansible_host=65.21.196.47 env=nimbus stage=geth data_center=he-eu-hel1 region=eu-hel1 dns_entry=goerli-03.he-eu-hel1.nimbus.geth.statusim.net

jakubgs added a commit that referenced this issue Sep 7, 2022
@jakubgs
Copy link
Member Author

jakubgs commented Sep 7, 2022

Configured existing AWS and MacOS nodes to use the new Goerli Geth nodes:

  • e80d5943 - nimbus.prater: use EL clients from new Geth hosts

Currently syncing:

image

jakubgs added a commit that referenced this issue Sep 7, 2022
@jakubgs
Copy link
Member Author

jakubgs commented Sep 8, 2022

I think this is done:

 > a nimbus.prater -a 'for port in $(seq 9300 9305); do curl -s 0:$port/eth/v1/node/syncing | jq -c; done' 
stable-large-01.aws-eu-central-1a.nimbus.prater | CHANGED | rc=0 >>
{"data":{"head_slot":"3836702","sync_distance":"6822","is_syncing":true,"is_optimistic":true}}
unstable-large-01.aws-eu-central-1a.nimbus.prater | CHANGED | rc=0 >>
{"data":{"head_slot":"3839794","sync_distance":"3730","is_syncing":true,"is_optimistic":true}}
testing-large-01.aws-eu-central-1a.nimbus.prater | CHANGED | rc=0 >>
{"data":{"head_slot":"3843523","sync_distance":"1","is_syncing":false,"is_optimistic":true}}
macos-01.ms-eu-dublin.nimbus.prater | FAILED | rc=127 >>
{"data":{"head_slot":"3843847","sync_distance":"0","is_syncing":false,"is_optimistic":true}}
{"data":{"head_slot":"3843838","sync_distance":"9","is_syncing":false,"is_optimistic":true}}
{"data":{"head_slot":"3843836","sync_distance":"11","is_syncing":false,"is_optimistic":true}}
windows-01.he-eu-hel1.nimbus.prater | FAILED! => {
{"data":{"head_slot":"3836646","sync_distance":"6880","is_syncing":true,"is_optimistic":true}}                                                    
{"data":{"head_slot":"3836671","sync_distance":"6855","is_syncing":true,"is_optimistic":true}}                                                    
{"data":{"head_slot":"3836785","sync_distance":"6741","is_syncing":true,"is_optimistic":true}}
linux-01.he-eu-hel1.nimbus.prater | CHANGED | rc=0 >>
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
linux-02.he-eu-hel1.nimbus.prater | CHANGED | rc=0 >>
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
linux-03.he-eu-hel1.nimbus.prater | CHANGED | rc=0 >>
{"data":{"head_slot":"3843523","sync_distance":"1","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843523","sync_distance":"1","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843523","sync_distance":"1","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843523","sync_distance":"1","is_syncing":false,"is_optimistic":false}}
linux-04.he-eu-hel1.nimbus.prater | CHANGED | rc=0 >>
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
linux-05.he-eu-hel1.nimbus.prater | CHANGED | rc=0 >>
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
linux-06.he-eu-hel1.nimbus.prater | CHANGED | rc=0 >>
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}

All are using the new Geth nodes. We can decommission the old AWS one.

jakubgs added a commit that referenced this issue Sep 8, 2022
No longer necessary after dedicated metal hosts were deployed:
#125

Signed-off-by: Jakub Sokołowski <[email protected]>
@jakubgs
Copy link
Member Author

jakubgs commented Sep 8, 2022

Got rid of the old AWS Geth Goerli node:

  • 94816223 - drop goerli-01.aws-eu-central-1a.nimbus.geth host

@jakubgs
Copy link
Member Author

jakubgs commented Sep 8, 2022

I consider this done.

@jakubgs jakubgs closed this as completed Sep 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant