Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploy validator clients #122

Closed
narimiran opened this issue Aug 3, 2022 · 23 comments
Closed

Deploy validator clients #122

narimiran opened this issue Aug 3, 2022 · 23 comments
Assignees

Comments

@narimiran
Copy link
Contributor

This is a continuation of #111 ("(...) if the setup proves to be stable enough, we might deploy some validators to the consensus node.")

This can either be a new role or an expansion of a current role.
This should be done for all testnets.
We should start with a relatively small number of validators initially, e.g. on the nodes that currently run 60 validators.

This can be done in one of these two ways:

  • New boolean flag use_validator_client: indicating that all validators on that host should be attached to the validator client.
  • Alternatively: allow the hosts to have validators attached both to the beacon node and to the validator client. New property number_of_validators: the number of validators attached to a validator client.

More information about running the validator client is available here.

@jakubgs jakubgs self-assigned this Aug 26, 2022
@jakubgs
Copy link
Member

jakubgs commented Aug 26, 2022

I assume since you want this enabled on nodes with 60 validators that would be metal-05.he-eu-hel1.nimbus.prater:

'metal-05.he-eu-hel1.nimbus.prater': # 60 each
- { branch: 'stable', start: 20164, end: 20224, build_freq: '*-*-* 11:00:00' }
- { branch: 'testing', start: 20284, end: 20344, build_freq: '*-*-* 15:00:00', nim_commit: 'version-1-6' }
- { branch: 'unstable', start: 20224, end: 20284, build_freq: '*-*-* 13:00:00', payload_builder: true, open_libp2p_ports: false }
- { branch: 'libp2p', start: 20344, end: 20404, build_freq: '*-*-* 17:00:00', nim_commit: 'version-1-6', nim_flags: '-d:json_rpc_websocket_package=websock' }

On all branches or just some?

@jakubgs
Copy link
Member

jakubgs commented Aug 30, 2022

Before this issue can be resolved we first need a proper dedicated EL client node setup, as right now all beacon nodes are using the same Geth node running on goerli-01.aws-eu-central-1a.nimbus.geth.

@jakubgs
Copy link
Member

jakubgs commented Sep 9, 2022

I started some work on this. Or at least I did some thinking:

  • I will set this up as part of the existing infra-role-beacon-node-* roles to avoid too much boilerplate.
  • I'm going to make use of the --secrets-dir and --validators-dir flags and point them at /var/empty or similar.
  • I will leave the secrets and validators in the same place as they already are, except they will be served by the client.

I will probably get a version working on Monday.

jakubgs added a commit to status-im/infra-role-beacon-node-windows that referenced this issue Sep 12, 2022
Necessary to later provide `/var/empty` as path for both in order to use
validator client service instead of loading validators directly.

status-im/infra-nimbus#122

Signed-off-by: Jakub Sokołowski <[email protected]>
jakubgs added a commit to status-im/infra-role-beacon-node-macos that referenced this issue Sep 12, 2022
Necessary to later provide `/var/empty` as path for both in order to use
validator client service instead of loading validators directly.

status-im/infra-nimbus#122

Signed-off-by: Jakub Sokołowski <[email protected]>
jakubgs added a commit to status-im/infra-role-beacon-node-linux that referenced this issue Sep 12, 2022
Necessary to later provide `/var/empty` as path for both in order to use
validator client service instead of loading validators directly.

status-im/infra-nimbus#122

Signed-off-by: Jakub Sokołowski <[email protected]>
jakubgs added a commit to status-im/infra-role-dist-validators that referenced this issue Sep 12, 2022
This is necessary since the `--secrets-dir` and `--validators-dir` flags
can be also provided separately to a beacon node.

This also allows for setting these paths to `/var/empty` when a
validator client is being used instead of providing the files to the node.

status-im/infra-nimbus#122

Signed-off-by: Jakub Sokołowski <[email protected]>
@jakubgs
Copy link
Member

jakubgs commented Sep 12, 2022

Changes necessary to manage location of secrets and validators folders separately:

Now we can point those at /var/empty when using a validator client.

@arnetheduck
Copy link
Member

arnetheduck commented Sep 12, 2022

I'm going to make use of the --secrets-dir and --validators-dir flags and point them at /var/empty or similar.
I will leave the secrets and validators in the same place as they already are, except they will be served by the client.

eh this feels risky - one restart where the flag is misspelled or some other shit reason, and the validators are gone

@jakubgs
Copy link
Member

jakubgs commented Sep 12, 2022

eh this feels risky - one restart where the flag is misspelled or some other shit reason, and the validators are gone

That's the point. They are supposed to be gone.

@jakubgs
Copy link
Member

jakubgs commented Sep 20, 2022

Based on discussion with @arnetheduck and reading of the main issue:

It seems it makes sense to create a separate Ansible role makes the most sense, since we might want to do 1-N setups.

@jakubgs
Copy link
Member

jakubgs commented Sep 21, 2022

I've created a repo for the separate ansible role: https://github.com/status-im/infra-repos/commit/764a8225

https://github.com/status-im/infra-role-validator-client

I will be basing it mostly on infra-role-beacon-node-linux.

@jakubgs
Copy link
Member

jakubgs commented Sep 21, 2022

Some initial work:

@jakubgs
Copy link
Member

jakubgs commented Sep 22, 2022

More changes to get this going:

@jakubgs
Copy link
Member

jakubgs commented Sep 23, 2022

I started by deploying changes to nimbus.ropsten host, but this is interesting:

{"lvl":"NOT","ts":"2022-09-23 08:28:15.955+00:00","msg":"Starting REST HTTP server","url":"http://127.0.0.1:5053/"}
{"lvl":"INF","ts":"2022-09-23 08:28:15.956+00:00","msg":"Beacon node has been identified","agent":"Nimbus/v22.9.1-72e6b2-stateofus","service":"fallback_service","endpoint":"127.0.0.1:9301"}
{"lvl":"INF","ts":"2022-09-23 08:28:15.956+00:00","msg":"Beacon node has compatible configuration","service":"fallback_service","endpoint":"127.0.0.1:9301 [Nimbus/v22.9.1-72e6b2-stateofus]"}
{"lvl":"INF","ts":"2022-09-23 08:28:15.956+00:00","msg":"Beacon node is in sync","sync_distance":0,"head_slot":833241,"is_opimistic":"false","service":"fallback_service","endpoint":"127.0.0.1:9301 [Nimbus/v22.9.1-72e6b2-stateofus]"}
{"lvl":"NOT","ts":"2022-09-23 08:28:15.957+00:00","msg":"Fork schedule updated","fork_schedule":[{"previous_version":"0x80000069","current_version":"0x80000069","epoch":0},{"previous_version":"0x80000069","current_version":"0x80000070","epoch":500},{"previous_version":"0x80000070","current_version":"0x80000071","epoch":750}],"service":"fork_service"}
{"lvl":"ERR","ts":"2022-09-23 08:28:27.959+00:00","msg":"Unable to get head state's validator information","service":"duties_service"}
{"lvl":"NOT","ts":"2022-09-23 08:28:27.961+00:00","msg":"REST service started","address":"127.0.0.1:5053"}
{"lvl":"INF","ts":"2022-09-23 08:28:27.961+00:00","msg":"Scheduling first slot action","startTime":"16w3d17h28m27s961ms398us840ns","nextSlot":833243,"timeToNextSlot":"8s38ms601us160ns"}
{"lvl":"INF","ts":"2022-09-23 08:28:27.962+00:00","msg":"Beacon node has been identified","agent":"Nimbus/v22.9.1-72e6b2-stateofus","service":"fallback_service","endpoint":"127.0.0.1:9301 [Nimbus/v22.9.1-72e6b2-stateofus]"}
{"lvl":"INF","ts":"2022-09-23 08:28:27.963+00:00","msg":"Beacon node has compatible configuration","service":"fallback_service","endpoint":"127.0.0.1:9301 [Nimbus/v22.9.1-72e6b2-stateofus]"}
{"lvl":"INF","ts":"2022-09-23 08:28:27.964+00:00","msg":"Beacon node is in sync","sync_distance":1,"head_slot":833241,"is_opimistic":"false","service":"fallback_service","endpoint":"127.0.0.1:9301 [Nimbus/v22.9.1-72e6b2-stateofus]"}
{"lvl":"WRN","ts":"2022-09-23 08:28:29.985+00:00","msg":"Connection with beacon node(s) has been lost","online_nodes":0,"unusable_nodes":1,"total_nodes":1,"service":"fallback_service"}
{"lvl":"WRN","ts":"2022-09-23 08:28:31.970+00:00","msg":"No suitable beacon nodes available","online_nodes":0,"offline_nodes":1,"uninitalized_nodes":0,"incompatible_nodes":0,"nonsynced_nodes":0,"total_nodes":1,"service":"fallback_service"}

It seems to be failing in a loop of:

  1. Beacon node has been identified
  2. Beacon node has compatible configuration
  3. Beacon node is in sync
  4. Connection with beacon node(s) has been lost
  5. No suitable beacon nodes available

Over and over again. @narimiran any idea?

@jakubgs
Copy link
Member

jakubgs commented Sep 23, 2022

I can also see that our usual healthcheck for Consul isn't there in Keymanager API:

[email protected]:~ % c http://localhost:5053/eth/v1/node/version
curl: (22) The requested URL returned error: 404 Not Found

So for now I'm going to use a TCP healthcheck, but it would be nice to have a route without auth for this.

API: https://ethereum.github.io/keymanager-APIs/#/Remote%20Key%20Manager

jakubgs added a commit to status-im/infra-role-beacon-node-linux that referenced this issue Sep 23, 2022
Necessary due to large size of headers whenn validator-client
has a large number of validators attached.

status-im/infra-nimbus#122

Signed-off-by: Jakub Sokołowski <[email protected]>
@jakubgs
Copy link
Member

jakubgs commented Sep 23, 2022

Some more changes:

And a fix for massive headers sent to the beacon node:

Which was causing this:

{
  "lvl": "ERR",
  "ts": "2022-09-23 11:12:24.002+00:00",
  "msg": "Unable to get head state's validator information",
  "service": "duties_service"
}

jakubgs added a commit that referenced this issue Sep 23, 2022
@jakubgs
Copy link
Member

jakubgs commented Sep 23, 2022

And here's the setup on nimbus.ropsten: 7d05abad

jakubgs added a commit that referenced this issue Sep 23, 2022
We want to test with lower numbers of validators first.

#122

Signed-off-by: Jakub Sokołowski <[email protected]>
@jakubgs
Copy link
Member

jakubgs commented Sep 23, 2022

Lowered number of validators for VC nodes as requested by @cheatfate:

  • 23c07e3d - nimbus.ropsten: lower geth memory limits
  • 89f04d8a - nimbus.ropsten: use less validators on VC nodes

'metal-01.he-eu-hel1.nimbus.ropsten':
- { start: 0, end: 500, validator_client: true } # 500
- { start: 500, end: 1500, validator_client: true } # 1000
- { start: 1500, end: 3500, validator_client: false } # 2000
- { start: 3500, end: 10000, validator_client: false } # 6500

jakubgs added a commit that referenced this issue Sep 23, 2022
It seems to be hogging far too much memory.

#122

Signed-off-by: Jakub Sokołowski <[email protected]>
jakubgs added a commit that referenced this issue Sep 23, 2022
We want to test with lower numbers of validators first.

#122

Signed-off-by: Jakub Sokołowski <[email protected]>
@jakubgs
Copy link
Member

jakubgs commented Sep 26, 2022

I forgot to note o friday that the beacon nodes with validator clients connected are using obscene amounts of memory:

image

The ones without validator clients are fine though.

jakubgs added a commit that referenced this issue Sep 28, 2022
For now only for the first node.
#122

Signed-off-by: Jakub Sokołowski <[email protected]>
@jakubgs
Copy link
Member

jakubgs commented Sep 28, 2022

I also deployed validator client for the first node on the Sepolia host: 7da6edf1

'linux-01.he-eu-hel1.nimbus.sepolia':
- { start: 0, end: 25, validator_client: true }
- { start: 25, end: 50, validator_client: false, nim_commit: 'version-1-6', payload_builder: true }
- { start: 50, end: 75, validator_client: false, nim_commit: 'version-1-6' }
- { start: 75, end: 100, validator_client: false, nim_flags: '-d:json_rpc_websocket_package=websock' }

And also implemented disabling service and its checks:

jakubgs added a commit that referenced this issue Sep 28, 2022
For now only for the first node.
#122

Signed-off-by: Jakub Sokołowski <[email protected]>
@jakubgs
Copy link
Member

jakubgs commented Sep 28, 2022

And interestingly enough on Sepolia no such memory issues appear:

image

image

Although that might be a function of number of validators attached. Or maybe network-specific.

jakubgs added a commit that referenced this issue Oct 11, 2022
@jakubgs
Copy link
Member

jakubgs commented Oct 11, 2022

I also deployed validator client for the stable node on linux-04 host on prater: 269a76a2

'linux-04.he-eu-hel1.nimbus.prater': # 30 each
- { branch: 'stable', start: 20044, end: 20074, build_freq: '*-*-* 11:00:00', validator_client: true }
- { branch: 'testing', start: 20104, end: 20134, build_freq: '*-*-* 15:00:00', nim_commit: 'version-1-6' }
- { branch: 'unstable', start: 20074, end: 20104, build_freq: '*-*-* 13:00:00', payload_builder: true }
- { branch: 'libp2p', start: 20134, end: 20164, build_freq: '*-*-* 17:00:00', nim_flags: '-d:json_rpc_websocket_package=websock' }

And it appears to run fine and without memory issues so far, but the doppelganger detection is taking an awful long time:

[email protected]:~ % grep 'Attestation has not been served' /data/validator-client-prater-stable-01/logs/service.log | head -n1
{"lvl":"INF","ts":"2022-10-11 11:36:28.002+00:00","msg":"Attestation has not been served (doppelganger check still active)","slot":4081682,"validator":"94cab382","validator_index":307765,"service":"attestation_service"}

[email protected]:~ % grep 'Attestation has not been served' /data/validator-client-prater-stable-01/logs/service.log | tail -n1
{"lvl":"INF","ts":"2022-10-11 12:20:04.002+00:00","msg":"Attestation has not been served (doppelganger check still active)","slot":4081900,"validator":"94cad9d2","validator_index":310801,"service":"attestation_service"}

Already 45 minues and still going. Appears to be a bug.

jakubgs added a commit that referenced this issue Oct 11, 2022
@jakubgs
Copy link
Member

jakubgs commented Oct 11, 2022

For now I've disabled doppelganger for the validator temporarily: 5de20671https://github.com/status-im/infra-nimbus/blob/5de206719e12e7b3e29364cc8474f46625a7cb1e/ansible/group_vars/nimbus.prater.yml#L85
Once the fix is merged I'll undo that.

@jakubgs
Copy link
Member

jakubgs commented Oct 11, 2022

Looks like we are in business:

[email protected]:~ % grep 'Attestation published' /data/validator-client-prater-stable-01/logs/service.log | tail -n4
{"lvl":"NOT","ts":"2022-10-11 12:28:16.005+00:00","msg":"Attestation published","attestation":{"aggregation_bits":"0x000000000000000000000000000000000000000000080020","data":{"slot":4081941,"index":62,"beacon_block_root":"84c6728e","source":"127559:5e90d514","target":"127560:2208b6af"},"signature":"abef638f"},"validator":"94cad9d2","validator_index":310801,"delay":"5ms15us884ns","service":"attestation_service"}
{"lvl":"NOT","ts":"2022-10-11 12:28:28.057+00:00","msg":"Attestation published","attestation":{"aggregation_bits":"0x000000000000000800000000000000000000000000000020","data":{"slot":4081942,"index":43,"beacon_block_root":"97f79d1a","source":"127559:5e90d514","target":"127560:2208b6af"},"signature":"b52d9f94"},"validator":"94cc88e5","validator_index":306447,"delay":"57ms822us424ns","service":"attestation_service"}
{"lvl":"NOT","ts":"2022-10-11 12:29:52.019+00:00","msg":"Attestation published","attestation":{"aggregation_bits":"0x000000000000000000000000000000000000000040000020","data":{"slot":4081949,"index":29,"beacon_block_root":"665b69cd","source":"127559:5e90d514","target":"127560:2208b6af"},"signature":"a3764bad"},"validator":"94cab382","validator_index":307765,"delay":"19ms516us82ns","service":"attestation_service"}
{"lvl":"NOT","ts":"2022-10-11 12:31:04.006+00:00","msg":"Attestation published","attestation":{"aggregation_bits":"0x004000000000000000000000000000000000000000000020","data":{"slot":4081955,"index":60,"beacon_block_root":"39e0176f","source":"127560:2208b6af","target":"127561:79b5ab0a"},"signature":"ad20a7b6"},"validator":"94c79ff7","validator_index":304582,"delay":"6ms122us158ns","service":"attestation_service"}

But one things that makes me wonder is why the validator client message is Attestation published while the beacon node message is Attestation sent. Seems like an unnecessary divergence that can just cause confusion.

@jakubgs
Copy link
Member

jakubgs commented Oct 12, 2022

This is currently working on all 3 testnets: sepolia, ropsten, and prater

'linux-01.he-eu-hel1.nimbus.sepolia':
- { start: 0, end: 25, validator_client: true }

'metal-01.he-eu-hel1.nimbus.ropsten':
- { start: 0, end: 500, validator_client: true } # 500
- { start: 500, end: 1500, validator_client: true } # 1000

'linux-04.he-eu-hel1.nimbus.prater': # 30 each
- { branch: 'stable', start: 20044, end: 20074, build_freq: '*-*-* 11:00:00', validator_client: true }

I consider this done. Reopen if there's something missing.

@jakubgs jakubgs closed this as completed Oct 12, 2022
jakubgs added a commit that referenced this issue Oct 24, 2022
@jakubgs
Copy link
Member

jakubgs commented Oct 24, 2022

Based on suggestion from @arnetheduck I've also deployed a VC for unstable on linux-03: cf8bab14

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants