Heavy HTTP requests halt the client #3536

ghost · 2022-09-02T14:37:46Z

Description

If I do consecutive HTTP querying of the node it either halts or starts restarting multiple times in a row.

Endpoints being called:

eth/v1/beacon/states/finalized/validators/BLS_KEY
eth/v1/beacon/states/finalized/finality_checkpoints

Approximate request rate is 3 requests / second.

I experimentally discovered that the reason behind the halt was the high usage of endpoints 1 and 2. once I disable the --http flag and restart the client the sync starts to catch on.

Version

v3.0.0 non-portable linux

Present Behaviour

I can see in the logs that the SLOT distance keeps increasing

First set of logs:

-- Logs begin at Tue 2022-08-30 15:44:25 UTC. --
Sep 02 14:19:51 ip-172-31-4-210 lighthouse[44527]: Sep 02 14:19:51.852 CRIT Beacon block processing error           error: ValidatorPubkeyCacheLockTimeout, service: beacon
Sep 02 14:19:51 ip-172-31-4-210 lighthouse[44527]: Sep 02 14:19:51.853 WARN BlockProcessingFailure                  outcome: ValidatorPubkeyCacheLockTimeout, msg: unexpected condition in processing block.
Sep 02 14:19:54 ip-172-31-4-210 lighthouse[44527]: Sep 02 14:19:54.005 INFO Syncing                                 est_time: --, distance: 34 slots (6 mins), peers: 21, service: slot_notifier
Sep 02 14:19:54 ip-172-31-4-210 lighthouse[44527]: Sep 02 14:19:54.005 WARN Syncing deposit contract block cache    est_blocks_remaining: 4300, service: slot_notifier
Sep 02 14:20:06 ip-172-31-4-210 lighthouse[44527]: Sep 02 14:20:06.101 INFO Syncing                                 est_time: --, distance: 35 slots (7 mins), peers: 16, service: slot_notifier
Sep 02 14:20:06 ip-172-31-4-210 lighthouse[44527]: Sep 02 14:20:06.125 WARN Syncing deposit contract block cache    est_blocks_remaining: 3521, service: slot_notifier
Sep 02 14:20:18 ip-172-31-4-210 lighthouse[44527]: Sep 02 14:20:18.005 INFO Syncing                                 est_time: 3 mins, speed: 0.14 slots/sec, distance: 31 slots (6 mins), peers: 22, service: slot_notifier
Sep 02 14:20:18 ip-172-31-4-210 lighthouse[44527]: Sep 02 14:20:18.005 WARN Syncing deposit contract block cache    est_blocks_remaining: 2833, service: slot_notifier
Sep 02 14:20:30 ip-172-31-4-210 lighthouse[44527]: Sep 02 14:20:30.002 INFO Syncing                                 est_time: 5 mins, speed: 0.11 slots/sec, distance: 32 slots (6 mins), peers: 26, service: slot_notifier
Sep 02 14:20:30 ip-172-31-4-210 lighthouse[44527]: Sep 02 14:20:30.002 WARN Syncing deposit contract block cache    est_blocks_remaining: 2040, service: slot_notifier
Sep 02 14:20:34 ip-172-31-4-210 lighthouse[44527]: Sep 02 14:20:34.611 INFO Sync state updated                      new_state: Synced, old_state: Syncing Head Chain, service: sync
Sep 02 14:20:34 ip-172-31-4-210 lighthouse[44527]: Sep 02 14:20:34.614 INFO Subscribed to topics                    topics: ["/eth2/c2ce3aa8/beacon_block/ssz_snappy", "/eth2/c2ce3aa8/beacon_aggregate_and_proof/ssz_snappy", "/eth2/c2ce3aa8/voluntary_exit/ssz_snappy", "/eth2/c2ce3aa8/proposer_slashing/ssz_snappy", "/eth2/c2ce3aa8/attester_slashing/ssz_snappy", "/eth2/c2ce3aa8/sync_committee_contribution_and_proof/ssz_snappy"], service: network
Sep 02 14:20:41 ip-172-31-4-210 systemd[1]: consensus_client.service: Main process exited, code=killed, status=9/KILL
Sep 02 14:20:41 ip-172-31-4-210 systemd[1]: consensus_client.service: Failed with result 'signal'.
Sep 02 14:20:52 ip-172-31-4-210 systemd[1]: consensus_client.service: Scheduled restart job, restart counter is at 1.
Sep 02 14:20:52 ip-172-31-4-210 systemd[1]: Stopped Lighthouse Consensus client.
Sep 02 14:20:52 ip-172-31-4-210 systemd[1]: Started Lighthouse Consensus client.
Sep 02 14:20:52 ip-172-31-4-210 lighthouse[44582]: Sep 02 14:20:52.930 INFO Logging to file                         path: "/home/ubuntu/consensus/beacon/logs/beacon.log"
Sep 02 14:20:53 ip-172-31-4-210 lighthouse[44582]: Sep 02 14:20:52.948 INFO Lighthouse started                      version: Lighthouse/v3.0.0-18c61a5
Sep 02 14:20:53 ip-172-31-4-210 lighthouse[44582]: Sep 02 14:20:52.949 INFO Configured for network                  name: goerli
Sep 02 14:20:53 ip-172-31-4-210 lighthouse[44582]: Sep 02 14:20:52.967 INFO Data directory initialised              datadir: /home/ubuntu/consensus
Sep 02 14:20:53 ip-172-31-4-210 lighthouse[44582]: Sep 02 14:20:52.977 INFO Deposit contract                        address: 0xff50ed3d0ec03ac01d4c79aad74928bff48a7b2b, deploy_block: 4367322
Sep 02 14:20:53 ip-172-31-4-210 lighthouse[44582]: Sep 02 14:20:53.626 INFO Hot-Cold DB initialized                 split_state: 0xcb0999c801f90f55361838cc12a84b3a0972d39abe1596665452766bea458665, split_slot: 3801600, service: freezer_db
Sep 02 14:20:53 ip-172-31-4-210 lighthouse[44582]: Sep 02 14:20:53.733 INFO Refusing to checkpoint sync             msg: database already exists, use --purge-db to force checkpoint sync, service: beacon
Sep 02 14:20:53 ip-172-31-4-210 lighthouse[44582]: Sep 02 14:20:53.734 INFO Starting beacon chain                   method: resume, service: beacon
Sep 02 14:21:48 ip-172-31-4-210 lighthouse[44582]: Sep 02 14:21:48.358 INFO Block production enabled                method: json rpc via http, endpoints: Auth { endpoint: "http://localhost:8551/", jwt_path: "/home/ubuntu/execution/geth/jwtsecret", jwt_id: None, jwt_version: None }
Sep 02 14:21:51 ip-172-31-4-210 lighthouse[44582]: Sep 02 14:21:51.350 INFO Beacon chain initialized                head_slot: 3801664, head_block: 0xd315…7160, head_state: 0x1c01…5bbf, service: beacon
Sep 02 14:21:51 ip-172-31-4-210 lighthouse[44582]: Sep 02 14:21:51.350 INFO Timer service started                   service: node_timer
Sep 02 14:21:51 ip-172-31-4-210 lighthouse[44582]: Sep 02 14:21:51.358 INFO UPnP Attempting to initialise routes    service: UPnP
Sep 02 14:21:51 ip-172-31-4-210 lighthouse[44582]: Sep 02 14:21:51.382 INFO Libp2p Starting                         bandwidth_config: 3-Average, peer_id: 16Uiu2HAmFdSH4dQimCrzswetJGuNbD9HSnFM53eadWxKJyGP2K5q, service: libp2p
Sep 02 14:21:51 ip-172-31-4-210 lighthouse[44582]: Sep 02 14:21:51.391 INFO ENR Initialised                         tcp: Some(9000), udp: Some(9000), ip: Some(3.71.250.111), id: 0x6d0a..8b26, seq: 3, enr: enr:-Ly4QNWW3pMeus1LYyifTz8AINPXl44KKHy2MLhqMmnAnAgMZ04Q1ueefPN70gae5JIZK87uLAxo-dtdhEr_SZYoItQDh2F0dG5ldHOIAAAAAAAAAACEZXRoMpDCzjqoAgAQIP__________gmlkgnY0gmlwhANH-m-Jc2VjcDI1NmsxoQMsLTcOH0dR-EUsJJiIOJLpaDprJ12AXqeCjcwu2wMICIhzeW5jbmV0cwCDdGNwgiMog3VkcIIjKA, service: libp2p
Sep 02 14:21:51 ip-172-31-4-210 lighthouse[44582]: Sep 02 14:21:51.414 INFO Listening established                   address: /ip4/0.0.0.0/tcp/9000/p2p/16Uiu2HAmFdSH4dQimCrzswetJGuNbD9HSnFM53eadWxKJyGP2K5q, service: libp2p
Sep 02 14:21:51 ip-172-31-4-210 lighthouse[44582]: Sep 02 14:21:51.484 INFO HTTP API started                        listen_address: 127.0.0.1:5052
Sep 02 14:21:51 ip-172-31-4-210 lighthouse[44582]: Sep 02 14:21:51.542 INFO Execution engine online                 service: exec
Sep 02 14:21:51 ip-172-31-4-210 lighthouse[44582]: Sep 02 14:21:51.543 INFO Issuing forkchoiceUpdated               forkchoice_state: ForkChoiceState { head_block_hash: 0x050e25d6b6a9ef0e2bbf1f88dc151991fae0b21ebbed8763c3ffb8e7df9fc930, safe_block_hash: 0x1efc8e7fe5e1cb5a5ffe5e5a52eec9ff00823794909fbe40fe3e1a99b218f010, finalized_block_hash: 0x31e7adca5f80369338f6bee0ce65ba5d6c42992f79e1003c2328f503ccbf4052 }, service: exec
Sep 02 14:21:52 ip-172-31-4-210 lighthouse[44582]: Sep 02 14:21:52.104 INFO Sync state updated                      new_state: Syncing Head Chain, old_state: Stalled, service: sync
Sep 02 14:21:53 ip-172-31-4-210 lighthouse[44582]: Sep 02 14:21:53.209 INFO Imported deposit log(s)                 new: 822, total: 178915, latest_block: 7514568, service: deposit_contract_rpc
Sep 02 14:22:01 ip-172-31-4-210 lighthouse[44582]: Sep 02 14:22:01.427 INFO UPnP not available                      error: IO error: Resource temporarily unavailable (os error 11), service: UPnP
Sep 02 14:22:06 ip-172-31-4-210 lighthouse[44582]: Sep 02 14:22:06.001 INFO Syncing                                 est_time: --, distance: 46 slots (9 mins), peers: 10, service: slot_notifier
Sep 02 14:22:06 ip-172-31-4-210 lighthouse[44582]: Sep 02 14:22:06.002 WARN Syncing deposit contract block cache    est_blocks_remaining: 5048, service: slot_notifier

Second set of logs:

Sep 02 14:22:06 ip-172-31-4-210 lighthouse[44582]: Sep 02 14:22:06.001 INFO Syncing                                 est_time: --, distance: 46 slots (9 mins), peers: 10, service: slot_notifier
Sep 02 14:22:06 ip-172-31-4-210 lighthouse[44582]: Sep 02 14:22:06.002 WARN Syncing deposit contract block cache    est_blocks_remaining: 5048, service: slot_notifier
Sep 02 14:22:18 ip-172-31-4-210 lighthouse[44582]: Sep 02 14:22:18.012 INFO Syncing                                 est_time: --, distance: 47 slots (9 mins), peers: 9, service: slot_notifier
Sep 02 14:22:18 ip-172-31-4-210 lighthouse[44582]: Sep 02 14:22:18.014 WARN Syncing deposit contract block cache    est_blocks_remaining: 5048, service: slot_notifier

Expected Behaviour

The sync distance should be decreasing or at least some sort of error should be thrown to indicate issues.

Steps to resolve

Suggestions:

Investigating why so few requests crash the node
Adding error messages

The text was updated successfully, but these errors were encountered:

winksaville · 2022-09-02T14:57:11Z

I wouldn't call 3req/sec heavy. Have you tried 100 or 1000/sec?

Are you making the requests from a another computer on the same network or are you making the requests from the same computer?

I'd like to duplicate the issue. If feasible, please create a repo with the code that reproduces the issue.

michaelsproul · 2022-09-02T23:35:45Z

When you get a ValidatorPubkeyCacheLockTimeout it will set block processing back quite substantially because it will have to re-assess the chain it's on and keep re-processing the failed blocks until they succeed. IMO the lock timeout is now more trouble than it's worth, but we haven't had the impetus to disable it by default. You can opt-out with the --disable-lock-timeouts flag (more info here: #2714).

Other things to try:

Set MALLOC_ARENA_MAX=1 (or 2 or 4) as an environment variable, to reduce heap fragmentation. On Linux we currently use GNU malloc which creates one heap per CPU core and is prone to fragmentation. It looks like your process got SIGKILLed – probably by the OOM-killer – so I suspect you've got a large number of cores and a lot of fragmentation. We've found that reducing the arena count doesn't impact performance substantially but decreases fragmentation a lot.
Try modifying your requests to use the head state rather than the finalized state. Presently Lighthouse only stores a handful of states (<5) near the head in memory and will have to load the finalized state from disk for requests like /eth/v1/beacon/states/finalized/finality_checkpoints. So every time you hit that endpoint it's forced to do an expensive state read. The head state will have more up-to-date finality information than the finalized state too. For the validators endpoint you may indeed want the finalized balance, in which case you'll need to keep using the finalized tag. But you could test using head and see if performance is substantially improved.

Longer-term remedies (Lighthouse code changes):

Improvements to the pubkey cache: Avoid pubkey cache timeouts #2381
Upgrade to tree-states, which will cache more states cached in memory including a guaranteed cache for the finalized state: Upgrade in-memory and on-disk state representation with tree states #3206
Switch to jemalloc on Linux to prevent fragmentation (implemented on the tree-states branch but needs extraction)

Thank you for looking into this @winksaville. If you manage to reproduce the issue maybe you could take #2381 on as a project? I think you'd have to resolve the merge conflicts and test the performance

ghost · 2022-09-03T10:16:25Z

Are you making the requests from a another computer on the same network or are you making the requests from the same computer?

Thank you for your reply.

The query problem is quite weird, to be honest. I can query eth/v1/beacon/states/finalized/validators/BLS_KEY with high-rates and no problems, but if I specify an endpoint eth/v1/beacon/states/SLOT_NUMBER/validators/BLS_KEY which is further from the head SLOT (as far as 10 SLOTS away) it stalls the server and the requests takes a while (7-10 seconds).

And to answer your question: I'm running a lighthouse node on an AWS EC2 instance and making requests from another EC2 instance so it's on 2 different machines.

I will spin up an example repo/script in the following days. Thanks

michaelsproul · 2022-09-03T10:27:45Z

@JustinZal the reason queries far from the head take so long (like multiple minutes) is because they involve replaying thousands of blocks in the default configuration, which is optimised for space efficiency. You can tweak the database to prioritise speed at the expense of using more space, see https://lighthouse-book.sigmaprime.io/advanced_database.html

ghost · 2022-09-03T11:01:42Z

@JustinZal the reason queries far from the head take so long (like multiple minutes) is because they involve replaying thousands of blocks in the default configuration, which is optimised for space efficiency. You can tweak the database to prioritise speed at the expense of using more space, see https://lighthouse-book.sigmaprime.io/advanced_database.html

Thank you very much! This section is extremely useful. For some reason haven't come across it before.

michaelsproul · 2022-09-04T00:18:54Z

Awesome! If you're running an explorer or similar it's a good idea to run with a lower --slots-per-restore-point.

I'm curious whether you've had a chance to try any of my suggestions from here: #3536 (comment) ? I think they should help quite a bit, particularly the memory arena one.

paulhauner · 2023-01-16T06:37:28Z

I'm closing this one since it seems inactive and it doesn't have anything actionable that isn't already tracked. Thanks everyone!

michaelsproul added the HTTP-API label Sep 2, 2022

paulhauner closed this as completed Jan 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heavy HTTP requests halt the client #3536

Heavy HTTP requests halt the client #3536

ghost commented Sep 2, 2022

winksaville commented Sep 2, 2022

michaelsproul commented Sep 2, 2022

ghost commented Sep 3, 2022

michaelsproul commented Sep 3, 2022

ghost commented Sep 3, 2022

michaelsproul commented Sep 4, 2022

paulhauner commented Jan 16, 2023

Heavy HTTP requests halt the client #3536

Heavy HTTP requests halt the client #3536

Comments

ghost commented Sep 2, 2022

Description

Version

Present Behaviour

Expected Behaviour

Steps to resolve

winksaville commented Sep 2, 2022

michaelsproul commented Sep 2, 2022

ghost commented Sep 3, 2022

michaelsproul commented Sep 3, 2022

ghost commented Sep 3, 2022

michaelsproul commented Sep 4, 2022

paulhauner commented Jan 16, 2023