Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Merged by Bors] - Optimise HTTP validator lookups #3559

Closed

Conversation

michaelsproul
Copy link
Member

@michaelsproul michaelsproul commented Sep 9, 2022

Issue Addressed

While digging around in some logs I noticed that queries for validators by pubkey were taking 10ms+, which seemed too long. This was due to a loop through the entire validator registry for each lookup.

Proposed Changes

Rather than using a loop through the register, this PR utilises the pubkey cache which is usually initialised at the head*. In case the cache isn't built, we fall back to the previous loop logic. In the vast majority of cases I expect the cache will be built, as the validator client queries at the head where all caches should be built.

Additional Info

*I had to modify the cache build that runs after fork choice to build the pubkey cache. I think it had been optimised out, perhaps accidentally. I think it's preferable to have the exit cache and the pubkey cache built on the head state, as they are required for verifying deposits and exits respectively, and we may as well build them off the hot path of block processing. Previously they'd get built the first time a deposit or exit needed to be verified.

I've deleted the unused map_state function which was obsoleted by map_state_and_execution_optimistic.

@michaelsproul michaelsproul added work-in-progress PR is a work-in-progress optimization Something to make Lighthouse run more efficiently. HTTP-API labels Sep 9, 2022
@michaelsproul
Copy link
Member Author

michaelsproul commented Sep 9, 2022

Here's some performance info from some of our Goerli nodes with thousands of inactive validator keys:

speedup

This is the time required for the VC to update all its validator indices from the BN, which drops from 110s to 83s on one VC (-25%) and from 29s to 14s on the other (-51%).

We can see that the underlying time that the BN is spending on the request is even more greatly reduced. Before running this PR the times are in the 2.5-6ms range, and afterwards are in the 50-250us range. Before:

Sep 09 06:48:21.113 DEBG Processed HTTP API request method: GET, path: /eth/v1/beacon/states/head/validators/0xa9dfac8705467d09bdf4db8e68cf6ef54e45294c432b180796634777829704429444d0b28eeed4a66a770bf3f1a104d2, status: 404 Not Found, elapsed: 2.298924ms

After:

Sep 09 06:52:33.637 DEBG Processed HTTP API request method: GET, path: /eth/v1/beacon/states/head/validators/0xa9dfac8705467d09bdf4db8e68cf6ef54e45294c432b180796634777829704429444d0b28eeed4a66a770bf3f1a104d2, status: 404 Not Found, elapsed: 58.289µs

That's a 10-50x reduction (at least -90%).

@michaelsproul michaelsproul added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress labels Sep 9, 2022
@paulhauner
Copy link
Member

cheers

@deauna, this review looks a lot like engagement farming to me and I will ban your account from this repository if it persists. I apologise sincerely if I've read this wrong.

Copy link
Member

@paulhauner paulhauner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one, I've also been giving this issue the side-eye when poking around Grafana.

I have one comment, but it's not a blocker.

snapshot
.beacon_state
.build_all_committee_caches(&self.spec)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be a little heavy handed to build the exit cache here as well, but I don't feel too strongly about that. IIRC we don't always need the exit cache to process a block.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figure it's quasi-equivalent to building it on-demand when it is eventually required. If we process a block and it becomes head, chances are its descendants will remain canonical and the exit cache will be required eventually, at which point we've paid once to build it (when the first ancestor was processed). The only case where I think it'll hurt is if there are lots of re-orgs that miss the snapshot cache and load a fresh state without any caches. We hope that these are rare. These re-orgs and these sorts of caching issues are also mitigated by tree-states, where all the states share common CoW caches rooted at the finalized state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My (potentially shoddy) understanding is that the exit cache is only required when we're processing exits, which is relatively infrequently. However, given that:

  • The cache we're building is on the head state.
  • The head state usually maintains it's caches (i.e., they aren't dropped and rebuilt often)
  • Trying to build an already-built exit cache is cheap.

I'm happy with this. I'll slap another approval!

@paulhauner paulhauner added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Sep 28, 2022
@michaelsproul michaelsproul added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Oct 5, 2022
Copy link
Member

@paulhauner paulhauner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved! I understand this PR went back to "ready-for-review" since you were waiting for my response to a comment.

@paulhauner paulhauner added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Oct 15, 2022
@michaelsproul
Copy link
Member Author

Sweet!

bors r+

bors bot pushed a commit that referenced this pull request Oct 15, 2022
## Issue Addressed

While digging around in some logs I noticed that queries for validators by pubkey were taking 10ms+, which seemed too long. This was due to a loop through the entire validator registry for each lookup.

## Proposed Changes

Rather than using a loop through the register, this PR utilises the pubkey cache which is usually initialised at the head*. In case the cache isn't built, we fall back to the previous loop logic. In the vast majority of cases I expect the cache will be built, as the validator client queries at the `head` where all caches should be built.

## Additional Info

*I had to modify the cache build that runs after fork choice to build the pubkey cache. I think it had been optimised out, perhaps accidentally. I think it's preferable to have the exit cache and the pubkey cache built on the head state, as they are required for verifying deposits and exits respectively, and we may as well build them off the hot path of block processing. Previously they'd get built the first time a deposit or exit needed to be verified.

I've deleted the unused `map_state` function which was obsoleted by `map_state_and_execution_optimistic`.
@bors bors bot changed the title Optimise HTTP validator lookups [Merged by Bors] - Optimise HTTP validator lookups Oct 16, 2022
@bors bors bot closed this Oct 16, 2022
@michaelsproul michaelsproul deleted the opt-validator-lookup branch October 16, 2022 01:05
bors bot pushed a commit that referenced this pull request Oct 26, 2022
## Issue Addressed

This reverts commit ca9dc8e (PR #3559) with some modifications.

## Proposed Changes

Unfortunately that PR introduced a performance regression in fork choice. The optimisation _intended_ to build the exit and pubkey caches on the head state _only if_ they were not already built. However, due to the head state always being cloned without these caches, we ended up building them every time the head changed, leading to a ~70ms+ penalty on mainnet.

https://github.com/sigp/lighthouse/blob/fcfd02aeec435203269b03865e3ccc23e5f51e6d/beacon_node/beacon_chain/src/canonical_head.rs#L633-L636

I believe this is a severe enough regression to justify immediately releasing v3.2.1 with this change.

## Additional Info

I didn't fully revert #3559, because there were some unrelated deletions of dead code in that PR which I figured we may as well keep.

An alternative would be to clone the extra caches, but this likely still imposes some cost, so in the interest of applying a conservative fix quickly, I think reversion is the best approach. The optimisation from #3559 was not even optimising a particularly significant path, it was mostly for VCs running larger numbers of inactive keys. We can re-do it in the `tree-states` world where cache clones are cheap.
macladson pushed a commit to macladson/lighthouse that referenced this pull request Jan 5, 2023
## Issue Addressed

While digging around in some logs I noticed that queries for validators by pubkey were taking 10ms+, which seemed too long. This was due to a loop through the entire validator registry for each lookup.

## Proposed Changes

Rather than using a loop through the register, this PR utilises the pubkey cache which is usually initialised at the head*. In case the cache isn't built, we fall back to the previous loop logic. In the vast majority of cases I expect the cache will be built, as the validator client queries at the `head` where all caches should be built.

## Additional Info

*I had to modify the cache build that runs after fork choice to build the pubkey cache. I think it had been optimised out, perhaps accidentally. I think it's preferable to have the exit cache and the pubkey cache built on the head state, as they are required for verifying deposits and exits respectively, and we may as well build them off the hot path of block processing. Previously they'd get built the first time a deposit or exit needed to be verified.

I've deleted the unused `map_state` function which was obsoleted by `map_state_and_execution_optimistic`.
macladson pushed a commit to macladson/lighthouse that referenced this pull request Jan 5, 2023
## Issue Addressed

This reverts commit ca9dc8e (PR sigp#3559) with some modifications.

## Proposed Changes

Unfortunately that PR introduced a performance regression in fork choice. The optimisation _intended_ to build the exit and pubkey caches on the head state _only if_ they were not already built. However, due to the head state always being cloned without these caches, we ended up building them every time the head changed, leading to a ~70ms+ penalty on mainnet.

https://github.com/sigp/lighthouse/blob/fcfd02aeec435203269b03865e3ccc23e5f51e6d/beacon_node/beacon_chain/src/canonical_head.rs#L633-L636

I believe this is a severe enough regression to justify immediately releasing v3.2.1 with this change.

## Additional Info

I didn't fully revert sigp#3559, because there were some unrelated deletions of dead code in that PR which I figured we may as well keep.

An alternative would be to clone the extra caches, but this likely still imposes some cost, so in the interest of applying a conservative fix quickly, I think reversion is the best approach. The optimisation from sigp#3559 was not even optimising a particularly significant path, it was mostly for VCs running larger numbers of inactive keys. We can re-do it in the `tree-states` world where cache clones are cheap.
Woodpile37 pushed a commit to Woodpile37/lighthouse that referenced this pull request Jan 6, 2024
## Issue Addressed

While digging around in some logs I noticed that queries for validators by pubkey were taking 10ms+, which seemed too long. This was due to a loop through the entire validator registry for each lookup.

## Proposed Changes

Rather than using a loop through the register, this PR utilises the pubkey cache which is usually initialised at the head*. In case the cache isn't built, we fall back to the previous loop logic. In the vast majority of cases I expect the cache will be built, as the validator client queries at the `head` where all caches should be built.

## Additional Info

*I had to modify the cache build that runs after fork choice to build the pubkey cache. I think it had been optimised out, perhaps accidentally. I think it's preferable to have the exit cache and the pubkey cache built on the head state, as they are required for verifying deposits and exits respectively, and we may as well build them off the hot path of block processing. Previously they'd get built the first time a deposit or exit needed to be verified.

I've deleted the unused `map_state` function which was obsoleted by `map_state_and_execution_optimistic`.
Woodpile37 pushed a commit to Woodpile37/lighthouse that referenced this pull request Jan 6, 2024
## Issue Addressed

This reverts commit ca9dc8e (PR sigp#3559) with some modifications.

## Proposed Changes

Unfortunately that PR introduced a performance regression in fork choice. The optimisation _intended_ to build the exit and pubkey caches on the head state _only if_ they were not already built. However, due to the head state always being cloned without these caches, we ended up building them every time the head changed, leading to a ~70ms+ penalty on mainnet.

https://github.com/sigp/lighthouse/blob/fcfd02aeec435203269b03865e3ccc23e5f51e6d/beacon_node/beacon_chain/src/canonical_head.rs#L633-L636

I believe this is a severe enough regression to justify immediately releasing v3.2.1 with this change.

## Additional Info

I didn't fully revert sigp#3559, because there were some unrelated deletions of dead code in that PR which I figured we may as well keep.

An alternative would be to clone the extra caches, but this likely still imposes some cost, so in the interest of applying a conservative fix quickly, I think reversion is the best approach. The optimisation from sigp#3559 was not even optimising a particularly significant path, it was mostly for VCs running larger numbers of inactive keys. We can re-do it in the `tree-states` world where cache clones are cheap.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
HTTP-API optimization Something to make Lighthouse run more efficiently. ready-for-merge This PR is ready to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants