Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unsaturate dial negotiation queue #6711

Merged
merged 1 commit into from
Dec 17, 2024

Conversation

AgeManning
Copy link
Member

@AgeManning AgeManning commented Dec 17, 2024

Issue Addressed

It seems to me there is a bug, where if we fail dials via io errors, we attempt a re-connection for a number of times before failing the RPC request. However, in the re-attempts we keep adding to the dial_negotiation count.

I suspect, after a number of failures, the dial_negotiation count can saturate and become larger than our max of 8 concurrent dials. In which case, we will no longer perform any further dials. Our dial queue will just grow and grow and nothing will happen because the RPC thinks we have already too many dials.

We've witnessed logs of this form. I haven't tested this, but I think my logic here is sound.

I think resolves #6703

@AgeManning
Copy link
Member Author

For reviewers.

The condition for us to send a message is here:
https://github.com/sigp/lighthouse/blob/stable/beacon_node/lighthouse_network/src/rpc/handler.rs#L781

if !self.dial_queue.is_empty() && self.dial_negotiated < self.max_dial_negotiated {
self.dial_negotiated += 1;

We add to this number. But notice this number must be less than self.max_dial_negotiated, which is some constant like 8.

Then when this dial fails:
https://github.com/sigp/lighthouse/blob/stable/beacon_node/lighthouse_network/src/rpc/handler.rs#L962

        let (id, req) = request_info;

        // map the error
        let error = match error {
            StreamUpgradeError::Timeout => RPCError::NegotiationTimeout,
            StreamUpgradeError::Apply(RPCError::IoError(e)) => {
                self.outbound_io_error_retries += 1;
                if self.outbound_io_error_retries < IO_ERROR_RETRIES {
                    self.send_request(id, req);
                    return;
                }
                RPCError::IoError(e)
            }
            StreamUpgradeError::NegotiationFailed => RPCError::UnsupportedProtocol,
            StreamUpgradeError::Io(io_err) => {
                self.outbound_io_error_retries += 1;
                if self.outbound_io_error_retries < IO_ERROR_RETRIES {
                    self.send_request(id, req);
                    return;
                }
                RPCError::IoError(io_err.to_string())
            }
            StreamUpgradeError::Apply(other) => other,
        };

        // This dialing is now considered failed
        self.dial_negotiated -= 1;

        self.outbound_io_error_retries = 0;
        self.events_out
            .push(HandlerEvent::Err(HandlerErr::Outbound {
                error,
                proto: req.versioned_protocol().protocol(),
                id,
            }));
    }
}

We can sometimes not decrement this constant, because we re-try. I think this means we can artificially increase this number, over our limit and get the dial_queue stuck.

Copy link
Member

@jimmygchen jimmygchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find!

@jimmygchen jimmygchen added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Dec 17, 2024
Copy link
Collaborator

@dapplion dapplion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Epic find

@jimmygchen
Copy link
Member

@mergify queue

Copy link

mergify bot commented Dec 17, 2024

queue

🛑 The pull request has been removed from the queue default

The merge conditions cannot be satisfied due to failing checks.

You can take a look at Queue: Embarked in merge queue check runs for more details.

In case of a failure due to a flaky test, you should first retrigger the CI.
Then, re-embark the pull request into the merge queue by posting the comment
@mergifyio refresh on the pull request.

mergify bot added a commit that referenced this pull request Dec 17, 2024
Copy link
Member

@jxs jxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM age!

Copy link

mergify bot commented Dec 17, 2024

This pull request has been removed from the queue for the following reason: checks failed.

The merge conditions cannot be satisfied due to failing checks:

You should look at the reason for the failure and decide if the pull request needs to be fixed or if you want to requeue it.

If you want to requeue this pull request, you need to post a comment with the text: @mergifyio requeue

@jimmygchen
Copy link
Member

@mergify requeue

Copy link

mergify bot commented Dec 17, 2024

requeue

✅ This pull request will be re-embarked automatically

The followup queue command will be automatically executed to re-embark the pull request

Copy link

mergify bot commented Dec 17, 2024

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at 1315c94

mergify bot added a commit that referenced this pull request Dec 17, 2024
@mergify mergify bot merged commit 1315c94 into sigp:unstable Dec 17, 2024
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Networking ready-for-merge This PR is ready to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants