-
Notifications
You must be signed in to change notification settings - Fork 784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unsaturate dial negotiation queue #6711
Conversation
For reviewers. The condition for us to send a message is here: if !self.dial_queue.is_empty() && self.dial_negotiated < self.max_dial_negotiated {
self.dial_negotiated += 1; We add to this number. But notice this number must be less than self.max_dial_negotiated, which is some constant like 8. Then when this dial fails: let (id, req) = request_info;
// map the error
let error = match error {
StreamUpgradeError::Timeout => RPCError::NegotiationTimeout,
StreamUpgradeError::Apply(RPCError::IoError(e)) => {
self.outbound_io_error_retries += 1;
if self.outbound_io_error_retries < IO_ERROR_RETRIES {
self.send_request(id, req);
return;
}
RPCError::IoError(e)
}
StreamUpgradeError::NegotiationFailed => RPCError::UnsupportedProtocol,
StreamUpgradeError::Io(io_err) => {
self.outbound_io_error_retries += 1;
if self.outbound_io_error_retries < IO_ERROR_RETRIES {
self.send_request(id, req);
return;
}
RPCError::IoError(io_err.to_string())
}
StreamUpgradeError::Apply(other) => other,
};
// This dialing is now considered failed
self.dial_negotiated -= 1;
self.outbound_io_error_retries = 0;
self.events_out
.push(HandlerEvent::Err(HandlerErr::Outbound {
error,
proto: req.versioned_protocol().protocol(),
id,
}));
}
} We can sometimes not decrement this constant, because we re-try. I think this means we can artificially increase this number, over our limit and get the dial_queue stuck. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great find!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Epic find
@mergify queue |
🛑 The pull request has been removed from the queue
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM age!
This pull request has been removed from the queue for the following reason: The merge conditions cannot be satisfied due to failing checks: You should look at the reason for the failure and decide if the pull request needs to be fixed or if you want to requeue it. If you want to requeue this pull request, you need to post a comment with the text: |
@mergify requeue |
✅ This pull request will be re-embarked automaticallyThe followup |
✅ The pull request has been merged automaticallyThe pull request has been merged automatically at 1315c94 |
Issue Addressed
It seems to me there is a bug, where if we fail dials via io errors, we attempt a re-connection for a number of times before failing the RPC request. However, in the re-attempts we keep adding to the
dial_negotiation
count.I suspect, after a number of failures, the
dial_negotiation
count can saturate and become larger than our max of 8 concurrent dials. In which case, we will no longer perform any further dials. Our dial queue will just grow and grow and nothing will happen because the RPC thinks we have already too many dials.We've witnessed logs of this form. I haven't tested this, but I think my logic here is sound.
I think resolves #6703