-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto retry message on connection failure #1856
Conversation
Allow SetWriteTime for sync messages as well
…design is solidified
# Conflicts: # src/StackExchange.Redis/PhysicalBridge.cs
support for string based IRetry configuration option
{ | ||
startProcessor = _queue.Count > 0; | ||
} | ||
if (startProcessor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can simplify to:
if(_queue.Count == 0)
{
return;
}
private readonly int? _maxRetryQueueLength; | ||
private readonly bool _runRetryLoopAsync; | ||
|
||
internal MessageRetryQueue(IMessageRetryHelper messageRetryHelper, int? maxRetryQueueLength = null, bool runRetryLoopAsync = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deepakverma Trying to figure out the scenario in which runRetryLoopAsync
would be false
(triggering a .Wait()
below), can you shed more light on that scenario/path?
{ | ||
var task = Task.Run(ProcessRetryQueueAsync); | ||
if (task.IsFaulted) | ||
throw task.Exception; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be a thread kickoff, and we should probably track state internally.
Overall, we probably need a state here (like the message backlog processor) that we interlock to change. This can help us know whether an existing thread is started/running (so we don't start multiple) and we can add that state to exception messages (like we do the message backlog today). This means we'll need to expose the status on the interface in some way.
/// <summary> | ||
/// Returns the current length of the retry queue. | ||
/// </summary> | ||
public abstract int CurrentQueueLength { get; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For exception messages, probably want to CurrentlyProcessing
(bool) for exception messages
Had a sync call on this right now, wanted to summarize and get agreement, in no particular order:
For number 1, if @TimLovellSmith agrees with the observations here, we should revert that change from the 2.x line and get a release out there with the few other misc fixes in queue. Can others please take a look at the above and provide thoughts? If we're agreed I'll start hacking at this on days off - if we're agreed on the regression we could get a new cut out there quickly and move the intended behavior into the 3.x line. |
This is interesting. Rather than agree we completely revert the out-of-order fix, I have a couple other options I am wondering about: A) why not have a way to bypass the backlog ordering constraint completely, useful mainly for commands that are run internally to re-establish the connection, and maintain the connection state, which don't want to care about ordering relative to end-user commands...? This is much like the workaround that I had in my original PR to fix race conditions in connection establishment the out-of-order fix that were related to these internal commands and causing test failure. You all might have to remind me what the main objections to that special casing logic were again... B) or... we could instead call it by design, that unauthenticated commands can fail for the sake of 'consistency' and 'security' when it seems that AUTH was sent AFTER the other command. (More of a 'least surprise' approach than real security perhaps - but I see possible benefits to avoiding things which LOOK like security bugs. Imagine hearing "my command was sent before auth and yet it succeeded, how?" all the time, and then having to distinguish that from real security issues...) Sounds like you prefer A rather than B. |
A follow-on thought... its sort of like how auto-retry is not really the right retry strategy for everything. Like what? Well, we already have this 'CanRetry' concept, which is an acknowledgement that some commands aren't suitable for auto-retrying, especially internal connection state management commands.
You might or might not believe that internal management commands would like to have their order preserved relative to each other.... I wonder which? Does this call for separate backlogs? |
@TimLovellSmith I think if we remove the message backlog at the physical bridge layer all together, the handshake bits simply won't be subject to the retry policy, since they're lower and writing to the pipe further down. I think this will be okay because they're in-order and we're now awaiting each command as it's written down (where as before we had the races). So instead of separate backlogs, connection handshake stuff just isn't backlogged at all, it goes direct. When it completes, the |
@NickCraver I wonder if I am understanding it. Is the proposal to remove the backlog altogether? Or is it just to bypass the backlog and retry stuff for handshakes? My understanding is that a fairly important function of the backlog is also to reduce lock contention for writers by ensuring that async writes don't have to block to acquire the lock on the socket. I.e. performance optimization. |
Yep, this!
Hmmm I wouldn't say that's a critical function of the backlog - it was originally created to order What order things come from callers down to the socket when there is a retry queue though: that's a question - either we disregard ordering completely and let them have it out on the lock, or we buffer into the queue similar to the backlog today. Thoughts? |
The problem I'm thinking of is indeed with the apps that run very hot. Normally they would be fine (looks good enough to put into prod...), but I have seen a couple times in the past clues in the dumps that when suddenly usage spikes up and lock contention on the write lock starts to be one of their big problems/contributing factors. |
PS I am not really sure whether AutoRetry is intended to break guarantees on command ordering... Guarantees might be achievable, but only by saying there's one backlog for everything, everything stays in it until successfully retried, and so no commands can get sent until all commands in front of them already are known to have succeeded, which is probably going to do terrible things to throughput. So I kinda assume AutoRetry breaks ordering guarantees. |
If we think that's an issue (agreeing it is if you're seeing it in those logs), we could provide a similar mechanism for the retry policy to take an exclusive lock during the flush perhaps. You think that'd address the best of both worlds? |
I believe the part which is the beneficial optimization is the part where lock contention is avoided for ordinary writes, and I don't know any good way to keep that AND have async APIs, and have no reordering bugs... except a backlog. So maybe we can reconsider the problem from scratch? Is the problem that AUTH commands get queued behind other new writes that are incoming while we are reconnecting? If that is it, that seems like it was probably always a problem with backlog - regardless of autoretry. Another way of explaining the same problem - we actually DON'T want consistent order of messages. or autoretry for things like AUTH commands while we are reconnecting. We want these commands to be special and send immediately. The proposed AutoRetry already treats such commands specially. By which I mean it doesn't automatically retry them. My question is therefore why not have the PhysicalBridge treat these special commands specially too? Just have them be completely opted out of consistent delivery order and bypass the backlog, and the problem is solved, isn't it? |
I think this is where we may be talking past each other. The idea in my mind is that the |
See #1856 (comment) for a discussion here. Note: this DOES NOT remove the PhysicalBridge backlog queue...because there are more issues discovered there. Will write up on GitHub.
@TimLovellSmith @mgravell I'm getting changes for the above in #1857, but note I have NOT removed the |
See #1856 (comment) for a discussion here. Note: this DOES NOT remove the `PhysicalBridge` backlog queue...because there are more issues discovered there. Will write up on the main PR.
Closing this in prep for #1864 approach of a triple queue. |
A move of #1755 into the repo so we can collab better (as discussed!).
Here we're getting command retries (at first for reconnects) into the primary library for users that want to enable this.