Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto retry message on connection failure #1856

Closed
wants to merge 83 commits into from
Closed

Conversation

NickCraver
Copy link
Collaborator

A move of #1755 into the repo so we can collab better (as discussed!).

Here we're getting command retries (at first for reconnects) into the primary library for users that want to enable this.

deepakverma and others added 30 commits June 7, 2021 09:23
Allow SetWriteTime for sync messages as well
# Conflicts:
#	src/StackExchange.Redis/PhysicalBridge.cs
support for string based IRetry configuration option
{
startProcessor = _queue.Count > 0;
}
if (startProcessor)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can simplify to:

if(_queue.Count == 0)
{
    return;
}

private readonly int? _maxRetryQueueLength;
private readonly bool _runRetryLoopAsync;

internal MessageRetryQueue(IMessageRetryHelper messageRetryHelper, int? maxRetryQueueLength = null, bool runRetryLoopAsync = true)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepakverma Trying to figure out the scenario in which runRetryLoopAsync would be false (triggering a .Wait() below), can you shed more light on that scenario/path?

{
var task = Task.Run(ProcessRetryQueueAsync);
if (task.IsFaulted)
throw task.Exception;
Copy link
Collaborator Author

@NickCraver NickCraver Sep 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be a thread kickoff, and we should probably track state internally.

Overall, we probably need a state here (like the message backlog processor) that we interlock to change. This can help us know whether an existing thread is started/running (so we don't start multiple) and we can add that state to exception messages (like we do the message backlog today). This means we'll need to expose the status on the interface in some way.

/// <summary>
/// Returns the current length of the retry queue.
/// </summary>
public abstract int CurrentQueueLength { get; }
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For exception messages, probably want to CurrentlyProcessing (bool) for exception messages

@NickCraver
Copy link
Collaborator Author

NickCraver commented Sep 7, 2021

Had a sync call on this right now, wanted to summarize and get agreement, in no particular order:

  1. I think the changes in Fixes "messages get sent out-of-order when a backlog is created by another thread" #1779 create an issue we failed to realize: the handshake itself will get queued behind other message in the backlog. This means especially in authenticated scenarios we'll cycle failure by issuing commands before the AUTH is gotten to in the pipe.
  2. The Task-based retry flush is subject to the same problems as Message Backlog: Task pool exhaustion scenario with synchronous calls #1847, so we should switch to a Thread there.
  3. There's a big interaction question with how the retry policy works (at the multiplexer level) and how the per-bridge retry policy works (at the bridge level). As it stands, upon failure we'd retry all commands via the PhysicalBridge backlog upon reconnect, but importantly those come back with a status of successful. This means they won't be touched by the retry policy, which through they succeeded.
  4. The above is complicated, but ultimately the reason the backlog exists was for the AUTH command ordering in the handshake phase of a connection. This is to keep the configuration and handshake phases of a connection sequence in order. Instead, what if we removed the PhysicalBridge backlog completely?
    • This needs investigation on if we can, but if all the retry mechanics were only the CommandRetryPolicy (and our built-in policy is effectively "always"), then there'd be no backlog at the bridge layer. Commands issued directly down there (e.g. handshakes) would fail, like we want, and no retry (like we want)
    • Message would need to grow a reference to the Server it was queued for if it was queued to a specific server (most things are not). If a Server is present, we'd want to retry on that specific server (like with a REPLICAOForKEYS` command you really don't want to arbitrarily redirect).
    • Overall, it'd be a huge simplification: system commands issued beneath the multiplexer would not be subject to retry or any mechanics thereof the user specified. e.g. WriteDirectOrQueueFireAndForgetAsync would not be subject to the policy or any backlogging at all.
      • However, I'm not sure this works reliably, it's possible we subtly depend on the backlog today to complete connections in ways we didn't fully reason about in the call. Need to heads down on this for a while and try things.
  5. The retry queue needs a bit more status exposed, and adding to the exception messages. Right now we wouldn't have an idea that there's a ton backlogged and waiting. To this end, we likely need at least a boolean status of whether it's trying to flush at the moment, and add the existing pending count to the exception data.
  6. This is a big change, and if the above is correct, we've got a regression in the 2.x line. Since we know this is a big change to the default retry mechanics, the proposal is to start on the 3.x branch line and aim this PR at that, allowing us to do alpha releases alongside 2.x, and get out of the critical path. We'll want to hammer these new approaches well and I think their appropriate status is pre-release anyhow.

For number 1, if @TimLovellSmith agrees with the observations here, we should revert that change from the 2.x line and get a release out there with the few other misc fixes in queue. Can others please take a look at the above and provide thoughts? If we're agreed I'll start hacking at this on days off - if we're agreed on the regression we could get a new cut out there quickly and move the intended behavior into the 3.x line.

cc @mgravell @philon-msft @carldc @deepakverma

@TimLovellSmith
Copy link
Collaborator

TimLovellSmith commented Sep 8, 2021

@NickCraver

the handshake itself will get queued behind other message in the backlog. This means especially in authenticated scenarios we'll cycle failure by issuing commands before the AUTH is gotten to in the pipe.

This is interesting. Rather than agree we completely revert the out-of-order fix, I have a couple other options I am wondering about:

A) why not have a way to bypass the backlog ordering constraint completely, useful mainly for commands that are run internally to re-establish the connection, and maintain the connection state, which don't want to care about ordering relative to end-user commands...? This is much like the workaround that I had in my original PR to fix race conditions in connection establishment the out-of-order fix that were related to these internal commands and causing test failure. You all might have to remind me what the main objections to that special casing logic were again...

B) or... we could instead call it by design, that unauthenticated commands can fail for the sake of 'consistency' and 'security' when it seems that AUTH was sent AFTER the other command. (More of a 'least surprise' approach than real security perhaps - but I see possible benefits to avoiding things which LOOK like security bugs. Imagine hearing "my command was sent before auth and yet it succeeded, how?" all the time, and then having to distinguish that from real security issues...)

Sounds like you prefer A rather than B.
I think I'm OK with that in the case where we're talking about AUTH commands issued by the multiplexer based on automatic connection management from configuration -- but I can't help feeling a little doubt about whether this is also the behavior someone issuing their own AUTH commands manually would expect, if anyone does that.

@TimLovellSmith
Copy link
Collaborator

A follow-on thought... its sort of like how auto-retry is not really the right retry strategy for everything. Like what? Well, we already have this 'CanRetry' concept, which is an acknowledgement that some commands aren't suitable for auto-retrying, especially internal connection state management commands.

        internal bool CanRetry(Message message, Exception ex)
        {
            if ((message.Flags & CommandFlags.NoRetry) != 0
                || ((message.Flags & CommandFlags.RetryIfNotSent) != 0 && message.Status == CommandStatus.Sent)
                || message.IsAdmin
                || message.IsInternalCall
                || !(ex is RedisException)
                || !ShouldRetry(message.Status))
            {
                return false;
            }

            return true;
        }

You might or might not believe that internal management commands would like to have their order preserved relative to each other.... I wonder which?

Does this call for separate backlogs?

@NickCraver
Copy link
Collaborator Author

@TimLovellSmith I think if we remove the message backlog at the physical bridge layer all together, the handshake bits simply won't be subject to the retry policy, since they're lower and writing to the pipe further down. I think this will be okay because they're in-order and we're now awaiting each command as it's written down (where as before we had the races). So instead of separate backlogs, connection handshake stuff just isn't backlogged at all, it goes direct. When it completes, the ComamndRetryPolicy should kick in and begin flushing. Does that approach make sense?

@TimLovellSmith
Copy link
Collaborator

@NickCraver I wonder if I am understanding it. Is the proposal to remove the backlog altogether? Or is it just to bypass the backlog and retry stuff for handshakes?

My understanding is that a fairly important function of the backlog is also to reduce lock contention for writers by ensuring that async writes don't have to block to acquire the lock on the socket. I.e. performance optimization.

@NickCraver
Copy link
Collaborator Author

Is the proposal to remove the backlog altogether?

Yep, this!

My understanding is that a fairly important function of the backlog is also to reduce lock contention for writers by ensuring that async writes don't have to block to acquire the lock on the socket. I.e. performance optimization.

Hmmm I wouldn't say that's a critical function of the backlog - it was originally created to order AUTH correctly in the handshake (which as you helped me debug, wasn't even quite right). If we're backlogged or retrying at all, the amount of time we're spending flushing and sending to Redis should completely dwarf backlog contention - if it's not dwarfing it, we probably have a spiral of death situation with anyone running that hot after a disconnect.

What order things come from callers down to the socket when there is a retry queue though: that's a question - either we disregard ordering completely and let them have it out on the lock, or we buffer into the queue similar to the backlog today. Thoughts?

@TimLovellSmith
Copy link
Collaborator

The problem I'm thinking of is indeed with the apps that run very hot. Normally they would be fine (looks good enough to put into prod...), but I have seen a couple times in the past clues in the dumps that when suddenly usage spikes up and lock contention on the write lock starts to be one of their big problems/contributing factors.
The backlog (theoretically) acts as a kind of pressure release / catchup optimization, where once triggered, it basically just has to acquire the lock once to write out all the stuff and be all caught up on writes again without lots of synchronization or switching overheads.

@TimLovellSmith
Copy link
Collaborator

PS I am not really sure whether AutoRetry is intended to break guarantees on command ordering...

Guarantees might be achievable, but only by saying there's one backlog for everything, everything stays in it until successfully retried, and so no commands can get sent until all commands in front of them already are known to have succeeded, which is probably going to do terrible things to throughput.

So I kinda assume AutoRetry breaks ordering guarantees.

@NickCraver
Copy link
Collaborator Author

If we think that's an issue (agreeing it is if you're seeing it in those logs), we could provide a similar mechanism for the retry policy to take an exclusive lock during the flush perhaps. You think that'd address the best of both worlds?

@TimLovellSmith
Copy link
Collaborator

I believe the part which is the beneficial optimization is the part where lock contention is avoided for ordinary writes, and I don't know any good way to keep that AND have async APIs, and have no reordering bugs... except a backlog.

So maybe we can reconsider the problem from scratch? Is the problem that AUTH commands get queued behind other new writes that are incoming while we are reconnecting? If that is it, that seems like it was probably always a problem with backlog - regardless of autoretry.

Another way of explaining the same problem - we actually DON'T want consistent order of messages. or autoretry for things like AUTH commands while we are reconnecting. We want these commands to be special and send immediately.

The proposed AutoRetry already treats such commands specially. By which I mean it doesn't automatically retry them. My question is therefore why not have the PhysicalBridge treat these special commands specially too? Just have them be completely opted out of consistent delivery order and bypass the backlog, and the problem is solved, isn't it?

@NickCraver
Copy link
Collaborator Author

I believe the part which is the beneficial optimization is the part where lock contention is avoided for ordinary writes, and I don't know any good way to keep that AND have async APIs, and have no reordering bugs... except a backlog.

I think this is where we may be talking past each other. The idea in my mind is that the CommandRetryPolicy is the backlog...the only backlog, in this proposal. Otherwise, we basically have 2 levels of backlog which is hard to even reason about.

NickCraver added a commit that referenced this pull request Sep 10, 2021
See #1856 (comment) for a discussion here.

Note: this DOES NOT remove the PhysicalBridge backlog queue...because there are more issues discovered there. Will write up on GitHub.
@NickCraver
Copy link
Collaborator Author

@TimLovellSmith @mgravell I'm getting changes for the above in #1857, but note I have NOT removed the PhysicalBridge backlog, because there's even more fun there digging in. Any chance we could get together early next week and talk through? Changes thus far are cleanup, simplification, status additions, Thread-based flushing, additional tests, and a move to FailedCommand for users to more easily create policies here.

See #1856 (comment) for a discussion here.

Note: this DOES NOT remove the `PhysicalBridge` backlog queue...because there are more issues discovered there. Will write up on the main PR.
@NickCraver
Copy link
Collaborator Author

Closing this in prep for #1864 approach of a triple queue.

@NickCraver NickCraver closed this Oct 21, 2021
@NickCraver NickCraver deleted the command-retry branch March 7, 2022 00:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants