Auto retry message on connection failure #1856

NickCraver · 2021-09-04T01:00:03Z

A move of #1755 into the repo so we can collab better (as discussed!).

Here we're getting command retries (at first for reconnects) into the primary library for users that want to enable this.

Allow SetWriteTime for sync messages as well

…design is solidified

# Conflicts: # src/StackExchange.Redis/PhysicalBridge.cs

support for string based IRetry configuration option

NickCraver · 2021-09-07T15:03:44Z

src/StackExchange.Redis/CommandRetry/MessageRetryQueue.cs

+            {
+                startProcessor = _queue.Count > 0;
+            }
+            if (startProcessor)


Can simplify to:

if(_queue.Count == 0) { return; }

NickCraver · 2021-09-07T15:37:14Z

src/StackExchange.Redis/CommandRetry/MessageRetryQueue.cs

+        private readonly int? _maxRetryQueueLength;
+        private readonly bool _runRetryLoopAsync;
+
+        internal MessageRetryQueue(IMessageRetryHelper messageRetryHelper, int? maxRetryQueueLength = null, bool runRetryLoopAsync = true)


@deepakverma Trying to figure out the scenario in which runRetryLoopAsync would be false (triggering a .Wait() below), can you shed more light on that scenario/path?

NickCraver · 2021-09-07T15:38:07Z

src/StackExchange.Redis/CommandRetry/MessageRetryQueue.cs

+            {
+                var task = Task.Run(ProcessRetryQueueAsync);
+                if (task.IsFaulted)
+                    throw task.Exception;


This needs to be a thread kickoff, and we should probably track state internally.

Overall, we probably need a state here (like the message backlog processor) that we interlock to change. This can help us know whether an existing thread is started/running (so we don't start multiple) and we can add that state to exception messages (like we do the message backlog today). This means we'll need to expose the status on the interface in some way.

NickCraver · 2021-09-07T15:45:21Z

src/StackExchange.Redis/CommandRetry/CommandRetryPolicy.cs

+        /// <summary>
+        /// Returns the current length of the retry queue.
+        /// </summary>
+        public abstract int CurrentQueueLength { get; }


For exception messages, probably want to CurrentlyProcessing (bool) for exception messages

NickCraver · 2021-09-07T16:40:33Z

Had a sync call on this right now, wanted to summarize and get agreement, in no particular order:

I think the changes in Fixes "messages get sent out-of-order when a backlog is created by another thread" #1779 create an issue we failed to realize: the handshake itself will get queued behind other message in the backlog. This means especially in authenticated scenarios we'll cycle failure by issuing commands before the AUTH is gotten to in the pipe.
The Task-based retry flush is subject to the same problems as Message Backlog: Task pool exhaustion scenario with synchronous calls #1847, so we should switch to a Thread there.
There's a big interaction question with how the retry policy works (at the multiplexer level) and how the per-bridge retry policy works (at the bridge level). As it stands, upon failure we'd retry all commands via the PhysicalBridge backlog upon reconnect, but importantly those come back with a status of successful. This means they won't be touched by the retry policy, which through they succeeded.
The above is complicated, but ultimately the reason the backlog exists was for the AUTH command ordering in the handshake phase of a connection. This is to keep the configuration and handshake phases of a connection sequence in order. Instead, what if we removed the PhysicalBridge backlog completely?
- This needs investigation on if we can, but if all the retry mechanics were only the CommandRetryPolicy (and our built-in policy is effectively "always"), then there'd be no backlog at the bridge layer. Commands issued directly down there (e.g. handshakes) would fail, like we want, and no retry (like we want)
- Message would need to grow a reference to the Server it was queued for if it was queued to a specific server (most things are not). If a Server is present, we'd want to retry on that specific server (like with a REPLICAOForKEYS` command you really don't want to arbitrarily redirect).
  - We can do this in PrepareToPushMessageToBridge<T> where we were either passed an explicit server or chose one in the first place.
- Overall, it'd be a huge simplification: system commands issued beneath the multiplexer would not be subject to retry or any mechanics thereof the user specified. e.g. WriteDirectOrQueueFireAndForgetAsync would not be subject to the policy or any backlogging at all.
  - However, I'm not sure this works reliably, it's possible we subtly depend on the backlog today to complete connections in ways we didn't fully reason about in the call. Need to heads down on this for a while and try things.
The retry queue needs a bit more status exposed, and adding to the exception messages. Right now we wouldn't have an idea that there's a ton backlogged and waiting. To this end, we likely need at least a boolean status of whether it's trying to flush at the moment, and add the existing pending count to the exception data.
This is a big change, and if the above is correct, we've got a regression in the 2.x line. Since we know this is a big change to the default retry mechanics, the proposal is to start on the 3.x branch line and aim this PR at that, allowing us to do alpha releases alongside 2.x, and get out of the critical path. We'll want to hammer these new approaches well and I think their appropriate status is pre-release anyhow.

For number 1, if @TimLovellSmith agrees with the observations here, we should revert that change from the 2.x line and get a release out there with the few other misc fixes in queue. Can others please take a look at the above and provide thoughts? If we're agreed I'll start hacking at this on days off - if we're agreed on the regression we could get a new cut out there quickly and move the intended behavior into the 3.x line.

cc @mgravell @philon-msft @carldc @deepakverma

TimLovellSmith · 2021-09-08T06:11:53Z

@NickCraver

the handshake itself will get queued behind other message in the backlog. This means especially in authenticated scenarios we'll cycle failure by issuing commands before the AUTH is gotten to in the pipe.

This is interesting. Rather than agree we completely revert the out-of-order fix, I have a couple other options I am wondering about:

A) why not have a way to bypass the backlog ordering constraint completely, useful mainly for commands that are run internally to re-establish the connection, and maintain the connection state, which don't want to care about ordering relative to end-user commands...? This is much like the workaround that I had in my original PR to fix race conditions in connection establishment the out-of-order fix that were related to these internal commands and causing test failure. You all might have to remind me what the main objections to that special casing logic were again...

B) or... we could instead call it by design, that unauthenticated commands can fail for the sake of 'consistency' and 'security' when it seems that AUTH was sent AFTER the other command. (More of a 'least surprise' approach than real security perhaps - but I see possible benefits to avoiding things which LOOK like security bugs. Imagine hearing "my command was sent before auth and yet it succeeded, how?" all the time, and then having to distinguish that from real security issues...)

Sounds like you prefer A rather than B.
I think I'm OK with that in the case where we're talking about AUTH commands issued by the multiplexer based on automatic connection management from configuration -- but I can't help feeling a little doubt about whether this is also the behavior someone issuing their own AUTH commands manually would expect, if anyone does that.

TimLovellSmith · 2021-09-08T22:02:20Z

A follow-on thought... its sort of like how auto-retry is not really the right retry strategy for everything. Like what? Well, we already have this 'CanRetry' concept, which is an acknowledgement that some commands aren't suitable for auto-retrying, especially internal connection state management commands.

        internal bool CanRetry(Message message, Exception ex)
        {
            if ((message.Flags & CommandFlags.NoRetry) != 0
                || ((message.Flags & CommandFlags.RetryIfNotSent) != 0 && message.Status == CommandStatus.Sent)
                || message.IsAdmin
                || message.IsInternalCall
                || !(ex is RedisException)
                || !ShouldRetry(message.Status))
            {
                return false;
            }

            return true;
        }

You might or might not believe that internal management commands would like to have their order preserved relative to each other.... I wonder which?

Does this call for separate backlogs?

NickCraver · 2021-09-08T22:28:45Z

@TimLovellSmith I think if we remove the message backlog at the physical bridge layer all together, the handshake bits simply won't be subject to the retry policy, since they're lower and writing to the pipe further down. I think this will be okay because they're in-order and we're now awaiting each command as it's written down (where as before we had the races). So instead of separate backlogs, connection handshake stuff just isn't backlogged at all, it goes direct. When it completes, the ComamndRetryPolicy should kick in and begin flushing. Does that approach make sense?

TimLovellSmith · 2021-09-08T22:44:59Z

@NickCraver I wonder if I am understanding it. Is the proposal to remove the backlog altogether? Or is it just to bypass the backlog and retry stuff for handshakes?

My understanding is that a fairly important function of the backlog is also to reduce lock contention for writers by ensuring that async writes don't have to block to acquire the lock on the socket. I.e. performance optimization.

NickCraver · 2021-09-08T22:49:02Z

Is the proposal to remove the backlog altogether?

Yep, this!

My understanding is that a fairly important function of the backlog is also to reduce lock contention for writers by ensuring that async writes don't have to block to acquire the lock on the socket. I.e. performance optimization.

Hmmm I wouldn't say that's a critical function of the backlog - it was originally created to order AUTH correctly in the handshake (which as you helped me debug, wasn't even quite right). If we're backlogged or retrying at all, the amount of time we're spending flushing and sending to Redis should completely dwarf backlog contention - if it's not dwarfing it, we probably have a spiral of death situation with anyone running that hot after a disconnect.

What order things come from callers down to the socket when there is a retry queue though: that's a question - either we disregard ordering completely and let them have it out on the lock, or we buffer into the queue similar to the backlog today. Thoughts?

TimLovellSmith · 2021-09-08T23:01:16Z

The problem I'm thinking of is indeed with the apps that run very hot. Normally they would be fine (looks good enough to put into prod...), but I have seen a couple times in the past clues in the dumps that when suddenly usage spikes up and lock contention on the write lock starts to be one of their big problems/contributing factors.
The backlog (theoretically) acts as a kind of pressure release / catchup optimization, where once triggered, it basically just has to acquire the lock once to write out all the stuff and be all caught up on writes again without lots of synchronization or switching overheads.

TimLovellSmith · 2021-09-08T23:04:53Z

PS I am not really sure whether AutoRetry is intended to break guarantees on command ordering...

Guarantees might be achievable, but only by saying there's one backlog for everything, everything stays in it until successfully retried, and so no commands can get sent until all commands in front of them already are known to have succeeded, which is probably going to do terrible things to throughput.

So I kinda assume AutoRetry breaks ordering guarantees.

NickCraver · 2021-09-08T23:35:06Z

If we think that's an issue (agreeing it is if you're seeing it in those logs), we could provide a similar mechanism for the retry policy to take an exclusive lock during the flush perhaps. You think that'd address the best of both worlds?

TimLovellSmith · 2021-09-10T00:29:00Z

I believe the part which is the beneficial optimization is the part where lock contention is avoided for ordinary writes, and I don't know any good way to keep that AND have async APIs, and have no reordering bugs... except a backlog.

So maybe we can reconsider the problem from scratch? Is the problem that AUTH commands get queued behind other new writes that are incoming while we are reconnecting? If that is it, that seems like it was probably always a problem with backlog - regardless of autoretry.

Another way of explaining the same problem - we actually DON'T want consistent order of messages. or autoretry for things like AUTH commands while we are reconnecting. We want these commands to be special and send immediately.

The proposed AutoRetry already treats such commands specially. By which I mean it doesn't automatically retry them. My question is therefore why not have the PhysicalBridge treat these special commands specially too? Just have them be completely opted out of consistent delivery order and bypass the backlog, and the problem is solved, isn't it?

NickCraver · 2021-09-10T00:39:31Z

I believe the part which is the beneficial optimization is the part where lock contention is avoided for ordinary writes, and I don't know any good way to keep that AND have async APIs, and have no reordering bugs... except a backlog.

I think this is where we may be talking past each other. The idea in my mind is that the CommandRetryPolicy is the backlog...the only backlog, in this proposal. Otherwise, we basically have 2 levels of backlog which is hard to even reason about.

This reverts commit c48ff6d.

See #1856 (comment) for a discussion here. Note: this DOES NOT remove the PhysicalBridge backlog queue...because there are more issues discovered there. Will write up on GitHub.

NickCraver · 2021-09-10T22:01:51Z

@TimLovellSmith @mgravell I'm getting changes for the above in #1857, but note I have NOT removed the PhysicalBridge backlog, because there's even more fun there digging in. Any chance we could get together early next week and talk through? Changes thus far are cleanup, simplification, status additions, Thread-based flushing, additional tests, and a move to FailedCommand for users to more easily create policies here.

See #1856 (comment) for a discussion here. Note: this DOES NOT remove the `PhysicalBridge` backlog queue...because there are more issues discovered there. Will write up on the main PR.

NickCraver · 2021-10-21T13:55:02Z

Closing this in prep for #1864 approach of a triple queue.

deepakverma and others added 30 commits June 7, 2021 09:23

autoretry with hooks

ad90509

RequestMessage Implementation

4e9e97b

Allow SetWriteTime for sync messages as well

Add hook to requestFailed on connection exception

433da7f

experimenting with the main retry loop to be handling time outs as well

8a533f6

Making requestfailed as internal for now, will expose it later after …

ae9e35a

…design is solidified

improve message timeout while waiting for connection restore to complete

754ba69

Remove unused variable

9ff40ef

Removing unused commandflags for now

fd1f1da

undo csproj change to match upstream

415bcdf

Creating a connectiontry configurationoption and command flags

889520c

connection,ultipexer changes to handle message retry

2ed46cd

Message retry

97d96cc

ensure no retry is set as default

1f7c27c

whoops, cleaning up stale code and rety should be default as null

bb1a1d1

typo

9e815d3

Merge branch 'main' into autoretry

78e6910

# Conflicts: # src/StackExchange.Redis/PhysicalBridge.cs

Merge branch 'main' into pr/1755

c7c6b8e

making Retryqueuelength nullable

d6b4d30

dropping onconnectionrestore from retry config option

623d731

renaming Retry => CommandRetry

7cd57b8

to honor redirent option specified by the user, do not reset noredirect

ed1be50

incorporating feedback on PR

906289f

do not retry admin commands

fd3407a

unifiying commandretry and commandlfags

ea5091a

fixed failing test

97c957a

IRetry

cc9f6da

introducing FailedMessage class to make IRetry shouldretry extensible

f4f994f

support for string based IRetry configuration option

separating out retrymanager

127c25c

first attempt to decouple iretrymanager from mux

5968359

moving out timeout check to retrymanager

e9f0c0f

NickCraver commented Sep 7, 2021

View reviewed changes

NickCraver added 2 commits September 7, 2021 11:06

Simplify

8d9d0fe

Merge branch 'main' into command-retry

664ede2

NickCraver commented Sep 7, 2021

View reviewed changes

NickCraver added 3 commits September 10, 2021 07:51

Revert "Cleanup internals"

3cc22ea

This reverts commit c48ff6d.

Add note on why this internal visibility is needed

62d215a

Merge remote-tracking branch 'origin/main' into command-retry

f4a3a54

NickCraver mentioned this pull request Sep 10, 2021

Command Retry: Changes and Simplifications #1857

Merged

Command Retry: Changes and Simplifications (#1857)

da3e55d

See #1856 (comment) for a discussion here. Note: this DOES NOT remove the `PhysicalBridge` backlog queue...because there are more issues discovered there. Will write up on the main PR.

NickCraver mentioned this pull request Sep 19, 2021

Backlog, Backpressure, and Retry - Overview and Planning #1864

Closed

sebastienros mentioned this pull request Sep 23, 2021

DocumentManager.GetInternalAsync cannot handle Redis server maintenance OrchardCMS/OrchardCore#10197

Closed

NickCraver marked this pull request as draft October 7, 2021 13:33

NickCraver closed this Oct 21, 2021

NickCraver deleted the command-retry branch March 7, 2022 00:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto retry message on connection failure #1856

Auto retry message on connection failure #1856

NickCraver commented Sep 4, 2021

NickCraver Sep 7, 2021

NickCraver Sep 7, 2021

NickCraver Sep 7, 2021 •

edited

Loading

NickCraver Sep 7, 2021

NickCraver commented Sep 7, 2021 •

edited

Loading

TimLovellSmith commented Sep 8, 2021 •

edited

Loading

TimLovellSmith commented Sep 8, 2021

NickCraver commented Sep 8, 2021

TimLovellSmith commented Sep 8, 2021

NickCraver commented Sep 8, 2021

TimLovellSmith commented Sep 8, 2021

TimLovellSmith commented Sep 8, 2021

NickCraver commented Sep 8, 2021

TimLovellSmith commented Sep 10, 2021

NickCraver commented Sep 10, 2021

NickCraver commented Sep 10, 2021

NickCraver commented Oct 21, 2021

Auto retry message on connection failure #1856

Auto retry message on connection failure #1856

Conversation

NickCraver commented Sep 4, 2021

NickCraver Sep 7, 2021

Choose a reason for hiding this comment

NickCraver Sep 7, 2021

Choose a reason for hiding this comment

NickCraver Sep 7, 2021 • edited Loading

Choose a reason for hiding this comment

NickCraver Sep 7, 2021

Choose a reason for hiding this comment

NickCraver commented Sep 7, 2021 • edited Loading

TimLovellSmith commented Sep 8, 2021 • edited Loading

TimLovellSmith commented Sep 8, 2021

NickCraver commented Sep 8, 2021

TimLovellSmith commented Sep 8, 2021

NickCraver commented Sep 8, 2021

TimLovellSmith commented Sep 8, 2021

TimLovellSmith commented Sep 8, 2021

NickCraver commented Sep 8, 2021

TimLovellSmith commented Sep 10, 2021

NickCraver commented Sep 10, 2021

NickCraver commented Sep 10, 2021

NickCraver commented Oct 21, 2021

NickCraver Sep 7, 2021 •

edited

Loading

NickCraver commented Sep 7, 2021 •

edited

Loading

TimLovellSmith commented Sep 8, 2021 •

edited

Loading