DocumentManager.GetInternalAsync cannot handle Redis server maintenance #10197

ShaneCourtrille · 2021-08-31T20:17:51Z

We are using Azure Redis Cache Standard Tier as our distributed cache and have noticed OrchardCore requests failing during the maintenance window.

After some investigation it appears the problem is the DocumentManager.GetInternalAsync method. The entry => {..} which is passed to _memoryCache.GetOrCreateAsync has no error handling/retry logic so when we get a RedisTimeoutException the entire request fails.

It looks like an easy solution would be to return a null id if that inner method fails? It would then just go to the document store to get the document (and potentially fail again when trying to store the id in distributed cache so another failure point to handle).

sebastienros · 2021-09-02T17:12:40Z

Agree with the suggestion. Could you provide a PR?

ShaneCourtrille · 2021-09-03T15:10:30Z

@sebastienros Yes I can.. working on it now.

ShaneCourtrille · 2021-09-03T21:03:25Z

@sebastienros If _options.CheckConsistency is true should it fail fast so that the Consistency code in SetInternalAsync that removes the document when the stored/document Identifiers don't match will keep working?

jtkech · 2021-09-05T04:07:56Z

@ShaneCourtrille

For info CheckConsistency is true by default

Hmm, would need to focus on it again but I think that's okay to fail fast in SetInternalAsync(), so only use the retry logic in GetInternalAsync(). Then, if in GetInternalAsync() the retry logic fails to read the redis cache, we should throw to not return null and then call SetInternalAsync() with a new built document.

ShaneCourtrille · 2021-09-07T14:07:12Z

The chance that SetInternalAsync() will work after a failure to GetInternalAsync() is extremely unlikely if the cause was a problem with Redis. I can make the change anyway for the times when Get fails but that only helps if it's not a Redis failure (or very short timed one).

I'm still investigating if there is a way to make the system more reliable during Redis maintenance windows.

ShaneCourtrille · 2021-09-07T20:53:19Z

@jtkech I'm actually not sure I see the value of the consistency check on SetInternalAsync() SPECIFICALLY when called from GetOrCreateImmutableAsync() since there is very little time between the 1st call to document store to get the document and the 2nd one in SetInternalAsync.

Is there really a reasonable chance of running into a scenario where you get the document from the store and then when SetInternalAsync is called you get a different document? Without being able to try/catch/ignore a Redis exception during the consistency check I don't see a way to make the DocumentManager work during a Redis maintenance window.

jtkech · 2021-09-07T22:26:09Z

Yes, redis is somewhere transactional and the database also, but the whole is not atomic, that's why we check an identifier as a key and whose value is also embedded in the document, this to keep in sync the shared cache and store.

Hmm, maybe a scoped flag saying that we are disconnected or not, and if disconnected don't try to read from / write to the redis cache, just invalidate the local memory cache, only work with the data store.

In our RedisLock implementation we already use a kind of retry logic with an incremental delay, maybe worth to look at it.

How long does the maintenance take?

Make sense to have a retry logic for the RedisLock as the lock itself needs redis, but not sure when trying to read data from redis as they are available in the shared store. So maybe not a retry logic, just a fallback to the data store if disconnected in a given scope (maybe + a log warning / error), then maybe (not sure if it is worth) a global thread safe counter (maybe still per document type) that we increment on error, reset / decrement on a successful read, fails as before if it reaches a max count, or just have the logs on each failing read.

I will look at it asap this week or week end.

ShaneCourtrille · 2021-09-08T14:10:10Z

We are seeing a few minutes on Azure though the official documentation says failovers should typically finish within 10-15 seconds. A quick read of some Redis Sentinel docs indicates it has the same time range.

ShaneCourtrille · 2021-09-09T19:12:46Z

@jtkech Do you think there is value in being able to differentiate between a Timeout exception, a Connection exception and any other exception? This makes things a bit more 'interesting' since OrchardCore.Documents does not have a StackExchange.Redis reference so you cannot catch explicit exceptions.

I've gone back and forth on the value of it but really if there was a circuit breaker in place then there isn't any difference I can see between Timeouts/Connections (and I've never seen any other exception from the library). I think your circuit breaker idea has a lot of value based on what we've been seeing in performance tests as the system definitely starts to experience major issues when you start experiencing Redis timeouts.

jtkech · 2021-09-10T03:02:31Z

Yes we could have more explicit catches by having our own RedisCache (at least a wrapper), we have our own RedisLock but we just use the aspnetcore RedisCache (StackExchangeRedis namespace) https://github.com/dotnet/aspnetcore/blob/e23fd047b4baf3480ce93d7643a897732f960557/src/Caching/StackExchangeRedis/src/RedisCache.cs#L17.

Note: As I remember there is already a retry logic, at least for connecting, and also for reconnecting (maybe an option).

Hmm, I think that on our side we just need to do a try catch on any exception, and log an error / warning with a more generic log message saying that we failed to read from the distributed cache, then if this way we loose any more specific log error, we could just pass the exception message to our above log for more detailled info.

ShaneCourtrille · 2021-09-14T17:13:33Z

You are correct about their being retry logic on the connection side of things. I've got things working now but the impact on performance is huge as currently each request takes the hit for the connection timeout. I think the global thread safe counter is going to be important and I'm looking at that now.

ShaneCourtrille · 2021-09-15T14:49:56Z

I was able to get things working fairly nicely but I'm not happy with the amount of clutter that gets added into DocumentManager to make it work with the thread safe bypass cache/retry logic in place. I think adding a RedisCache makes more sense but I don't have the time available for that for a week or two.

jtkech · 2021-09-16T01:27:38Z

@ShaneCourtrille

Thanks for working on this

Feel free to open a PR even if it is not ready

Then I will help if I have time

ShaneCourtrille · 2021-09-17T17:56:40Z

@jtkech Done #10297

sebastienros · 2021-09-23T17:34:42Z

Maybe this PR will fix all the issues:
StackExchange/StackExchange.Redis#1856

ShaneCourtrille · 2021-09-23T21:19:13Z

@sebastienros Sadly it won't resolve the problem for us.. I just found this issue today and it matches what we're seeing StackExchange/StackExchange.Redis#1848 and it looks like its a no-fix. The workaround doesn't work for us as we're on Docker.

The ForceReconnect idea has value but as mentioned in their issue, the original implementation ignores timeout exceptions and in this scenario that's what we end up seeing so it would take a bit more logic that I'm not sure I'd be the best person to come up with since our usage of OC is likely quite different then others.

jtkech · 2021-09-23T21:35:34Z

Interesting, hmm but maybe not incompatible with the breaker pattern suggested here

So that during a temporary outage we only read the database for a certain amount of time

This in place of entering too many times in a retries loop

ShaneCourtrille · 2021-09-23T21:42:26Z

@jtkech Yeah.. I was thinking the implementation from PR #10313 still makes sense as a generic breaker implementation. This more complex/specific Redis issue could be handled separately in the future.

ShaneCourtrille added the bug 🐛 label Aug 31, 2021

sebastienros added this to the 1.0.x milestone Sep 2, 2021

ShaneCourtrille mentioned this issue Sep 17, 2021

Support Distributed Caching bypass capability #10297

Closed

jtkech mentioned this issue Sep 26, 2021

Cache Failover and Shared options #10338

Merged

jtkech closed this as completed in #10338 Nov 19, 2021

sebastienros modified the milestones: 1.x, 1.2 Jan 6, 2022

Piedone mentioned this issue Jan 17, 2024

Generic handler for distributed cache exceptions #10313

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DocumentManager.GetInternalAsync cannot handle Redis server maintenance #10197

DocumentManager.GetInternalAsync cannot handle Redis server maintenance #10197

ShaneCourtrille commented Aug 31, 2021 •

edited

Loading

sebastienros commented Sep 2, 2021

ShaneCourtrille commented Sep 3, 2021

ShaneCourtrille commented Sep 3, 2021

jtkech commented Sep 5, 2021

ShaneCourtrille commented Sep 7, 2021 •

edited

Loading

ShaneCourtrille commented Sep 7, 2021

jtkech commented Sep 7, 2021

ShaneCourtrille commented Sep 8, 2021

ShaneCourtrille commented Sep 9, 2021 •

edited

Loading

jtkech commented Sep 10, 2021

ShaneCourtrille commented Sep 14, 2021

ShaneCourtrille commented Sep 15, 2021

jtkech commented Sep 16, 2021

ShaneCourtrille commented Sep 17, 2021

sebastienros commented Sep 23, 2021

ShaneCourtrille commented Sep 23, 2021

jtkech commented Sep 23, 2021

ShaneCourtrille commented Sep 23, 2021

DocumentManager.GetInternalAsync cannot handle Redis server maintenance #10197

DocumentManager.GetInternalAsync cannot handle Redis server maintenance #10197

Comments

ShaneCourtrille commented Aug 31, 2021 • edited Loading

sebastienros commented Sep 2, 2021

ShaneCourtrille commented Sep 3, 2021

ShaneCourtrille commented Sep 3, 2021

jtkech commented Sep 5, 2021

ShaneCourtrille commented Sep 7, 2021 • edited Loading

ShaneCourtrille commented Sep 7, 2021

jtkech commented Sep 7, 2021

ShaneCourtrille commented Sep 8, 2021

ShaneCourtrille commented Sep 9, 2021 • edited Loading

jtkech commented Sep 10, 2021

ShaneCourtrille commented Sep 14, 2021

ShaneCourtrille commented Sep 15, 2021

jtkech commented Sep 16, 2021

ShaneCourtrille commented Sep 17, 2021

sebastienros commented Sep 23, 2021

ShaneCourtrille commented Sep 23, 2021

jtkech commented Sep 23, 2021

ShaneCourtrille commented Sep 23, 2021

ShaneCourtrille commented Aug 31, 2021 •

edited

Loading

ShaneCourtrille commented Sep 7, 2021 •

edited

Loading

ShaneCourtrille commented Sep 9, 2021 •

edited

Loading