Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DocumentManager.GetInternalAsync cannot handle Redis server maintenance #10197

Closed
ShaneCourtrille opened this issue Aug 31, 2021 · 18 comments · Fixed by #10338
Closed

DocumentManager.GetInternalAsync cannot handle Redis server maintenance #10197

ShaneCourtrille opened this issue Aug 31, 2021 · 18 comments · Fixed by #10338
Labels
Milestone

Comments

@ShaneCourtrille
Copy link
Contributor

ShaneCourtrille commented Aug 31, 2021

We are using Azure Redis Cache Standard Tier as our distributed cache and have noticed OrchardCore requests failing during the maintenance window.

After some investigation it appears the problem is the DocumentManager.GetInternalAsync method. The entry => {..} which is passed to _memoryCache.GetOrCreateAsync has no error handling/retry logic so when we get a RedisTimeoutException the entire request fails.

It looks like an easy solution would be to return a null id if that inner method fails? It would then just go to the document store to get the document (and potentially fail again when trying to store the id in distributed cache so another failure point to handle).

@sebastienros sebastienros added this to the 1.0.x milestone Sep 2, 2021
@sebastienros
Copy link
Member

Agree with the suggestion. Could you provide a PR?

@ShaneCourtrille
Copy link
Contributor Author

@sebastienros Yes I can.. working on it now.

@ShaneCourtrille
Copy link
Contributor Author

@sebastienros If _options.CheckConsistency is true should it fail fast so that the Consistency code in SetInternalAsync that removes the document when the stored/document Identifiers don't match will keep working?

@jtkech
Copy link
Member

jtkech commented Sep 5, 2021

@ShaneCourtrille

For info CheckConsistency is true by default

Hmm, would need to focus on it again but I think that's okay to fail fast in SetInternalAsync(), so only use the retry logic in GetInternalAsync(). Then, if in GetInternalAsync() the retry logic fails to read the redis cache, we should throw to not return null and then call SetInternalAsync() with a new built document.

@ShaneCourtrille
Copy link
Contributor Author

ShaneCourtrille commented Sep 7, 2021

The chance that SetInternalAsync() will work after a failure to GetInternalAsync() is extremely unlikely if the cause was a problem with Redis. I can make the change anyway for the times when Get fails but that only helps if it's not a Redis failure (or very short timed one).

I'm still investigating if there is a way to make the system more reliable during Redis maintenance windows.

@ShaneCourtrille
Copy link
Contributor Author

@jtkech I'm actually not sure I see the value of the consistency check on SetInternalAsync() SPECIFICALLY when called from GetOrCreateImmutableAsync() since there is very little time between the 1st call to document store to get the document and the 2nd one in SetInternalAsync.

Is there really a reasonable chance of running into a scenario where you get the document from the store and then when SetInternalAsync is called you get a different document? Without being able to try/catch/ignore a Redis exception during the consistency check I don't see a way to make the DocumentManager work during a Redis maintenance window.

@jtkech
Copy link
Member

jtkech commented Sep 7, 2021

Yes, redis is somewhere transactional and the database also, but the whole is not atomic, that's why we check an identifier as a key and whose value is also embedded in the document, this to keep in sync the shared cache and store.

Hmm, maybe a scoped flag saying that we are disconnected or not, and if disconnected don't try to read from / write to the redis cache, just invalidate the local memory cache, only work with the data store.

In our RedisLock implementation we already use a kind of retry logic with an incremental delay, maybe worth to look at it.

How long does the maintenance take?

Make sense to have a retry logic for the RedisLock as the lock itself needs redis, but not sure when trying to read data from redis as they are available in the shared store. So maybe not a retry logic, just a fallback to the data store if disconnected in a given scope (maybe + a log warning / error), then maybe (not sure if it is worth) a global thread safe counter (maybe still per document type) that we increment on error, reset / decrement on a successful read, fails as before if it reaches a max count, or just have the logs on each failing read.

I will look at it asap this week or week end.

@ShaneCourtrille
Copy link
Contributor Author

We are seeing a few minutes on Azure though the official documentation says failovers should typically finish within 10-15 seconds. A quick read of some Redis Sentinel docs indicates it has the same time range.

@ShaneCourtrille
Copy link
Contributor Author

ShaneCourtrille commented Sep 9, 2021

@jtkech Do you think there is value in being able to differentiate between a Timeout exception, a Connection exception and any other exception? This makes things a bit more 'interesting' since OrchardCore.Documents does not have a StackExchange.Redis reference so you cannot catch explicit exceptions.

I've gone back and forth on the value of it but really if there was a circuit breaker in place then there isn't any difference I can see between Timeouts/Connections (and I've never seen any other exception from the library). I think your circuit breaker idea has a lot of value based on what we've been seeing in performance tests as the system definitely starts to experience major issues when you start experiencing Redis timeouts.

@jtkech
Copy link
Member

jtkech commented Sep 10, 2021

Yes we could have more explicit catches by having our own RedisCache (at least a wrapper), we have our own RedisLock but we just use the aspnetcore RedisCache (StackExchangeRedis namespace) https://github.com/dotnet/aspnetcore/blob/e23fd047b4baf3480ce93d7643a897732f960557/src/Caching/StackExchangeRedis/src/RedisCache.cs#L17.

Note: As I remember there is already a retry logic, at least for connecting, and also for reconnecting (maybe an option).

Hmm, I think that on our side we just need to do a try catch on any exception, and log an error / warning with a more generic log message saying that we failed to read from the distributed cache, then if this way we loose any more specific log error, we could just pass the exception message to our above log for more detailled info.

@ShaneCourtrille
Copy link
Contributor Author

You are correct about their being retry logic on the connection side of things. I've got things working now but the impact on performance is huge as currently each request takes the hit for the connection timeout. I think the global thread safe counter is going to be important and I'm looking at that now.

@ShaneCourtrille
Copy link
Contributor Author

I was able to get things working fairly nicely but I'm not happy with the amount of clutter that gets added into DocumentManager to make it work with the thread safe bypass cache/retry logic in place. I think adding a RedisCache makes more sense but I don't have the time available for that for a week or two.

@jtkech
Copy link
Member

jtkech commented Sep 16, 2021

@ShaneCourtrille

Thanks for working on this

Feel free to open a PR even if it is not ready

Then I will help if I have time

@ShaneCourtrille
Copy link
Contributor Author

@jtkech Done #10297

@sebastienros
Copy link
Member

Maybe this PR will fix all the issues:
StackExchange/StackExchange.Redis#1856

@ShaneCourtrille
Copy link
Contributor Author

@sebastienros Sadly it won't resolve the problem for us.. I just found this issue today and it matches what we're seeing StackExchange/StackExchange.Redis#1848 and it looks like its a no-fix. The workaround doesn't work for us as we're on Docker.

The ForceReconnect idea has value but as mentioned in their issue, the original implementation ignores timeout exceptions and in this scenario that's what we end up seeing so it would take a bit more logic that I'm not sure I'd be the best person to come up with since our usage of OC is likely quite different then others.

@jtkech
Copy link
Member

jtkech commented Sep 23, 2021

Interesting, hmm but maybe not incompatible with the breaker pattern suggested here

So that during a temporary outage we only read the database for a certain amount of time

This in place of entering too many times in a retries loop

@ShaneCourtrille
Copy link
Contributor Author

@jtkech Yeah.. I was thinking the implementation from PR #10313 still makes sense as a generic breaker implementation. This more complex/specific Redis issue could be handled separately in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants