-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DocumentManager.GetInternalAsync cannot handle Redis server maintenance #10197
Comments
Agree with the suggestion. Could you provide a PR? |
@sebastienros Yes I can.. working on it now. |
@sebastienros If _options.CheckConsistency is true should it fail fast so that the Consistency code in SetInternalAsync that removes the document when the stored/document Identifiers don't match will keep working? |
For info Hmm, would need to focus on it again but I think that's okay to fail fast in |
The chance that SetInternalAsync() will work after a failure to GetInternalAsync() is extremely unlikely if the cause was a problem with Redis. I can make the change anyway for the times when Get fails but that only helps if it's not a Redis failure (or very short timed one). I'm still investigating if there is a way to make the system more reliable during Redis maintenance windows. |
@jtkech I'm actually not sure I see the value of the consistency check on SetInternalAsync() SPECIFICALLY when called from GetOrCreateImmutableAsync() since there is very little time between the 1st call to document store to get the document and the 2nd one in SetInternalAsync. Is there really a reasonable chance of running into a scenario where you get the document from the store and then when SetInternalAsync is called you get a different document? Without being able to try/catch/ignore a Redis exception during the consistency check I don't see a way to make the DocumentManager work during a Redis maintenance window. |
Yes, redis is somewhere transactional and the database also, but the whole is not atomic, that's why we check an identifier as a key and whose value is also embedded in the document, this to keep in sync the shared cache and store. Hmm, maybe a scoped flag saying that we are disconnected or not, and if disconnected don't try to read from / write to the redis cache, just invalidate the local memory cache, only work with the data store. In our RedisLock implementation we already use a kind of retry logic with an incremental delay, maybe worth to look at it. How long does the maintenance take? Make sense to have a retry logic for the RedisLock as the lock itself needs redis, but not sure when trying to read data from redis as they are available in the shared store. So maybe not a retry logic, just a fallback to the data store if disconnected in a given scope (maybe + a log warning / error), then maybe (not sure if it is worth) a global thread safe counter (maybe still per document type) that we increment on error, reset / decrement on a successful read, fails as before if it reaches a max count, or just have the logs on each failing read. I will look at it asap this week or week end. |
We are seeing a few minutes on Azure though the official documentation says failovers should typically finish within 10-15 seconds. A quick read of some Redis Sentinel docs indicates it has the same time range. |
@jtkech Do you think there is value in being able to differentiate between a Timeout exception, a Connection exception and any other exception? This makes things a bit more 'interesting' since OrchardCore.Documents does not have a StackExchange.Redis reference so you cannot catch explicit exceptions. I've gone back and forth on the value of it but really if there was a circuit breaker in place then there isn't any difference I can see between Timeouts/Connections (and I've never seen any other exception from the library). I think your circuit breaker idea has a lot of value based on what we've been seeing in performance tests as the system definitely starts to experience major issues when you start experiencing Redis timeouts. |
Yes we could have more explicit catches by having our own RedisCache (at least a wrapper), we have our own RedisLock but we just use the aspnetcore RedisCache (StackExchangeRedis namespace) https://github.com/dotnet/aspnetcore/blob/e23fd047b4baf3480ce93d7643a897732f960557/src/Caching/StackExchangeRedis/src/RedisCache.cs#L17. Note: As I remember there is already a retry logic, at least for connecting, and also for reconnecting (maybe an option). Hmm, I think that on our side we just need to do a try catch on any exception, and log an error / warning with a more generic log message saying that we failed to read from the distributed cache, then if this way we loose any more specific log error, we could just pass the exception message to our above log for more detailled info. |
You are correct about their being retry logic on the connection side of things. I've got things working now but the impact on performance is huge as currently each request takes the hit for the connection timeout. I think the global thread safe counter is going to be important and I'm looking at that now. |
I was able to get things working fairly nicely but I'm not happy with the amount of clutter that gets added into DocumentManager to make it work with the thread safe bypass cache/retry logic in place. I think adding a RedisCache makes more sense but I don't have the time available for that for a week or two. |
Thanks for working on this Feel free to open a PR even if it is not ready Then I will help if I have time |
Maybe this PR will fix all the issues: |
@sebastienros Sadly it won't resolve the problem for us.. I just found this issue today and it matches what we're seeing StackExchange/StackExchange.Redis#1848 and it looks like its a no-fix. The workaround doesn't work for us as we're on Docker. The ForceReconnect idea has value but as mentioned in their issue, the original implementation ignores timeout exceptions and in this scenario that's what we end up seeing so it would take a bit more logic that I'm not sure I'd be the best person to come up with since our usage of OC is likely quite different then others. |
Interesting, hmm but maybe not incompatible with the breaker pattern suggested here So that during a temporary outage we only read the database for a certain amount of time This in place of entering too many times in a retries loop |
We are using Azure Redis Cache Standard Tier as our distributed cache and have noticed OrchardCore requests failing during the maintenance window.
After some investigation it appears the problem is the DocumentManager.GetInternalAsync method. The entry => {..} which is passed to _memoryCache.GetOrCreateAsync has no error handling/retry logic so when we get a RedisTimeoutException the entire request fails.
It looks like an easy solution would be to return a null id if that inner method fails? It would then just go to the document store to get the document (and potentially fail again when trying to store the id in distributed cache so another failure point to handle).
The text was updated successfully, but these errors were encountered: