Cache Failover and Shared options #10338

jtkech · 2021-09-26T07:27:28Z

Fixes #10197 Related to #10297 and #10313

@ShaneCourtrille Yes, in #10313 you did a generic distributed cache breaker, then I also thought about a component, not related to the distributed cache but on which we could pass delegates to execute depending on the state.

But not easy to be well adapted in all scenarios. So here I first retried to do the job in the DocManager, one goal being to still use the local memory cache on an outage (only during a retry latency time), in place of querying the database again.

Only started to work on it, so not ready but you can see the idea ;)

jtkech · 2021-09-27T06:44:58Z

Just did some tests in the Failover state, it works and at least the page is rendered but it takes a long time (e.g. 40s with the blog theme) because there are still many attemps to access the cache e.g. by our IDynamicCache that uses IDistributedCache, and so on. Would it be okay for you?

Hmm, maybe better to also have a failover pattern in a couple of other services, knowing that the remaining services are more intended to cache data temporarily, so with less concerns about keeping in sync the cache with the database.

Anyway will be good to keep what I did here about the shared options allowing to also have global config options for all document types (easier to configure), but with a lower precedence than the configs per doc type that we already have.

ShaneCourtrille · 2021-09-27T14:34:13Z

40s wouldn't work for anyone hosting on Azure Linux containers due to that other issue that was found which means we're talking 40s replies for 15mins.

I did notice the GetOrCreateMutableAsync path hasn't been dealt with which might explain part of your 40s as I originally did a similar implementation (without the DocumentNamed... stuff) and didn't see a 40s time taken but I don't know if our code base has other changes that explain that difference.

jtkech · 2021-09-27T22:16:30Z

@ShaneCourtrille

I did notice the GetOrCreateMutableAsync path hasn't been dealt with which might explain part of your 40s.

No, this one is only used for a mutation before updating (the store if not marked as volatile).

Okay, I was using the blog theme that has a menu using cache / tagCache attributes, that's why the other caching services were used. By just removing these attributes, it works well while the document manager is in the failover state.

What I'm doing now is to not allow to update the store if not able to access the distributed cache, in that case we will do a store CancelAsync(), so that we keep in sync the caches (e.g. of other running instances) with the store.

Because of the above, I think it's okay to have an higher default value e.g. 5 minutes for the configurable FailoverRetryLatency, at least good to have a value higher than the time of the first failing access to the cache.
Next step, use an adapted failover pattern also for the dynamic cache, I think a simpler one because the related data are not intended to be cached for a long time, e.g. without having to deal with the caches / datastore syncing.

Then I will test it by re-using the caching attributes of the blog theme menu.

You can already test this PR as it seems that you don't use any liquid / razor helpers involving dynamic caching.

jtkech · 2021-09-28T01:37:59Z

@ShaneCourtrille

First read my previous coment

I applied a simple failover pattern to the dynamic cache service so that it works well even if I re-use the dyn cache by re-using the cached menu. In the logs we can see the 2 first attemps to read the distributed cache.

2021-09-28 02:45:45.9812|Default|00-50a372c3f3c00f49bd95841ce19ebe5e-fb1fa76597e32642-00
||OrchardCore.Documents.DocumentManager
|WARN|Failed to read the 'SiteSettings' from the distributed cache 

2021-09-28 02:46:12.8066|Default|00-50a372c3f3c00f49bd95841ce19ebe5e-fb1fa76597e32642-00
||OrchardCore.DynamicCache.Services.DefaultDynamicCacheService
|WARN|Failed to read the 'cachecontext-main-menu' from the dynamic cache

Then if you edit e.g. the site settings and save them, there are not stored and you get

2021-09-28 03:29:50.7981|Default|00-f356576513ee434fa10b6cf3944b271a-6e8bbf14ff3d1c4a-00
||Microsoft.AspNetCore.Diagnostics.ExceptionHandlerMiddleware
|ERROR|An unhandled exception has occurred while executing the request. System.InvalidOperationException:
Can't update the SiteSettings if not able to access the distributed cache

I will push but the failover retry latency of the dynamic cache is not yet configurable.

jtkech · 2021-09-29T03:09:28Z

@ShaneCourtrille

Okay, and also, when Redis is first connected and then disconnected, it is faster to check that it can't be accessed, in my first tests Redis was not connected from the beginning, so it was always initializing more things before trying to connect.

So, I reduced again the default retry latency but to 30s, and I tried again many connections / disconnections.

All seems to work fine, the goal being to keep the site usable during a Redis outage, but also to not allow a disconnected instance to update the store with a document related to a distributed cache entry.

Can you try it and see if it fits your needs?

ShaneCourtrille · 2021-10-04T17:06:58Z

@jtkech I haven't had a chance to look into this as things are a bit crazy but I'm trying to get to it in the next day or two

ShaneCourtrille · 2021-10-05T19:42:53Z

src/OrchardCore.Modules/OrchardCore.Redis/Services/RedisBus.cs

+            if (_redis.Connection == null)
+            {
+                _logger.LogError("Unable to subscribe to the channel '{ChannelName}'.", _channelPrefix + channel);
+                return;


This hides what used to be an explicit exception before. Do you really want that change in behavior?

By this I meant.. is the bus used in places where you might want to make a different decisions based off if you can actually connect or not? If I'm not mistaken this is used for things like the distributed reloading of shells for instance?

If I'm not mistaken this is used for things like the distributed reloading of shells for instance?

RedisBus is there but currently we never use it, to keep in sync the distributed cache with the database we use identifiers in place of using message bus events to invalidate cache entries.

This hides what used to be an explicit exception before. Do you really want that change in behavior?

If Connection is still null it means that we can't connect and that just before _redis.ConnectAsync() already has logged an explicit exception. Then, without this null check and if Connection is null, in the subsequent try catch block, on Connection.GetSubscriber() it would throw another non relevant NRE + stack trace.

So it works as before, it just prevents useless logging.

I did the same in RedisLock and RedisTagCache where we check .Database in place of .Connection but that's quite the same. Then, about using a try catch block (that was already there), for the redis lock, it still respects the contract, it still uses its retry logic and return false after the timeout.

The RedisTagCache is more related to dynamic caching (see the related comments), so I think it's okay to use a try catch block and just log errors, for the RedisBus I'm open to any suggestion but as said we opted to not use it to keep in sync data e.g. in a distributed env.

jtkech · 2021-10-05T22:29:19Z

@ShaneCourtrille

I haven't had a chance to look into this as things are a bit crazy but I'm trying to get to it in the next day or two

No problem, only when you have time, thanks for the review, I will answer to your questions in the review.

Forgot what I did around config options, I just added global options for all doc types, not only per doc type.
Yes I opted to manage the failover state in the doc manager (idem in dyn cache service) so that it can be better adapted, as I remember, there are not so many places where we check the failover state that is held by the memory cache with an expiration time that we set to the configurable retry latency.

jtkech · 2021-10-09T03:00:58Z

@ShaneCourtrille

Did you try it?

ShaneCourtrille · 2021-11-04T19:07:49Z

Our migration to 1.0 code base is still in progress. I'm working on backporting the fix to pre 1.x to verify.

jtkech · 2021-11-05T03:55:42Z

Okay thanks, let us know.

ShaneCourtrille · 2021-11-15T15:36:16Z

So ended up testing it against a 1.0 version and initial tests seemed good for viewing a tenant but we've got custom stuff for the admin side of things which means I can't test that.

ShaneCourtrille · 2021-11-16T15:47:38Z

@jtkech So one thing I am thinking is that without a ForceReconnect pattern implementation as per the recommendation from StackExchange.Redis there is a 15 minute period where the failover will continue to be hit every N seconds. Their alternative of modifying the underlying properties in Linux unfortunately cannot not be implemented when using Azure App Services Linux Containers due to a lack of access to docker run command parameters or privileged keyword.

jtkech · 2021-11-17T23:48:35Z

@ShaneCourtrille

Okay, so in that case you can't update e.g. tcp_retries2.

What you think about merging this PR and then, based on this comment StackExchange/StackExchange.Redis#1848 (comment), try to implement a ForceReconnect pattern if you have time and then suggest another PR.

ShaneCourtrille · 2021-11-18T16:54:14Z

That makes sense to me.

DocumentManager Failover

c86d41e

jtkech added the don't merge label Sep 26, 2021

Named and also Shared Options

89f13e2

Don't update the store if not able to access the distributed cache

09af682

jtkech added 6 commits September 28, 2021 03:43

Dynamic Cache Failover Pattern

906454d

Don't make the FailoverKey configurable

23cb421

Minor changes

b13c71e

Make dyn cache retry latency configurable

203902b

Minor changes

e0eb928

Refactor some loggings and minor changes

68d9867

jtkech requested a review from sebastienros as a code owner September 29, 2021 03:02

jtkech changed the title ~~DocumentManager distributed cache Failover~~ Cache Failover and Shared options Sep 29, 2021

jtkech removed the don't merge label Oct 2, 2021

ShaneCourtrille reviewed Oct 5, 2021

View reviewed changes

jtkech added 3 commits October 24, 2021 02:32

Merge remote-tracking branch 'origin/main' into jtkech/failover

6533fdf

Minor change

002367f

Merge remote-tracking branch 'origin/main' into jtkech/failover

d27bae0

jtkech merged commit ace8e09 into main Nov 19, 2021

jtkech deleted the jtkech/failover branch November 19, 2021 05:14

Piedone mentioned this pull request Jan 17, 2024

Generic handler for distributed cache exceptions #10313

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache Failover and Shared options #10338

Cache Failover and Shared options #10338

jtkech commented Sep 26, 2021

jtkech commented Sep 27, 2021

ShaneCourtrille commented Sep 27, 2021

jtkech commented Sep 27, 2021

jtkech commented Sep 28, 2021

jtkech commented Sep 29, 2021

ShaneCourtrille commented Oct 4, 2021

ShaneCourtrille Oct 5, 2021

ShaneCourtrille Oct 5, 2021

jtkech Oct 5, 2021

jtkech commented Oct 5, 2021

jtkech commented Oct 9, 2021

ShaneCourtrille commented Nov 4, 2021

jtkech commented Nov 5, 2021

ShaneCourtrille commented Nov 15, 2021

ShaneCourtrille commented Nov 16, 2021 •

edited

Loading

jtkech commented Nov 17, 2021

ShaneCourtrille commented Nov 18, 2021

Cache Failover and Shared options #10338

Cache Failover and Shared options #10338

Conversation

jtkech commented Sep 26, 2021

jtkech commented Sep 27, 2021

ShaneCourtrille commented Sep 27, 2021

jtkech commented Sep 27, 2021

jtkech commented Sep 28, 2021

jtkech commented Sep 29, 2021

ShaneCourtrille commented Oct 4, 2021

ShaneCourtrille Oct 5, 2021

Choose a reason for hiding this comment

ShaneCourtrille Oct 5, 2021

Choose a reason for hiding this comment

jtkech Oct 5, 2021

Choose a reason for hiding this comment

jtkech commented Oct 5, 2021

jtkech commented Oct 9, 2021

ShaneCourtrille commented Nov 4, 2021

jtkech commented Nov 5, 2021

ShaneCourtrille commented Nov 15, 2021

ShaneCourtrille commented Nov 16, 2021 • edited Loading

jtkech commented Nov 17, 2021

ShaneCourtrille commented Nov 18, 2021

ShaneCourtrille commented Nov 16, 2021 •

edited

Loading