-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache Failover and Shared options #10338
Conversation
Just did some tests in the Failover state, it works and at least the page is rendered but it takes a long time (e.g. 40s with the blog theme) because there are still many attemps to access the cache e.g. by our Hmm, maybe better to also have a failover pattern in a couple of other services, knowing that the remaining services are more intended to cache data temporarily, so with less concerns about keeping in sync the cache with the database. Anyway will be good to keep what I did here about the shared options allowing to also have global config options for all document types (easier to configure), but with a lower precedence than the configs per doc type that we already have. |
40s wouldn't work for anyone hosting on Azure Linux containers due to that other issue that was found which means we're talking 40s replies for 15mins. I did notice the GetOrCreateMutableAsync path hasn't been dealt with which might explain part of your 40s as I originally did a similar implementation (without the DocumentNamed... stuff) and didn't see a 40s time taken but I don't know if our code base has other changes that explain that difference. |
No, this one is only used for a mutation before updating (the store if not marked as volatile). Okay, I was using the blog theme that has a menu using cache / tagCache attributes, that's why the other caching services were used. By just removing these attributes, it works well while the document manager is in the failover state.
You can already test this PR as it seems that you don't use any liquid / razor helpers involving dynamic caching. |
First read my previous coment I applied a simple failover pattern to the dynamic cache service so that it works well even if I re-use the dyn cache by re-using the cached menu. In the logs we can see the 2 first attemps to read the distributed cache.
Then if you edit e.g. the site settings and save them, there are not stored and you get
I will push but the failover retry latency of the dynamic cache is not yet configurable. |
Okay, and also, when Redis is first connected and then disconnected, it is faster to check that it can't be accessed, in my first tests Redis was not connected from the beginning, so it was always initializing more things before trying to connect. So, I reduced again the default retry latency but to 30s, and I tried again many connections / disconnections. All seems to work fine, the goal being to keep the site usable during a Redis outage, but also to not allow a disconnected instance to update the store with a document related to a distributed cache entry. Can you try it and see if it fits your needs? |
@jtkech I haven't had a chance to look into this as things are a bit crazy but I'm trying to get to it in the next day or two |
if (_redis.Connection == null) | ||
{ | ||
_logger.LogError("Unable to subscribe to the channel '{ChannelName}'.", _channelPrefix + channel); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hides what used to be an explicit exception before. Do you really want that change in behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By this I meant.. is the bus used in places where you might want to make a different decisions based off if you can actually connect or not? If I'm not mistaken this is used for things like the distributed reloading of shells for instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm not mistaken this is used for things like the distributed reloading of shells for instance?
RedisBus
is there but currently we never use it, to keep in sync the distributed cache with the database we use identifiers in place of using message bus events to invalidate cache entries.
This hides what used to be an explicit exception before. Do you really want that change in behavior?
If Connection
is still null it means that we can't connect and that just before _redis.ConnectAsync()
already has logged an explicit exception. Then, without this null check and if Connection
is null, in the subsequent try catch block, on Connection.GetSubscriber()
it would throw another non relevant NRE + stack trace.
So it works as before, it just prevents useless logging.
I did the same in RedisLock
and RedisTagCache
where we check .Database
in place of .Connection
but that's quite the same. Then, about using a try catch block (that was already there), for the redis lock, it still respects the contract, it still uses its retry logic and return false after the timeout.
The RedisTagCache
is more related to dynamic caching (see the related comments), so I think it's okay to use a try catch block and just log errors, for the RedisBus
I'm open to any suggestion but as said we opted to not use it to keep in sync data e.g. in a distributed env.
No problem, only when you have time, thanks for the review, I will answer to your questions in the review.
|
Did you try it? |
Our migration to 1.0 code base is still in progress. I'm working on backporting the fix to pre 1.x to verify. |
Okay thanks, let us know. |
So ended up testing it against a 1.0 version and initial tests seemed good for viewing a tenant but we've got custom stuff for the admin side of things which means I can't test that. |
@jtkech So one thing I am thinking is that without a ForceReconnect pattern implementation as per the recommendation from StackExchange.Redis there is a 15 minute period where the failover will continue to be hit every N seconds. Their alternative of modifying the underlying properties in Linux unfortunately cannot not be implemented when using Azure App Services Linux Containers due to a lack of access to docker run command parameters or privileged keyword. |
Okay, so in that case you can't update e.g. What you think about merging this PR and then, based on this comment StackExchange/StackExchange.Redis#1848 (comment), try to implement a |
That makes sense to me. |
Fixes #10197 Related to #10297 and #10313
@ShaneCourtrille Yes, in #10313 you did a generic distributed cache breaker, then I also thought about a component, not related to the distributed cache but on which we could pass delegates to execute depending on the state.
But not easy to be well adapted in all scenarios. So here I first retried to do the job in the DocManager, one goal being to still use the local memory cache on an outage (only during a retry latency time), in place of querying the database again.
Only started to work on it, so not ready but you can see the idea ;)