Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache Failover and Shared options #10338

Merged
merged 12 commits into from
Nov 19, 2021
Merged

Cache Failover and Shared options #10338

merged 12 commits into from
Nov 19, 2021

Conversation

jtkech
Copy link
Member

@jtkech jtkech commented Sep 26, 2021

Fixes #10197 Related to #10297 and #10313

@ShaneCourtrille Yes, in #10313 you did a generic distributed cache breaker, then I also thought about a component, not related to the distributed cache but on which we could pass delegates to execute depending on the state.

But not easy to be well adapted in all scenarios. So here I first retried to do the job in the DocManager, one goal being to still use the local memory cache on an outage (only during a retry latency time), in place of querying the database again.

Only started to work on it, so not ready but you can see the idea ;)

@jtkech
Copy link
Member Author

jtkech commented Sep 27, 2021

Just did some tests in the Failover state, it works and at least the page is rendered but it takes a long time (e.g. 40s with the blog theme) because there are still many attemps to access the cache e.g. by our IDynamicCache that uses IDistributedCache, and so on. Would it be okay for you?

Hmm, maybe better to also have a failover pattern in a couple of other services, knowing that the remaining services are more intended to cache data temporarily, so with less concerns about keeping in sync the cache with the database.

Anyway will be good to keep what I did here about the shared options allowing to also have global config options for all document types (easier to configure), but with a lower precedence than the configs per doc type that we already have.

@ShaneCourtrille
Copy link
Contributor

40s wouldn't work for anyone hosting on Azure Linux containers due to that other issue that was found which means we're talking 40s replies for 15mins.

I did notice the GetOrCreateMutableAsync path hasn't been dealt with which might explain part of your 40s as I originally did a similar implementation (without the DocumentNamed... stuff) and didn't see a 40s time taken but I don't know if our code base has other changes that explain that difference.

@jtkech
Copy link
Member Author

jtkech commented Sep 27, 2021

@ShaneCourtrille

I did notice the GetOrCreateMutableAsync path hasn't been dealt with which might explain part of your 40s.

No, this one is only used for a mutation before updating (the store if not marked as volatile).

Okay, I was using the blog theme that has a menu using cache / tagCache attributes, that's why the other caching services were used. By just removing these attributes, it works well while the document manager is in the failover state.

  • What I'm doing now is to not allow to update the store if not able to access the distributed cache, in that case we will do a store CancelAsync(), so that we keep in sync the caches (e.g. of other running instances) with the store.

    Because of the above, I think it's okay to have an higher default value e.g. 5 minutes for the configurable FailoverRetryLatency, at least good to have a value higher than the time of the first failing access to the cache.

  • Next step, use an adapted failover pattern also for the dynamic cache, I think a simpler one because the related data are not intended to be cached for a long time, e.g. without having to deal with the caches / datastore syncing.

    Then I will test it by re-using the caching attributes of the blog theme menu.

You can already test this PR as it seems that you don't use any liquid / razor helpers involving dynamic caching.

@jtkech
Copy link
Member Author

jtkech commented Sep 28, 2021

@ShaneCourtrille

First read my previous coment

I applied a simple failover pattern to the dynamic cache service so that it works well even if I re-use the dyn cache by re-using the cached menu. In the logs we can see the 2 first attemps to read the distributed cache.

2021-09-28 02:45:45.9812|Default|00-50a372c3f3c00f49bd95841ce19ebe5e-fb1fa76597e32642-00
||OrchardCore.Documents.DocumentManager
|WARN|Failed to read the 'SiteSettings' from the distributed cache 

2021-09-28 02:46:12.8066|Default|00-50a372c3f3c00f49bd95841ce19ebe5e-fb1fa76597e32642-00
||OrchardCore.DynamicCache.Services.DefaultDynamicCacheService
|WARN|Failed to read the 'cachecontext-main-menu' from the dynamic cache

Then if you edit e.g. the site settings and save them, there are not stored and you get

2021-09-28 03:29:50.7981|Default|00-f356576513ee434fa10b6cf3944b271a-6e8bbf14ff3d1c4a-00
||Microsoft.AspNetCore.Diagnostics.ExceptionHandlerMiddleware
|ERROR|An unhandled exception has occurred while executing the request. System.InvalidOperationException:
Can't update the SiteSettings if not able to access the distributed cache

I will push but the failover retry latency of the dynamic cache is not yet configurable.

@jtkech
Copy link
Member Author

jtkech commented Sep 29, 2021

@ShaneCourtrille

Okay, and also, when Redis is first connected and then disconnected, it is faster to check that it can't be accessed, in my first tests Redis was not connected from the beginning, so it was always initializing more things before trying to connect.

So, I reduced again the default retry latency but to 30s, and I tried again many connections / disconnections.

All seems to work fine, the goal being to keep the site usable during a Redis outage, but also to not allow a disconnected instance to update the store with a document related to a distributed cache entry.

Can you try it and see if it fits your needs?

@jtkech jtkech changed the title DocumentManager distributed cache Failover Cache Failover and Shared options Sep 29, 2021
@jtkech jtkech removed the don't merge label Oct 2, 2021
@ShaneCourtrille
Copy link
Contributor

@jtkech I haven't had a chance to look into this as things are a bit crazy but I'm trying to get to it in the next day or two

if (_redis.Connection == null)
{
_logger.LogError("Unable to subscribe to the channel '{ChannelName}'.", _channelPrefix + channel);
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hides what used to be an explicit exception before. Do you really want that change in behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By this I meant.. is the bus used in places where you might want to make a different decisions based off if you can actually connect or not? If I'm not mistaken this is used for things like the distributed reloading of shells for instance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken this is used for things like the distributed reloading of shells for instance?

RedisBus is there but currently we never use it, to keep in sync the distributed cache with the database we use identifiers in place of using message bus events to invalidate cache entries.

This hides what used to be an explicit exception before. Do you really want that change in behavior?

If Connection is still null it means that we can't connect and that just before _redis.ConnectAsync() already has logged an explicit exception. Then, without this null check and if Connection is null, in the subsequent try catch block, on Connection.GetSubscriber() it would throw another non relevant NRE + stack trace.

So it works as before, it just prevents useless logging.

I did the same in RedisLock and RedisTagCache where we check .Database in place of .Connection but that's quite the same. Then, about using a try catch block (that was already there), for the redis lock, it still respects the contract, it still uses its retry logic and return false after the timeout.

The RedisTagCache is more related to dynamic caching (see the related comments), so I think it's okay to use a try catch block and just log errors, for the RedisBus I'm open to any suggestion but as said we opted to not use it to keep in sync data e.g. in a distributed env.

@jtkech
Copy link
Member Author

jtkech commented Oct 5, 2021

@ShaneCourtrille

I haven't had a chance to look into this as things are a bit crazy but I'm trying to get to it in the next day or two

No problem, only when you have time, thanks for the review, I will answer to your questions in the review.

  • Forgot what I did around config options, I just added global options for all doc types, not only per doc type.

  • Yes I opted to manage the failover state in the doc manager (idem in dyn cache service) so that it can be better adapted, as I remember, there are not so many places where we check the failover state that is held by the memory cache with an expiration time that we set to the configurable retry latency.

@jtkech
Copy link
Member Author

jtkech commented Oct 9, 2021

@ShaneCourtrille

Did you try it?

@ShaneCourtrille
Copy link
Contributor

Our migration to 1.0 code base is still in progress. I'm working on backporting the fix to pre 1.x to verify.

@jtkech
Copy link
Member Author

jtkech commented Nov 5, 2021

Okay thanks, let us know.

@ShaneCourtrille
Copy link
Contributor

So ended up testing it against a 1.0 version and initial tests seemed good for viewing a tenant but we've got custom stuff for the admin side of things which means I can't test that.

@ShaneCourtrille
Copy link
Contributor

ShaneCourtrille commented Nov 16, 2021

@jtkech So one thing I am thinking is that without a ForceReconnect pattern implementation as per the recommendation from StackExchange.Redis there is a 15 minute period where the failover will continue to be hit every N seconds. Their alternative of modifying the underlying properties in Linux unfortunately cannot not be implemented when using Azure App Services Linux Containers due to a lack of access to docker run command parameters or privileged keyword.

@jtkech
Copy link
Member Author

jtkech commented Nov 17, 2021

@ShaneCourtrille

Okay, so in that case you can't update e.g. tcp_retries2.

What you think about merging this PR and then, based on this comment StackExchange/StackExchange.Redis#1848 (comment), try to implement a ForceReconnect pattern if you have time and then suggest another PR.

@ShaneCourtrille
Copy link
Contributor

That makes sense to me.

@jtkech jtkech merged commit ace8e09 into main Nov 19, 2021
@jtkech jtkech deleted the jtkech/failover branch November 19, 2021 05:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DocumentManager.GetInternalAsync cannot handle Redis server maintenance
2 participants