-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Caffeine cache for request api #2213
Conversation
🚢 |
Realized a bit after the fact that we should probably have @jschlather and/or @tpetr take a look to make sure this aligns with what we talked about in the critsit post — would you guys mind taking a peek? |
I think this would be cleaner if the cache was in |
The reason that I didn't put the cache in the |
We also have several layers of cache in that call that didn't end up helping us with the 504s: https://github.com/HubSpot/Singularity/blob/master/SingularityService/src/main/java/com/hubspot/singularity/data/RequestManager.java#L628-L634. |
Do you know what this value is going to be for a java web app? |
For the web app, the user ID is the user who's logged in accessing the web app. You're logged in through SSO, can you confirm @pschoenfelder? |
So it should be the janus username and stable across difference instances of the same deployable? |
Yes, just checked the logging around the user ID to confirm. |
Okay cool, it would be nice to debounce across callers. But this should prevent one service from taking us down. The other option here would be to put a 1s cache on the getRequests call and then also add a 5s cache to the getRequestsWithHistory calls. I don't have the heap/thread dumps handy, but I'm pretty sure neither the leader cache or the web cache were active on the instances I looked at. |
I included the user in the key because users can have different levels of authorization, and since the past few times 504s have been caused by the same IP I think we should be in the clear with that level of granularity. I'll discuss the two layers of CaffeineCache with the team. The web cache wouldn't have been active because that cache is only used for the web app, but I'll look more into the LeaderCache. Edit: LeaderCache is only active on a single instance (the scheduler instance). |
@jschlather During the latest slow down, I found three ZK calls that kept timing out with the one ZK call in DeployManager (from RequestHelper)
ZK call in RequestManager
ZK call in UserManager (from RequestHelper)
|
And this was with the caffeine cache? |
Yes. I was thinking that it caches that first time for a second but then when we are hit with 100 calls a second immediately after expiry and we end up not being able to cache again because of ZK timeouts. We're looking at the heapdump now and there was one item in the cache, so some things are entering. |
Right, my original intention was for the cache to work across callers. So that way we only ever end up with one of these requests in progress. Since the cache is per caller, seems like we still end up with too many concurrent requests to ZK. Caffeine should also debounce the call to ZK, so if if you have two requests for the same cache key then the first one that misses calls to ZK and the second one waits for that. Maybe the answer here is more short TTL caches. |
I discussed with the team and it's not going to be possible to get the cache to work across all callers because of user settings and admin/non-admin privileges, but we're going to move CaffeineCaches into the We aren't too worried about the ZK call to get request history because it is only called if |
Sounds good. |
In response to Singularity experiencing slowness due to one endpoint getting hammered, we want the ability to cache a request value for at least a second to have some de-bouncing. To do this, we are using CaffeineCache with an expireAfterWrite of one second that can be re-configured to another value if necessary.
Open question: should the cache be in the resource file or somewhere nested in the
RequestManager
andRequestHelper
?