-
Notifications
You must be signed in to change notification settings - Fork 672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] FlytePropeller 1.13.0 is crashing Flyteadmin and making the overall service in a bad state #5606
Comments
The issue you're describing seems to be related to the interaction between FlytePropeller and FlyteAdmin, particularly with the gRPC calls to Based on the provided data sources, here are a few steps and considerations to address this issue:
Would you like more detailed steps on how to implement these changes or further troubleshooting steps? Sources: This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not. Mention @RunLLM to ask follow-up questions. See runllm.com for more! |
I think what is happening here is that the auth intercepter, provided by the Inside MaterializeCredentials, the code does a couple things that maybe can be optimized.
@pmahindrakar-oss can you think of any quick fixes here? I think it should be okay to cache it on first read but i don't know how often these get refreshed etc. maybe we refresh it whenever we do a token refresh? On the Admin side, it does seem likely that the oom is caused by this repeated calling of this endpoint. That's a side effect though. Calling these two endpoints thousands of times shouldn't be a problem. We're still looking into this. This is what a pprof looks like right now, so it feels like something prometheus related (which would make sense given how simple those two endpoints are) but @eapolinario and I dug for a while and didn't see a culprit just yet. |
I noticed this PR wasn't upstreamed. This PR fixed "propeller panic when talking to admin , due to client interceptor sending nil token cache". I figured this could be a potential cause for the hammering of NewAuthInterceptor -> MaterializeCredentials -> GetOauth2Metadata + GetPublicClientConfig. I'm unsure if this is the issue, however, as this seems like it would cause issues for most users on 1.13. Also I don't get the worker panics that this PR fixed when running without it. |
Would you have any propeller logs while it was crashing during your upgrade to 1.13? Also would you be able to share any logs of workflows that failed during the upgrade? |
@RRap0so I think there are two issues happening here. Admin shouldn't be oom'ing, and the client that propeller uses shouldn't be making that many calls to Admin. Wrt the second one, I asked internally and confirmed, unless you instantiate a new auth server, the details shouldn't change. To that end, until we add in a caching layer or change the logic, you can just short-circuit the logic by setting the config fields here and here. This is not ideal as you'll have these fields twice in your config - once where Admin is currently grabbing the values (to then serve in these two endpoints) and once here. Make sure you set scopes as well. The memory leak we're still investigating. It's not clear to me how they are related. Will post back here when we have more. |
Hey @pvditt and @wild-endeavor thank you for all the help! I'll try to find some logs but honestly I don't think there were any problems with the execution of the workflows. Things were still running, what I'm not sure is if it was able to publish events to admin. |
I feel that it is fairly normal for clients to cache oauth metadata. That is something I'd be happy to work on as I've done it before. |
Does anyone have an MVP flyte-binary |
@eapolinario we just gave it another go and see the same behaviour. The Oath2Metadata spikes in flyteadmin, flyteadmin OOMing and restarting. |
@RRap0so if the change we talked about works, I think we should close out this issue. There may be a memory leak when Admin serves up 404s in perpetuity but i think that's not the right thing to be optimizing for either. The memory issue identified in this issue with propeller was isolated to metrics, and is related to this issue here that @eapolinario found. I think we should cut a separate ticket to discuss metric life-cycle with @EngHabu as that discussion is pretty orthogonal. what do you guys think? |
@wild-endeavor so far so good. We've update our code with the latest flyeidl libs!. I'm going to ahead and close thise one. |
Describe the bug
We've tried several times upgrading to flyte v1.13.0 and we've found some serious issues blocking the upgrade, we don't know the cause yet but overall it seems that Flyte Propeller calls to FlyteAdmin makes FlyteAdmin make many calls too two GRPC Endpoints:
GetOauth2Metadata
GetPublicClientConfig
What we see in Flyteadmin is a lot of pressure in the GRPC side and an increase in memory until it gets killed by the k8s with OOM. This in turn causes the FlytePropeller calls to fail with grpc errors and eventually also cause a crash. What we also noticed is that this versions added a retry mechanism which in our case is too aggressive but we've tuned that down.
Hard to say looking at the code but I suspect this PR introduced some kind of bug or missing config unaware to us which makes FlyteAdmin constantly trying to fetch a token (and fails).
#5407
When we rollback to FlytePropeller v1.11.0 things are fine and FlyteAdmin behaves nicely (still running in v1.13.0).
Expected behavior
Flyteadmin doesn't get massively impacted by grpc calls.
Additional context to reproduce
No response
Screenshots
No response
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
The text was updated successfully, but these errors were encountered: