Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reintroduce k8s client fallback to cache lookups #4733

Merged
merged 5 commits into from
Jan 17, 2024

Conversation

hamersaw
Copy link
Contributor

@hamersaw hamersaw commented Jan 17, 2024

Tracking issue

fixes #4730

Why are the changes needed?

This PR updated k8s and removed the logic to fallback to the client Reader when the cache Reader was a miss. In high traffic scenarios the k8s informer can not populate the cache fast enough, so the fallback API call is required to ensure correctness.

What changes were proposed in this pull request?

Creating the k8s client with a fallback to API call when an object does not exist in the informer cache.

How was this patch tested?

Faked an object NotFound error because of cache miss.

Setup process

Updated k8s client and cache through the otel wrapper to fail retrievals if the object was created less than 3 seconds ago. Example code:

    func (c *K8sCacheWrapper) Get(ctx context.Context, key client.ObjectKey, obj client.Object, opts ...client.GetOption) error {
        ctx, span := NewSpan(ctx, K8sClientTracer, fmt.Sprintf("%s.Cache/Get", k8sSpanPathPrefix))
        defer span.End()

        err := c.Cache.Get(ctx, key, obj, opts...)
        if time.Since(obj.GetCreationTimestamp().Time) < (time.Second * 3) {
            obj = nil
            return k8serrors.NewNotFound(schema.GroupResource{Group: "foo", Resource: "bar"}, key.Name)
        }

        return err
        //return c.Cache.Get(ctx, key, obj, opts...)
    }

Then validated that this causes Pods to be stuck in Terminating state with finalizers still intact:

    hamersaw@ragnarok:~$ kubectl -n flytesnacks-development get pods
    NAME                        READY   STATUS        RESTARTS   AGE
    f5329181f1364487b8a9-n0-0   0/1     Terminating   0          95s
    f5329181f1364487b8a9-n0-1   0/1     Terminating   0          86s
    f5329181f1364487b8a9-n0-2   0/1     Terminating   0          76s

with log

{"json":{"exec_id":"f5329181f1364487b8a9","node":"n0","ns":"flytesnacks-development","res_ver":"11272","routine":"worker-0","src":"plugin_manager.go:269","tasktype":"python-task","wf":"flytesnacks:development:hello-world.my_wf_no_outputs_5"},"level":"warning","msg":"Failed to find the Resource with name: flytesnacks-development/f5329181f1364487b8a9-n0-0. Error: bar.foo \"f5329181f1364487b8a9-n0-0\" not found","ts":"2024-01-16T14:28:07-06:00"}

and UI message
[3/3] currentAttempt done. Last Error: SYSTEM::resource not found, name [flytesnacks-development/f5329181f1364487b8a9-n0-3]. reason: bar.foo "f5329181f1364487b8a9-n0-3" not found

Screenshots

NA

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

NA

Docs link

NA

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working labels Jan 17, 2024
Copy link

codecov bot commented Jan 17, 2024

Codecov Report

Attention: 39 lines in your changes are missing coverage. Please review.

Comparison is base (0c8dc61) 58.20% compared to head (7085329) 58.19%.

Files Patch % Lines
flytepropeller/pkg/controller/executors/kube.go 0.00% 39 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4733      +/-   ##
==========================================
- Coverage   58.20%   58.19%   -0.02%     
==========================================
  Files         626      626              
  Lines       53800    53833      +33     
==========================================
+ Hits        31316    31327      +11     
- Misses      19976    20000      +24     
+ Partials     2508     2506       -2     
Flag Coverage Δ
unittests 58.19% <0.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
@honnix
Copy link
Member

honnix commented Jan 17, 2024

Just to confirm my understanding. Does it mean this comment is not correct without this PR?

@hamersaw
Copy link
Contributor Author

Just to confirm my understanding. Does it mean this comment is not correct without this PR?

@honnix correct, previous to this merge it was and ti will be again after we get this PR merged. Right now, we only look at the informer cache and do not fall back to the api reader.

Copy link
Contributor

@kumare3 kumare3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

func NewFallbackClientBuilder(scope promutils.Scope) *FallbackClientBuilder {
return &FallbackClientBuilder{
scope: scope,
func (c fallbackClientReader) List(ctx context.Context, list client.ObjectList, opts ...client.ListOption) (err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember this code

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Jan 17, 2024
@honnix
Copy link
Member

honnix commented Jan 17, 2024

Just to confirm my understanding. Does it mean this comment is not correct without this PR?

@honnix correct, previous to this merge it was and ti will be again after we get this PR merged. Right now, we only look at the informer cache and do not fall back to the api reader.

This comment successfully tricked me not to look further into client details, lol.

@hamersaw hamersaw merged commit ada7695 into master Jan 17, 2024
42 of 44 checks passed
@hamersaw hamersaw deleted the bug/fallback-kube-client branch January 17, 2024 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] flytepropeller fails trying to get pod resource using the kubeClient
3 participants