Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: enable drift detection + takeover experience in work applier [full DRAFT] #950

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

michaelawyu
Copy link
Contributor

Description of your changes

This PR includes a new implementation of the work applier that enables the drift detection + takeover experience.

Important:

Please note that this PR is submitted as a draft to provide a preview of the whole implementation. Though the flow has been fully implemented, there might still be incoming changes. The current focus is to provide a fully functional preview for interested parties; additional tests are currently being developed. When the code is ready, it will be submitted in smaller chunks. To make reviewing (and PR submissions) easier, the implementation is organized as a standalone controller that can replace the current work applier, rather than directly patching the existing code, even though codes are being reused as appropriate.

I have:

  • Run make reviewable to ensure this PR is ready for review.

How has this code been tested

N/A

Special notes for your reviewer

@michaelawyu michaelawyu marked this pull request as draft November 11, 2024 17:12
@michaelawyu michaelawyu changed the title feat: enable drift detection + takeover experience in work applier [DRAFT] feat: enable drift detection + takeover experience in work applier [full DRAFT] Nov 11, 2024

// If the Work object is not yet available, reconcile again in 5 seconds.
if !isWorkObjectAvailable(work) {
return ctrl.Result{RequeueAfter: availabilityCheckRequeueAfter}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this interval be tuned based on the type of resources or the progress?

identifier.Kind = manifestObj.GetKind()
identifier.Name = manifestObj.GetName()
identifier.Namespace = manifestObj.GetNamespace()
identifier.GenerateName = manifestObj.GetGenerateName()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this? AFAIK, this field and name are mutually exclusive

//
// This check is done on the Work object scope, and is primarily added to address the case
// where duplicate objects might appear in a Fleet resource envelope and lead to unexpected
// behaviors. Duplication is a non-issue without Fleet resource envelopes, as the Fleet hub
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This statement is not true since we apply override policies.

checked[wriStr] = true

// Prepare the manifest conditions for the write-ahead process.
manifestCondForWA := prepareManifestCondForWA(wriStr, bundle.id, work.Generation, existingManifestCondQIdx, work.Status.ManifestConditions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "WA" stand for?

bundle.availabilityResTyp = ManifestProcessingAvailabilityResultTypeFailed
klog.ErrorS(err,
"Failed to track the availability of the applied object in the member cluster",
"work", *workRef, "GVR", *bundle.gvr, "inMemberClusterObj", klog.KObj(bundle.inMemberClusterObj))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why pass workRef as a pointer and then get the value of it?

) {
// Firstly, attempt to find if an object has been created in the member cluster based on the manifest object.
inMemberClusterObj, err := r.findInMemberClusterObjectFor(ctx,
bundle.gvr, bundle.manifestObj, bundle.manifestCondIdx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is bundle.manifestCondIdx set? It's not set in preProcessManifests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably should have set it prepareManifestProcessingBundles?

Comment on lines +123 to +142
inMemberClusterObjCopyJSONBytes, err := inMemberClusterObjCopy.MarshalJSON()
if err != nil {
return nil, fmt.Errorf("failed to marshal the object from the member cluster into JSON: %w", err)
}
appliedObjJSONBytes, err := appliedObjCopy.MarshalJSON()
if err != nil {
return nil, fmt.Errorf("failed to marshal the object returned by the dry-run apply op into JSON: %w", err)
}

// Compare the JSON representations.
patch, err := jsondiff.CompareJSON(appliedObjJSONBytes, inMemberClusterObjCopyJSONBytes)
if err != nil {
return nil, fmt.Errorf("failed to compare the JSON representations of the the object from the member cluster and the object returned by the dry-run apply op: %w", err)
}

details, err := organizeJSONPatchIntoFleetPatchDetails(patch, appliedObjCopy.Object, inMemberClusterObj.Object)
if err != nil {
return nil, fmt.Errorf("failed to organize the JSON patch into Fleet patch details: %w", err)
}
return details, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we reuse this for full comparision?

klog.ErrorS(err, "Failed to decode the manifest", "ordinal", pieces, "work", klog.KObj(work))
bundle.applyErr = fmt.Errorf("failed to decode manifest: %w", err)
bundle.applyResTyp = ManifestProcessingApplyResultTypeDecodingErred
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't return any error back to the caller, thus there are bundles that have no gvr/ManifestObj. However, those are used extensively as pointers in the rest part of the controller logic. It seems that this can lead to Nullptr panic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that there are checks like "bundle.applyErr != nil" on some places but I am not sure if it covers all cases. Maybe we can add a step after preProcessManifests to remove those from the bundle?

// * Fleet reports that the manifest has been applied before (i.e., inMemberClusterObj exists); and
// * The apply strategy dictates that an apply op should only run if there is no
// detected drift; and
// * The hash of the manifest object is consistently with that bookkept in the live object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// * The hash of the manifest object is consistently with that bookkept in the live object
// * The hash of the manifest object is not consistent with that book keeping annotation in the live object

}

inMemberClusterObjLastAppliedManifestObjHash := inMemberClusterObj.GetAnnotations()[fleetv1beta1.ManifestHashAnnotation]
return manifestObjHash == inMemberClusterObjLastAppliedManifestObjHash, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why check if the manifest matches?

@@ -0,0 +1,490 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please make sure that the previous test and integration test still works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants