Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: update v1 migration guide #7220

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

jmdeal
Copy link
Contributor

@jmdeal jmdeal commented Oct 15, 2024

Fixes #N/A

Description
Performs various updates to the v1 migration guide. Highlights include:

  • Reordering sections to better indicate importance of information
  • Attaching rather than replacing the IAM policy during upgrade
  • Added clarification and examples for several steps
  • General formatting changes

How was this change tested?
make website

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@jmdeal jmdeal requested a review from a team as a code owner October 15, 2024 16:18
Copy link

netlify bot commented Oct 15, 2024

Deploy Preview for karpenter-docs-prod ready!

Name Link
🔨 Latest commit 2103343
🔍 Latest deploy log https://app.netlify.com/sites/karpenter-docs-prod/deploys/673e60ef38beae0008027b51
😎 Deploy Preview https://deploy-preview-7220--karpenter-docs-prod.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@jmdeal jmdeal force-pushed the docs/update-migration-guide branch 5 times, most recently from bfc923a to 3f642f6 Compare October 15, 2024 20:19
{{% /alert %}}

Below is the full changelog for v1, copied from the [v1 Migration Upgrade Procedure]({{<ref "v1-migration#upgrade-procedure" >}}).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this to prevent further instances of drift between the two sources. I don't think there's a ton of value in having this duplicated in the first place. Alternatively, we could create a separate file and embed it in both docs.

@jmdeal jmdeal force-pushed the docs/update-migration-guide branch 2 times, most recently from 67e6e96 to 3217409 Compare October 15, 2024 20:26
@coveralls
Copy link

coveralls commented Oct 15, 2024

Pull Request Test Coverage Report for Build 11942748277

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 82.449%

Totals Coverage Status
Change from base Build 11924046371: 0.0%
Covered Lines: 5684
Relevant Lines: 6894

💛 - Coveralls


### Upgrading to `0.37.0`+

{{% alert title="Warning" color="warning" %}}
`0.33.0`+ _only_ supports Karpenter v1beta1 APIs and will not work with existing Provisioner, AWSNodeTemplate or Machine alpha APIs. Do not upgrade to `0.37.0`+ without first [upgrading to `0.32.x`]({{<ref "#upgrading-to-0320" >}}). This version supports both the alpha and beta APIs, allowing you to migrate all of your existing APIs to beta APIs without experiencing downtime.
{{% /alert %}}

{{% alert title="Note" color="primary" %}}
Webhooks have been re-enabled by default starting in `0.37.3` to faciliate migration to `v1.0`.
If your cluster has network policies that blocks Ingress, ports 8000, 8001, 8081, 8443 will need to be allowlisted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If your cluster has network policies that blocks Ingress, ports 8000, 8001, 8081, 8443 will need to be allowlisted.
If your cluster has network policies that block Ingress then ports `8000`, `8001`, `8081`, and `8443` will need to be allowlisted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably also include the bit about security groups here.

website/content/en/docs/upgrading/v1-migration.md Outdated Show resolved Hide resolved
website/content/en/docs/upgrading/v1-migration.md Outdated Show resolved Hide resolved

Once the Karpenter CRDs are upgraded to v1, conversion webhooks are needed to help convert APIs that are stored in etcd from v1 to v1beta1. Also changes to the CRDs will need to at least include the latest version of the CRD in this case being v1. The patch versions of the v1beta1 Karpenter controller that include the conversion wehooks include:
#### General Migration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a note about checking CRD status stored versions? IIRC the update will fail if the migration controller hasn't finished its work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the fence on rather this should go in the v1 upgrade guide, or just include it in v1.1. There's nothing actionable to do until you're upgrading to v1.1. I did leave a TODO in the doc on this, but I was thinking it could be decoupled from this PR.

Copy link
Contributor

@engedaam engedaam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!!

If your cluster has network policies that blocks Ingress, ports 8000, 8001, 8081, 8443 will need to be allowlisted.
You may still choose to disable webhooks through the helm chart.
{{% /alert %}}

* Karpenter no longer supports using the `karpenter.sh/provisioner-name` label in NodePool labels and requirements or in application node selectors, affinities, or topologySpreadConstraints. If you were previously using this label to target applications to specific Provisioners, you should update your applications to use the `karpenter.sh/nodepool` label instead before upgrading. If you upgrade without changing these labels, you may begin to see pod scheduling failures for these applications.
* Karpenter now tags `spot-instances-request` with the same tags that it tags instances, volumes, and primary ENIs. This means that you will now need to add `ec2:CreateTags` permission for `spot-instances-request`. You can also further scope your controller policy for the `ec2:RunInstances` action to require that it launches the `spot-instances-request` with these specific tags. You can view an example of scoping these actions in the [Getting Started Guide's default CloudFormation controller policy](https://github.com/aws/karpenter/blob/v0.33.0/website/content/en/preview/getting-started/getting-started-with-karpenter/cloudformation.yaml#L61).
* We now recommend that you set the installation namespace for your Karpenter controllers to `kube-system` to denote Karpenter as a critical cluster component. This ensures that requests from the Karpenter controllers are treated with higher priority by assigning them to a different [PriorityLevelConfiguration](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/#prioritylevelconfiguration) than generic requests from other namespaces. For more details on API Priority and Fairness, read the [Kubernetes API Priority and Fairness Conceptual Docs](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/). Note: Changing the namespace for your Karpenter release will cause the service account namespace to change. If you are using IRSA for authentication with AWS, you will need to change scoping set in the controller's trust policy from `karpenter:karpenter` to `kube-system:karpenter`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should adjust the wording to point bellow. I think customers can get confused here

website/content/en/docs/upgrading/v1-migration.md Outdated Show resolved Hide resolved
website/content/en/docs/upgrading/v1-migration.md Outdated Show resolved Hide resolved
website/content/en/docs/upgrading/v1-migration.md Outdated Show resolved Hide resolved
website/content/en/docs/upgrading/v1-migration.md Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to do this? iiuc, cfn templating strings are pretty valuable, and this makes it so that users have to edit it themselves after the fact to get the templating again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used in the migration guide for attaching a role without updating the cloudformation stack. I went with this approach for a couple of reasons:

  • It gives users not using CFN a method to extract the policy for a given version
  • The cloudformation stack that has the NodeRole (which we're attaching this policy to) is managed by eksctl, it's not the cloudformation stack with the policy
  • The policy only needs to be attached to the role for a short duration, it's removed from the role in the end of the migration guide

Values are still templated using environment variables, they're just substituted in the shell rather than by cloudformation. This doesn't impact anywhere else we use CFN, it's just for applying the policy in the migration guide.

An alternative would be staticlly providing a policy file for each version we support upgrade through. This has the risk of introducing drift between our latest policies over time without additional release automation.


### Upgrading to `0.37.0`+

{{% alert title="Warning" color="warning" %}}
`0.33.0`+ _only_ supports Karpenter v1beta1 APIs and will not work with existing Provisioner, AWSNodeTemplate or Machine alpha APIs. Do not upgrade to `0.37.0`+ without first [upgrading to `0.32.x`]({{<ref "#upgrading-to-0320" >}}). This version supports both the alpha and beta APIs, allowing you to migrate all of your existing APIs to beta APIs without experiencing downtime.
{{% /alert %}}

{{% alert title="Note" color="primary" %}}
Webhooks have been re-enabled by default starting in `0.37.3` to faciliate migration to `v1.0`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what versions were they disabled by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v0.33+. I don't think we need to add anything additional there since it's still called out in the v0.33 section of the upgrade guide.

* Karpenter no longer supports using the `karpenter.sh/provisioner-name` label in NodePool labels and requirements or in application node selectors, affinities, or topologySpreadConstraints. If you were previously using this label to target applications to specific Provisioners, you should update your applications to use the `karpenter.sh/nodepool` label instead before upgrading. If you upgrade without changing these labels, you may begin to see pod scheduling failures for these applications.
* Karpenter now tags `spot-instances-request` with the same tags that it tags instances, volumes, and primary ENIs. This means that you will now need to add `ec2:CreateTags` permission for `spot-instances-request`. You can also further scope your controller policy for the `ec2:RunInstances` action to require that it launches the `spot-instances-request` with these specific tags. You can view an example of scoping these actions in the [Getting Started Guide's default CloudFormation controller policy](https://github.com/aws/karpenter/blob/v0.33.0/website/content/en/preview/getting-started/getting-started-with-karpenter/cloudformation.yaml#L61).
* We now recommend that you set the installation namespace for your Karpenter controllers to `kube-system` to denote Karpenter as a critical cluster component. This ensures that requests from the Karpenter controllers are treated with higher priority by assigning them to a different [PriorityLevelConfiguration](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/#prioritylevelconfiguration) than generic requests from other namespaces. For more details on API Priority and Fairness, read the [Kubernetes API Priority and Fairness Conceptual Docs](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/). Note: Changing the namespace for your Karpenter release will cause the service account namespace to change. If you are using IRSA for authentication with AWS, you will need to change scoping set in the controller's trust policy from `karpenter:karpenter` to `kube-system:karpenter`.
* `0.33.0` disables mutating and validating webhooks by default in favor of using [Common Expression Language for CRD validation](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#validation). The Common Expression Language Validation Feature [is enabled by default on EKS 1.25](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#validation-rules). If you are using Kubernetes version >= 1.25, no further action is required. If you are using a Kubernetes version below 1.25, you now need to set `DISABLE_WEBHOOK=false` in your container environment variables or `--set webhook.enabled=true` if using Helm. View the [Webhook Support Deprecated in Favor of CEL Section of the v1beta1 Migration Guide]({{<ref "../../v0.32/upgrading/v1beta1-migration#webhook-support-deprecated-in-favor-of-cel" >}}).
* ~~`0.33.0` disables mutating and validating webhooks by default in favor of using [Common Expression Language for CRD validation](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#validation). The Common Expression Language Validation Feature [is enabled by default on EKS 1.25](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#validation-rules). If you are using Kubernetes version >= 1.25, no further action is required. If you are using a Kubernetes version below 1.25, you now need to set `DISABLE_WEBHOOK=false` in your container environment variables or `--set webhook.enabled=true` if using Helm. View the [Webhook Support Deprecated in Favor of CEL Section of the v1beta1 Migration Guide]({{<ref "../../v0.32/upgrading/v1beta1-migration#webhook-support-deprecated-in-favor-of-cel" >}}).~~
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does ~~ do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strikethrough, I didn't want to remove the entry but this indicates it's no longer relevant.

Karpenter `1.0.0` introduces v1 APIs, including _significant_ changes to the API and upgrade procedures for the Karpenter controllers. **Do not** upgrade to `1.0.0`+ without referencing the [v1 Migration Upgrade Procedure]({{<ref "v1-migration#upgrade-procedure" >}}).

This version adds [conversion webhooks](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/#webhook-conversion) to automatically pull the v1 API version of previously applied v1beta1 NodePools, EC2NodeClasses, and NodeClaims. Karpenter will stop serving the v1beta1 API version at v1.1.0 and will drop the conversion webhooks at that time. Migrate all stored manifests to v1 API versions on Karpenter v1.0+.
Karpenter `1.0.0` introduces the `v1` APIs and uses [conversion webhooks](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/#webhook-conversion) to support existing `v1beta1` APIs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider the following

Suggested change
Karpenter `1.0.0` introduces the `v1` APIs and uses [conversion webhooks](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/#webhook-conversion) to support existing `v1beta1` APIs.
Karpenter `1.0.0` introduces the `v1` APIs and uses [conversion webhooks](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/#webhook-conversion) to support automatic in-place conversion from existing `v1beta1` APIs to `v1` APIs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've moved away from that language since it seems to have led to confusion, e.g. "why can I still see my v1beta1 resources, weren't they converted"? Since the storage version in ETCD is really an implementation detail I'm thinking framing the webhooks as just a way to support both versions is more clear.

@@ -176,233 +291,339 @@ kubectl patch customresourcedefinitions ec2nodeclasses.karpenter.k8s.aws -p "{\"
--set controller.resources.limits.memory=1Gi \
--wait
```
{{% alert title="Note" color="primary" %}}
<!-- Note: don't indent this line to match the indenting of the alert box. Hugo will create a code block. -->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol this doesn't show up in the website, right? is it that there's no new line in between the last code block and this that this happens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, definitely doesn't show up it's a comment. IIUC a 4 space indent will get you a code block. I can give that a shot, but I think I had tried that.

Copy link
Contributor

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

@jmdeal
Copy link
Contributor Author

jmdeal commented Nov 15, 2024

/remove-lifecycle stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants