Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Q3] Add retry logic to rancher2_cluster update #1159

Merged
merged 2 commits into from
Aug 7, 2023

Conversation

a-blender
Copy link
Contributor

@a-blender a-blender commented Jun 27, 2023

Issue: #1040

Problem

Error when upgrading rke cluster with Terraform 1.25

Error: Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [code=ServerError, message=Operation cannot be fulfilled on clusters.management.cattle.io "c-mhjrs": the object has been modified; please apply your changes to the latest version and try again, baseType=error]

which in this case, indicates an intermittent race condition on update of the rancher2_cluster resource.

Solution

Add retry code to rancher2_cluster resource Update. I have chosen to use resource.Retry instead of context.WithTimeout like in this code because we need a cluster update to retry until a non retryable error is returned or the timeout is reached. WithTimeout controls the timeout of an HTTP response, say a 500, but we don't want the request cancelled if it fails. We want to retry several times, especially on upgrade since customers tend to get pretty frustrated if they have to run a terraform apply more than once.

I've also moved updateClusterMonitoring into a helper function to increase readability of the Update function and updated comments.

Testing

Engineering Testing

Manual Testing

The race condition in the original issue is difficult to reproduce, but team discussion has concluded that adding retry code to rancher2_cluster Update will fix the failure on server error.

Test steps

  • Provision rke cluster on EC2 nodes with 1 cp/1 etcd/1 worker
  • Verify cluster provisions after one tf apply / no HTTP 500 error
  • Update rke cluster to increase worker count to 3 + add a fourth node pool
  • Verify cluster updates after one tf apply / no HTTP 500 error
  • Update rke cluster to enable_cluster_monitoring = true (preferably while the previous update is still running to simulate the race condition)
  • Verify cluster updates after one terraform apply (it took 27m for me!) / no HTTP 500 error
main.tf
terraform {
  required_providers {
    rancher2 = {
      source  = "terraform.local/local/rancher2"
      version = "1.0.0"
    }
  }
}
provider "rancher2" {
  api_url   = var.rancher_api_url
  token_key = var.rancher_admin_bearer_token
  insecure  = true
}
data "rancher2_cloud_credential" "rancher2_cloud_credential" {
  name = var.cloud_credential_name
}
resource "rancher2_cluster" "rancher2_cluster" {
  name = var.rke_cluster_name
  rke_config {
    kubernetes_version = "v1.26.4-rancher2-1"
    network {
      plugin = var.rke_network_plugin
    }
  }
}
resource "rancher2_node_template" "rancher2_node_template" {
  name = var.rke_node_template_name
  amazonec2_config {
    access_key           = var.aws_access_key
	  secret_key     = var.aws_secret_key
	  region         = var.aws_region
          ami            = var.aws_ami
	  security_group = [var.aws_security_group_name]
	  subnet_id      = var.aws_subnet_id
	  vpc_id         = var.aws_vpc_id
	  zone           = var.aws_zone_letter
	  root_size      = var.aws_root_size
	  instance_type  = var.aws_instance_type
  }
}
resource "rancher2_node_pool" "pool1" {
  cluster_id       = rancher2_cluster.rancher2_cluster.id
  name             = "pool1"
  hostname_prefix  = "tf-pool1-"
  node_template_id = rancher2_node_template.rancher2_node_template.id
  quantity         = 1
  control_plane    = false
  etcd             = true 
  worker           = false 
}
resource "rancher2_node_pool" "pool2" {
  cluster_id       = rancher2_cluster.rancher2_cluster.id
  name             = "pool2"
  hostname_prefix  = "tf-pool2-"
  node_template_id = rancher2_node_template.rancher2_node_template.id
  quantity         = 1
  control_plane    = true
  etcd             = false 
  worker           = false 
}
resource "rancher2_node_pool" "pool3" {
  cluster_id       = rancher2_cluster.rancher2_cluster.id
  name             = "pool3"
  hostname_prefix  = "tf-pool3-"
  node_template_id = rancher2_node_template.rancher2_node_template.id
  quantity         = 1
  control_plane    = false
  etcd             = false 
  worker           = true 
}
main.tf (sections with cluster updates)
resource "rancher2_cluster" "rancher2_cluster" {
  name = var.rke_cluster_name
  rke_config {
    kubernetes_version = "v1.26.4-rancher2-1"
    network {
      plugin = var.rke_network_plugin
    }
  }
  enable_cluster_monitoring = true
}
.
.
.
resource "rancher2_node_pool" "pool3" {
  cluster_id       = rancher2_cluster.rancher2_cluster.id
  name             = "pool3"
  hostname_prefix  = "tf-pool3-"
  node_template_id = rancher2_node_template.rancher2_node_template.id
  quantity         = 3
  control_plane    = false
  etcd             = false 
  worker           = true 
}
resource "rancher2_node_pool" "pool4" {
  cluster_id       = rancher2_cluster.rancher2_cluster.id
  name             = "pool4"
  hostname_prefix  = "tf-pool4-"
  node_template_id = rancher2_node_template.rancher2_node_template.id
  quantity         = 1
  control_plane    = false
  etcd             = true
  worker           = true
}

Automated Testing

QA Testing Considerations

Regressions Considerations

@a-blender a-blender added this to the 2023-Q3-v2.7x - Terraform milestone Jun 27, 2023
@a-blender a-blender force-pushed the fix-tf-cluster-upgrade branch 2 times, most recently from 93e6ab0 to 1f1cf71 Compare June 28, 2023 20:44
@a-blender a-blender requested review from jakefhyde, HarrisonWAffel and a team June 28, 2023 20:51
@a-blender a-blender marked this pull request as ready for review June 28, 2023 20:51
Copy link
Contributor

@HarrisonWAffel HarrisonWAffel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small nit about not reusing a client, but otherwise lgtm

rancher2/resource_rancher2_cluster.go Outdated Show resolved Hide resolved
@HarrisonWAffel HarrisonWAffel requested a review from a team June 28, 2023 21:19
@a-blender a-blender force-pushed the fix-tf-cluster-upgrade branch from 1f1cf71 to 27a6ccf Compare June 29, 2023 00:08
@snasovich snasovich removed this from the 2023-Q3-v2.7x - Terraform milestone Jun 30, 2023
@a-blender a-blender changed the title Add retry logic to rancher2_cluster update [DNM Q3] Add retry logic to rancher2_cluster update Jun 30, 2023
@a-blender a-blender changed the title [DNM Q3] Add retry logic to rancher2_cluster update [Q3] Add retry logic to rancher2_cluster update Jul 31, 2023
@a-blender a-blender removed the request for review from jakefhyde August 3, 2023 20:40
rancher2/resource_rancher2_cluster.go Outdated Show resolved Hide resolved
@a-blender a-blender requested a review from jiaqiluo August 4, 2023 20:53
rancher2/resource_rancher2_cluster.go Outdated Show resolved Hide resolved
@a-blender a-blender force-pushed the fix-tf-cluster-upgrade branch from 7261652 to e4cd06d Compare August 7, 2023 14:22
@a-blender a-blender requested a review from jiaqiluo August 7, 2023 14:22
@a-blender a-blender merged commit 187a69c into rancher:master Aug 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants