[Q3] Add retry logic to rancher2_cluster update #1159

a-blender · 2023-06-27T21:57:00Z

Issue: #1040

Problem

Error when upgrading rke cluster with Terraform 1.25

Error: Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [code=ServerError, message=Operation cannot be fulfilled on clusters.management.cattle.io "c-mhjrs": the object has been modified; please apply your changes to the latest version and try again, baseType=error]

which in this case, indicates an intermittent race condition on update of the rancher2_cluster resource.

Solution

Add retry code to rancher2_cluster resource Update. I have chosen to use resource.Retry instead of context.WithTimeout like in this code because we need a cluster update to retry until a non retryable error is returned or the timeout is reached. WithTimeout controls the timeout of an HTTP response, say a 500, but we don't want the request cancelled if it fails. We want to retry several times, especially on upgrade since customers tend to get pretty frustrated if they have to run a terraform apply more than once.

I've also moved updateClusterMonitoring into a helper function to increase readability of the Update function and updated comments.

Testing

Engineering Testing

Manual Testing

The race condition in the original issue is difficult to reproduce, but team discussion has concluded that adding retry code to rancher2_cluster Update will fix the failure on server error.

Test steps

Provision rke cluster on EC2 nodes with 1 cp/1 etcd/1 worker
Verify cluster provisions after one tf apply / no HTTP 500 error
Update rke cluster to increase worker count to 3 + add a fourth node pool
Verify cluster updates after one tf apply / no HTTP 500 error
Update rke cluster to enable_cluster_monitoring = true (preferably while the previous update is still running to simulate the race condition)
Verify cluster updates after one terraform apply (it took 27m for me!) / no HTTP 500 error

main.tf

terraform {
  required_providers {
    rancher2 = {
      source  = "terraform.local/local/rancher2"
      version = "1.0.0"
    }
  }
}
provider "rancher2" {
  api_url   = var.rancher_api_url
  token_key = var.rancher_admin_bearer_token
  insecure  = true
}
data "rancher2_cloud_credential" "rancher2_cloud_credential" {
  name = var.cloud_credential_name
}
resource "rancher2_cluster" "rancher2_cluster" {
  name = var.rke_cluster_name
  rke_config {
    kubernetes_version = "v1.26.4-rancher2-1"
    network {
      plugin = var.rke_network_plugin
    }
  }
}
resource "rancher2_node_template" "rancher2_node_template" {
  name = var.rke_node_template_name
  amazonec2_config {
    access_key           = var.aws_access_key
	  secret_key     = var.aws_secret_key
	  region         = var.aws_region
          ami            = var.aws_ami
	  security_group = [var.aws_security_group_name]
	  subnet_id      = var.aws_subnet_id
	  vpc_id         = var.aws_vpc_id
	  zone           = var.aws_zone_letter
	  root_size      = var.aws_root_size
	  instance_type  = var.aws_instance_type
  }
}
resource "rancher2_node_pool" "pool1" {
  cluster_id       = rancher2_cluster.rancher2_cluster.id
  name             = "pool1"
  hostname_prefix  = "tf-pool1-"
  node_template_id = rancher2_node_template.rancher2_node_template.id
  quantity         = 1
  control_plane    = false
  etcd             = true 
  worker           = false 
}
resource "rancher2_node_pool" "pool2" {
  cluster_id       = rancher2_cluster.rancher2_cluster.id
  name             = "pool2"
  hostname_prefix  = "tf-pool2-"
  node_template_id = rancher2_node_template.rancher2_node_template.id
  quantity         = 1
  control_plane    = true
  etcd             = false 
  worker           = false 
}
resource "rancher2_node_pool" "pool3" {
  cluster_id       = rancher2_cluster.rancher2_cluster.id
  name             = "pool3"
  hostname_prefix  = "tf-pool3-"
  node_template_id = rancher2_node_template.rancher2_node_template.id
  quantity         = 1
  control_plane    = false
  etcd             = false 
  worker           = true 
}

main.tf (sections with cluster updates)

resource "rancher2_cluster" "rancher2_cluster" {
  name = var.rke_cluster_name
  rke_config {
    kubernetes_version = "v1.26.4-rancher2-1"
    network {
      plugin = var.rke_network_plugin
    }
  }
  enable_cluster_monitoring = true
}
.
.
.
resource "rancher2_node_pool" "pool3" {
  cluster_id       = rancher2_cluster.rancher2_cluster.id
  name             = "pool3"
  hostname_prefix  = "tf-pool3-"
  node_template_id = rancher2_node_template.rancher2_node_template.id
  quantity         = 3
  control_plane    = false
  etcd             = false 
  worker           = true 
}
resource "rancher2_node_pool" "pool4" {
  cluster_id       = rancher2_cluster.rancher2_cluster.id
  name             = "pool4"
  hostname_prefix  = "tf-pool4-"
  node_template_id = rancher2_node_template.rancher2_node_template.id
  quantity         = 1
  control_plane    = false
  etcd             = true
  worker           = true
}

Automated Testing

QA Testing Considerations

Regressions Considerations

HarrisonWAffel

One small nit about not reusing a client, but otherwise lgtm

rancher2/resource_rancher2_cluster.go

a-blender added this to the 2023-Q3-v2.7x - Terraform milestone Jun 27, 2023

a-blender force-pushed the fix-tf-cluster-upgrade branch 2 times, most recently from 93e6ab0 to 1f1cf71 Compare June 28, 2023 20:44

a-blender requested review from jakefhyde, HarrisonWAffel and a team June 28, 2023 20:51

a-blender marked this pull request as ready for review June 28, 2023 20:51

HarrisonWAffel approved these changes Jun 28, 2023

View reviewed changes

rancher2/resource_rancher2_cluster.go Outdated Show resolved Hide resolved

HarrisonWAffel requested a review from a team June 28, 2023 21:19

Add retry logic to rancher2_cluster update

27a6ccf

a-blender force-pushed the fix-tf-cluster-upgrade branch from 1f1cf71 to 27a6ccf Compare June 29, 2023 00:08

snasovich removed this from the 2023-Q3-v2.7x - Terraform milestone Jun 30, 2023

a-blender changed the title ~~Add retry logic to rancher2_cluster update~~ [DNM Q3] Add retry logic to rancher2_cluster update Jun 30, 2023

a-blender changed the title ~~[DNM Q3] Add retry logic to rancher2_cluster update~~ [Q3] Add retry logic to rancher2_cluster update Jul 31, 2023

a-blender removed the request for review from jakefhyde August 3, 2023 20:40

jiaqiluo requested changes Aug 4, 2023

View reviewed changes

rancher2/resource_rancher2_cluster.go Outdated Show resolved Hide resolved

thatmidwesterncoder approved these changes Aug 4, 2023

View reviewed changes

a-blender requested a review from jiaqiluo August 4, 2023 20:53

jiaqiluo requested changes Aug 4, 2023

View reviewed changes

rancher2/resource_rancher2_cluster.go Outdated Show resolved Hide resolved

Address comment

e4cd06d

a-blender force-pushed the fix-tf-cluster-upgrade branch from 7261652 to e4cd06d Compare August 7, 2023 14:22

a-blender requested a review from jiaqiluo August 7, 2023 14:22

jiaqiluo approved these changes Aug 7, 2023

View reviewed changes

a-blender merged commit 187a69c into rancher:master Aug 7, 2023

a-blender mentioned this pull request Aug 7, 2023

Error when upgrading cluster with terraform #1040

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Q3] Add retry logic to rancher2_cluster update #1159

[Q3] Add retry logic to rancher2_cluster update #1159

a-blender commented Jun 27, 2023 •

edited

Loading

HarrisonWAffel left a comment

[Q3] Add retry logic to rancher2_cluster update #1159

[Q3] Add retry logic to rancher2_cluster update #1159

Conversation

a-blender commented Jun 27, 2023 • edited Loading

Issue: #1040

Problem

Solution

Testing

Engineering Testing

Manual Testing

Automated Testing

QA Testing Considerations

Regressions Considerations

HarrisonWAffel left a comment

Choose a reason for hiding this comment

a-blender commented Jun 27, 2023 •

edited

Loading