google_container_cluster doesn't create default pool, fails after long health check #6842

kevinohara80 · 2020-07-23T14:48:33Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
If you are interested in working on this issue or have submitted a pull request, please leave a comment.
If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

Terraform v0.12.24
+ provider.google v3.31.0
+ provider.google-beta v3.31.0
+ provider.random v2.3.0

Affected Resource(s)

google_container_cluster
google_container_node_pool

Terraform Configuration Files

locals {
  node_pool_oauth_scopes = [
    "https://www.googleapis.com/auth/compute",
    "https://www.googleapis.com/auth/devstorage.read_only",
    "https://www.googleapis.com/auth/logging.write",
    "https://www.googleapis.com/auth/monitoring",
    "https://www.googleapis.com/auth/service.management.readonly",
    "https://www.googleapis.com/auth/servicecontrol",
    "https://www.googleapis.com/auth/trace.append",
  ]
}

resource "google_container_cluster" "cluster" {
  provider           = google-beta
  name               = "primary-cluster"
  description        = "Primary k8s cluster for ${local.env}"
  location           = var.cluster_location
  project            = local.project_id
  min_master_version = var.k8s_min_version
  # logging_service    = "logging.googleapis.com/kubernetes"
  # monitoring_service = "monitoring.googleapis.com/kubernetes"

  # We can't create a cluster with no node pool defined, but we want to only use
  # separately managed node pools. So we create the smallest possible default
  # node pool and immediately delete it.
  remove_default_node_pool = true
  initial_node_count       = 1

  network_policy {
    enabled = true
  }

  master_auth {
    client_certificate_config {
      issue_client_certificate = false
    }
  }

  release_channel {
    channel = var.auto_release_k8s_master ? var.auto_release_channel : "UNSPECIFIED"
  }

  database_encryption {
    state    = "ENCRYPTED"
    key_name = google_kms_crypto_key.k8s_ale.self_link
  }

  cluster_autoscaling {
    enabled = var.cluster_autoscaling_enabled

    resource_limits {
      resource_type = "cpu"
      minimum       = 1
      maximum       = 1
    }

    resource_limits {
      resource_type = "memory"
      minimum       = 1
      maximum       = 1
    }
  }

  vertical_pod_autoscaling {
    enabled = var.vertical_pod_autoscaling_enabled
  }

  workload_identity_config {
    identity_namespace = "${local.project_id}.svc.id.goog"
  }

  node_config {
    oauth_scopes = local.node_pool_oauth_scopes
  }

}

resource "google_container_node_pool" "main_pool" {
  provider              = google-beta
  count              = var.node_pool_count
  name_prefix        = "node-pool-${count.index}-"
  project            = local.project_id
  location           = var.cluster_location
  cluster                    = google_container_cluster.cluster.name
  initial_node_count = var.node_pool_node_count
  node_count         = var.node_pool_autoscaling_enabled ? null : var.node_pool_node_count

  dynamic "autoscaling" {
    for_each = var.node_pool_autoscaling_enabled ? [1] : []
    content {
      min_node_count = var.node_pool_autoscaling_min
      max_node_count = var.node_pool_autoscaling_max
    }
  }

  management {
    auto_repair  = var.auto_repair_nodes
    auto_upgrade = true
  }

  upgrade_settings {
    max_surge       = 1
    max_unavailable = 0
  }

  node_config {
    # preemptible = true // this should be false in prod
    disk_size_gb    = 100
    disk_type       = "pd-standard"
    local_ssd_count = 0
    machine_type    = var.node_pool_instance_type

    metadata = {
      disable-legacy-endpoints = "true"
    }

    oauth_scopes = local.node_pool_oauth_scopes

    workload_metadata_config {
      node_metadata = "GKE_METADATA_SERVER"
    }
  }

  lifecycle {
    create_before_destroy = true
  }
}

Debug Output

https://gist.github.com/kevinohara80/b2a771721d2a476c78f83a181328e12a

Panic Output

Expected Behavior

Cluster should have been created as this was a working configuration previously. I believe the last time this was run was on provider version 3.6 or so....possibly a later version than that.

Actual Behavior

The cluster creation hangs at the "Health Checks" portion and it looks as if the default node pool never gets created. It hangs for about 20 minutes before finally failing with the following message:

Error: Error waiting for creating GKE cluster: All cluster resources were brought up, 
but: only 0 nodes out of 1 have registered; this is likely due to Nodes failing to start 
correctly; try re-creating the cluster or contact support if that doesn't work.

The GKE UI shows the default node pool trying to be created the entire time but the nodes never get spun up.

Steps to Reproduce

terraform apply

Important Factoids

References

I thought it could be related to OAuth scopes so I've tried explicitly setting those on the default node pool Unable to create a GKE cluster #898

The text was updated successfully, but these errors were encountered:

edwardmedia · 2020-07-23T15:49:57Z

@kevinohara80 can you share your debug log? Thanks

kevinohara80 · 2020-07-24T14:28:39Z

@edwardmedia Added the gist url to the original issue body

imrannayer · 2020-07-27T02:24:57Z

I had the same issue. It all started since the default version for new clusters changed to 1.15.12-gke.2 (previously 1.14.10-gke.36) for stable release channel on July 22.
https://cloud.google.com/kubernetes-engine/docs/release-notes#july_22_2020_r24

I solved the issue by removing release channel and setting min_master_version to 1.14.10-gke.42 or 1.14.10-gke.46 as a workaround.
It is not just default node pool. It is happening for any node pool in cluster.

rileykarson · 2020-07-28T16:53:49Z

I'm not sure there's much we can do in Terraform to fix this, unfortunately. We're sending correct requests to the API, but the GCE instances GKE provisions are failing to register themselves correctly as nodes. As @imrannayer's experience shows, it appears to be tied to the GKE K8S version and not the Terraform provider version. I've experienced the same myself intermittently, and it's generally resolved by waiting or changing GKE versions.

I'm going to close this issue since there's nothing we can do provider-side. However, if any of you are consistently experiencing this issue and downgrading your provider works, please share that here!

ghost · 2020-08-28T13:50:37Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!

ghost added bug labels Jul 23, 2020

edwardmedia self-assigned this Jul 23, 2020

edwardmedia added the waiting-response label Jul 23, 2020

ghost removed the waiting-response label Jul 27, 2020

edwardmedia assigned rileykarson Jul 27, 2020

rileykarson closed this as completed Jul 28, 2020

rileykarson added the upstream label Jul 28, 2020

ghost locked and limited conversation to collaborators Aug 28, 2020

github-actions bot added service/container forward/review In review; remove label to forward labels Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

google_container_cluster doesn't create default pool, fails after long health check #6842

google_container_cluster doesn't create default pool, fails after long health check #6842

kevinohara80 commented Jul 23, 2020 •

edited

Loading

edwardmedia commented Jul 23, 2020

kevinohara80 commented Jul 24, 2020

imrannayer commented Jul 27, 2020 •

edited

Loading

rileykarson commented Jul 28, 2020

ghost commented Aug 28, 2020

google_container_cluster doesn't create default pool, fails after long health check #6842

google_container_cluster doesn't create default pool, fails after long health check #6842

Comments

kevinohara80 commented Jul 23, 2020 • edited Loading

Community Note

Terraform Version

Affected Resource(s)

Terraform Configuration Files

Debug Output

Panic Output

Expected Behavior

Actual Behavior

Steps to Reproduce

Important Factoids

References

edwardmedia commented Jul 23, 2020

kevinohara80 commented Jul 24, 2020

imrannayer commented Jul 27, 2020 • edited Loading

rileykarson commented Jul 28, 2020

ghost commented Aug 28, 2020

kevinohara80 commented Jul 23, 2020 •

edited

Loading

imrannayer commented Jul 27, 2020 •

edited

Loading