Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

google_container_cluster doesn't create default pool, fails after long health check #6842

Closed
kevinohara80 opened this issue Jul 23, 2020 · 5 comments
Assignees
Labels

Comments

@kevinohara80
Copy link

kevinohara80 commented Jul 23, 2020

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

Terraform v0.12.24
+ provider.google v3.31.0
+ provider.google-beta v3.31.0
+ provider.random v2.3.0

Affected Resource(s)

  • google_container_cluster
  • google_container_node_pool

Terraform Configuration Files

locals {
  node_pool_oauth_scopes = [
    "https://www.googleapis.com/auth/compute",
    "https://www.googleapis.com/auth/devstorage.read_only",
    "https://www.googleapis.com/auth/logging.write",
    "https://www.googleapis.com/auth/monitoring",
    "https://www.googleapis.com/auth/service.management.readonly",
    "https://www.googleapis.com/auth/servicecontrol",
    "https://www.googleapis.com/auth/trace.append",
  ]
}

resource "google_container_cluster" "cluster" {
  provider           = google-beta
  name               = "primary-cluster"
  description        = "Primary k8s cluster for ${local.env}"
  location           = var.cluster_location
  project            = local.project_id
  min_master_version = var.k8s_min_version
  # logging_service    = "logging.googleapis.com/kubernetes"
  # monitoring_service = "monitoring.googleapis.com/kubernetes"

  # We can't create a cluster with no node pool defined, but we want to only use
  # separately managed node pools. So we create the smallest possible default
  # node pool and immediately delete it.
  remove_default_node_pool = true
  initial_node_count       = 1

  network_policy {
    enabled = true
  }

  master_auth {
    client_certificate_config {
      issue_client_certificate = false
    }
  }

  release_channel {
    channel = var.auto_release_k8s_master ? var.auto_release_channel : "UNSPECIFIED"
  }

  database_encryption {
    state    = "ENCRYPTED"
    key_name = google_kms_crypto_key.k8s_ale.self_link
  }

  cluster_autoscaling {
    enabled = var.cluster_autoscaling_enabled

    resource_limits {
      resource_type = "cpu"
      minimum       = 1
      maximum       = 1
    }

    resource_limits {
      resource_type = "memory"
      minimum       = 1
      maximum       = 1
    }
  }

  vertical_pod_autoscaling {
    enabled = var.vertical_pod_autoscaling_enabled
  }

  workload_identity_config {
    identity_namespace = "${local.project_id}.svc.id.goog"
  }

  node_config {
    oauth_scopes = local.node_pool_oauth_scopes
  }

}

resource "google_container_node_pool" "main_pool" {
  provider              = google-beta
  count              = var.node_pool_count
  name_prefix        = "node-pool-${count.index}-"
  project            = local.project_id
  location           = var.cluster_location
  cluster                    = google_container_cluster.cluster.name
  initial_node_count = var.node_pool_node_count
  node_count         = var.node_pool_autoscaling_enabled ? null : var.node_pool_node_count

  dynamic "autoscaling" {
    for_each = var.node_pool_autoscaling_enabled ? [1] : []
    content {
      min_node_count = var.node_pool_autoscaling_min
      max_node_count = var.node_pool_autoscaling_max
    }
  }

  management {
    auto_repair  = var.auto_repair_nodes
    auto_upgrade = true
  }

  upgrade_settings {
    max_surge       = 1
    max_unavailable = 0
  }

  node_config {
    # preemptible = true // this should be false in prod
    disk_size_gb    = 100
    disk_type       = "pd-standard"
    local_ssd_count = 0
    machine_type    = var.node_pool_instance_type

    metadata = {
      disable-legacy-endpoints = "true"
    }

    oauth_scopes = local.node_pool_oauth_scopes

    workload_metadata_config {
      node_metadata = "GKE_METADATA_SERVER"
    }
  }

  lifecycle {
    create_before_destroy = true
  }
}

Debug Output

https://gist.github.com/kevinohara80/b2a771721d2a476c78f83a181328e12a

Panic Output

Expected Behavior

Cluster should have been created as this was a working configuration previously. I believe the last time this was run was on provider version 3.6 or so....possibly a later version than that.

Actual Behavior

The cluster creation hangs at the "Health Checks" portion and it looks as if the default node pool never gets created. It hangs for about 20 minutes before finally failing with the following message:

Error: Error waiting for creating GKE cluster: All cluster resources were brought up, 
but: only 0 nodes out of 1 have registered; this is likely due to Nodes failing to start 
correctly; try re-creating the cluster or contact support if that doesn't work.

The GKE UI shows the default node pool trying to be created the entire time but the nodes never get spun up.

Google_Cloud_Platform

Steps to Reproduce

  1. terraform apply

Important Factoids

References

@ghost ghost added bug labels Jul 23, 2020
@edwardmedia edwardmedia self-assigned this Jul 23, 2020
@edwardmedia
Copy link
Contributor

@kevinohara80 can you share your debug log? Thanks

@kevinohara80
Copy link
Author

@edwardmedia Added the gist url to the original issue body

@imrannayer
Copy link

imrannayer commented Jul 27, 2020

I had the same issue. It all started since the default version for new clusters changed to 1.15.12-gke.2 (previously 1.14.10-gke.36) for stable release channel on July 22.
https://cloud.google.com/kubernetes-engine/docs/release-notes#july_22_2020_r24

I solved the issue by removing release channel and setting min_master_version to 1.14.10-gke.42 or 1.14.10-gke.46 as a workaround.
It is not just default node pool. It is happening for any node pool in cluster.

@rileykarson
Copy link
Collaborator

I'm not sure there's much we can do in Terraform to fix this, unfortunately. We're sending correct requests to the API, but the GCE instances GKE provisions are failing to register themselves correctly as nodes. As @imrannayer's experience shows, it appears to be tied to the GKE K8S version and not the Terraform provider version. I've experienced the same myself intermittently, and it's generally resolved by waiting or changing GKE versions.

I'm going to close this issue since there's nothing we can do provider-side. However, if any of you are consistently experiencing this issue and downgrading your provider works, please share that here!

@ghost
Copy link

ghost commented Aug 28, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!

@ghost ghost locked and limited conversation to collaborators Aug 28, 2020
@github-actions github-actions bot added service/container forward/review In review; remove label to forward labels Jan 14, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants