Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to use templates with cluster name when passing userdata_override_base64 #67

Closed
z0rc opened this issue May 19, 2021 · 13 comments · Fixed by #84
Closed

Unable to use templates with cluster name when passing userdata_override_base64 #67

z0rc opened this issue May 19, 2021 · 13 comments · Fixed by #84
Labels
bug 🐛 An issue with the system

Comments

@z0rc
Copy link

z0rc commented May 19, 2021

Describe the Bug

On order to launch node group with Bottlerocket OS I need to pass custom userdata, that templated like this:

[settings.kubernetes]
api-server = "${api_server}"
cluster-certificate = "${cluster_certificate}"
cluster-name = "${cluster_name}"

This template used to render local variable like this:

locals {
  bottlerocket_userdata = base64encode(templatefile("./templates/userdata.tpl",
    {
      api_server          = module.eks.eks_cluster_endpoint
      cluster_certificate = module.eks.eks_cluster_certificate_authority_data
      cluster_name        = module.eks.eks_cluster_id
    }
  ))
}

Which passed to module as:

...
userdata_override_base64 = local.bottlerocket_userdata
...

terraform plan stage fails with:

╷
│ Error: Invalid count argument
│
│   on .terraform/modules/eks_node_group/main.tf line 52, in data "aws_eks_cluster" "this":
│   52:   count = local.get_cluster_data ? 1 : 0
│
│ The "count" value depends on resource attributes that cannot be determined until apply, so Terraform cannot predict how many instances will be created. To work around this, use the -target argument to first apply only the
│ resources that the count depends on.
╵

Expected Behavior

Terraform plan should execute just fine.

Steps to Reproduce

I was able to narrow down templating issue:

  • userdata_override_base64 = module.eks.eks_cluster_id fails on plan with error from above
  • userdata_override_base64 = "fixed string" successfully passes plan

Environment (please complete the following information):

Anything that will help us triage the bug will help. Here are some ideas:

% terraform version
Terraform v0.15.3
on darwin_amd64
+ provider registry.terraform.io/hashicorp/aws v3.40.0
+ provider registry.terraform.io/hashicorp/kubernetes v2.2.0
+ provider registry.terraform.io/hashicorp/local v2.1.0
+ provider registry.terraform.io/hashicorp/null v3.1.0
+ provider registry.terraform.io/hashicorp/random v3.1.0
+ provider registry.terraform.io/hashicorp/template v2.2.0

Modules:

  • cloudposse/eks-cluster/aws 0.38.0
  • cloudposse/eks-node-group/aws 0.20.0

As far as I was able to debug, terraform bubbles error to topmost constructed variable, so it isn't real cause reported. Real error is in https://github.com/cloudposse/terraform-aws-eks-node-group/blob/0.20.0/userdata.tf#L49 around var.userdata_override_base64 == null check.

@z0rc z0rc added the bug 🐛 An issue with the system label May 19, 2021
@z0rc
Copy link
Author

z0rc commented May 20, 2021

As a workaround I'm doing staged apply via terraform apply -target module.eks first and just terraform apply after. This makes outputs required for userdata template be available when planning eks_node_group module.

@z0rc
Copy link
Author

z0rc commented May 20, 2021

Looking into other issues, this looks like another instance of #37.

@Nuru
Copy link
Contributor

Nuru commented Jun 30, 2021

@z0rc Yes, this is not really a bug, so much as a limitation. This module expects its inputs to be available during the plan phase, and Terraform does not have the output of the template available at the plan phase. I wonder if this will work around the problem for you:

userdata_override_base64 = coalesce(local.bottlerocket_userdata, "Iw==")

@z0rc
Copy link
Author

z0rc commented Jul 6, 2021

@Nuru proposed code didn't have any effect on the error, it's reproducible with coalesce. I also tried kinda similar approach using join function, but error still remains.

Also this isn't about template per se, sample (but logically incorrect, but still) way to reproduce the error:

userdata_override_base64 = module.eks.eks_cluster_id

@nnsense
Copy link

nnsense commented Jul 8, 2021

I've hit the same, no vars can be used basically

Terraform version 0.15.5
Module version 0.20.0

@lyubomirdimov
Copy link

@z0rc

Check the following snippet (it works for me, with bottlerocket AMI):

module "linux_nodegroup_1" {
  source  = ".../terraform-aws-eks-node-group.git?ref=tags/0.22.0"
  enabled = var.linux_nodegroup_1_enabled

  name                              = var.linux_nodegroup_1_name
  ami_image_id                      = var.linux_nodegroup_1_ami_image_id
  subnet_ids                        = var.vpc.private_subnet_ids
  cluster_name                      = module.eks_cluster.eks_cluster_id
  instance_types                    = var.linux_nodegroup_1_instance_types
  desired_size                      = var.linux_nodegroup_1_desired_size
  min_size                          = var.linux_nodegroup_1_min_size
  max_size                          = var.linux_nodegroup_1_max_size
  disk_size                         = var.linux_nodegroup_1_disk_size
  create_before_destroy             = var.linux_nodegroup_1_create_before_destroy
  cluster_autoscaler_enabled        = var.linux_nodegroup_1_cluster_autoscaler_enabled
  existing_workers_role_policy_arns = local.linux_nodegroup_1_existing_workers_role_policy_arns
  userdata_override_base64          = base64encode(data.template_file.linux_nodegroup_1_userdata.rendered)

  context = module.this.context
}

data "template_file" "linux_nodegroup_1_userdata" {
  template = file("${path.module}/templates/bottlerocket_userdata.toml.tpl")
  vars = {
    cluster_name     = module.eks_cluster.eks_cluster_id
    cluster_endpoint = module.eks_cluster.eks_cluster_endpoint
    cluster_ca_data  = module.eks_cluster.eks_cluster_certificate_authority_data
    node_labels      = join("\n", [for label, value in var.linux_nodegroup_1_kubernetes_labels : "\"${label}\" = \"${value}\""])
    node_taints      = join("\n", [for taint, value in var.linux_nodegroup_1_kubernetes_taints : "\"${taint}\" = \"${value}\""])
  }
}

@z0rc
Copy link
Author

z0rc commented Jul 20, 2021

@lyubomirdimov the only difference I see between your snippet and mine is I use templatefile function, where you're using data.template_file (which is actually not recommended). Still I tried this approach, which didn't work in my case unfortunately, error remains the same.

As I'm trying to dig into this issue, it looks like it's discussed at hashicorp/terraform#26755 and at this moment of time would require refactroing this module in way where null checks won't be used for count.

@z0rc
Copy link
Author

z0rc commented Jul 20, 2021

I can "fix" this particular issue by removing && var.userdata_override_base64 == null check from https://github.com/cloudposse/terraform-aws-eks-node-group/blob/0.22.0/userdata.tf#L49, after that terrafrom plan succeeds as expected. But this isn't a fix, as this breaks other use cases of this module.

@lyubomirdimov
Copy link

lyubomirdimov commented Jul 20, 2021

@z0rc

I managed to reproduce the issue in my environment as well. The reason why i didn't see it, is because the clusters were existing before I introduced bottlerocket Nodegroups.
On clean deployment I get the error from the image below:
image

@z0rc
Copy link
Author

z0rc commented Jul 21, 2021

Okay, I give up trying to fix this and migrating away from this module.

My last attempt was trying to build my own aws_launch_template and pass it to module. This opened another can of worms, this time related to custom launch templates:

╷
│ Error: Invalid count argument
│
│   on .terraform/modules/eks_node_group/launch-template.tf line 107, in data "aws_launch_template" "this":
│  107:   count = local.enabled && length(local.configured_launch_template_name) > 0 ? 1 : 0
│
│ The "count" value depends on resource attributes that cannot be determined until apply, so Terraform cannot predict how many instances will be created. To work around this, use the -target argument to first apply only the
│ resources that the count depends on.
╵

It seems this module only reliably works when using majority of defaults to run AL2 nodes, running custom ami, userdata, launch template doesn't work and isn't actually tested.

@Nuru
Copy link
Contributor

Nuru commented Aug 10, 2021

@z0rc Thank you for all the effort you put into this issue, and I am sorry we have not fixed it yet. As you pointed out, fixing it will require breaking changes, and we have them on the drawing board. We plan to have this module updated in the near future.

It seems this module only reliably works when using majority of defaults to run AL2 nodes, running custom ami, userdata, launch template doesn't work and isn't actually tested.

This is not fair. All of the inputs were tested and work when given static inputs. The fair statement is that this module only reliably works when its inputs are known at plan time. It was imagined that you would be hard-coding values, not supplying derived values. It was also imagined that Hashicorp would fix this general issue.

In any case, the next version will be reworked so that it works with derived values as inputs. We will probably also simplify it so that it always creates a launch template instead trying to avoid it when possible.

@Nuru
Copy link
Contributor

Nuru commented Aug 30, 2021

@z0rc This and all the related issues should now be resolved with the release of version 0.25.0 of this module. Please give it a try and report back.

@z0rc
Copy link
Author

z0rc commented Aug 30, 2021

@Nuru sorry, but I won't be able to test and report on it. I already migrated away from module and migration back to module doesn't provide a value in my case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 An issue with the system
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants