Pods are sometimes assigned to the incorrect IAM role #46

smelchior · 2017-01-31T13:02:24Z

In our cluster sometimes the pods that have a proper "iam.amazonaws.com/role" annotation do not receive their role when they start up. kube2iam returns the default role to them which in our case does not have any permissions. After some time of the application requests the credentials again, it gets the proper assignment.
Relevant log messages:

level=info msg="Requesting /latest/meta-data/iam/security-credentials/"
level=warning msg="Using fallback role for IP 10.233.109.12"
level=info msg="Requesting /latest/meta-data/iam/security-credentials/kube.no-permissions"
level=warning msg="Using fallback role for IP 10.233.109.12"
.... some time later ....
level=info msg="Requesting /latest/meta-data/iam/security-credentials/"
level=info msg="Requesting /latest/meta-data/iam/security-credentials/kube.kube-system.route53-kubernetes"

I am not really sure how to debug this further, it might be related to the issues described in #32

The text was updated successfully, but these errors were encountered:

jtblin · 2017-02-23T12:01:29Z

I guess it's possible that the pod tried to access its IAM role before the onAdd method is completed which retrieves the annotation via the kubernetes API. Perhaps retrying there for a short amount of time could solve this issue. Will look into this soon.

struz · 2017-03-14T01:04:37Z

After investigating a little bit it seems that the precision on this race condition is quite tight (i.e. sub-second).

time="2017-03-10T00:48:25Z" level=warning msg="Using fallback role for IP 10.254.30.13"
time="2017-03-10T00:48:25Z" level=warning msg="Role:arn:aws:iam::<account>:<role> on namespace: not found."
time="2017-03-10T00:48:25Z" level=debug msg="Pod OnUpdate kube2iam-test - 10.254.30.13"
time="2017-03-10T00:48:25Z" level=debug msg="Pod OnDelete kube2iam-test - "
time="2017-03-10T00:48:25Z" level=debug msg="Pod OnAdd kube2iam-test - 10.254.30.13"
time="2017-03-10T00:48:25Z" level=debug msg="- Role arn:aws:iam::<account>:role/kube-test"

There are two edge cases I can think of here:

pod created, role requested before OnAdd finishes, returns default role even if pod is annotated
pod updated / restarted, restarts and requests role before OnUpdate finishes, applies the old role that was stored

To fix both of these at once I can only think of adding a static delay of somewhere between 0 and 1 second. Exponential backoff isn't really useful to fix the second case. Hopefully it won't have too much of an impact on apps either, since they should be caching credentials.

We could add this delay to replace (or augment) the existing exponential backoff in server.getRole.

smelchior · 2017-03-14T07:38:32Z

Thanks for looking into this. I guess adding the static delay is the best option for now. It would be great if that could be implemented.

jtblin · 2017-03-14T09:05:42Z

I don't think we'd want to add a static delay before getting the IAM role but maybe I misunderstand what you mean. There is another issue that getting the role on-demand (when a request for credentials comes in) triggers timeouts in some AWS clients.

The issue from @smelchior here is specifically about fallback i.e. "Using fallback role for IP 10.233.109.12" so I think an exponential backoff wrapping the block that retrieves from the rolesByIP (https://github.com/jtblin/kube2iam/blob/master/cmd/store.go#L28) map would work and cover for both cases. Perhaps we should wait that we move to IndexInformer before doing that though.

struz · 2017-03-14T09:28:36Z

IndexInformer will probably fix this too, so maybe we should wait.

ash2k · 2017-03-14T10:53:54Z

I agree with Jerome - sleeps are not the best way.
I think a better option is to decouple requests for roles from lookup from store using a channel (a la pub-sub). Something like this:
Each credentials lookup submits a subscription request to a channel. Request itself contains a channel for response and a context to tell when it is no longer needed. Submitting goroutine blocks on read from that response channel (with a timeout of course).
There is a goroutine which reads requests from that channel and keeps record of all the IPs with pending requests. This goroutine either immediately requests credentials if pod is found or puts the requests in the queue (map ip->requests slice). Every time informer gets an event about a pod it looks up the queue to see if there are any interested parties. If there are, then execute the credentials lookup and return the result.
This is somewhat similar to what I've implemented in https://github.com/atlassian/gostatsd/blob/master/pkg/statsd/cloud_handler.go

The best course of action at the moment is to refactor it to use indexed informer, but it will not magically fix this issue (if I understand it correctly).

p.s. I'm sorry if I misunderstood the issue and this is completely off.

ankon · 2017-03-14T11:25:49Z

Looking at the code for updating a pod (OnAdd/OnDelete) and the store logic: isn't the easiest fix here to extend the coverage of the store mutex so that it covers the whole of OnAdd/OnDelete? Now there are points in the process where Get would see an inconsistent state.

struz · 2017-03-14T12:31:34Z

What @ankon said is correct, I didn't even think of that. It will work unless the request for the role is received before the api update is sent. As he said, I suspect they are received almost in tandem right now, and the mutex covering more space might always allow the api server event to complete firing first.

ash2k · 2017-03-19T00:22:07Z

BTW looks like store has quite a few data races - this can be a reason for flaky behaviour as well. What I can see without digging/testing:

rolesByNamespace, namespaceByIP fields are read without holding a mutex in few places while written with the mutex held
DumpRolesBy*() methods return map without copying it under a mutex. Reads from debug endpoint will by racy

jtblin · 2017-03-19T23:42:27Z

Good catch! Although unlikely to be the root cause of this issue as it was opened before namespaces were added.

I don't think the store mutex is going to cover everything e.g. pod making trying to get IAM credentials before it has been added to the store but it would be a good improvement already. The exponential backoff strategy is actually probably not the right answer either because it would delay the fallback even when fallback is desired. I am not sure the channel approach would solve the underlying issue either.

I think the underlying issue is the same as #31. Prefetching credentials would solve the issue here as well, there was a PR originally but I rejected it because it could potentially DDOS the AWS API limit and create too many routines in kube2iam. We may want to review the approach and coming up with a gentler approach that prefetches credentials and refresh them.

ash2k · 2017-03-20T05:35:52Z

I'm pretty sure the channel approach will at least reduce the response latency. It is instantaneous "push" as soon as information is available instead of polling with exponential backoff.

jtblin · 2017-03-20T05:44:49Z

There is no polling with exponential backoff here, it's reading directly from a map which is instantaneous. I was proposing to add exponential backoff instead of the sleep proposed by @struz but as said above, I don't think that's the right strategy either.

The only exponential backoff in the code is around this block for the entire store.Get i.e. it cannot be the original issue here as it would have fallen back to the default role already at this point.

struz · 2017-03-20T06:49:25Z

One problem is that it will fall back to the default role even when a role is specified sometimes. Because the role is asked for before the map is finished updating the system just assumes it has no role and serves it the default one with no back off

jtblin · 2017-03-20T07:12:33Z

It uses a mutex to read and write from the map so I don't think that's correct:

// Get returns the iam role based on IP address.
func (s *store) Get(IP string) (string, error) {
	s.mutex.RLock()
	defer s.mutex.RUnlock()
	if role, ok := s.rolesByIP[IP]; ok {
		return role, nil
	}
	if s.defaultRole != "" {
		log.Warnf("Using fallback role for IP %s", IP)
		return s.defaultRole, nil
	}
	return "", fmt.Errorf("Unable to find role for IP %s", IP)
}

func (s *store) AddRoleToIP(pod *api.Pod, role string) {
	s.mutex.Lock()
	s.rolesByIP[pod.Status.PodIP] = role
	s.mutex.Unlock()
}

jtblin · 2017-03-20T07:18:12Z

Sorry you're right for the update case, there could be a race condition that a mutex around delete and add could solve.

pingles · 2017-06-01T12:22:16Z

We had this problem in one of our production clusters just over a week ago- we noticed that some processes would go unhealthy periodically and through the logs noticed they were attempting to use credentials for a different role (and thus being denied access to the AWS resources they needed). We'd been fine for a few months but we crossed some kind of cluster load that meant stuff started breaking.

I started working on moving code around to support broader locks but made enough changes (both to address this and to incorporate prefetching and other bits) it ended up easier to just start from scratch. To address this (the incorrect IAM role) issue we found a thread-safe cache inside k8s.io/client-go that can watch + sync pod state.

kiam (which I'd like to think is something like the child of kube2iam 😄 ) is at https://github.com/uswitch/kiam if its helpful.

jtblin · 2017-06-05T12:13:56Z

Ah... I just added a lock to the update of pods and namespaces as originally pointed out by @ankon. That should be enough to fix the problem.

@pingles I'd be interested to know how many pods you have running on your cluster with the prefetch of credentials and if you're hitting issues with AWS API limits. It's one of the things I'm wary of with this, especially when the daemon starts and need to fetch the credentials for all the pods (potentially on multiple nodes), as after a cursory glance at your code, it doesn't seem you have a rate limiter or backoff with exponential retries.

pingles · 2017-06-05T12:42:35Z

@jtblin we're running 10s of nodes in each cluster (around 15-30 depending on load) with 6,600 pods running now, although a far smaller number would have role annotations (we don't attach default roles to stuff that doesn't need AWS access).

We've not had any errors from AWS API limits as yet. As you say, the number of API calls grows as nodes * roles and the credential cache tracks around 30-40 roles currently. The prefetcher runs in its own go routine so fetches are serial per-node but with a 400ms per AssumeRole call (95th percentile) there's a kind of built-in rate-limit :) Across all clusters we see a peak API call per-second of around 50 so its not optimal but puts it under whatever limit AWS specify internally (although maybe that could be different per-account).

We were planning on wrapping the AWS client with a rate limiter/retry gateway but we needed something we could run asap and we didn't hit the problem so pushed it down the priority.

smelchior · 2017-06-13T07:34:15Z

Thanks for fixing this @jtblin! Can you tell already when there will be a new release that includes the fix?

sibrahim001982 · 2018-05-08T10:58:48Z

Hi

Was this fixed?

jrnt30 · 2018-05-08T15:45:03Z

@sibrahim001982 The known causes for this situation have been fixed and integrated, yes. There have been a few explicit improvements to the way that Kube2iam handles the pod inventory, but there may be times when there is a very large spike in pods that would still result in some observed issues.

If you are experiencing a specific issue I would recommend opening up a ticket and providing as much information as possible about the issue.

Some things to do beforehand are:

Increase the logging level on the server
Enable the debug endpoint and see if that is telling
Ensure you're on the latest version
Provide any reproduction instructions

rkoval mentioned this issue Jun 1, 2017

kube2iam never recovers after node failures #78

Closed

jtblin closed this as completed in e815c20 Jun 5, 2017

damomurf mentioned this issue May 14, 2018

Timing issue causing failures in Pods created by Jobs #136

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods are sometimes assigned to the incorrect IAM role #46

Pods are sometimes assigned to the incorrect IAM role #46

smelchior commented Jan 31, 2017 •

edited

Loading

jtblin commented Feb 23, 2017

struz commented Mar 14, 2017 •

edited

Loading

smelchior commented Mar 14, 2017

jtblin commented Mar 14, 2017

struz commented Mar 14, 2017

ash2k commented Mar 14, 2017

ankon commented Mar 14, 2017

struz commented Mar 14, 2017

ash2k commented Mar 19, 2017

jtblin commented Mar 19, 2017

ash2k commented Mar 20, 2017

jtblin commented Mar 20, 2017

struz commented Mar 20, 2017

jtblin commented Mar 20, 2017

jtblin commented Mar 20, 2017

pingles commented Jun 1, 2017 •

edited

Loading

jtblin commented Jun 5, 2017

pingles commented Jun 5, 2017

smelchior commented Jun 13, 2017

sibrahim001982 commented May 8, 2018

jrnt30 commented May 8, 2018

Pods are sometimes assigned to the incorrect IAM role #46

Pods are sometimes assigned to the incorrect IAM role #46

Comments

smelchior commented Jan 31, 2017 • edited Loading

jtblin commented Feb 23, 2017

struz commented Mar 14, 2017 • edited Loading

smelchior commented Mar 14, 2017

jtblin commented Mar 14, 2017

struz commented Mar 14, 2017

ash2k commented Mar 14, 2017

ankon commented Mar 14, 2017

struz commented Mar 14, 2017

ash2k commented Mar 19, 2017

jtblin commented Mar 19, 2017

ash2k commented Mar 20, 2017

jtblin commented Mar 20, 2017

struz commented Mar 20, 2017

jtblin commented Mar 20, 2017

jtblin commented Mar 20, 2017

pingles commented Jun 1, 2017 • edited Loading

jtblin commented Jun 5, 2017

pingles commented Jun 5, 2017

smelchior commented Jun 13, 2017

sibrahim001982 commented May 8, 2018

jrnt30 commented May 8, 2018

smelchior commented Jan 31, 2017 •

edited

Loading

struz commented Mar 14, 2017 •

edited

Loading

pingles commented Jun 1, 2017 •

edited

Loading