[1.1.0] elastic-operator pod gets OOMKilled #2981

edwardsmit · 2020-04-28T16:44:35Z

#2819 Bug Report
What did you do?
Upgraded ECK from 1.0.1 to 1.1.0

What did you expect to see?
The elastic-operator-0 pod runs smoothly

What did you see instead? Under which circumstances?
The elastic-operator-0 pod gets OOMKilled

Environment

ECK version:

1.1.0 via https://download.elastic.co/downloads/eck/1.1.0/all-in-one.yaml

Kubernetes information:

insert any information about your Kubernetes environment that could help us:
- Cloud: GKE
for each of them please give us the version you are using

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-16T23:34:25Z", GoVersion:"go1.14.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.11-gke.9", GitCommit:"e1af17fd873e15a48769e2c7b9851405f89e3d0d", GitTreeState:"clean", BuildDate:"2020-04-06T20:56:54Z", GoVersion:"go1.12.17b4", Compiler:"gc", Platform:"linux/amd64"}

Logs:

Name:           elastic-operator-0
Namespace:      elastic-system
Priority:       0
Node:           gke-****-preemptible-n1-91a14778-r0jl/10.132.0.42
Start Time:     Tue, 28 Apr 2020 18:39:14 +0200
Labels:         control-plane=elastic-operator
                controller-revision-hash=elastic-operator-7fd97d75b8
                statefulset.kubernetes.io/pod-name=elastic-operator-0
Annotations:    co.elastic.logs/raw:
                  [{"type":"container","json.keys_under_root":true,"paths":["/var/log/containers/*${data.kubernetes.container.id}.log"],"processors":[{"conv...
Status:         Running
IP:             10.60.51.57
IPs:            <none>
Controlled By:  StatefulSet/elastic-operator
Containers:
  manager:
    Container ID:  docker://bf5e12ba54b31bd734f49effbe8d583c5c8987e46353e8a1b0b07f911dc975d0
    Image:         docker.elastic.co/eck/eck-operator:1.1.0
    Image ID:      docker-pullable://docker.elastic.co/eck/eck-operator@sha256:ff6bb1fa755fb24f191262dc5c3220e822b9583b342acfe7556c0269a25bd90e
    Port:          9443/TCP
    Host Port:     0/TCP
    Args:
      manager
      --enable-webhook
      --log-verbosity=0
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Tue, 28 Apr 2020 18:40:54 +0200
      Finished:     Tue, 28 Apr 2020 18:40:58 +0200
    Ready:          False
    Restart Count:  4
    Limits:
      cpu:     1
      memory:  150Mi
    Requests:
      cpu:     100m
      memory:  50Mi
    Environment:
      OPERATOR_NAMESPACE:  elastic-system (v1:metadata.namespace)
      WEBHOOK_SECRET:      elastic-webhook-server-cert
      OPERATOR_IMAGE:      docker.elastic.co/eck/eck-operator:1.1.0
    Mounts:
      /tmp/k8s-webhook-server/serving-certs from cert (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from elastic-operator-token-949th (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  elastic-webhook-server-cert
    Optional:    false
  elastic-operator-token-949th:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  elastic-operator-token-949th
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                   From                                             Message
  ----     ------     ----                  ----                                             -------
  Normal   Scheduled  2m53s                 default-scheduler                                Successfully assigned elastic-system/elastic-operator-0 to gke-****-preemptible-n1-91a14778-r0jl
  Normal   Pulled     74s (x5 over 2m53s)   kubelet, gke-****-preemptible-n1-91a14778-r0jl  Container image "docker.elastic.co/eck/eck-operator:1.1.0" already present on machine
  Normal   Created    73s (x5 over 2m52s)   kubelet, gke-****-preemptible-n1-91a14778-r0jl  Created container manager
  Normal   Started    73s (x5 over 2m52s)   kubelet, gke-****-preemptible-n1-91a14778-r0jl  Started container manager
  Warning  BackOff    31s (x10 over 2m42s)  kubelet, gke-****-preemptible-n1-91a14778-r0jl  Back-off restarting failed container

If I remove the memory limit of 150Mi for the manager container in the elastic-operator statefulset everything works.

The text was updated successfully, but these errors were encountered:

sebgl · 2020-04-29T08:05:32Z

Thanks for reporting @edwardsmit. A few questions:

did you get any OOMKill before upgrading to 1.1.0?
how many Elasticsearch resources does this instance of the operator manage?

edwardsmit · 2020-04-29T08:17:59Z

No problem @sebgl

No, the 1.0.1 version worked OOTB
One test-cluster with 3 masters and 2 datanodes

david-kow · 2020-04-30T08:43:18Z

Hi @edwardsmit, thanks for your comments. We couldn't repro this unfortunately - few more questions so we can pin this down:

How much memory utilization was there in the cluster? Is it possible that it was OOMKilled before reaching 150Mb?
How many resources are there in the cluster (pods, secrets)?
After removing the limit, what is the memory usage of manager process? Can you exec into the container and check?

edwardsmit · 2020-04-30T09:06:14Z

Hi @david-kow,

I don't know, yes it is possible that it gets OOMKilled before reaching the 150Mb. Is there a way to check at which usage-level the OOM kicked in?
The Elasticsearch- or the K8S-cluster? We run almost 600 pods in the entire K8S cluster at the moment.
I'm not sure what is the best way to expose this, so after some googling I used cat /proc/1/status If you want me to do another command in the operator-container let me know.

Name:   elastic-operato
Umask:  0022
State:  S (sleeping)
Tgid:   1
Ngid:   0
Pid:    1
PPid:   0
TracerPid:      0
Uid:    101     101     101     101
Gid:    101     101     101     101
FDSize: 64
Groups:
NStgid: 1
NSpid:  1
NSpgid: 1
NSsid:  1
VmPeak:   551500 kB
VmSize:   551500 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:    457104 kB
VmRSS:    404360 kB
RssAnon:          376160 kB
RssFile:           28200 kB
RssShmem:              0 kB
VmData:   515048 kB
VmStk:       132 kB
VmExe:     17856 kB
VmLib:         8 kB
VmPTE:       984 kB
VmSwap:        0 kB
HugetlbPages:          0 kB
CoreDumping:    0
Threads:        10
SigQ:   0/104338
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: fffffffe3bfa3a00
SigIgn: 0000000000000000
SigCgt: fffffffe7fc1feff
CapInh: 00000000a80425fb
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000
NoNewPrivs:     0
Seccomp:        0
Speculation_Store_Bypass:       thread vulnerable
Cpus_allowed:   f
Cpus_allowed_list:      0-3
Mems_allowed:   00000000,00000001
Mems_allowed_list:      0
voluntary_ctxt_switches:        220
nonvoluntary_ctxt_switches:     76

barkbay · 2020-04-30T09:19:13Z

I don't know, yes it is possible that it gets OOMKilled before reaching the 150Mb. Is there a way to check at which usage-level the OOM kicked in?

You can usually get this kind of information using the dmesg command on the node where ECK was running, and looks for something like that:

[3458703.013937] Task in ... killed as a result of limit of ....
[3458703.039061] memory: usage x, limit x, failcnt x3458703.044979] memory+swap: usage x, limit x, failcnt x
[3458703.051495] kmem: usage x, limit x, failcnt 0
[3458703.058078] Memory cgroup stats for x ... active_file:0KB unevictable:0KB
....
[3458703.135532] oom_reaper: reaped process x (x), now anon-rss:x, file-rss:x, shmem-rss:0kB

(you can use dmesg -T in order to have human readable timestamps)

andersosthus · 2020-04-30T12:33:19Z

I see the same issue in our cluster. I upgraded from 1.0.1 to 1.1.0. One Elasticsearch cluster only. Operator gets OOM killed. I raised the memory limit to 500Mi, and now it runs fine.

Attached is a graph of the memory consumption for the operator pod:

sebgl · 2020-04-30T13:25:18Z

@andersosthus is this graph showing memory usage for 1.1.0 entirely, or does it represent moving from 1.0.1 to 1.1.0 (there's something happening around 14:23)?

charith-elastic · 2020-04-30T13:46:43Z

On a side note, if you are looking at the container_memory_usage_bytes metric, it can be misleading due to a change in the Go runtime. A better approximation of the actual memory usage can be obtained from the container_memory_working_set_bytes instead.

This obviously doesn't explain the OOM kill though.

andersosthus · 2020-04-30T13:55:04Z

@sebgl That is only showing the v1.1.0 (it was in our test cluster I upgraded from 1.0.1 to 1.1.0, but I don't have good metrics there).

Here is another screenshot, using container_memory_working_set_bytes. It shows the manager container from the time it starts up. When it starts, there is one Elasticsearch CR that has been submitted but not processed yet, so it starts processing that one right away.

edwardsmit · 2020-05-02T05:05:20Z

@barkbay I'm not able to run dmseg on the node, as I don't have access to the node.
However, it seems others have provided some evidence of similar situations ⬆

david-kow · 2020-05-04T14:28:47Z

We run almost 600 pods in the entire K8S cluster at the moment.

I think this is what causes the issue. When watching the resources in the cluster for changes, we can't (due to dependant libraries) watch resources only belonging to ECK. With larger clusters, ECK will need more memory even if it manages the same number of Elastic resources. I've created #3025 and #3026 to track improving the default and documenting the issue. There is also kubernetes-sigs/controller-runtime#244 that describes the limitation we are facing.

I'm not sure why the issue was not visible in 1.0.1 - I couldn't repro a significant difference between memory consumption. Maybe cluster resources increased between the time 1.0.1 pod was started and the upgrade? The size of the initial spike should correlate with number of resources in the cluster too.

andersosthus · 2020-05-04T14:56:29Z

Yeah, we also run several hundred pods, so probably the cause for the OOM kill in our cluster as well. I didn't run the 1.0.1 operator in the production cluster so I don't know how that one handled it.

I suggest either increasing the default memory limit, or just add a note in the docs about it.

anyasabo · 2020-05-15T17:28:30Z

We increased the request and limit in 1.1.1 which should resolve this for most scenarios

botelastic bot added the triage label Apr 28, 2020

david-kow added the >bug Something isn't working label Apr 30, 2020

botelastic bot removed the triage label Apr 30, 2020

david-kow mentioned this issue May 4, 2020

Consider raising default memory limit for the operator #3025

Closed

anyasabo closed this as completed May 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1.1.0] elastic-operator pod gets OOMKilled #2981

[1.1.0] elastic-operator pod gets OOMKilled #2981

edwardsmit commented Apr 28, 2020 •

edited

Loading

sebgl commented Apr 29, 2020

edwardsmit commented Apr 29, 2020

david-kow commented Apr 30, 2020

edwardsmit commented Apr 30, 2020

barkbay commented Apr 30, 2020 •

edited

Loading

andersosthus commented Apr 30, 2020

sebgl commented Apr 30, 2020

charith-elastic commented Apr 30, 2020

andersosthus commented Apr 30, 2020 •

edited

Loading

edwardsmit commented May 2, 2020

david-kow commented May 4, 2020

andersosthus commented May 4, 2020

anyasabo commented May 15, 2020

[1.1.0] elastic-operator pod gets OOMKilled #2981

[1.1.0] elastic-operator pod gets OOMKilled #2981

Comments

edwardsmit commented Apr 28, 2020 • edited Loading

sebgl commented Apr 29, 2020

edwardsmit commented Apr 29, 2020

david-kow commented Apr 30, 2020

edwardsmit commented Apr 30, 2020

barkbay commented Apr 30, 2020 • edited Loading

andersosthus commented Apr 30, 2020

sebgl commented Apr 30, 2020

charith-elastic commented Apr 30, 2020

andersosthus commented Apr 30, 2020 • edited Loading

edwardsmit commented May 2, 2020

david-kow commented May 4, 2020

andersosthus commented May 4, 2020

anyasabo commented May 15, 2020

edwardsmit commented Apr 28, 2020 •

edited

Loading

barkbay commented Apr 30, 2020 •

edited

Loading

andersosthus commented Apr 30, 2020 •

edited

Loading