Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.1.0] elastic-operator pod gets OOMKilled #2981

Closed
edwardsmit opened this issue Apr 28, 2020 · 13 comments
Closed

[1.1.0] elastic-operator pod gets OOMKilled #2981

edwardsmit opened this issue Apr 28, 2020 · 13 comments
Labels
>bug Something isn't working

Comments

@edwardsmit
Copy link

edwardsmit commented Apr 28, 2020

#2819 Bug Report
What did you do?
Upgraded ECK from 1.0.1 to 1.1.0

What did you expect to see?
The elastic-operator-0 pod runs smoothly

What did you see instead? Under which circumstances?
The elastic-operator-0 pod gets OOMKilled

Environment

  • ECK version:

1.1.0 via https://download.elastic.co/downloads/eck/1.1.0/all-in-one.yaml

  • Kubernetes information:

    insert any information about your Kubernetes environment that could help us:

    • Cloud: GKE

    for each of them please give us the version you are using

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-16T23:34:25Z", GoVersion:"go1.14.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.11-gke.9", GitCommit:"e1af17fd873e15a48769e2c7b9851405f89e3d0d", GitTreeState:"clean", BuildDate:"2020-04-06T20:56:54Z", GoVersion:"go1.12.17b4", Compiler:"gc", Platform:"linux/amd64"}
  • Logs:
Name:           elastic-operator-0
Namespace:      elastic-system
Priority:       0
Node:           gke-****-preemptible-n1-91a14778-r0jl/10.132.0.42
Start Time:     Tue, 28 Apr 2020 18:39:14 +0200
Labels:         control-plane=elastic-operator
                controller-revision-hash=elastic-operator-7fd97d75b8
                statefulset.kubernetes.io/pod-name=elastic-operator-0
Annotations:    co.elastic.logs/raw:
                  [{"type":"container","json.keys_under_root":true,"paths":["/var/log/containers/*${data.kubernetes.container.id}.log"],"processors":[{"conv...
Status:         Running
IP:             10.60.51.57
IPs:            <none>
Controlled By:  StatefulSet/elastic-operator
Containers:
  manager:
    Container ID:  docker://bf5e12ba54b31bd734f49effbe8d583c5c8987e46353e8a1b0b07f911dc975d0
    Image:         docker.elastic.co/eck/eck-operator:1.1.0
    Image ID:      docker-pullable://docker.elastic.co/eck/eck-operator@sha256:ff6bb1fa755fb24f191262dc5c3220e822b9583b342acfe7556c0269a25bd90e
    Port:          9443/TCP
    Host Port:     0/TCP
    Args:
      manager
      --enable-webhook
      --log-verbosity=0
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Tue, 28 Apr 2020 18:40:54 +0200
      Finished:     Tue, 28 Apr 2020 18:40:58 +0200
    Ready:          False
    Restart Count:  4
    Limits:
      cpu:     1
      memory:  150Mi
    Requests:
      cpu:     100m
      memory:  50Mi
    Environment:
      OPERATOR_NAMESPACE:  elastic-system (v1:metadata.namespace)
      WEBHOOK_SECRET:      elastic-webhook-server-cert
      OPERATOR_IMAGE:      docker.elastic.co/eck/eck-operator:1.1.0
    Mounts:
      /tmp/k8s-webhook-server/serving-certs from cert (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from elastic-operator-token-949th (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  elastic-webhook-server-cert
    Optional:    false
  elastic-operator-token-949th:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  elastic-operator-token-949th
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                   From                                             Message
  ----     ------     ----                  ----                                             -------
  Normal   Scheduled  2m53s                 default-scheduler                                Successfully assigned elastic-system/elastic-operator-0 to gke-****-preemptible-n1-91a14778-r0jl
  Normal   Pulled     74s (x5 over 2m53s)   kubelet, gke-****-preemptible-n1-91a14778-r0jl  Container image "docker.elastic.co/eck/eck-operator:1.1.0" already present on machine
  Normal   Created    73s (x5 over 2m52s)   kubelet, gke-****-preemptible-n1-91a14778-r0jl  Created container manager
  Normal   Started    73s (x5 over 2m52s)   kubelet, gke-****-preemptible-n1-91a14778-r0jl  Started container manager
  Warning  BackOff    31s (x10 over 2m42s)  kubelet, gke-****-preemptible-n1-91a14778-r0jl  Back-off restarting failed container

If I remove the memory limit of 150Mi for the manager container in the elastic-operator statefulset everything works.

@botelastic botelastic bot added the triage label Apr 28, 2020
@sebgl
Copy link
Contributor

sebgl commented Apr 29, 2020

Thanks for reporting @edwardsmit. A few questions:

  • did you get any OOMKill before upgrading to 1.1.0?
  • how many Elasticsearch resources does this instance of the operator manage?

@edwardsmit
Copy link
Author

No problem @sebgl

  • No, the 1.0.1 version worked OOTB
  • One test-cluster with 3 masters and 2 datanodes

@david-kow
Copy link
Contributor

Hi @edwardsmit, thanks for your comments. We couldn't repro this unfortunately - few more questions so we can pin this down:

  1. How much memory utilization was there in the cluster? Is it possible that it was OOMKilled before reaching 150Mb?
  2. How many resources are there in the cluster (pods, secrets)?
  3. After removing the limit, what is the memory usage of manager process? Can you exec into the container and check?

@david-kow david-kow added the >bug Something isn't working label Apr 30, 2020
@botelastic botelastic bot removed the triage label Apr 30, 2020
@edwardsmit
Copy link
Author

Hi @david-kow,

  1. I don't know, yes it is possible that it gets OOMKilled before reaching the 150Mb. Is there a way to check at which usage-level the OOM kicked in?
  2. The Elasticsearch- or the K8S-cluster? We run almost 600 pods in the entire K8S cluster at the moment.
  3. I'm not sure what is the best way to expose this, so after some googling I used cat /proc/1/status If you want me to do another command in the operator-container let me know.
Name:   elastic-operato
Umask:  0022
State:  S (sleeping)
Tgid:   1
Ngid:   0
Pid:    1
PPid:   0
TracerPid:      0
Uid:    101     101     101     101
Gid:    101     101     101     101
FDSize: 64
Groups:
NStgid: 1
NSpid:  1
NSpgid: 1
NSsid:  1
VmPeak:   551500 kB
VmSize:   551500 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:    457104 kB
VmRSS:    404360 kB
RssAnon:          376160 kB
RssFile:           28200 kB
RssShmem:              0 kB
VmData:   515048 kB
VmStk:       132 kB
VmExe:     17856 kB
VmLib:         8 kB
VmPTE:       984 kB
VmSwap:        0 kB
HugetlbPages:          0 kB
CoreDumping:    0
Threads:        10
SigQ:   0/104338
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: fffffffe3bfa3a00
SigIgn: 0000000000000000
SigCgt: fffffffe7fc1feff
CapInh: 00000000a80425fb
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000
NoNewPrivs:     0
Seccomp:        0
Speculation_Store_Bypass:       thread vulnerable
Cpus_allowed:   f
Cpus_allowed_list:      0-3
Mems_allowed:   00000000,00000001
Mems_allowed_list:      0
voluntary_ctxt_switches:        220
nonvoluntary_ctxt_switches:     76

@barkbay
Copy link
Contributor

barkbay commented Apr 30, 2020

I don't know, yes it is possible that it gets OOMKilled before reaching the 150Mb. Is there a way to check at which usage-level the OOM kicked in?

You can usually get this kind of information using the dmesg command on the node where ECK was running, and looks for something like that:

[3458703.013937] Task in ... killed as a result of limit of ....
[3458703.039061] memory: usage x, limit x, failcnt x3458703.044979] memory+swap: usage x, limit x, failcnt x
[3458703.051495] kmem: usage x, limit x, failcnt 0
[3458703.058078] Memory cgroup stats for x ... active_file:0KB unevictable:0KB
....
[3458703.135532] oom_reaper: reaped process x (x), now anon-rss:x, file-rss:x, shmem-rss:0kB

(you can use dmesg -T in order to have human readable timestamps)

@andersosthus
Copy link

I see the same issue in our cluster. I upgraded from 1.0.1 to 1.1.0. One Elasticsearch cluster only. Operator gets OOM killed. I raised the memory limit to 500Mi, and now it runs fine.

Attached is a graph of the memory consumption for the operator pod:
ECKOperator

@sebgl
Copy link
Contributor

sebgl commented Apr 30, 2020

@andersosthus is this graph showing memory usage for 1.1.0 entirely, or does it represent moving from 1.0.1 to 1.1.0 (there's something happening around 14:23)?

@charith-elastic
Copy link
Contributor

On a side note, if you are looking at the container_memory_usage_bytes metric, it can be misleading due to a change in the Go runtime. A better approximation of the actual memory usage can be obtained from the container_memory_working_set_bytes instead.

This obviously doesn't explain the OOM kill though.

@andersosthus
Copy link

andersosthus commented Apr 30, 2020

@sebgl That is only showing the v1.1.0 (it was in our test cluster I upgraded from 1.0.1 to 1.1.0, but I don't have good metrics there).

Here is another screenshot, using container_memory_working_set_bytes. It shows the manager container from the time it starts up. When it starts, there is one Elasticsearch CR that has been submitted but not processed yet, so it starts processing that one right away.

ECKOperator2

@edwardsmit
Copy link
Author

@barkbay I'm not able to run dmseg on the node, as I don't have access to the node.
However, it seems others have provided some evidence of similar situations ⬆

@david-kow
Copy link
Contributor

We run almost 600 pods in the entire K8S cluster at the moment.

I think this is what causes the issue. When watching the resources in the cluster for changes, we can't (due to dependant libraries) watch resources only belonging to ECK. With larger clusters, ECK will need more memory even if it manages the same number of Elastic resources. I've created #3025 and #3026 to track improving the default and documenting the issue. There is also kubernetes-sigs/controller-runtime#244 that describes the limitation we are facing.

I'm not sure why the issue was not visible in 1.0.1 - I couldn't repro a significant difference between memory consumption. Maybe cluster resources increased between the time 1.0.1 pod was started and the upgrade? The size of the initial spike should correlate with number of resources in the cluster too.

@andersosthus
Copy link

Yeah, we also run several hundred pods, so probably the cause for the OOM kill in our cluster as well. I didn't run the 1.0.1 operator in the production cluster so I don't know how that one handled it.

I suggest either increasing the default memory limit, or just add a note in the docs about it.

@anyasabo
Copy link
Contributor

We increased the request and limit in 1.1.1 which should resolve this for most scenarios

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants