-
Notifications
You must be signed in to change notification settings - Fork 729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[1.1.0] elastic-operator pod gets OOMKilled #2981
Comments
Thanks for reporting @edwardsmit. A few questions:
|
No problem @sebgl
|
Hi @edwardsmit, thanks for your comments. We couldn't repro this unfortunately - few more questions so we can pin this down:
|
Hi @david-kow,
Name: elastic-operato
Umask: 0022
State: S (sleeping)
Tgid: 1
Ngid: 0
Pid: 1
PPid: 0
TracerPid: 0
Uid: 101 101 101 101
Gid: 101 101 101 101
FDSize: 64
Groups:
NStgid: 1
NSpid: 1
NSpgid: 1
NSsid: 1
VmPeak: 551500 kB
VmSize: 551500 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 457104 kB
VmRSS: 404360 kB
RssAnon: 376160 kB
RssFile: 28200 kB
RssShmem: 0 kB
VmData: 515048 kB
VmStk: 132 kB
VmExe: 17856 kB
VmLib: 8 kB
VmPTE: 984 kB
VmSwap: 0 kB
HugetlbPages: 0 kB
CoreDumping: 0
Threads: 10
SigQ: 0/104338
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: fffffffe3bfa3a00
SigIgn: 0000000000000000
SigCgt: fffffffe7fc1feff
CapInh: 00000000a80425fb
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000
NoNewPrivs: 0
Seccomp: 0
Speculation_Store_Bypass: thread vulnerable
Cpus_allowed: f
Cpus_allowed_list: 0-3
Mems_allowed: 00000000,00000001
Mems_allowed_list: 0
voluntary_ctxt_switches: 220
nonvoluntary_ctxt_switches: 76 |
You can usually get this kind of information using the
(you can use |
@andersosthus is this graph showing memory usage for 1.1.0 entirely, or does it represent moving from 1.0.1 to 1.1.0 (there's something happening around 14:23)? |
On a side note, if you are looking at the This obviously doesn't explain the OOM kill though. |
@sebgl That is only showing the v1.1.0 (it was in our test cluster I upgraded from 1.0.1 to 1.1.0, but I don't have good metrics there). Here is another screenshot, using |
@barkbay I'm not able to run |
I think this is what causes the issue. When watching the resources in the cluster for changes, we can't (due to dependant libraries) watch resources only belonging to ECK. With larger clusters, ECK will need more memory even if it manages the same number of Elastic resources. I've created #3025 and #3026 to track improving the default and documenting the issue. There is also kubernetes-sigs/controller-runtime#244 that describes the limitation we are facing. I'm not sure why the issue was not visible in 1.0.1 - I couldn't repro a significant difference between memory consumption. Maybe cluster resources increased between the time 1.0.1 pod was started and the upgrade? The size of the initial spike should correlate with number of resources in the cluster too. |
Yeah, we also run several hundred pods, so probably the cause for the OOM kill in our cluster as well. I didn't run the 1.0.1 operator in the production cluster so I don't know how that one handled it. I suggest either increasing the default memory limit, or just add a note in the docs about it. |
We increased the request and limit in 1.1.1 which should resolve this for most scenarios |
#2819 Bug Report
What did you do?
Upgraded ECK from 1.0.1 to 1.1.0
What did you expect to see?
The elastic-operator-0 pod runs smoothly
What did you see instead? Under which circumstances?
The elastic-operator-0 pod gets OOMKilled
Environment
1.1.0 via https://download.elastic.co/downloads/eck/1.1.0/all-in-one.yaml
Kubernetes information:
insert any information about your Kubernetes environment that could help us:
for each of them please give us the version you are using
If I remove the memory limit of 150Mi for the
manager
container in theelastic-operator
statefulset everything works.The text was updated successfully, but these errors were encountered: