Skip to content
This repository has been archived by the owner on May 25, 2023. It is now read-only.

kube batch exit with Resource is not sufficient panic randomly. #659

Closed
TommyLike opened this issue Mar 25, 2019 · 3 comments · Fixed by #660
Closed

kube batch exit with Resource is not sufficient panic randomly. #659

TommyLike opened this issue Mar 25, 2019 · 3 comments · Fixed by #660
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@TommyLike
Copy link
Contributor

TommyLike commented Mar 25, 2019

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

What happened:
I use kube batch as a scheduler, kube batch will exit randomly due the the "Resource is not sufficient panic" in preempt action.

I0325 08:36:26.107287       1 preempt.go:46] Enter Preempt ...
I0325 08:36:26.107351       1 preempt.go:59] Added Queue <default> for Job <kube-system/fad8bc6c-4ed7-11e9-86bc-02420c101920>
I0325 08:36:26.108084       1 preempt.go:185] Predicates failed for task <test/st-qj-2-st-qj-2-task-0-7> on node <integration-control-plane>: task <test/st-qj-2-st-qj-2-task-0-7> does not tolerate node <integration-control-plane> taints
I0325 08:36:26.108111       1 preempt.go:202] Considering Task <test/st-qj-2-st-qj-2-task-0-7> on Node <integration-worker>.
I0325 08:36:26.108124       1 gang.go:114] Can not preempt task <test/st-qj-1-st-qj-1-task-0-3> because of gang-scheduling
I0325 08:36:26.108129       1 gang.go:114] Can not preempt task <test/st-qj-1-st-qj-1-task-0-6> because of gang-scheduling
I0325 08:36:26.108132       1 gang.go:114] Can not preempt task <test/st-qj-1-st-qj-1-task-0-1> because of gang-scheduling
I0325 08:36:26.108136       1 gang.go:121] Victims from Gang plugins are [Task (fdd825c9-4ed7-11e9-86bc-02420c101920:kube-system/coredns-86c58d9df4-2g7p8): job fdd00b3d-4ed7-11e9-86bc-02420c101920, status Running, pri 0, resreq cpu 100.00, memory 73400320.00, GPU 0.00 Task (fdd8cb77-4ed7-11e9-86bc-02420c101920:kube-system/coredns-86c58d9df4-6cspd): job fdd00b3d-4ed7-11e9-86bc-02420c101920, status Running, pri 0, resreq cpu 100.00, memory 73400320.00, GPU 0.00 Task (ffa9efaa-4ed7-11e9-86bc-02420c101920:kube-system/kube-proxy-r8zzn): job fad8bc6c-4ed7-11e9-86bc-02420c101920, status Running, pri 2000001000, resreq cpu 0.00, memory 0.00, GPU 0.00 Task (a08853a3-4ed8-11e9-86bc-02420c101920:kube-system/weave-net-kgfqf): job fb8d368a-4ed7-11e9-86bc-02420c101920, status Running, pri 0, resreq cpu 20.00, memory 0.00, GPU 0.00 Task (15b62434-4ed8-11e9-86bc-02420c101920:kube-system/integration-admission-5b58cfc4db-p8pm5): job 15b035d0-4ed8-11e9-86bc-02420c101920, status Running, pri 0, resreq cpu 0.00, memory 0.00, GPU 0.00 Task (01a2cf63-4ed8-11e9-86bc-02420c101920:kube-system/tiller-deploy-5b7c66d59c-kt5rs): job 01a145a3-4ed8-11e9-86bc-02420c101920, status Running, pri 0, resreq cpu 0.00, memory 0.00, GPU 0.00]
E0325 08:36:26.108204       1 preempt.go:226] Try to preempt Task <kube-system/coredns-86c58d9df4-2g7p8> for Tasks <test/st-qj-2-st-qj-2-task-0-7>
I0325 08:36:26.108227       1 asm_amd64.s:523] Leaving Preempt ...
I0325 08:36:26.112806       1 cache.go:152] Updating pod condition for test/st-qj-2-st-qj-2-task-0-5 to (PodScheduled==False)
I0325 08:36:26.119875       1 cache.go:152] Updating pod condition for test/st-qj-2-st-qj-2-task-0-3 to (PodScheduled==False)
I0325 08:36:26.130717       1 event_handlers.go:214] Updated pod <test/st-qj-2-st-qj-2-task-0-5> in cache.
I0325 08:36:26.131134       1 cache.go:152] Updating pod condition for test/st-qj-2-st-qj-2-task-0-4 to (PodScheduled==False)
I0325 08:36:26.138254       1 cache.go:152] Updating pod condition for test/st-qj-2-st-qj-2-task-0-6 to (PodScheduled==False)
I0325 08:36:26.142774       1 event_handlers.go:214] Updated pod <test/st-qj-2-st-qj-2-task-0-3> in cache.
I0325 08:36:26.143172       1 cache.go:152] Updating pod condition for test/st-qj-2-st-qj-2-task-0-1 to (PodScheduled==False)
I0325 08:36:26.152926       1 cache.go:152] Updating pod condition for test/st-qj-2-st-qj-2-task-0-2 to (PodScheduled==False)
I0325 08:36:26.322345       1 event_handlers.go:214] Updated pod <test/st-qj-2-st-qj-2-task-0-4> in cache.
I0325 08:36:26.322475       1 cache.go:152] Updating pod condition for test/st-qj-2-st-qj-2-task-0-0 to (PodScheduled==False)
I0325 08:36:26.719779       1 cache.go:152] Updating pod condition for test/st-qj-2-st-qj-2-task-0-8 to (PodScheduled==False)
I0325 08:36:27.119165       1 event_handlers.go:214] Updated pod <test/st-qj-2-st-qj-2-task-0-6> in cache.
I0325 08:36:27.119228       1 cache.go:152] Updating pod condition for test/st-qj-2-st-qj-2-task-0-7 to (PodScheduled==False)
I0325 08:36:27.520555       1 event_handlers.go:214] Updated pod <test/st-qj-2-st-qj-2-task-0-1> in cache.
I0325 08:36:27.520588       1 event_handlers.go:214] Updated pod <test/st-qj-2-st-qj-2-task-0-2> in cache.
I0325 08:36:27.520603       1 event_handlers.go:214] Updated pod <test/st-qj-2-st-qj-2-task-0-0> in cache.
I0325 08:36:27.520615       1 event_handlers.go:214] Updated pod <test/st-qj-2-st-qj-2-task-0-8> in cache.
I0325 08:36:27.520818       1 event_handlers.go:214] Updated pod <test/st-qj-2-st-qj-2-task-0-7> in cache.
I0325 08:36:27.524747       1 session.go:140] Close Session 13f2a137-4ed9-11e9-8870-d6f72309612d
E0325 08:36:27.524861       1 runtime.go:69] Observed a panic: &errors.errorString{s:"Resource is not sufficient to do operation: <cpu 1000.00, memory 0.00, GPU 0.00> sub <cpu 100.00, memory 73400320.00, GPU 0.00>"} (Resource is not sufficient to do operation: <cpu 1000.00, memory 0.00, GPU 0.00> sub <cpu 100.00, memory 73400320.00, GPU 0.00>)
/root/go_projects/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/root/go_projects/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/root/go_projects/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/asm_amd64.s:522
/usr/local/go/src/runtime/panic.go:513
/root/go_projects/src/volcano.sh/volcano/vendor/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/api/resource_info.go:108
/root/go_projects/src/volcano.sh/volcano/pkg/scheduler/actions/preempt/preempt.go:238
/root/go_projects/src/volcano.sh/volcano/pkg/scheduler/actions/preempt/preempt.go:102
/root/go_projects/src/volcano.sh/volcano/vendor/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/scheduler.go:96
/root/go_projects/src/volcano.sh/volcano/vendor/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/scheduler.go:82
/root/go_projects/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/root/go_projects/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/root/go_projects/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/usr/local/go/src/runtime/asm_amd64.s:1333
panic: Resource is not sufficient to do operation: <cpu 1000.00, memory 0.00, GPU 0.00> sub <cpu 100.00, memory 73400320.00, GPU 0.00> [recovered]
	panic: Resource is not sufficient to do operation: <cpu 1000.00, memory 0.00, GPU 0.00> sub <cpu 100.00, memory 73400320.00, GPU 0.00>

goroutine 181 [running]:
volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/root/go_projects/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x108
panic(0x12421a0, 0xc0006902f0)
	/usr/local/go/src/runtime/panic.go:513 +0x1b9
volcano.sh/volcano/vendor/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/api.(*Resource).Sub(0xc0008f9660, 0xc0008f9680, 0x13f0100)
	/root/go_projects/src/volcano.sh/volcano/vendor/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/api/resource_info.go:108 +0x10e
volcano.sh/volcano/pkg/scheduler/actions/preempt.preempt(0xc000342700, 0xc00075b970, 0xc0003d8930, 0xc0003cd620, 0xc00075b910, 0x1, 0x0, 0x1)
	/root/go_projects/src/volcano.sh/volcano/pkg/scheduler/actions/preempt/preempt.go:238 +0xf89
volcano.sh/volcano/pkg/scheduler/actions/preempt.(*preemptAction).Execute(0xc00000c048, 0xc000342700)
	/root/go_projects/src/volcano.sh/volcano/pkg/scheduler/actions/preempt/preempt.go:102 +0xc00
volcano.sh/volcano/vendor/github.com/kubernetes-sigs/kube-batch/pkg/scheduler.(*Scheduler).runOnce(0xc000456300)
	/root/go_projects/src/volcano.sh/volcano/vendor/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/scheduler.go:96 +0x239
volcano.sh/volcano/vendor/github.com/kubernetes-sigs/kube-batch/pkg/scheduler.(*Scheduler).runOnce-fm()
	/root/go_projects/src/volcano.sh/volcano/vendor/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/scheduler.go:82 +0x2a
volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc0005aa7e0)
	/root/go_projects/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0005aa7e0, 0x3b9aca00, 0x0, 0x1, 0x0)
	/root/go_projects/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbe
volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc0005aa7e0, 0x3b9aca00, 0x0)
	/root/go_projects/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by volcano.sh/volcano/vendor/github.com/kubernetes-sigs/kube-batch/pkg/scheduler.(*Scheduler).Run
	/root/go_projects/src/volcano.sh/volcano/vendor/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/scheduler.go:82 +0x194

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
    kube batch version: 0.4.1
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 25, 2019
@zionwu
Copy link
Contributor

zionwu commented Mar 26, 2019

I used v0.3 and hit the similar issue and the panic is in proportion plugin
below is the log:

E0326 07:23:28.012485       1 runtime.go:66] Observed a panic: &errors.errorString{s:"Resource is not sufficient to do operation: <cpu 408833.33, memory 5346978668544.00, GPU 0.00> sub <cpu 608000.00, memory 5153960755200.00, GPU 0.00>"} (Resource is not sufficient to do operation: <cpu 408833.33, memory 5346978668544.00, GPU 0.00> sub <cpu 608000.00, memory 5153960755200.00, GPU 0.00>)
/Users/klaus/Workspace/kb_ws/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
/Users/klaus/Workspace/kb_ws/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/Users/klaus/Workspace/kb_ws/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/asm_amd64.s:522
/usr/local/go/src/runtime/panic.go:513
/Users/klaus/Workspace/kb_ws/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/api/resource_info.go:108
/Users/klaus/Workspace/kb_ws/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/plugins/proportion/proportion.go:138
/Users/klaus/Workspace/kb_ws/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/framework/framework.go:36
/Users/klaus/Workspace/kb_ws/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/scheduler.go:83
/Users/klaus/Workspace/kb_ws/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/scheduler.go:76
/Users/klaus/Workspace/kb_ws/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/Users/klaus/Workspace/kb_ws/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/Users/klaus/Workspace/kb_ws/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/usr/local/go/src/runtime/asm_amd64.s:1333
panic: Resource is not sufficient to do operation: <cpu 408833.33, memory 5346978668544.00, GPU 0.00> sub <cpu 608000.00, memory 5153960755200.00, GPU 0.00> [recovered]
	panic: Resource is not sufficient to do operation: <cpu 408833.33, memory 5346978668544.00, GPU 0.00> sub <cpu 608000.00, memory 5153960755200.00, GPU 0.00>

goroutine 246 [running]:
github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/Users/klaus/Workspace/kb_ws/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x108
panic(0x111cb00, 0xc00102a870)
	/usr/local/go/src/runtime/panic.go:513 +0x1b9
github.com/kubernetes-sigs/kube-batch/pkg/scheduler/api.(*Resource).Sub(0xc001003760, 0xc0010037e0, 0xc000564458)
	/Users/klaus/Workspace/kb_ws/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/api/resource_info.go:108 +0x10e
github.com/kubernetes-sigs/kube-batch/pkg/scheduler/plugins/proportion.(*proportionPlugin).OnSessionOpen(0xc000f78360, 0xc00039a9c0)
	/Users/klaus/Workspace/kb_ws/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/plugins/proportion/proportion.go:138 +0xefc
github.com/kubernetes-sigs/kube-batch/pkg/scheduler/framework.OpenSession(0x143e180, 0xc0002a01c0, 0xc000e4e280, 0x5, 0x8, 0x0)
	/Users/klaus/Workspace/kb_ws/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/framework/framework.go:36 +0x275
github.com/kubernetes-sigs/kube-batch/pkg/scheduler.(*Scheduler).runOnce(0xc000395ce0)
	/Users/klaus/Workspace/kb_ws/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/scheduler.go:83 +0xf3
github.com/kubernetes-sigs/kube-batch/pkg/scheduler.(*Scheduler).runOnce-fm()
	/Users/klaus/Workspace/kb_ws/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/scheduler.go:76 +0x2a
github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000ca4fd0)
	/Users/klaus/Workspace/kb_ws/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000ca4fd0, 0x3b9aca00, 0x0, 0x1, 0xc0004d6180)
	/Users/klaus/Workspace/kb_ws/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbe
github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc000ca4fd0, 0x3b9aca00, 0xc0004d6180)
	/Users/klaus/Workspace/kb_ws/src/github.com/kubernetes-sigs/kube-batch/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by github.com/kubernetes-sigs/kube-batch/pkg/scheduler.(*Scheduler).Run
	/Users/klaus/Workspace/kb_ws/src/github.com/kubernetes-sigs/kube-batch/pkg/scheduler/scheduler.go:76 +0x167

```

@zionwu
Copy link
Contributor

zionwu commented Mar 26, 2019

@k82cn The PR #660 does not fix the same panic issue in proportion.go.

@hex108
Copy link
Contributor

hex108 commented Mar 26, 2019

Yes, as I said in that PR, we might need check all the places that calling Sub.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants