Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Provide a better failure message experience for GPU resource request failure #2118

Closed
scarlett2018 opened this issue Feb 1, 2019 · 1 comment
Labels
C-MII PAI-Exp ResourceAllocation GangAllocation and others

Comments

@scarlett2018
Copy link
Member

http://*.225:9286/view.html?username=Guisu&jobName=bert-pytorch-corn-18

The job requested 6 GPUs, but the server actually only have 4 as Max.
image
In the application summary log: http://*.225:9286/view.html?username=Scarlett&jobName=scarlett-bert-pytorch-corn-18
[ExitCustomizedDiagnostics]:
onError called into AM from RM due to non-transient error, maybe application is non-compliant.
Exception:
org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested GPUs < 0, or requested GPUs > max configured, requestedGPUs=6, maxGPUs=4

@scarlett2018
Copy link
Member Author

merged with #2326

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
C-MII PAI-Exp ResourceAllocation GangAllocation and others
Projects
None yet
Development

No branches or pull requests

1 participant