This repository has been archived by the owner on Jun 6, 2024. It is now read-only.
Need Accurate Job Status, Failure Reason (Category), and useful Resolutions for users to take action #2326
Closed
7 of 11 tasks
Work break down:
1. Setup Initial Spec Config file
1. Preserved Known User Container Exit Code
2. Preserved Unknown User Container Exit Code
3. Support structured ExitDiagnostics
1. YARN Collect Runtime Dynamic ExitInfo (Runtime.pai.agg.error)
1. Refine ExitCode
2. Split platform log from user log
3. Generate dynamic ExitInfo into Runtime.pai.agg.error file
1. Print User Understandable Container Disk Error Message to DiskCleaner.pai.error file under Container log directory
1. Setup and Load Spec
2. Get more structured ExitInfo from static ExitSpec and dynamic ExitDiagnostics
1. Show more structured ExitInfo
2. Add Error Message and Retry info in UI
=============
Need Accurate Job Status, Failure Reason (Category), and useful Resolutions for users to take action
previous problems are also listed:
Job Status
info needed:
info needed:
info needed:
info needed:
info needed:
info needed:
Error Type (this need to be discuss and update after team discussion)
Matched Pattern
Apply user code analyze rules for user command sub type error.
##Test Cases
Test Case 1. An example for resouce not enough "waiting":
“GangAllocation cannot be satisfied in time: Still waiting for 1 outstanding Tasks after timeout 1100s, maybe current available resource for the application is not enough, please retry later.”
Test Case 2. #2118
http://*.225:9286/view.html?username=Guisu&jobName=bert-pytorch-corn-18
The job requested 6 GPUs, but the server actually only have 4 as Max.
image
In the application summary log: http://*.225:9286/view.html?username=Scarlett&jobName=scarlett-bert-pytorch-corn-18
[ExitCustomizedDiagnostics]:
onError called into AM from RM due to non-transient error, maybe application is non-compliant.
Exception:
org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested GPUs < 0, or requested GPUs > max configured, requestedGPUs=6, maxGPUs=4
Test Case 3 need to know when the job is killed by system #2333
NNI reports there are some failure jobs, but the status is not clear enough to let them know whether the job is killed by system or failed due to user codes.
The text was updated successfully, but these errors were encountered: