Need Accurate Job Status, Failure Reason (Category), and useful Resolutions for users to take action #2326

scarlett2018 · 2019-03-15T02:05:06Z

Work break down:

=============
Need Accurate Job Status, Failure Reason (Category), and useful Resolutions for users to take action

previous problems are also listed:

need to know when the job is killed by system (i.e. preempt bonus).
provide detailed reasons for "waiting" jobs #824
How to use GPUs on different machines? And how to change maxGPUs? #1832
failed during to run job #1819
[job status] Task status shows running, but no resources are allocated. #2022
Provide a better failure message experience for GPU resource request failure #2118

Job Status

Job Submitted
info needed:
- job start time
- retry times
Job Waiting
info needed:
- sub task status: Task queueing; Task allocating resource; Task waiting;
- job waiting time
Job Running
info needed:
- sub task status: Task waiting; Task running; Task Completed; Task failed; Task Stopped;
- job execution duration
Job Stopped
info needed:
- job stopped at (time)
- job stopped by (system/user)
- job stopped reason
Job Succeeded
info needed:
- job end time
Job Failed
info needed:
- job failed at (time)
- Error Type
- Error Code
- Error Message
- Matched Pattern/resolution

Error Type (this need to be discuss and update after team discussion)

User Command
Launcher preparation failure
Docker failure
AM and other type (i.e. Error code 177)

Matched Pattern
Apply user code analyze rules for user command sub type error.

##Test Cases
Test Case 1. An example for resouce not enough "waiting":
“GangAllocation cannot be satisfied in time: Still waiting for 1 outstanding Tasks after timeout 1100s, maybe current available resource for the application is not enough, please retry later.”

Test Case 2. #2118
http://*.225:9286/view.html?username=Guisu&jobName=bert-pytorch-corn-18

The job requested 6 GPUs, but the server actually only have 4 as Max.
image
In the application summary log: http://*.225:9286/view.html?username=Scarlett&jobName=scarlett-bert-pytorch-corn-18
[ExitCustomizedDiagnostics]:
onError called into AM from RM due to non-transient error, maybe application is non-compliant.
Exception:
org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested GPUs < 0, or requested GPUs > max configured, requestedGPUs=6, maxGPUs=4

Test Case 3 need to know when the job is killed by system #2333
NNI reports there are some failure jobs, but the status is not clear enough to let them know whether the job is killed by system or failed due to user codes.

scarlett2018 · 2019-03-19T08:12:45Z

An example for resouce not enough "waiting":
“GangAllocation cannot be satisfied in time: Still waiting for 1 outstanding Tasks after timeout 1100s, maybe current available resource for the application is not enough, please retry later.”

sterowang · 2019-03-25T10:12:11Z

How about we display a timeline graph to show this curves together with job status changing?

How many containers has been allocated.
How many containers pending.
How many containers running.
How many containers succeeded.
How many containers failed.
How many containers stopped.

yqwang-ms · 2019-03-25T10:19:05Z

This PR covers latest per container state:
https://github.com/Microsoft/pai/pull/2306/files

For timeline graph, we may need to store status history first.

scarlett2018 added 0.12.0 candidates high priority Epic PAI-Exp labels Mar 15, 2019

scarlett2018 assigned yqwang-ms Mar 22, 2019

scarlett2018 added this to the End April Release milestone Mar 25, 2019

scarlett2018 mentioned this issue Mar 25, 2019

End May 2019 Release Plan #2386

Closed

4 tasks

This was referenced Mar 26, 2019

Provide a better failure message experience for GPU resource request failure #2118

Closed

2019 Mid April Iteration Plan #2372

Closed

scarlett2018 assigned sunqinzheng Mar 29, 2019

squirrelsc mentioned this issue Apr 1, 2019

HDFS error message is not accurate. #2440

Closed

scarlett2018 mentioned this issue Apr 3, 2019

need to know when the job is killed by system #2333

Closed

scarlett2018 removed the 0.12.0 candidates label Apr 3, 2019

scarlett2018 closed this as completed Jun 25, 2019

yqwang-ms mentioned this issue Aug 14, 2019

Add RETRY_PENDING Job State #3376

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need Accurate Job Status, Failure Reason (Category), and useful Resolutions for users to take action #2326

Need Accurate Job Status, Failure Reason (Category), and useful Resolutions for users to take action #2326

scarlett2018 commented Mar 15, 2019 •

edited by yqwang-ms

Loading

scarlett2018 commented Mar 19, 2019

sterowang commented Mar 25, 2019

yqwang-ms commented Mar 25, 2019

Need Accurate Job Status, Failure Reason (Category), and useful Resolutions for users to take action #2326

Need Accurate Job Status, Failure Reason (Category), and useful Resolutions for users to take action #2326

Comments

scarlett2018 commented Mar 15, 2019 • edited by yqwang-ms Loading

scarlett2018 commented Mar 19, 2019

sterowang commented Mar 25, 2019

yqwang-ms commented Mar 25, 2019

scarlett2018 commented Mar 15, 2019 •

edited by yqwang-ms

Loading