Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Need Accurate Job Status, Failure Reason (Category), and useful Resolutions for users to take action #2326

Closed
7 of 11 tasks
scarlett2018 opened this issue Mar 15, 2019 · 3 comments

Comments

@scarlett2018
Copy link
Member

scarlett2018 commented Mar 15, 2019

Work break down:

=============
Need Accurate Job Status, Failure Reason (Category), and useful Resolutions for users to take action

previous problems are also listed:

image

Job Status

  • Job Submitted
    info needed:
    • job start time
    • retry times
  • Job Waiting
    info needed:
    • sub task status: Task queueing; Task allocating resource; Task waiting;
    • job waiting time
  • Job Running
    info needed:
    • sub task status: Task waiting; Task running; Task Completed; Task failed; Task Stopped;
    • job execution duration
  • Job Stopped
    info needed:
    • job stopped at (time)
    • job stopped by (system/user)
    • job stopped reason
  • Job Succeeded
    info needed:
    • job end time
  • Job Failed
    info needed:
    • job failed at (time)
    • Error Type
    • Error Code
    • Error Message
    • Matched Pattern/resolution

Error Type (this need to be discuss and update after team discussion)

  • User Command
  • Launcher preparation failure
  • Docker failure
  • AM and other type (i.e. Error code 177)

Matched Pattern
Apply user code analyze rules for user command sub type error.

##Test Cases
Test Case 1. An example for resouce not enough "waiting":
“GangAllocation cannot be satisfied in time: Still waiting for 1 outstanding Tasks after timeout 1100s, maybe current available resource for the application is not enough, please retry later.”
image

Test Case 2. #2118
http://*.225:9286/view.html?username=Guisu&jobName=bert-pytorch-corn-18

The job requested 6 GPUs, but the server actually only have 4 as Max.
image
In the application summary log: http://*.225:9286/view.html?username=Scarlett&jobName=scarlett-bert-pytorch-corn-18
[ExitCustomizedDiagnostics]:
onError called into AM from RM due to non-transient error, maybe application is non-compliant.
Exception:
org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested GPUs < 0, or requested GPUs > max configured, requestedGPUs=6, maxGPUs=4

Test Case 3 need to know when the job is killed by system #2333
NNI reports there are some failure jobs, but the status is not clear enough to let them know whether the job is killed by system or failed due to user codes.

@scarlett2018
Copy link
Member Author

An example for resouce not enough "waiting":
“GangAllocation cannot be satisfied in time: Still waiting for 1 outstanding Tasks after timeout 1100s, maybe current available resource for the application is not enough, please retry later.”
image

@sterowang
Copy link

How about we display a timeline graph to show this curves together with job status changing?

  1. How many containers has been allocated.
  2. How many containers pending.
  3. How many containers running.
  4. How many containers succeeded.
  5. How many containers failed.
  6. How many containers stopped.

@yqwang-ms
Copy link
Member

This PR covers latest per container state:
https://github.com/Microsoft/pai/pull/2306/files

For timeline graph, we may need to store status history first.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants