Roadmap

OpenPAI Roadmap 2019

We typically look out 6 to 12 months and establish topics we want to work on. As we go we learn and our assessment of some of the topics listed changes. Thus, we may add or drop topics as we go.

We describe some initiatives as "investigations" which means our goal in the next few months is to better understand the problem and potential solutions before scheduling actual feature work. Once an investigation is done, we will update our plan, either deferring the initiative or committing to it.

Iteration plans can be found --> Iteration Plans.

As always, we will listen to your feedback and adapt our plans if needed.

Themes

Our roadmap covers the following themes:

User themes

Easy to use: Simple and just enough UX for data scientists, researchers and students
Easy to debug: Easy to understand and use experience for debugging the training jobs

Admin and Ops themes

Easy to manage: Fluent installation, upgrade and maintenance experience for IT admin
Easy to operate: Easy to use metrics for Operations to understand the resources usage

Easy to use

Instead of setting the GPU/memory/core manually and separately, PAI expose simple SKUs that are available in this PAI instance for user to select #2062 @debuggy @qfyin
Provide filters for jobs list page #302 @Gerhut

Refine Job detail page
- 1st round refine #2211 @sunqinzheng @qfyin
- Tensorboard direct link option in job page
Accurate Job Status, Failure Reason (Category), and useful Resolutions #2326
Provide better documentation and best practices for users
- How to use NFS as PAI's storage
- As an ops, I need to know the best practice of using PAI VC, and better manageability on VC. Related issues: #2073, #906
- best practices for checkpoint, so job could retry from the most recently progress
Job list view shows GPU and Task counts
A user home page to provide a concentrated view for all 'my' own related work.
PAIShare and Marketplace scenarios
As a PAI user, I need to better understand how resources will/are/were allocated for my jobs, so that I know how to better compile my job scripts. Related Issues: #2062, #1943, #1989, #1777, #1819, #1904, #1995, #1968

Easy to debug

Better Job Debugging Experience for End Users #2210 @ydye
- Job debugging reservation when job failed due to users' error. #2213 @ydye
- An option for user to decide whether to enable debugging reservation for the job or not. #2214 @ydye
- Approach to collect the container information which is reserved for job debugging. #2215
- Display the debugging reservation status for job in webportal #2216
- Approach to notify users when their jobs are in debugging reservation #2217
- Provide detail information when the job container exits. #2218

Easy to manage

Team wise storage management support #2204 @ydye @wangdian
Backward compatible upgrading #2212 @hao1939

Role based access control
- User account integration with AAD
GPU scheduling with priority
Per job preemption choice
PAI everywhere
- Run PAI on an existing Kubernetes cluster
Support to allocate resources for VC by quantity instead of percentage
HA support for OpenPAI
Cluster auto maintainance
A complete story for storage supports
As a running service, we should not expose too much info to unknown users before login.
Machine auto-maintenance

Easy to operate

Detect and alert for unhealthy GPU #2192 @mzmssg @xudifsd
Provide the ability to query all the jobs in a Node in PAI Web Portal #2128 @xudifsd
Display resource utility per vc/queue metrics in grafana #2208 @xudifsd

Aware and alert for low utilization jobs
Ability to generate reports for cluster/vc/resources/users/jobs usage #2127
GPU status summary
List all the GPUs' utilization for 1 machine
As an ops, I need the capability to batch create/update/delete user accounts and share with users through their email. Related Issues: #2078, #2085, #921

Better foundation

In addition to the above themes, there are fundamental architecture improvements need to be taken to support all the great features and experiences:

End to end job event tracking/logging support

If there are any questions or concerns about this wiki, please open OpenPAI Issue directly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly