Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Roadmap

Scarlett Li edited this page Mar 15, 2019 · 54 revisions

OpenPAI Roadmap 2019

We typically look out 6 to 12 months and establish topics we want to work on. As we go we learn and our assessment of some of the topics listed changes. Thus, we may add or drop topics as we go.

We describe some initiatives as "investigations" which means our goal in the next few months is to better understand the problem and potential solutions before scheduling actual feature work. Once an investigation is done, we will update our plan, either deferring the initiative or committing to it.

Iteration plans can be found --> Iteration Plans.

As always, we will listen to your feedback and adapt our plans if needed.

Themes

Our roadmap covers the following themes:

User themes

  • Easy to use: Simple and just enough UX for data scientists, researchers and students
  • Easy to debug: Easy to understand and use experience for debugging the training jobs

Admin and Ops themes

  • Easy to manage: Fluent installation, upgrade and maintenance experience for IT admin
  • Easy to operate: Easy to use metrics for Operations to understand the resources usage

Easy to use

  • 0.11.0 candidate Instead of setting the GPU/memory/core manually and separately, PAI expose simple SKUs that are available in this PAI instance for user to select #2062 @debuggy @qfyin
  • 0.11.0 candidateProvide filters for jobs list page #302 @Gerhut
  • Refine Job detail page
    • 0.11.0 candidate 1st round refine #2211 @sunqinzheng @qfyin
    • Tensorboard direct link option in job page
  • Accurate Job Status, Failure Reason (Category), and useful Resolutions #2326
  • Provide better documentation and best practices for users
    • How to use NFS as PAI's storage
    • As an ops, I need to know the best practice of using PAI VC, and better manageability on VC. Related issues: #2073, #906
    • best practices for checkpoint, so job could retry from the most recently progress
  • Job list view shows GPU and Task counts
  • A user home page to provide a concentrated view for all 'my' own related work.
  • PAIShare and Marketplace scenarios
  • As a PAI user, I need to better understand how resources will/are/were allocated for my jobs, so that I know how to better compile my job scripts. Related Issues: #2062, #1943, #1989, #1777, #1819, #1904, #1995, #1968

Easy to debug

  • Better Job Debugging Experience for End Users #2210 @ydye
    • 0.11.0 candidate Job debugging reservation when job failed due to users' error. #2213 @ydye
    • 0.11.0 candidate An option for user to decide whether to enable debugging reservation for the job or not. #2214 @ydye
    • Approach to collect the container information which is reserved for job debugging. #2215
    • Display the debugging reservation status for job in webportal #2216
    • Approach to notify users when their jobs are in debugging reservation #2217
    • Provide detail information when the job container exits. #2218

Easy to manage

  • 0.11.0 candidate Team wise storage management support #2204 @ydye @wangdian
  • 0.11.0 candidate Backward compatible upgrading #2212 @hao1939
  • Role based access control
    • 0.11.0 candidate User account integration with AAD
  • GPU scheduling with priority
  • Per job preemption choice
  • PAI everywhere
    • Run PAI on an existing Kubernetes cluster
  • Support to allocate resources for VC by quantity instead of percentage
  • HA support for OpenPAI
  • Cluster auto maintainance
  • A complete story for storage supports
  • As a running service, we should not expose too much info to unknown users before login.
  • Machine auto-maintenance

Easy to operate

  • 0.11.0 candidate Detect and alert for unhealthy GPU #2192 @mzmssg @xudifsd
  • 0.11.0 candidate Provide the ability to query all the jobs in a Node in PAI Web Portal #2128 @xudifsd
  • 0.11.0 candidate Display resource utility per vc/queue metrics in grafana #2208 @xudifsd
  • Aware and alert for low utilization jobs
  • Ability to generate reports for cluster/vc/resources/users/jobs usage #2127
  • GPU status summary
  • List all the GPUs' utilization for 1 machine
  • As an ops, I need the capability to batch create/update/delete user accounts and share with users through their email. Related Issues: #2078, #2085, #921

Better foundation

In addition to the above themes, there are fundamental architecture improvements need to be taken to support all the great features and experiences:

  • End to end job event tracking/logging support