Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: kubeflow/trainer
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v1.3.0-rc.2
Choose a base ref
...
head repository: kubeflow/trainer
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: v1.3.0
Choose a head ref
  • 3 commits
  • 5 files changed
  • 1 contributor

Commits on Oct 3, 2021

  1. Cherry pick #1415 #1418 to v1.3-branch (#1428)

    * Feature/support pytorchjob set queue of volcano (#1415)
    
    * support pytorch use volcano-queue
    
    * support pytorch use volcano-queue
    
    Signed-off-by: bert.li <qiankun.li@qq.com>
    
    * set SchedulingPolicy for runPolicy
    
    Signed-off-by: bert.li <qiankun.li@qq.com>
    
    * use pytorchjob.Spec.RunPolicy directly
    
    * fix hyperlinks in the 'overview' section (#1418)
    
    hyperlinks now point to the latest api reference files.
    issue - #1411
    Jeffwan authored Oct 3, 2021

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
    Copy the full SHA
    e0b872e View commit details
  2. Copy the full SHA
    760ac11 View commit details
  3. Copy the full SHA
    3aa19be View commit details
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -12,9 +12,9 @@ run distributed or non-distributed TensorFlow/PyTorch/MXNet/XGBoost jobs on Kube

- For a complete reference of the custom resource definitions, please refer to the API Definition.
- [Tensorflow API Definition](pkg/apis/tensorflow/v1/types.go)
- [PyTorch API Definition](pkg/apis/pytorch/v1/types.go)
- [MXNet API Definition](pkg/apis/mxnet/v1/types.go)
- [XGBoost API Definition](pkg/apis/xgboost/v1/types.go)
- [PyTorch API Definition](pkg/apis/pytorch/v1/pytorchjob_types.go)
- [MXNet API Definition](pkg/apis/mxnet/v1/mxjob_types.go)
- [XGBoost API Definition](pkg/apis/xgboost/v1/xgboostjob_types.go)
- For details on API design, please refer to the [v1alpha2 design doc](https://github.com/kubeflow/community/blob/master/proposals/tf-operator-design-v1alpha2.md).
- For details of all-in-one operator design, please refer to the [All-in-one Kubeflow Training Operator](https://docs.google.com/document/d/1x1JPDQfDMIbnoQRftDH1IzGU0qvHGSU4W6Jl4rJLPhI/edit#heading=h.e33ufidnl8z6)
- For details on its obersibility, please refer to the [monitoring design doc](docs/monitoring/README.md).
2 changes: 1 addition & 1 deletion manifests/overlays/kubeflow/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -7,4 +7,4 @@ resources:
images:
- name: kubeflow/training-operator
newName: public.ecr.aws/j1r0q0g6/training/training-operator
newTag: "d4423c83124ce7ab58b9a61a2e909b2e9c14c236"
newTag: "760ac1171dd30039a7363ffa03c77454bd714da5"
2 changes: 1 addition & 1 deletion manifests/overlays/standalone/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -7,4 +7,4 @@ resources:
images:
- name: kubeflow/training-operator
newName: public.ecr.aws/j1r0q0g6/training/training-operator
newTag: "d4423c83124ce7ab58b9a61a2e909b2e9c14c236"
newTag: "760ac1171dd30039a7363ffa03c77454bd714da5"
11 changes: 1 addition & 10 deletions pkg/controller.v1/mxnet/mxjob_controller.go
Original file line number Diff line number Diff line change
@@ -166,17 +166,8 @@ func (r *MXJobReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl
replicas[commonv1.ReplicaType(k)] = v
}

// Construct RunPolicy based on MXJob.Spec
runPolicy := &commonv1.RunPolicy{
CleanPodPolicy: mxjob.Spec.RunPolicy.CleanPodPolicy,
TTLSecondsAfterFinished: mxjob.Spec.RunPolicy.TTLSecondsAfterFinished,
ActiveDeadlineSeconds: mxjob.Spec.RunPolicy.ActiveDeadlineSeconds,
BackoffLimit: mxjob.Spec.RunPolicy.BackoffLimit,
SchedulingPolicy: nil,
}

// Use common to reconcile the job related pod and service
err = r.ReconcileJobs(mxjob, replicas, mxjob.Status, runPolicy)
err = r.ReconcileJobs(mxjob, replicas, mxjob.Status, &mxjob.Spec.RunPolicy)
if err != nil {
logrus.Warnf("Reconcile MX Job error %v", err)
return ctrl.Result{}, err
11 changes: 1 addition & 10 deletions pkg/controller.v1/pytorch/pytorchjob_controller.go
Original file line number Diff line number Diff line change
@@ -155,17 +155,8 @@ func (r *PyTorchJobReconciler) Reconcile(ctx context.Context, req ctrl.Request)
// Set default priorities to pytorch job
r.Scheme.Default(pytorchjob)

// Construct RunPolicy based on PyTorchJob.Spec
runPolicy := &commonv1.RunPolicy{
CleanPodPolicy: pytorchjob.Spec.RunPolicy.CleanPodPolicy,
TTLSecondsAfterFinished: pytorchjob.Spec.RunPolicy.TTLSecondsAfterFinished,
ActiveDeadlineSeconds: pytorchjob.Spec.RunPolicy.ActiveDeadlineSeconds,
BackoffLimit: pytorchjob.Spec.RunPolicy.BackoffLimit,
SchedulingPolicy: nil,
}

// Use common to reconcile the job related pod and service
err = r.ReconcileJobs(pytorchjob, pytorchjob.Spec.PyTorchReplicaSpecs, pytorchjob.Status, runPolicy)
err = r.ReconcileJobs(pytorchjob, pytorchjob.Spec.PyTorchReplicaSpecs, pytorchjob.Status, &pytorchjob.Spec.RunPolicy)
if err != nil {
logrus.Warnf("Reconcile PyTorch Job error %v", err)
return ctrl.Result{}, err