Training: Reorganized Training Operator Docs #3719

andreyvelich · 2024-04-23T11:52:20Z

Related: kubeflow/training-operator#1998.
I created the following sections for Training Operator docs:

Overview
Installation
Getting Started
User Guides
Reference

A few points:

@StefanoFioravanzo @kubeflow/wg-training-leads Any ideas on what we could add to Why Training Operator ? section ? Initially, we can just add some basic info.
I didn't move CRDs to reference in this PR since we don't have time to discuss how we are going to generate them. What do you think we should do in this PR ?
Do we need to have working example in GettingStarted page ? Would it be too complicated to consume ?

/hold for review

/assign @StefanoFioravanzo @kubeflow/wg-training-leads @hbelmiro @kuizhiqing @droctothorpe @franciscojavierarceo
Looking for your feedback!

Signed-off-by: Andrey Velichkevich <[email protected]>

google-oss-prow · 2024-04-23T11:52:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Andrey Velichkevich <[email protected]>

franciscojavierarceo · 2024-04-24T03:16:13Z

content/en/docs/components/training/getting-started.md

+    # Create model.
+    class Net(torch.nn.Module):
+        """Create the Pytorch model"""
+        ...


nit: my recommendation would be to populate this even with a trivial single layer NN that would actually run for this example. It helps users get started that may just be executing copy-pasta.

Sure, let me try to add something simple.

StefanoFioravanzo · 2024-04-24T12:20:09Z

@andreyvelich thanks for this!

Any ideas on what we could add to Why Training Operator? section ?

Let's start with something simple and iterate in future PRs. We can start by answering questions like:

How does training operator simplify distributed training with respect to a more traditional approach?
How does Kubernetes help in solving these problems?
Why is Training Operator part of the Kubeflow ecosystem?

I didn't move CRDs to reference in this PR since we don't have time to discuss how we are going to generate them. What do you think we should do in this PR ?

Makes sense. I'd keep the scope of this PR to the restructuring you already implemented. Let's iterate on content separtely. We can address each framework's user guide in dedicated PRs.

Do we need to have working example in GettingStarted page ? Would it be too complicated to consume ?

Getting Started should have an end-to-end working (yet simple) example. Generally people just want to copy paste some stuff, run it, and see results. Then you typically link some more advanced tutorials or user guides at the end

Signed-off-by: Andrey Velichkevich <[email protected]>

andreyvelich · 2024-04-24T16:16:57Z

That makes sense @StefanoFioravanzo, I added initial ideas for Why Training Operator ? and also I added the AI/ML lifecycle diagram that we can re-use in various Kubeflow components to explain which stage of lifecycle each component addresses (e.g. Spark Operator, Model Registry, Katib, Notebooks, KServe).
Please let me know what do you think @kubeflow/wg-training-leads @StefanoFioravanzo ?

Signed-off-by: Andrey Velichkevich <[email protected]>

franciscojavierarceo · 2024-04-24T16:28:37Z

content/en/docs/components/training/overview.md

-
-<img src="/docs/components/training/images/distributed-tfjob.drawio.svg"
-  alt="Distributed TFJob"
+<img src="/docs/components/training/images/ml-lifecycle-training-operator.drawio.svg"


I love this diagram. Based on my comment here, it may make sense to also feature serving along with model serving (though I note in my comment that there are some architectural choices to be had on which order the serving should happen). Regardless, I think it's worth stating explicitly that feature serving is a component in model serving.

That makes sense, do we want to update the diagram once we discuss the architecture for Feature Serving ?
For this diagram, I just took diagram that we worked together with @ronaldpetty @zanetworker for CNCF WG AI WhitePaper
https://www.cncf.io/wp-content/uploads/2024/03/cloud_native_ai24_031424a-2.pdf

Oh nice, looks like that was very recent. @ronaldpetty @zanetworker let me know if you have any thoughts/opinions there. Would love for feature serving to be included in this as I think it is becoming increasingly more important.

@franciscojavierarceo Not sure how to make it simple in this diagram. What do you think about this:

This may be overkill...but here's my attempt at it. It's technically missing the measurement required to actually repeat the process (e.g., a click for a recommendation engine).

That makes sense, thanks @franciscojavierarceo!
A few questions:

Isn't the feature store also use for Model Training and Optimization ?

Should we split the Model Development, Iteration, and Optimization with:
Model Experimentation and Development (Kubeflow Notebooks) --> Model Optimization and Hyperparameter Tuning (Katib) (similar to this flow: https://www.kubeflow.org/docs/started/architecture/#introducing-the-ml-workflow).
Since I also want to use this diagram in other docs: Kubeflow Notebooks, Kubeflow Katib.

Isn't the feature store also use for Model Training and Optimization ?

In some sense yes. Feature selection is often done during Model Development, Iteration, and Optimization but that is downstream of Feature Extraction. In short, you have to pull all of the features you want first before you can select which ones are best for your model, so this diagram is more representative of how that flow actually works. Let me know if you have additional thoughts there.

Should we split the Model Development, Iteration, and Optimization with:
Model Experimentation and Development (Kubeflow Notebooks) --> Model Optimization and Hyperparameter Tuning (Katib) (similar to this flow: https://www.kubeflow.org/docs/started/architecture/#introducing-the-ml-workflow).

How about this:

That looks good, I would also split Model Experimentation with Chose ML Algorithms + Code Model and HP Tuning + Architecture Search.
That will allow us to explain users on which stage Kubeflow Notebooks are used on which Kubeflow Katib is used

Will send you a message on slack

@franciscojavierarceo this sounds like a good discussion to have on #wg-artificial-intelligence :)

If we are looking for abstractions, I'd categorize it as Data prep, model building (training, tuning, experimentation, feature engineering,...), model serving & deployment (pull from registry, deploying inference,...), then operation & eval (monitoring,..).

I.e., there are phases in the lifecycle (building, serving, operations and iteration), components (feature stores, registries, infra,...), and personas (highlighted in the white paper, but also here: https://tag-runtime.cncf.io/wgs/cnaiwg/glossary/#data-engineers, https://tag-runtime.cncf.io/wgs/cnaiwg/glossary/#data-engineers).

Aligning, reusing the existing language across would be great and if there are language gaps we could patch in cnai as well :)

Signed-off-by: Andrey Velichkevich <[email protected]>

andreyvelich · 2024-04-24T17:06:22Z

@franciscojavierarceo I added working example, does it look good ?

Signed-off-by: Andrey Velichkevich <[email protected]>

hbelmiro · 2024-04-25T12:58:04Z

content/en/docs/components/training/overview.md

+| TensorFlow   | [TFJob](/docs/components/training/user-guides/tensorflow/)       |
+| XGBoost      | [XGBoostJob](/docs/components/training/user-guides/xgboost/)     |
+| MPI          | [MPIJob](/docs/components/training/user-guides/mpi/)             |
+| PaddlePaddle | [PaddleJob](/docs/components/training/user-guides/paddlepaddle/) |


Nice catch!

Signed-off-by: Andrey Velichkevich <[email protected]>

andreyvelich · 2024-04-25T18:26:09Z

@franciscojavierarceo and I made a few changes for the AI/ML lifecycle diagram, so it would be easier to use it in other Kubeflow Components doc (e.g. Katib, FEAST, Model Registry, KServe, Notebooks, Spark Operator).

Signed-off-by: Andrey Velichkevich <[email protected]>

andreyvelich · 2024-04-25T23:46:06Z

This PR should be ready unless you have any other comments.
/hold cancel

hbelmiro

/lgtm

andreyvelich added 3 commits April 23, 2024 12:33

Training: Reorganized Training Operator Docs

2b07727

Signed-off-by: Andrey Velichkevich <[email protected]>

Create reference page

4ff14ab

Signed-off-by: Andrey Velichkevich <[email protected]>

Update links for TensorFlow

b144a43

Signed-off-by: Andrey Velichkevich <[email protected]>

google-oss-prow bot added the do-not-merge/hold label Apr 23, 2024

google-oss-prow bot requested review from johnugeorge, sperlingxx and terrytangyuan April 23, 2024 11:52

google-oss-prow bot added approved size/XL labels Apr 23, 2024

andreyvelich added 4 commits April 23, 2024 20:51

Fix more links

0291b66

Signed-off-by: Andrey Velichkevich <[email protected]>

Fix links in getting started

69c7bd6

Signed-off-by: Andrey Velichkevich <[email protected]>

Modify Getting Started

ff8f6c4

Signed-off-by: Andrey Velichkevich <[email protected]>

Add stable status notice to overview page

7418950

Signed-off-by: Andrey Velichkevich <[email protected]>

franciscojavierarceo reviewed Apr 24, 2024

View reviewed changes

Add Why Training Operator Section

88efb7b

Signed-off-by: Andrey Velichkevich <[email protected]>

andreyvelich added 2 commits April 24, 2024 17:17

Fix Lifecycle Diagram

254dcd7

Signed-off-by: Andrey Velichkevich <[email protected]>

Change text font

a02b346

Signed-off-by: Andrey Velichkevich <[email protected]>

franciscojavierarceo reviewed Apr 24, 2024

View reviewed changes

Add working example

0412a56

Signed-off-by: Andrey Velichkevich <[email protected]>

google-oss-prow bot added size/XXL and removed size/XL labels Apr 24, 2024

andreyvelich mentioned this pull request Apr 24, 2024

Not getting Kubeflow Training SDK v1.7 when installing kubeflow-training kubeflow/training-operator#2082

Closed

Add 2nd version for AI/ML lifecycle

49493dc

Signed-off-by: Andrey Velichkevich <[email protected]>

hbelmiro reviewed Apr 25, 2024

View reviewed changes

andreyvelich added 2 commits April 25, 2024 19:19

Update AI/ML Lifecycle

5da2fcd

Signed-off-by: Andrey Velichkevich <[email protected]>

Fix install command

93bdb28

Signed-off-by: Andrey Velichkevich <[email protected]>

Remove old diagram

64cbe2b

Signed-off-by: Andrey Velichkevich <[email protected]>

google-oss-prow bot removed the do-not-merge/hold label Apr 25, 2024

hbelmiro reviewed Apr 26, 2024

View reviewed changes

google-oss-prow bot assigned hbelmiro Apr 26, 2024

google-oss-prow bot added the lgtm label Apr 26, 2024

google-oss-prow bot merged commit 8fe2bd2 into kubeflow:master Apr 26, 2024
6 checks passed

andreyvelich deleted the training-improve-docs branch April 26, 2024 15:06

andreyvelich mentioned this pull request May 2, 2024

Implement a Reusable E2E Kubeflow ML Lifecycle #3728

Merged

StefanoFioravanzo mentioned this pull request May 6, 2024

Improve docs for Training Operator 1.8 kubeflow/training-operator#1998

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training: Reorganized Training Operator Docs #3719

Training: Reorganized Training Operator Docs #3719

andreyvelich commented Apr 23, 2024 •

edited

Loading

google-oss-prow bot commented Apr 23, 2024

franciscojavierarceo Apr 24, 2024

andreyvelich Apr 24, 2024

StefanoFioravanzo commented Apr 24, 2024

andreyvelich commented Apr 24, 2024

franciscojavierarceo Apr 24, 2024 •

edited

Loading

andreyvelich Apr 24, 2024

franciscojavierarceo Apr 24, 2024

andreyvelich Apr 24, 2024

franciscojavierarceo Apr 25, 2024 •

edited

Loading

andreyvelich Apr 25, 2024

franciscojavierarceo Apr 25, 2024 •

edited

Loading

andreyvelich Apr 25, 2024

franciscojavierarceo Apr 25, 2024

zanetworker Apr 26, 2024 •

edited

Loading

andreyvelich commented Apr 24, 2024

hbelmiro Apr 25, 2024

andreyvelich Apr 25, 2024

andreyvelich commented Apr 25, 2024

andreyvelich commented Apr 25, 2024

hbelmiro left a comment

	\| PaddlePaddle \| [PaddleJob](/docs/components/training/user-guides/paddlepaddle/) \|
	\| PaddlePaddle \| [PaddleJob](/docs/components/training/user-guides/paddle/) \|

Training: Reorganized Training Operator Docs #3719

Training: Reorganized Training Operator Docs #3719

Conversation

andreyvelich commented Apr 23, 2024 • edited Loading

google-oss-prow bot commented Apr 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanoFioravanzo commented Apr 24, 2024

andreyvelich commented Apr 24, 2024

franciscojavierarceo Apr 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

franciscojavierarceo Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

franciscojavierarceo Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zanetworker Apr 26, 2024 • edited Loading

Choose a reason for hiding this comment

andreyvelich commented Apr 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich commented Apr 25, 2024

andreyvelich commented Apr 25, 2024

hbelmiro left a comment

Choose a reason for hiding this comment

andreyvelich commented Apr 23, 2024 •

edited

Loading

franciscojavierarceo Apr 24, 2024 •

edited

Loading

franciscojavierarceo Apr 25, 2024 •

edited

Loading

franciscojavierarceo Apr 25, 2024 •

edited

Loading

zanetworker Apr 26, 2024 •

edited

Loading