-
Notifications
You must be signed in to change notification settings - Fork 781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training: Reorganized Training Operator Docs #3719
Training: Reorganized Training Operator Docs #3719
Conversation
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
# Create model. | ||
class Net(torch.nn.Module): | ||
"""Create the Pytorch model""" | ||
... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: my recommendation would be to populate this even with a trivial single layer NN that would actually run for this example. It helps users get started that may just be executing copy-pasta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me try to add something simple.
@andreyvelich thanks for this!
Let's start with something simple and iterate in future PRs. We can start by answering questions like:
Makes sense. I'd keep the scope of this PR to the restructuring you already implemented. Let's iterate on content separtely. We can address each framework's user guide in dedicated PRs.
Getting Started should have an end-to-end working (yet simple) example. Generally people just want to copy paste some stuff, run it, and see results. Then you typically link some more advanced tutorials or user guides at the end |
Signed-off-by: Andrey Velichkevich <[email protected]>
That makes sense @StefanoFioravanzo, I added initial ideas for |
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
|
||
<img src="/docs/components/training/images/distributed-tfjob.drawio.svg" | ||
alt="Distributed TFJob" | ||
<img src="/docs/components/training/images/ml-lifecycle-training-operator.drawio.svg" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this diagram. Based on my comment here, it may make sense to also feature serving along with model serving (though I note in my comment that there are some architectural choices to be had on which order the serving should happen). Regardless, I think it's worth stating explicitly that feature serving is a component in model serving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, do we want to update the diagram once we discuss the architecture for Feature Serving ?
For this diagram, I just took diagram that we worked together with @ronaldpetty @zanetworker for CNCF WG AI WhitePaper
https://www.cncf.io/wp-content/uploads/2024/03/cloud_native_ai24_031424a-2.pdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nice, looks like that was very recent. @ronaldpetty @zanetworker let me know if you have any thoughts/opinions there. Would love for feature serving to be included in this as I think it is becoming increasingly more important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@franciscojavierarceo Not sure how to make it simple in this diagram. What do you think about this:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, thanks @franciscojavierarceo!
A few questions:
- Isn't the feature store also use for Model Training and Optimization ?
- Should we split the Model Development, Iteration, and Optimization with:
Model Experimentation and Development (Kubeflow Notebooks) --> Model Optimization and Hyperparameter Tuning (Katib) (similar to this flow: https://www.kubeflow.org/docs/started/architecture/#introducing-the-ml-workflow).
Since I also want to use this diagram in other docs: Kubeflow Notebooks, Kubeflow Katib.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the feature store also use for Model Training and Optimization ?
In some sense yes. Feature selection is often done during Model Development, Iteration, and Optimization
but that is downstream of Feature Extraction
. In short, you have to pull all of the features you want first before you can select which ones are best for your model, so this diagram is more representative of how that flow actually works. Let me know if you have additional thoughts there.
Should we split the Model Development, Iteration, and Optimization with:
Model Experimentation and Development (Kubeflow Notebooks) --> Model Optimization and Hyperparameter Tuning (Katib) (similar to this flow: https://www.kubeflow.org/docs/started/architecture/#introducing-the-ml-workflow).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks good, I would also split Model Experimentation with Chose ML Algorithms + Code Model and HP Tuning + Architecture Search.
That will allow us to explain users on which stage Kubeflow Notebooks are used on which Kubeflow Katib is used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will send you a message on slack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@franciscojavierarceo this sounds like a good discussion to have on #wg-artificial-intelligence :)
If we are looking for abstractions, I'd categorize it as Data prep, model building (training, tuning, experimentation, feature engineering,...), model serving & deployment (pull from registry, deploying inference,...), then operation & eval (monitoring,..).
I.e., there are phases in the lifecycle (building, serving, operations and iteration), components (feature stores, registries, infra,...), and personas (highlighted in the white paper, but also here: https://tag-runtime.cncf.io/wgs/cnaiwg/glossary/#data-engineers, https://tag-runtime.cncf.io/wgs/cnaiwg/glossary/#data-engineers).
Aligning, reusing the existing language across would be great and if there are language gaps we could patch in cnai as well :)
Signed-off-by: Andrey Velichkevich <[email protected]>
@franciscojavierarceo I added working example, does it look good ? |
Signed-off-by: Andrey Velichkevich <[email protected]>
| TensorFlow | [TFJob](/docs/components/training/user-guides/tensorflow/) | | ||
| XGBoost | [XGBoostJob](/docs/components/training/user-guides/xgboost/) | | ||
| MPI | [MPIJob](/docs/components/training/user-guides/mpi/) | | ||
| PaddlePaddle | [PaddleJob](/docs/components/training/user-guides/paddlepaddle/) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broken link.
| PaddlePaddle | [PaddleJob](/docs/components/training/user-guides/paddlepaddle/) | | |
| PaddlePaddle | [PaddleJob](/docs/components/training/user-guides/paddle/) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
@franciscojavierarceo and I made a few changes for the AI/ML lifecycle diagram, so it would be easier to use it in other Kubeflow Components doc (e.g. Katib, FEAST, Model Registry, KServe, Notebooks, Spark Operator). |
Signed-off-by: Andrey Velichkevich <[email protected]>
This PR should be ready unless you have any other comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Related: kubeflow/training-operator#1998.
I created the following sections for Training Operator docs:
A few points:
Why Training Operator ?
section ? Initially, we can just add some basic info./hold for review
/assign @StefanoFioravanzo @kubeflow/wg-training-leads @hbelmiro @kuizhiqing @droctothorpe @franciscojavierarceo
Looking for your feedback!