-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GSoC 2024] Scheduling AI workload among multiple clusters #369
Comments
Hi @haoqing0110, I am interested to work on this project. Can we discuss this further? |
I gone through the project, I got to know that it requires an addon to collect and score clusters based on GPU/TPU(contribute to |
Hello @haoqing0110 :) ##
I went through the OCM official page and tried some of the functions, including installing OCM, deploy Kubernetes resources on a specific cluster(Manifestwork) on a cluster, and also tried to create a Placement to manage set of cluster(distribute the deployments in both clusters), and they all done successfully.
All in all, I am aware that this project is more challenging then building a FaaS, and I am ready to learn and work on it! Thank you for your attention to read through it, looking forward to your reply. |
Hello @Sayanjones @z1ens, thanks for being interested in this project. Feel free to join our community slack channel if you want to have further discussion. @z1ens Thank you for your question, below are some of my thought:
|
cc @qiujian16 |
Hi @haoqing0110, My name is Khai. I came across this project in GSOC24, and I would love to be a contributor. I tried to join the Slack page but I ran into the error "It looks like there isn’t an account on Kubernetes tied to this email address.". I look forward to discuss more with you! |
https://communityinviter.com/apps/kubernetes/community @k2nt you can get an invite here for the Slack channel. |
Hi @mikeshng. Thank you for your email (and post)! I hope you can point me to the correct channel for this project (I assume that it is open-cluster-mgmt). I am posting here instead of replying via email so that other contributors can see this also. |
Thanks @k2nt yes, the channel is |
Hello, @haoqing0110 |
Hi all, @haoqing0110 is going to talk more about this topic in this week's community meeting. Please feel free to ask any questions here or during the meeting. You can find the community meeting schedule here: |
This has been selected to participate in this year's Google Summer of Code! 🎉 cncf/mentoring#1221 |
/assign @z1ens |
@qiujian16 @haoqing0110 resource-usage-collect agent needs to consider the available resources of each node, ometimes the cluster resources are sufficient, but the node resources are insufficient. |
@ivan-cai yes, I suppose @z1ens 's PR open-cluster-management-io/addon-contrib#20 has changed to calculate the score based on the max node resource. We also had a discussion about whether need both cluster resource score and node resource score, it seems node resource score is more useful. |
@ivan-cai Exactly as @haoqing0110 mentioned, I’ve implemented a scoring strategy in the resource-usage-collect-addon that includes both node scope and cluster scope scores. In Kubernetes, a job can only be scheduled if a single node in the cluster has resources >= the job's request. Therefore, linking the scoring mechanism to the node with the maximum available resources is logical. I also developed a cluster scope score that assesses the total available resources in the cluster, as sometimes cluster admins want to spread workloads across multiple clusters or nodes to enhance resource utilization. |
Congratulations to @z1ens for completing the Google Summer of Code 2024 and contributing to the Open Cluster Management community. The following PRs have been merged to our repos: These contributions are also an important part of two KubeCon topics. Thanks again for your contributions! |
/close |
@qiujian16: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This is one of GSoC 2024 projects.
Announcement
cncf/mentoring#1221
Google Summer of Code 2024 Timeline
https://developers.google.com/open-source/gsoc/timeline
Description
Open Cluster Management (OCM) focuses on multicluster and multicloud management scenarios for Kubernetes applications. Open APIs are evolving within this project for cluster registration, workload distribution, dynamic placement of policies and workloads, and much more. The placement concept is used to dynamically select a set of clusters so that higher level users can either replicate Kubernetes resources to the member clusters or run their advanced workload. For example: as an application developer, I can deploy my workload to clusters with the most allocatable memory and CPU.
Now, with the rise of AI technology, there’s a growing need to schedule AI workload based on GPU/TPU resources. In this project we want you to use the placement extensible scheduling mechanism to implement a GPU/TPU resource collector addon by addon template and provide an
AddonPlacementScore
to make placement decision based on GPU/TPU resources. We also want you to propose a customized external Kueue Admission Check controller to consume the placement decision to schedule AI workload among multiple clusters based on GPU/TPU resources.Expected Outcome
Develop the GPU/TPU resource collector addon, which includes documentation of the addon architecture and describing the
AddonPlacementScore
usage. Also, implement the addon using the addon template and contribute the code to the addon-contrib repository.Deliver a proposal for the external Kueue Admission Check controller. The proposal should outline the API design and explain how the controller uses the OCM scheduling result and interacts with Kueue. The proposal needs to be finally reviewed in OCM community meeting. Also, you need to deliver a prototype based on the proposal.
Recommended Skills
Golang, Kubernetes, Scheduling
Mentor(s)
Qing Hao (@haoqing0110, [email protected]) - primary
Jian Qiu (@qiujian16, [email protected])
References
Open Cluster Management
Placement concept
AddOn concept
Placement extensible scheduling mechanism
Build an addon with addon template
GPU on *KS, for example GPUs in GKE
Kueue Admission Check
Discussion
Feel free to raise your questions here. Can also reach out to us in the slack channel. Failed to join by the link? See solutions at #369 (comment) .
The text was updated successfully, but these errors were encountered: