Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add topologySpreadConstraints #2091

Conversation

jbhalodia-slack
Copy link
Contributor

@jbhalodia-slack jbhalodia-slack commented Jul 22, 2024

Purpose of this PR

Its good to spread the Spark Operator pods across the cluster among failure-domains such as regions, zones, nodes, and other user-defined topology domains. This can help to achieve high availability as well as efficient resource utilization.

Proposed changes:

Change Category

Indicate the type of change by marking the applicable boxes:

  • Bugfix (non-breaking change which fixes an issue)
  • Feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that could affect existing functionality)
  • Documentation update

Rationale

Production workloads should use enable topologySpreadConstraints to make sure their workloads are running in HA and are resilient to node or AZ specific failures.

Checklist

Before submitting your PR, please review the following:

  • I have conducted a self-review of my own code.
  • I have updated documentation accordingly.
  • I have added tests that prove my changes are effective or that my feature works.
  • Existing unit tests pass locally with my changes.

Additional Notes

Github Issue: #2086
Slack Thread: https://cloud-native.slack.com/archives/C074588U7EG/p1721240818494049

ChenYi015 and others added 4 commits July 22, 2024 12:30
* Update docs

Signed-off-by: Yi Chen <[email protected]>

* Remove docs and update README

Signed-off-by: Yi Chen <[email protected]>

* Add link to monthly community meeting

Signed-off-by: Yi Chen <[email protected]>

---------

Signed-off-by: Yi Chen <[email protected]>
Signed-off-by: jbhalodia-slack <[email protected]>
* Add PodDisruptionBudget to chart

Signed-off-by: Carlos Sánchez Páez <[email protected]>
Signed-off-by: Carlos Sánchez Páez <[email protected]>
Signed-off-by: Carlos Sánchez Páez <[email protected]>

* PR comments

Signed-off-by: Carlos Sánchez Páez <[email protected]>

---------

Signed-off-by: Carlos Sánchez Páez <[email protected]>
Signed-off-by: Carlos Sánchez Páez <[email protected]>
Signed-off-by: jbhalodia-slack <[email protected]>
Signed-off-by: jbhalodia-slack <[email protected]>
@jbhalodia-slack jbhalodia-slack force-pushed the jigar/set-topologySpreadConstraints branch from 0728f55 to e119dcd Compare July 22, 2024 16:30
@google-oss-prow google-oss-prow bot added size/XXL and removed size/L labels Jul 22, 2024
Signed-off-by: jbhalodia-slack <[email protected]>
Signed-off-by: jbhalodia-slack <[email protected]>
@jbhalodia-slack jbhalodia-slack force-pushed the jigar/set-topologySpreadConstraints branch from dc1427c to 2c4b7d2 Compare July 22, 2024 17:28
@google-oss-prow google-oss-prow bot added size/M and removed size/L labels Jul 22, 2024
@jbhalodia-slack jbhalodia-slack changed the title Set topologySpreadConstraints Add topologySpreadConstraints Jul 22, 2024
Signed-off-by: jbhalodia-slack <[email protected]>
@jbhalodia-slack jbhalodia-slack force-pushed the jigar/set-topologySpreadConstraints branch from 0c0ba32 to 00a26df Compare July 22, 2024 17:55
@@ -17,21 +17,22 @@ tests:

- it: Should render spark operator podDisruptionBudget if podDisruptionBudget.enable is true
set:
replicaCount: 2
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PDB tests were failing on the master branch so these are fixes to get them to pass.

@jbhalodia-slack
Copy link
Contributor Author

Hi @vara-bonthu @andreyvelich @ChenYi015 @yuchaoran2011, could you please review this PR? 🙇‍♂️

Copy link
Contributor

@vara-bonthu vara-bonthu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @jbhalodia-slack
/approve

@yuchaoran2011 @ChenYi015 Please review

Copy link
Contributor

@ChenYi015 ChenYi015 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!
/lgtm

Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ChenYi015, vara-bonthu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [ChenYi015,vara-bonthu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 4108f54 into kubeflow:master Jul 26, 2024
7 checks passed
ChenYi015 pushed a commit to ChenYi015/spark-operator that referenced this pull request Aug 1, 2024
* Update README and documentation (kubeflow#2047)

* Update docs

Signed-off-by: Yi Chen <[email protected]>

* Remove docs and update README

Signed-off-by: Yi Chen <[email protected]>

* Add link to monthly community meeting

Signed-off-by: Yi Chen <[email protected]>

---------

Signed-off-by: Yi Chen <[email protected]>
Signed-off-by: jbhalodia-slack <[email protected]>

* Add PodDisruptionBudget to chart (kubeflow#2078)

* Add PodDisruptionBudget to chart

Signed-off-by: Carlos Sánchez Páez <[email protected]>
Signed-off-by: Carlos Sánchez Páez <[email protected]>
Signed-off-by: Carlos Sánchez Páez <[email protected]>

* PR comments

Signed-off-by: Carlos Sánchez Páez <[email protected]>

---------

Signed-off-by: Carlos Sánchez Páez <[email protected]>
Signed-off-by: Carlos Sánchez Páez <[email protected]>
Signed-off-by: jbhalodia-slack <[email protected]>

* Set topologySpreadConstraints

Signed-off-by: jbhalodia-slack <[email protected]>

* Update README and increase patch version

Signed-off-by: jbhalodia-slack <[email protected]>

* Revert replicaCount change

Signed-off-by: jbhalodia-slack <[email protected]>

* Update README after master merger

Signed-off-by: jbhalodia-slack <[email protected]>

* Update README

Signed-off-by: jbhalodia-slack <[email protected]>

---------

Signed-off-by: Yi Chen <[email protected]>
Signed-off-by: jbhalodia-slack <[email protected]>
Signed-off-by: Carlos Sánchez Páez <[email protected]>
Signed-off-by: Carlos Sánchez Páez <[email protected]>
Co-authored-by: Yi Chen <[email protected]>
Co-authored-by: Carlos Sánchez Páez <[email protected]>
(cherry picked from commit 4108f54)
google-oss-prow bot pushed a commit that referenced this pull request Aug 1, 2024
* Update helm docs (#2081)

Signed-off-by: Carlos Sánchez Páez <[email protected]>
(cherry picked from commit eca3fc8)

* Update the process to build api-docs, generate CRD manifests and code (#2046)

* Update .gitignore

Signed-off-by: Yi Chen <[email protected]>

* Update .dockerignore

Signed-off-by: Yi Chen <[email protected]>

* Update Makefile

Signed-off-by: Yi Chen <[email protected]>

* Update the process to generate api docs

Signed-off-by: Yi Chen <[email protected]>

* Update the workflow to generate api docs

Signed-off-by: Yi Chen <[email protected]>

* Use controller-gen to generate CRD and deep copy related methods

Signed-off-by: Yi Chen <[email protected]>

* Update helm chart CRDs

Signed-off-by: Yi Chen <[email protected]>

* Update workflow for building spark operator

Signed-off-by: Yi Chen <[email protected]>

* Update README.md

Signed-off-by: Yi Chen <[email protected]>

---------

Signed-off-by: Yi Chen <[email protected]>
(cherry picked from commit 779ea3d)

* Add topologySpreadConstraints (#2091)

* Update README and documentation (#2047)

* Update docs

Signed-off-by: Yi Chen <[email protected]>

* Remove docs and update README

Signed-off-by: Yi Chen <[email protected]>

* Add link to monthly community meeting

Signed-off-by: Yi Chen <[email protected]>

---------

Signed-off-by: Yi Chen <[email protected]>
Signed-off-by: jbhalodia-slack <[email protected]>

* Add PodDisruptionBudget to chart (#2078)

* Add PodDisruptionBudget to chart

Signed-off-by: Carlos Sánchez Páez <[email protected]>
Signed-off-by: Carlos Sánchez Páez <[email protected]>
Signed-off-by: Carlos Sánchez Páez <[email protected]>

* PR comments

Signed-off-by: Carlos Sánchez Páez <[email protected]>

---------

Signed-off-by: Carlos Sánchez Páez <[email protected]>
Signed-off-by: Carlos Sánchez Páez <[email protected]>
Signed-off-by: jbhalodia-slack <[email protected]>

* Set topologySpreadConstraints

Signed-off-by: jbhalodia-slack <[email protected]>

* Update README and increase patch version

Signed-off-by: jbhalodia-slack <[email protected]>

* Revert replicaCount change

Signed-off-by: jbhalodia-slack <[email protected]>

* Update README after master merger

Signed-off-by: jbhalodia-slack <[email protected]>

* Update README

Signed-off-by: jbhalodia-slack <[email protected]>

---------

Signed-off-by: Yi Chen <[email protected]>
Signed-off-by: jbhalodia-slack <[email protected]>
Signed-off-by: Carlos Sánchez Páez <[email protected]>
Signed-off-by: Carlos Sánchez Páez <[email protected]>
Co-authored-by: Yi Chen <[email protected]>
Co-authored-by: Carlos Sánchez Páez <[email protected]>
(cherry picked from commit 4108f54)

* Use controller-runtime to reconsturct spark operator (#2072)

* Use controller-runtime to reconstruct spark operator

Signed-off-by: Yi Chen <[email protected]>

* Update helm charts

Signed-off-by: Yi Chen <[email protected]>

* Update examples

Signed-off-by: Yi Chen <[email protected]>

---------

Signed-off-by: Yi Chen <[email protected]>
(cherry picked from commit 0dc641b)

---------

Co-authored-by: Carlos Sánchez Páez <[email protected]>
Co-authored-by: jbhalodia-slack <[email protected]>
YanivKunda pushed a commit to YanivKunda/spark-operator that referenced this pull request Aug 5, 2024
* Update README and documentation (kubeflow#2047)

* Update docs

Signed-off-by: Yi Chen <[email protected]>

* Remove docs and update README

Signed-off-by: Yi Chen <[email protected]>

* Add link to monthly community meeting

Signed-off-by: Yi Chen <[email protected]>

---------

Signed-off-by: Yi Chen <[email protected]>
Signed-off-by: jbhalodia-slack <[email protected]>

* Add PodDisruptionBudget to chart (kubeflow#2078)

* Add PodDisruptionBudget to chart

Signed-off-by: Carlos Sánchez Páez <[email protected]>
Signed-off-by: Carlos Sánchez Páez <[email protected]>
Signed-off-by: Carlos Sánchez Páez <[email protected]>

* PR comments

Signed-off-by: Carlos Sánchez Páez <[email protected]>

---------

Signed-off-by: Carlos Sánchez Páez <[email protected]>
Signed-off-by: Carlos Sánchez Páez <[email protected]>
Signed-off-by: jbhalodia-slack <[email protected]>

* Set topologySpreadConstraints

Signed-off-by: jbhalodia-slack <[email protected]>

* Update README and increase patch version

Signed-off-by: jbhalodia-slack <[email protected]>

* Revert replicaCount change

Signed-off-by: jbhalodia-slack <[email protected]>

* Update README after master merger

Signed-off-by: jbhalodia-slack <[email protected]>

* Update README

Signed-off-by: jbhalodia-slack <[email protected]>

---------

Signed-off-by: Yi Chen <[email protected]>
Signed-off-by: jbhalodia-slack <[email protected]>
Signed-off-by: Carlos Sánchez Páez <[email protected]>
Signed-off-by: Carlos Sánchez Páez <[email protected]>
Co-authored-by: Yi Chen <[email protected]>
Co-authored-by: Carlos Sánchez Páez <[email protected]>
sigmarkarl pushed a commit to spotinst/spark-on-k8s-operator that referenced this pull request Aug 7, 2024
* Update README and documentation (kubeflow#2047)

* Update docs

Signed-off-by: Yi Chen <[email protected]>

* Remove docs and update README

Signed-off-by: Yi Chen <[email protected]>

* Add link to monthly community meeting

Signed-off-by: Yi Chen <[email protected]>

---------

Signed-off-by: Yi Chen <[email protected]>
Signed-off-by: jbhalodia-slack <[email protected]>

* Add PodDisruptionBudget to chart (kubeflow#2078)

* Add PodDisruptionBudget to chart

Signed-off-by: Carlos Sánchez Páez <[email protected]>
Signed-off-by: Carlos Sánchez Páez <[email protected]>
Signed-off-by: Carlos Sánchez Páez <[email protected]>

* PR comments

Signed-off-by: Carlos Sánchez Páez <[email protected]>

---------

Signed-off-by: Carlos Sánchez Páez <[email protected]>
Signed-off-by: Carlos Sánchez Páez <[email protected]>
Signed-off-by: jbhalodia-slack <[email protected]>

* Set topologySpreadConstraints

Signed-off-by: jbhalodia-slack <[email protected]>

* Update README and increase patch version

Signed-off-by: jbhalodia-slack <[email protected]>

* Revert replicaCount change

Signed-off-by: jbhalodia-slack <[email protected]>

* Update README after master merger

Signed-off-by: jbhalodia-slack <[email protected]>

* Update README

Signed-off-by: jbhalodia-slack <[email protected]>

---------

Signed-off-by: Yi Chen <[email protected]>
Signed-off-by: jbhalodia-slack <[email protected]>
Signed-off-by: Carlos Sánchez Páez <[email protected]>
Signed-off-by: Carlos Sánchez Páez <[email protected]>
Co-authored-by: Yi Chen <[email protected]>
Co-authored-by: Carlos Sánchez Páez <[email protected]>
jbhalodia-slack added a commit to jbhalodia-slack/spark-operator that referenced this pull request Oct 4, 2024
…ubeflow#2108)

* Update helm docs (kubeflow#2081)

Signed-off-by: Carlos Sánchez Páez <[email protected]>
(cherry picked from commit eca3fc8)

* Update the process to build api-docs, generate CRD manifests and code (kubeflow#2046)

* Update .gitignore

Signed-off-by: Yi Chen <[email protected]>

* Update .dockerignore

Signed-off-by: Yi Chen <[email protected]>

* Update Makefile

Signed-off-by: Yi Chen <[email protected]>

* Update the process to generate api docs

Signed-off-by: Yi Chen <[email protected]>

* Update the workflow to generate api docs

Signed-off-by: Yi Chen <[email protected]>

* Use controller-gen to generate CRD and deep copy related methods

Signed-off-by: Yi Chen <[email protected]>

* Update helm chart CRDs

Signed-off-by: Yi Chen <[email protected]>

* Update workflow for building spark operator

Signed-off-by: Yi Chen <[email protected]>

* Update README.md

Signed-off-by: Yi Chen <[email protected]>

---------

Signed-off-by: Yi Chen <[email protected]>
(cherry picked from commit 779ea3d)

* Add topologySpreadConstraints (kubeflow#2091)

* Update README and documentation (kubeflow#2047)

* Update docs

Signed-off-by: Yi Chen <[email protected]>

* Remove docs and update README

Signed-off-by: Yi Chen <[email protected]>

* Add link to monthly community meeting

Signed-off-by: Yi Chen <[email protected]>

---------

Signed-off-by: Yi Chen <[email protected]>
Signed-off-by: jbhalodia-slack <[email protected]>

* Add PodDisruptionBudget to chart (kubeflow#2078)

* Add PodDisruptionBudget to chart

Signed-off-by: Carlos Sánchez Páez <[email protected]>
Signed-off-by: Carlos Sánchez Páez <[email protected]>
Signed-off-by: Carlos Sánchez Páez <[email protected]>

* PR comments

Signed-off-by: Carlos Sánchez Páez <[email protected]>

---------

Signed-off-by: Carlos Sánchez Páez <[email protected]>
Signed-off-by: Carlos Sánchez Páez <[email protected]>
Signed-off-by: jbhalodia-slack <[email protected]>

* Set topologySpreadConstraints

Signed-off-by: jbhalodia-slack <[email protected]>

* Update README and increase patch version

Signed-off-by: jbhalodia-slack <[email protected]>

* Revert replicaCount change

Signed-off-by: jbhalodia-slack <[email protected]>

* Update README after master merger

Signed-off-by: jbhalodia-slack <[email protected]>

* Update README

Signed-off-by: jbhalodia-slack <[email protected]>

---------

Signed-off-by: Yi Chen <[email protected]>
Signed-off-by: jbhalodia-slack <[email protected]>
Signed-off-by: Carlos Sánchez Páez <[email protected]>
Signed-off-by: Carlos Sánchez Páez <[email protected]>
Co-authored-by: Yi Chen <[email protected]>
Co-authored-by: Carlos Sánchez Páez <[email protected]>
(cherry picked from commit 4108f54)

* Use controller-runtime to reconsturct spark operator (kubeflow#2072)

* Use controller-runtime to reconstruct spark operator

Signed-off-by: Yi Chen <[email protected]>

* Update helm charts

Signed-off-by: Yi Chen <[email protected]>

* Update examples

Signed-off-by: Yi Chen <[email protected]>

---------

Signed-off-by: Yi Chen <[email protected]>
(cherry picked from commit 0dc641b)

---------

Co-authored-by: Carlos Sánchez Páez <[email protected]>
Co-authored-by: jbhalodia-slack <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants