-
-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement CRD with embedded PodSpec #452
Conversation
I'm very much in favour of this! My only thought is I wonder if we can PR |
65ec2f9
to
42c1be3
Compare
Ok, so the progress so far is as follows:
However, I'm having some difficulty with the tests. For the operator integration tests The @jacobtomlinson @Matt711 Have you guys encountered these issues before with testing? Any thoughts how to fix the issues? Left TODO:
Yeah I'm onboard with this, but I don't have heaps of time to contribute there - for now I'm relying on subprocess for testing and the pre-commit hooks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great. I left a couple of small comments.
Not entirely sure why the tests are failing. We are continuing to develop this and I'm actively trying to make the tests more stable. So perhaps try pulling from main
and resolving your conflicts and things may improve.
I also see things are failing the linting step. Could you ensure you have pre-commit set up and you run it on everything (pre-commit run --all
might be helpful rather than doing it at commit time).
It would be good to see docs for the helm chart in this PR. But if you want to defer the deployment CI stuff to a follow up PR that may help to get this merged sooner.
doc/source/operator.rst
Outdated
@@ -16,8 +16,8 @@ To install the the operator first we need to create the Dask custom resources: | |||
|
|||
.. code-block:: console | |||
|
|||
$ kubectl apply -f https://raw.githubusercontent.com/dask/dask-kubernetes/main/dask_kubernetes/operator/customresources/daskcluster.yaml | |||
$ kubectl apply -f https://raw.githubusercontent.com/dask/dask-kubernetes/main/dask_kubernetes/operator/customresources/daskworkergroup.yaml | |||
$ kubectl apply -f https://raw.githubusercontent.com/dask/dask-kubernetes/main/dask_kubernetes/resources/manifests/daskcluster.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that a few different things live in this repo I would like to keep everything nested under operator instead of moving it up to the top level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, I've moved the resources into dask_kubernetes/operator/deployment
doc/source/operator.rst
Outdated
apiVersion: kubernetes.dask.org/v1 | ||
kind: DaskCluster | ||
metadata: | ||
name: simple-cluster | ||
name: simple-cluster | ||
spec: | ||
image: "daskdev/dask:latest" | ||
replicas: 3 | ||
worker: | ||
replicas: 2 | ||
spec: | ||
containers: | ||
- name: worker | ||
image: "daskdev/dask:latest" | ||
imagePullPolicy: "IfNotPresent" | ||
args: | ||
- dask-worker | ||
# Note the name of the cluster service, which adds "-service" to the end | ||
- tcp://simple-cluster-service.default.svc.cluster.local:8786 | ||
scheduler: | ||
spec: | ||
containers: | ||
- name: scheduler | ||
image: "daskdev/dask:latest" | ||
imagePullPolicy: "IfNotPresent" | ||
args: | ||
- dask-scheduler | ||
ports: | ||
- name: comm | ||
containerPort: 8786 | ||
protocol: TCP | ||
- name: dashboard | ||
containerPort: 8787 | ||
protocol: TCP | ||
readinessProbe: | ||
tcpSocket: | ||
port: comm | ||
initialDelaySeconds: 5 | ||
periodSeconds: 10 | ||
livenessProbe: | ||
tcpSocket: | ||
port: comm | ||
initialDelaySeconds: 15 | ||
periodSeconds: 20 | ||
service: | ||
type: NodePort | ||
selector: | ||
dask.org/cluster-name: simple-cluster | ||
dask.org/component: scheduler | ||
ports: | ||
- name: comm | ||
protocol: TCP | ||
port: 8786 | ||
targetPort: "comm" | ||
- name: dashboard | ||
protocol: TCP | ||
port: 8787 | ||
targetPort: "dashboard" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the things that made me nervous. This boilerplate is going to be mostly the same in 99% of deployments, it's a shame it's so verbose. But I take your point about folks being used to specs being verbose.
The cluster name in the service selector has to match the one set in the metadata at the top right? Could we use a YAML anchor here to avoid the repetition?
...
metadata:
name: &cluster simple-cluster
...
spec:
...
scheduler:
...
service:
type: NodePort
selector:
dask.org/cluster-name: *cluster
dask.org/component: scheduler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah thats the nature of declarative structures I think. I don't know if YAML anchors work with k8s manifests - will have to experiment and see if the API will accept it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This question suggests that anchors are supported.
https://devops.stackexchange.com/questions/12837/kubernetes-configuration-with-yaml-anchors
I am happy to do this. |
db82465
to
dd7b23d
Compare
I've rebased against the upstream main now, so this PR is up-to-date
The linting step is failing due to my attempt to create a
Sure thing. Currently the helm chart is just in the source, but eventually it may be published to the |
…ate CRD to use references
…fault values during build
…res, and add k8s-crd-resolver to test requirements
…ler name call to operator daskworkergroup_update
…uster name baked into the resource
…se the post-fixed name -service
…ng the new CRD spec
…written during rebase
… with prebuilt CRD templates
…of new versions of the CRD templates on commit
…correct files for custom pre-commit script
…8s-crd-resolver dependency
….txt and remove it from pre-commit hook (and update env)
…/deployment directory
11ee196
to
cd7fa3f
Compare
Hi @jacobtomlinson, any chance you could do a review on this one and consider the changes? I'm having a hard time keeping up with the rebasing of the main branch - there were several conflicts this time and I'm worried that I'll be undoing stuff from your work. |
That's fair. We are moving pretty fast right now so I appreciate it's tricky tracking the upstream changes. I'm mostly focused on getting the tests/CI more stable in the hopes that it resolves what is happening here. I'll grab a copy locally for review. If you don't mind I may push commits here to try and get things unstuck. |
I've made a few fixes and pushed up to see how the CI goes. For me the operator tests are passing locally but the experimental cluster manager ones are failing, I'll keep digging. |
I've managed to get most tests passing locally. The ones that are still failing are I'll continue digging into it tomorrow. I'm noticing a lot of this error in the operator logs. It might be the culprit. @samdyzon does anything stand out to you that might be causing it?
|
I found a couple of bugs related to names of services and the worker groups inside |
Awesome thanks, tests are passing locally for me now. I'll start digging into the CI complaints. I've also looked a little into the |
Nearly there! Looks like the fixes from #461 had been accidentally reverted. Hopefully this will be it! |
🚀🚀🚀🚀🚀🚀🚀🚀 Thanks for all your efforts here @samdyzon! |
Sweet! Thanks for helping get it over the line! We'll be able to start using it in anger soon - especially with the autoscaling functionality in the works. Cheers! |
Sure thing, thanks for sticking around and responding to reviews. Yeah looking forward to autoscaling. |
The following is my formal design proposal for the implementation of reference-able k8s specs embedded within the operator CRD files.
Goals
CRD Templates
Currently there are two main custom resources created for the operator:
DaskCluster
andDaskWorkerGroup
.This proposal does not change the type of custom resources, but aims to update their structure slightly.
First, we should create "pseudo custom resources" which are templates for Dask resources, but are not fully defined CRDs themselves.
dask_kubernetes/operator/customresources/templates.yaml
This yaml file provides the same style of resource definition that k8s uses in the
k8s-1.21.1.json
specification files.The goal of building these definitions is to allow reuse of them in the real custom resources.
dask_kubernetes/operator/customresources/daskworkergroup.yaml
dask_kubernetes/operator/customresources/daskcluster.yaml
By abstracting the
DaskScheduler
andDaskWorker
we get a small amount of code-reuse in the main CRD templates, and have a more semantic description of resources in the CRD templates.Edit: We also needed to include some "patch" files into the build process, as mentioned by the
k8s-crd-resolver
documentation. In this case, I created a patch file for each CRD template, which contained the patches needed for that resource:dask_kubernetes/operator/customresources/daskcluster.patch.yaml
These patch files are necessary to prevent errors during installation of resources for the CRD (not the CRD itself). The
k8s-crd-resolver
command has to be updated to include this patch file. (see updated build scripts below)Some notes:
DaskCluster
, I split the worker/scheduler definitions into separate properties to be more explicit about what belongs to which resource, this will affect the operator implementation.Building the CRD
# Assuming running from the root directory, and a temp folder called "dist" exists k8s-crd-resolver -r -j dask_kubernetes/operator/customresources/daskcluster.patch.yaml dask_kubernetes/operator/customresources/daskcluster.yaml dist/daskcluster.yaml
# Assuming running from the root directory, and a temp folder called "dist" exists k8s-crd-resolver -r -j dask_kubernetes/operator/customresources/daskworkergroup.patch.yaml dask_kubernetes/operator/customresources/daskworkergroup.yaml dist/daskworkergroup.yaml
Operator Changes
Once the CRDs have been generated, the operator expects the new pod specification to work. The only change needed in the operator is to update the
build_scheduler_pod_spec
,build_scheduler_service_spec
, andbuild_worker_pod_spec
methods. Since the operator will have all of the detail needed to create a spec, these methods simply need to pass the spec object and return the dictionary result. For example:The
build_scheduler_service_spec
is slightly more complicated, because theselector
property depends on metadata that has been generated in thebuild_scheduler_pod_spec
and not provided by the developer. We must add additional documentation to the users warning them that they must provide the correct values forselector
or else the service will not know what pod to connect to.Edit: The decision was made to use the above method as I believe it is more idiomatic approach, and k8s developers will be aware of the purpose of the service
selector
property.We must also update all of the places these helper methods are called to pass in the appropriate values.
Testing
In order to run the test suite we have to update the
customresources
fixture defined indask_kubernetes/operator/tests/conftest.py
to run the build process first. This algorithm will look like:k8s-crd-resolver -r dask_kubernetes/operator/customresources/daskcluster.yaml {temp_directory}/daskcluster.yaml
k8s-crd-resolver -r dask_kubernetes/operator/customresources/daskworkergroup.yaml {temp_directory}/daskworkergroup.yaml
k8s_cluster.kubectl("apply", "-f", temp_directory)
method@jacobtomlinson @Matt711 let me know what you guys think about this - it currently (and deliberately) excludes the build/publish part of the process since we're still working through that in #447.