Helm Upgrade Causes the Spark Operator To Stop Working #1554

estiller · 2022-06-23T03:09:46Z

Hi everyone,
We install the Spark Operator Helm template (v1.1.24) as part of our system's umbrella chart. Our problem is that we run helm upgrade on the umbrella chart, the upgrade causes the Spark Operator template to recreate the operator's Service Account, Cluster Role and Cluster Role Binding. This is because of the hook-delete-policy introduced in #1384.

This causes the Spark Operator itself to stop functioning since the pod itself is not recreated/restarted. Since the service account was recreated, its token is different after the upgrade, and the spark operator pod can no longer access the Kubernetes API. You can see it in the logs after it happens:

E0623 01:55:52.879781      11 reflector.go:127] pkg/mod/[k8s.io/[email protected]/tools/cache/reflector.go:156](http://k8s.io/[email protected]/tools/cache/reflector.go:156): Failed to watch *v1beta2.SparkApplication: the server has asked for the client to provide credentials (get [sparkapplications.sparkoperator.k8s.io](http://sparkapplications.sparkoperator.k8s.io/))
E0623 01:55:53.808330      11 reflector.go:127] pkg/mod/[k8s.io/[email protected]/tools/cache/reflector.go:156](http://k8s.io/[email protected]/tools/cache/reflector.go:156): Failed to watch *v1beta2.SparkApplication: failed to list *v1beta2.SparkApplication: Unauthorized
E0623 01:55:55.566717      11 reflector.go:127] pkg/mod/[k8s.io/[email protected]/tools/cache/reflector.go:156](http://k8s.io/[email protected]/tools/cache/reflector.go:156): Failed to watch *v1beta2.SparkApplication: failed to list *v1beta2.SparkApplication: Unauthorized
E0623 01:55:59.732919      11 reflector.go:127] pkg/mod/[k8s.io/[email protected]/tools/cache/reflector.go:156](http://k8s.io/[email protected]/tools/cache/reflector.go:156): Failed to watch *v1beta2.SparkApplication: failed to list *v1beta2.SparkApplication: Unauthorized
E0623 01:56:06.900876      11 reflector.go:127] pkg/mod/[k8s.io/[email protected]/tools/cache/reflector.go:156](http://k8s.io/[email protected]/tools/cache/reflector.go:156): Failed to watch *v1.Pod: the server has asked for the client to provide credentials (get pods)
E0623 01:56:07.875561      11 reflector.go:127] pkg/mod/[k8s.io/[email protected]/tools/cache/reflector.go:156](http://k8s.io/[email protected]/tools/cache/reflector.go:156): Failed to watch *v1.Pod: failed to list *v1.Pod: Unauthorized
E0623 01:56:09.434049      11 reflector.go:127] pkg/mod/[k8s.io/[email protected]/tools/cache/reflector.go:156](http://k8s.io/[email protected]/tools/cache/reflector.go:156): Failed to watch *v1beta2.SparkApplication: failed to list *v1beta2.SparkApplication: Unauthorized
E0623 01:56:10.088295      11 reflector.go:127] pkg/mod/[k8s.io/[email protected]/tools/cache/reflector.go:156](http://k8s.io/[email protected]/tools/cache/reflector.go:156): Failed to watch *v1.Pod: failed to list *v1.Pod: Unauthorized
E0623 01:56:14.792479      11 reflector.go:127] pkg/mod/[k8s.io/[email protected]/tools/cache/reflector.go:156](http://k8s.io/[email protected]/tools/cache/reflector.go:156): Failed to watch *v1.Pod: failed to list *v1.Pod: Unauthorized
E0623 01:56:23.007482      11 reflector.go:127] pkg/mod/[k8s.io/[email protected]/tools/cache/reflector.go:156](http://k8s.io/[email protected]/tools/cache/reflector.go:156): Failed to watch *v1.Pod: failed to list *v1.Pod: Unauthorized
E0623 01:56:26.308825      11 reflector.go:127] pkg/mod/[k8s.io/[email protected]/tools/cache/reflector.go:156](http://k8s.io/[email protected]/tools/cache/reflector.go:156): Failed to watch *v1beta2.SparkApplication: failed to list *v1beta2.SparkApplication: Unauthorized
E0623 01:56:39.562384      11 reflector.go:127] pkg/mod/[k8s.io/[email protected]/tools/cache/reflector.go:156](http://k8s.io/[email protected]/tools/cache/reflector.go:156): Failed to watch *v1.Pod: failed to list *v1.Pod: Unauthorized
E0623 01:57:09.297253      11 reflector.go:127] pkg/mod/[k8s.io/[email protected]/tools/cache/reflector.go:156](http://k8s.io/[email protected]/tools/cache/reflector.go:156): Failed to watch *v1beta2.SparkApplication: failed to list *v1beta2.SparkApplication: Unauthorized
E0623 01:57:10.760446      11 reflector.go:127] pkg/mod/[k8s.io/[email protected]/tools/cache/reflector.go:156](http://k8s.io/[email protected]/tools/cache/reflector.go:156): Failed to watch *v1.Pod: failed to list *v1.Pod: Unauthorized
E0623 01:57:45.396531      11 reflector.go:127] pkg/mod/[k8s.io/[email protected]/tools/cache/reflector.go:156](http://k8s.io/[email protected]/tools/cache/reflector.go:156): Failed to watch *v1beta2.SparkApplication: failed to list *v1beta2.SparkApplication: Unauthorized
E0623 01:57:51.592719      11 reflector.go:127] pkg/mod/[k8s.io/[email protected]/tools/cache/reflector.go:156](http://k8s.io/[email protected]/tools/cache/reflector.go:156): Failed to watch *v1.Pod: failed to list *v1.Pod: Unauthorized
E0623 01:58:32.519788      11 reflector.go:127] pkg/mod/[k8s.io/[email protected]/tools/cache/reflector.go:156](http://k8s.io/[email protected]/tools/cache/reflector.go:156): Failed to watch *v1beta2.SparkApplication: failed to list *v1beta2.SparkApplication: Unauthorized
E0623 01:58:47.470333      11 reflector.go:127] pkg/mod/[k8s.io/[email protected]/tools/cache/reflector.go:156](http://k8s.io/[email protected]/tools/cache/reflector.go:156): Failed to watch *v1.Pod: failed to list *v1.Pod: Unauthorized

This is very similar to an issue I found in another project.

The issue is only resolved by manually restarting the pod, or waiting for 60 minutes until the access token is refreshed and the Spark Operator resumes operation.

Our current workaround is to restart the pod after each deployment. This is not ideal as it requires us to run an extra "manual" command after the Helm deployment completes.

Possible solutions include:

Not deleting the service account after every deployment/upgrade.
Restarting the pod after every deployment as described in the Helm documentation.

Assuming that option 2 is the better choice due to Helm hook limitations, I can open a PR implementing this solution. What do you think?

The text was updated successfully, but these errors were encountered:

anry007 · 2022-08-31T21:17:01Z

Proposed solution # 2 can be implemented via setting spark-operator.podAnnotations.checksum/build to constantly changing version in your deployment script, e.g.:

helm upgrade my-umbrella-chart . --set spark-operator.podAnnotations.checksum/build=${my-build-version}

This will trigger spark operator to restart.

julienlau · 2023-02-22T13:38:35Z

Is it not a problem of out of order operation ?
I mean I think the problem is that the seviceaccount is always recreated on helm upgrade.
Except sparkoperator pods do not wait for the serviceaccount to be recreated before restarting on upgrade.
Then most of the time the pods starts with the previous serviceaccount before it is trashed and recreated...

Maybe not recreating the SA would be a better solution ?
I see that the chart already has options :

serviceAccounts.sparkoperator.create
serviceAccounts.spark.create
Maybe this could be forced to false when upgrading ?

github-actions · 2024-09-03T12:10:07Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions · 2024-09-23T14:04:32Z

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

DerekTBrown mentioned this issue Oct 13, 2022

Fix #1554 : Run hook on post-delete #1624

Closed

DerekTBrown pushed a commit to DerekTBrown/spark-on-k8s-operator that referenced this issue Oct 14, 2022

Fix kubeflow#1554 : Run hook on post-delete

7059dcc

github-actions bot added the lifecycle/stale label Sep 3, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Helm Upgrade Causes the Spark Operator To Stop Working #1554

Helm Upgrade Causes the Spark Operator To Stop Working #1554

estiller commented Jun 23, 2022 •

edited

Loading

anry007 commented Aug 31, 2022 •

edited

Loading

julienlau commented Feb 22, 2023 •

edited

Loading

github-actions bot commented Sep 3, 2024

github-actions bot commented Sep 23, 2024

Helm Upgrade Causes the Spark Operator To Stop Working #1554

Helm Upgrade Causes the Spark Operator To Stop Working #1554

Comments

estiller commented Jun 23, 2022 • edited Loading

anry007 commented Aug 31, 2022 • edited Loading

julienlau commented Feb 22, 2023 • edited Loading

github-actions bot commented Sep 3, 2024

github-actions bot commented Sep 23, 2024

estiller commented Jun 23, 2022 •

edited

Loading

anry007 commented Aug 31, 2022 •

edited

Loading

julienlau commented Feb 22, 2023 •

edited

Loading