-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Helm Upgrade Causes the Spark Operator To Stop Working #1554
Comments
Proposed solution # 2 can be implemented via setting spark-operator.podAnnotations.checksum/build to constantly changing version in your deployment script, e.g.:
This will trigger spark operator to restart. |
Is it not a problem of out of order operation ? Maybe not recreating the SA would be a better solution ?
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it. |
Hi everyone,
We install the Spark Operator Helm template (v1.1.24) as part of our system's umbrella chart. Our problem is that we run
helm upgrade
on the umbrella chart, the upgrade causes the Spark Operator template to recreate the operator's Service Account, Cluster Role and Cluster Role Binding. This is because of thehook-delete-policy
introduced in #1384.This causes the Spark Operator itself to stop functioning since the pod itself is not recreated/restarted. Since the service account was recreated, its token is different after the upgrade, and the spark operator pod can no longer access the Kubernetes API. You can see it in the logs after it happens:
This is very similar to an issue I found in another project.
The issue is only resolved by manually restarting the pod, or waiting for 60 minutes until the access token is refreshed and the Spark Operator resumes operation.
Our current workaround is to restart the pod after each deployment. This is not ideal as it requires us to run an extra "manual" command after the Helm deployment completes.
Possible solutions include:
Assuming that option 2 is the better choice due to Helm hook limitations, I can open a PR implementing this solution. What do you think?
The text was updated successfully, but these errors were encountered: