-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify service account for kaniko jobs #352
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @shydefoo . Can you test it in dev? The e2e test in github doesn't cover pyfunc image build.
apiVersion: v1 | ||
kind: ServiceAccount | ||
metadata: | ||
name: {{ .Values.merlin.imageBuilder.serviceAccount.name }}-{{ .Release.Name }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Is there any reason not to use full name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hang on, i think this is wrong, will change to just use
{{ .Values.merlin.imageBuilder.serviceAccount.name }}
without the {{ .Release.Name }}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But without the release name, there could be conflict of service account when there is multiple merlin release in same namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that was my initial thought, but then when specifying service account specified externally it should respect the name provided. Pushing a fix for this.
@@ -152,6 +152,10 @@ spec: | |||
- name: IMG_BUILDER_CONTEXT_SUB_PATH | |||
value: "{{ .Values.merlin.imageBuilder.contextSubPath }}" | |||
{{- end }} | |||
{{- if .Values.merlin.imageBuilder.serviceAccount.name }} | |||
- name: IMG_BUILDER_KANIKO_SERVICE_ACCOUNT | |||
value: "{{ .Values.merlin.imageBuilder.serviceAccount.name }}-{{ .Release.Name }}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Similar question to above
Yep i have tested it in dev environment manually, image building is successful for both model service and batch predictor. |
Perfect! |
* upstream/main: Support resource configuration for imagebuilder (#351)
@@ -348,7 +348,7 @@ jobs: | |||
LOCAL_REGISTRY: "dev.localhost" | |||
INGRESS_HOST: "127.0.0.1.nip.io" | |||
MLP_CHART_VERSION: 0.3.4 | |||
E2E_PYTHON_VERSION: "3.10" | |||
E2E_PYTHON_VERSION: "3.10.6" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This version is specified to as leaving it as 3.10 gives this error: see this run https://github.com/gojek/merlin/actions/runs/4249958378/jobs/7390728519
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. @ariefrahmansyah I think it's got to do with how the CI use cache key https://github.com/gojek/merlin/blob/36c63d3775b5b08d518dde87d35385c5c8cb00e0/.github/workflows/merlin.yml#L365 , I am not sure why we use ${{ matrix.python-version }}
, this task doesn't have matrix right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate more on why specifying E2E_PYTHON_VERSION to 3.10.6 solve the issue?
-> I am not sure why we use ${{ matrix.python-version }}, this task doesn't have matrix right?
Yeah, we should change the cache key to not using matrix.python-version, but still have to be different from other python jobs cache key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure on this. This is the error that happens consistently when 3.10
is used:
Notice: A new release of pip is available: 23.0 -> 23.0.1
Notice: To update, run: pip install --upgrade pip
Installing dependencies from Pipfile...
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.10.10/x64/bin/pipenv", line 8, in <module>
sys.exit(cli())
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/vendor/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/cli/options.py", line 56, in main
return super().main(*args, **kwargs, windows_expand_args=False)
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/vendor/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/vendor/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/vendor/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/vendor/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/vendor/click/decorators.py", line [84](https://github.com/gojek/merlin/actions/runs/4249958378/jobs/7390728519#step:14:85), in new_func
return ctx.invoke(f, obj, *args, **kwargs)
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/vendor/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/cli/command.py", line 233, in install
do_install(
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/core.py", line 2064, in do_install
do_init(
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/core.py", line 1333, in do_init
do_install_dependencies(
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/core.py", line [86](https://github.com/gojek/merlin/actions/runs/4249958378/jobs/7390728519#step:14:87)1, in do_install_dependencies
batch_install(
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/core.py", line 732, in batch_install
deps_to_install = [
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/core.py", line 733, in <listcomp>
dep for dep in deps_to_install if not project.environment.is_satisfied(dep)
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/project.py", line 304, in environment
self._environment = self.get_environment(allow_global=allow_global)
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/project.py", line 2[90](https://github.com/gojek/merlin/actions/runs/4249958378/jobs/7390728519#step:14:91), in get_environment
environment = Environment(
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/environment.py", line 79, in __init__
self._base_paths = self.get_paths()
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/environment.py", line 390, in get_paths
c = subprocess_run(command)
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/utils/processes.py", line 75, in subprocess_run
return subprocess.run(
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/subprocess.py", line 503, in run
with Popen(*popenargs, **kwargs) as process:
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/subprocess.py", line [97](https://github.com/gojek/merlin/actions/runs/4249958378/jobs/7390728519#step:14:98)1, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/subprocess.py", line 1847, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/home/runner/.local/share/virtualenvs/sdk-ywGzPJ0F/bin/python'
There was a similar issue raised here, pypa/pipenv#5075, but the solution is not clear to me. Since the python version here is 3.10.10
, i thought of specifying a fixed version and it seems to be able to install the dependencies.
@@ -33,4 +33,4 @@ kubectl create namespace ${E2E_PROJECT_NAME} --dry-run=client -o yaml | kubectl | |||
cd ../../python/sdk | |||
pip install pipenv==2022.8.19 | |||
pipenv install --dev --skip-lock --python ${PYTHON_VERSION} | |||
pipenv run pytest -n=8 -W=ignore --cov=merlin -m "not (feast or batch or pyfunc or local_server_test or cli or customtransformer)" | |||
pipenv run pytest -n=6 -W=ignore --cov=merlin -m "not (feast or batch or pyfunc or local_server_test or cli or customtransformer)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reduced number of parallel processes to 6 so less inference services are created at the same time, so less chances of deployment create timing out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have timeout issue now 😱
@ariefrahmansyah
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Our e2e test in GitHub is only deploying standard models though. Need more analysis on the root cause. I'll put it on our e2e test improvement doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry maybe timeout is the wrong word..I'm not 100% if it's a timeout issue, but i keep seeing this similar error in the debug log whenever the test fails:
merlin-log
{"level":"error","ts":1677142137.5520687,"caller":"cluster/controller.go:167","msg":"unable to create inference service logger-s-1 Internal error occurred: failed calling webhook \"inferenceservice.kserve-webhook-server.validator\": Post \"[https://kserve-webhook-server-service.kserve.svc:443/validate-serving-kserve-io-v1beta1-inferenceservice?timeout=10s\](https://kserve-webhook-server-service.kserve.svc/validate-serving-kserve-io-v1beta1-inferenceservice?timeout=10s\)": EOF","stacktrace":"github.com/gojek/merlin/cluster.(*controller).Deploy\n\t/src/api/cluster/controller.go:167\ngithub.com/gojek/merlin/queue/work.(*ModelServiceDeployment).Deploy\n\t/src/api/queue/work/model_service_deployment.go:115\ngithub.com/gojek/merlin/queue.(*worker).processJob\n\t/src/api/queue/worker.go:81"}
{"level":"error","ts":1677142137.5602424,"caller":"work/model_service_deployment.go:117","msg":"unable to deploy version endpoint for model: logger-s, version: 1, reason: error creating inference service","stacktrace":"github.com/gojek/merlin/queue/work.(*ModelServiceDeployment).Deploy\n\t/src/api/queue/work/model_service_deployment.go:117\ngithub.com/gojek/merlin/queue.(*worker).processJob\n\t/src/api/queue/worker.go:81"}
{"level":"error","ts":1677142137.5852926,"caller":"queue/worker.go:82","msg":"Job 1 failed with error: error creating inference service","stacktrace":"github.com/gojek/merlin/queue.(*worker).processJob\n\t/src/api/queue/worker.go:82"}
model-log
Error from server (BadRequest): container "kserve-container" in pod "multi-env-1-predictor-default-00001-deployment-64dc8f997c-s4hdl" is waiting to start: PodInitializing
e2e-tests finally passing, @pradithya will merge this PR if that's ok |
Please do. Thanks @shydefoo ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks, @shydefoo!
What this PR does / why we need it:
This PR adds the option to set the service account for kaniko jobs using the
IMG_BUILDER_KANIKO_SERVICE_ACCOUNT
environment variable. If this environment variable is empty, the existing behaviour will be used (ie mounting of K8s secret containing Google service account key into the pod).The helm chart has been modified to create the Kaniko job K8s SA if needed. Else, the K8s SA can be created externally and specified in the helm chart values for merlin to use.
Previously, all kaniko jobs will have:
GOOGLE_APPLICATION_CREDENTIALS
env var setgcloud auth activate-service-account
commandWhen this environment variable is set, the K8s kaniko image builder jobs (prediction and batch) will use the specified K8s service account, and not mount the K8s secret as part of the job's PodSpec.
The dockerfile used by batch-prediction and pyfunc-server is modified to run
gcloud auth activate-service-account
command only ifGOOGLE_APPLICATION_CREDENTIALS
argument is set.Which issue(s) this PR fixes:
Fixes #
Does this PR introduce a user-facing change?:
NONE
Checklist