This repository has been archived by the owner on Aug 17, 2023. It is now read-only.
Kubeflow Fairing TrainJob creates an image with Root user and fairing job pod will not execute on AKS which has policy to not allow Docker containers running as Root user #525
Labels
/kind bug
What steps did you take and what happened:
I am running a simple fairing example shown here with Microsoft Azure backend.
from kubeflow import fairing
from kubeflow.fairing import TrainJob
from kubeflow.fairing.backends import KubeflowAzureBackend
from kubeflow.fairing.kubernetes.utils import get_resource_mutator
class Trainer(object):
def train(self):
print("hello world!")
from kubeflow.fairing.builders.cluster.azurestorage_context import StorageContextSource
BuildContext = StorageContextSource(
region=AZURE_REGION, resource_group_name=AZURE_RESOURCE_GROUP,
storage_account_name=AZURE_STORAGE_ACCOUNT
)
job = TrainJob(Trainer,
input_files=['ames_dataset/train.csv', "requirements.txt"],
docker_registry=DOCKER_REGISTRY, base_docker_image = None,
backend=KubeflowAzureBackend(build_context_source=BuildContext))
job.submit()
When job.submit() command executes, I get the following messages (no errors)...Then the command never finishes executing and nothing happens beyond this point.
[I 200722 19:15:28 azure:156] Creating secret 'storage-credentials-5a318d6e' in namespace 'pshah'
[W 200722 19:15:29 manager:298] Waiting for fairing-builder-5zzdt-mxn9b to start...
[W 200722 19:15:29 manager:298] Waiting for fairing-builder-5zzdt-mxn9b to start...
[W 200722 19:15:29 manager:298] Waiting for fairing-builder-5zzdt-mxn9b to start...
[W 200722 19:15:31 manager:298] Waiting for fairing-builder-5zzdt-mxn9b to start...
When I checked the status of the fairing job using kubectl, I noticed following:
state:
waiting:
message: container has runAsNonRoot and image will run as root
reason: CreateContainerConfigError
I checked with our cluster team they confirmed that our AKS cluster has a policy that will not allow Docker containers to run as Root user and hence the pod tries to schedule but never executes. When fairing creates an image, it has Root user by default in the image it built.
What did you expect to happen:
The error should have been clearly displayed when executing the Trainjob.submit() command. It should not remain stuck waiting forever. Also, Kubeflow fairing commands (including Trainjob.submit()) needs to have some way or setting through which we can set the user as some other non-root user in the Docker image that it creates and pushes to the registry and executes on AKS.
Anything else you would like to add:
How to run Fairing Train_job.submit() command successfully if my cluster has policy to not allow Docker images with root user?
Environment:
python -c "import kubeflow.fairing; print(kubeflow.fairing.__version__)"
): 1.0.1kubectl version
):Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.2", GitCommit:"59603c6e503c87169aea6106f57b9f242f64df89", GitTreeState:"clean", BuildDate:"2020-01-18T23:30:10Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.7", GitCommit:"169db3bff4b5fb7722e967c5b6356713f05f15ed", GitTreeState:"clean", BuildDate:"2020-04-03T16:14:09Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}
/etc/os-release
):NOTE: If you are using fair from master, please provide us the git commit hash.
The text was updated successfully, but these errors were encountered: