Skip to content

Latest commit

 

History

History
360 lines (273 loc) · 16.1 KB

README.adoc

File metadata and controls

360 lines (273 loc) · 16.1 KB

Istio, Kubeflow on OpenShift 4.2 / CodeReady Containers

The above solution uses Dex as OIDC provider. It should be relatively straightforward to use RH-SSO/Keycloak. Please find labs and videos with RH-SSO on OpenShift 4.2 at http://bit.ly/33rjfry

Preparation

For detailed instructions and videos on setting up CodeReady Containers / OpenShift 4.2 on bare metal servers, please see:

crc version && oc version && kfctl version

version: 1.0.0+575079b
OpenShift version:4.2.0 (embedded in binary)
Client Version: v4.3.0
Server Version: 4.2.0
Kubernetes Version: v1.14.6+2e5ed54
kfctl v0.7.0-rc.5-7-gc66ebff3

Before installing Kubeflow, we need to take some further actions. The disk space of the VM is too low, so we will expand it first:

sudo dnf -y install libguestfs libguestfs-tools
export CRC_MACHINE_IMAGE=${HOME}/.crc/machines/crc/crc
#example : export CRC_MACHINE_IMAGE=/home/demouser/.crc/machines/crc/crc
crc stop
#adding 500G
sudo qemu-img resize ${CRC_MACHINE_IMAGE} +500G
sudo cp ${CRC_MACHINE_IMAGE} ${CRC_MACHINE_IMAGE}.ORIGINAL
sudo virt-resize --expand /dev/sda3 ${CRC_MACHINE_IMAGE}.ORIGINAL ${CRC_MACHINE_IMAGE}
sudo rm ${CRC_MACHINE_IMAGE}.ORIGINAL
crc setup
crc start
crc status

Now login as kubeadmin:

oc login -u kubeadmin -p wyozw-5ywAy-5yoap-7rj8q https://api.crc.testing:6443

After the VM starts, we need to take some actions to ensure everything runs correctly:

  • The PVs provided by CRC are of type hostPath. This means that securityContext settings like fsGroup won’t work correctly, which is a problem for Jupyter Notebooks and the OIDC AuthService. To fix this, we ssh into the VM and make the PV folder read/writeable by all users:

ssh -i ~/.crc/machines/crc/id_rsa [email protected]
sudo chmod "0777" -R /mnt/pv-data/
exit
# istio-system
oc adm policy add-scc-to-user anyuid system:serviceaccount:istio-system:istio-ingress-service-account
oc adm policy add-scc-to-user anyuid system:serviceaccount:istio-system:default
oc adm policy add-scc-to-user anyuid system:serviceaccount:istio-system:grafana
oc adm policy add-scc-to-user anyuid system:serviceaccount:istio-system:prometheus
oc adm policy add-scc-to-user anyuid system:serviceaccount:istio-system:istio-egressgateway-service-account
oc adm policy add-scc-to-user anyuid system:serviceaccount:istio-system:istio-citadel-service-account
oc adm policy add-scc-to-user anyuid system:serviceaccount:istio-system:istio-ingressgateway-service-account
oc adm policy add-scc-to-user anyuid system:serviceaccount:istio-system:istio-cleanup-old-ca-service-account
oc adm policy add-scc-to-user anyuid system:serviceaccount:istio-system:istio-mixer-post-install-account
oc adm policy add-scc-to-user anyuid system:serviceaccount:istio-system:istio-mixer-service-account
oc adm policy add-scc-to-user anyuid system:serviceaccount:istio-system:istio-pilot-service-account
oc adm policy add-scc-to-user anyuid system:serviceaccount:istio-system:istio-sidecar-injector-service-account
oc adm policy add-scc-to-user anyuid system:serviceaccount:istio-system:istio-galley-service-account
oc adm policy add-scc-to-user anyuid system:serviceaccount:istio-system:istio-multi
oc adm policy add-scc-to-user anyuid system:serviceaccount:istio-system:builder
oc adm policy add-scc-to-user anyuid system:serviceaccount:istio-system:deployer
oc adm policy add-scc-to-user anyuid system:serviceaccount:istio-system:istio-cleanup-secrets-service-account
oc adm policy add-scc-to-user anyuid system:serviceaccount:istio-system:istio-grafana-post-install-account
oc adm policy add-scc-to-user anyuid system:serviceaccount:istio-system:istio-security-post-install-account
oc adm policy add-scc-to-user anyuid system:serviceaccount:istio-system:kiali-service-account

oc adm policy add-scc-to-user anyuid -z istio-egressgateway-service-account -n istio-system
oc adm policy add-scc-to-user anyuid -z istio-citadel-service-account -n istio-system
oc adm policy add-scc-to-user anyuid -z istio-ingressgateway-service-account -n istio-system
oc adm policy add-scc-to-user anyuid -z istio-cleanup-old-ca-service-account -n istio-system
oc adm policy add-scc-to-user anyuid -z istio-mixer-post-install-account -n istio-system
oc adm policy add-scc-to-user anyuid -z istio-mixer-service-account -n istio-system
oc adm policy add-scc-to-user anyuid -z istio-pilot-service-account -n istio-system
oc adm policy add-scc-to-user anyuid -z istio-sidecar-injector-service-account -n istio-system

# kubeflow
oc adm policy add-scc-to-user anyuid system:serviceaccount:kubeflow:default
oc adm policy add-scc-to-user anyuid system:serviceaccount:kubeflow:admission-webhook-service-account
oc adm policy add-scc-to-user anyuid system:serviceaccount:kubeflow:katib-controller
oc adm policy add-scc-to-user anyuid system:serviceaccount:kubeflow:katib-ui

oc adm policy add-role-to-user cluster-admin  system:serviceaccount:kubeflow:ml-pipeline
oc adm policy add-role-to-user cluster-admin  system:serviceaccount:kubeflow:pipeline-runner
oc adm policy add-role-to-user cluster-admin  system:serviceaccount:kubeflow:admission-webhook-service-account
oc adm policy add-role-to-user cluster-admin  system:serviceaccount:kubeflow:application-controller-service-account
oc adm policy add-role-to-user cluster-admin  system:serviceaccount:kubeflow:builder
oc adm policy add-role-to-user cluster-admin  system:serviceaccount:kubeflow:centraldashboard
oc adm policy add-role-to-user cluster-admin  system:serviceaccount:kubeflow:default
oc adm policy add-role-to-user cluster-admin  system:serviceaccount:kubeflow:deployer
oc adm policy add-role-to-user cluster-admin  system:serviceaccount:kubeflow:jupyter-web-app-service-account
oc adm policy add-role-to-user cluster-admin  system:serviceaccount:kubeflow:argo
oc adm policy add-role-to-user cluster-admin  system:serviceaccount:kubeflow:argo-ui
oc adm policy add-role-to-user cluster-admin  system:serviceaccount:kubeflow:metadata-ui
oc adm policy add-role-to-user cluster-admin  system:serviceaccount:kubeflow:ml-pipeline
oc adm policy add-role-to-user cluster-admin  system:serviceaccount:kubeflow:ml-pipeline-persistenceagent
oc adm policy add-role-to-user cluster-admin  system:serviceaccount:kubeflow:ml-pipeline-ui
oc adm policy add-role-to-user cluster-admin  system:serviceaccount:kubeflow:ml-pipeline-viewer-crd-service-account
oc adm policy add-role-to-user cluster-admin  system:serviceaccount:kubeflow:notebook-controller-service-account
oc adm policy add-role-to-user cluster-admin  system:serviceaccount:kubeflow:profiles-controller-service-account
oc adm policy add-role-to-user cluster-admin  system:serviceaccount:kubeflow:spartakus
oc adm policy add-role-to-user cluster-admin  system:serviceaccount:kubeflow:tf-job-dashboard
oc adm policy add-role-to-user cluster-admin  system:serviceaccount:kubeflow:tf-job-operator
oc adm policy add-role-to-user cluster-admin  system:serviceaccount:kubeflow:pytorch-operator



oc adm policy add-scc-to-user privileged   system:serviceaccount:kubeflow:ml-pipeline
oc adm policy add-scc-to-user privileged   system:serviceaccount:kubeflow:pipeline-runner
oc adm policy add-scc-to-user privileged   system:serviceaccount:kubeflow:admission-webhook-service-account
oc adm policy add-scc-to-user privileged   system:serviceaccount:kubeflow:application-controller-service-account
oc adm policy add-scc-to-user privileged   system:serviceaccount:kubeflow:builder
oc adm policy add-scc-to-user privileged   system:serviceaccount:kubeflow:centraldashboard
oc adm policy add-scc-to-user privileged   system:serviceaccount:kubeflow:default
oc adm policy add-scc-to-user privileged   system:serviceaccount:kubeflow:deployer
oc adm policy add-scc-to-user privileged   system:serviceaccount:kubeflow:jupyter-web-app-service-account
oc adm policy add-scc-to-user privileged   system:serviceaccount:kubeflow:argo
oc adm policy add-scc-to-user privileged   system:serviceaccount:kubeflow:argo-ui
oc adm policy add-scc-to-user privileged   system:serviceaccount:kubeflow:metadata-ui
oc adm policy add-scc-to-user privileged   system:serviceaccount:kubeflow:ml-pipeline
oc adm policy add-scc-to-user privileged   system:serviceaccount:kubeflow:ml-pipeline-persistenceagent
oc adm policy add-scc-to-user privileged   system:serviceaccount:kubeflow:ml-pipeline-ui
oc adm policy add-scc-to-user privileged   system:serviceaccount:kubeflow:ml-pipeline-viewer-crd-service-account
oc adm policy add-scc-to-user privileged   system:serviceaccount:kubeflow:notebook-controller-service-account
oc adm policy add-scc-to-user privileged   system:serviceaccount:kubeflow:profiles-controller-service-account
oc adm policy add-scc-to-user privileged   system:serviceaccount:kubeflow:spartakus
oc adm policy add-scc-to-user privileged   system:serviceaccount:kubeflow:tf-job-dashboard
oc adm policy add-scc-to-user privileged   system:serviceaccount:kubeflow:tf-job-operator

Install Kubeflow

The instructions are available in the existing_arrikto config docs (https://www.kubeflow.org/docs/started/k8s/kfctl-existing-arrikto/). We copy them here for the sake of reproducibility.

# Download the kfctl binary
wget 'https://github.com/kubeflow/kubeflow/releases/download/v0.7.0-rc.6/kfctl_v0.7.0-rc.5-7-gc66ebff3_linux.tar.gz'
tar -xvf kfctl_v0.7.0-rc.5-7-gc66ebff3_linux.tar.gz


# Add kfctl to PATH, to make the kfctl binary easier to use.
# Use only alphanumeric characters or - in the directory name.
export PATH=$PATH:"<path-to-kfctl>"

# Set the following kfctl configuration file:
export CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/v0.7-branch/kfdef/kfctl_existing_arrikto.0.7.0.yaml"

# Set KF_NAME to the name of your Kubeflow deployment. You also use this
# value as directory name when creating your configuration directory.
# For example, your deployment name can be 'my-kubeflow' or 'kf-test'.
export KF_NAME=<your choice of name for the Kubeflow deployment>

# Set the path to the base directory where you want to store one or more
# Kubeflow deployments. For example, /opt/.
# Then set the Kubeflow application directory for this deployment.
export BASE_DIR=<path to a base directory>
export KF_DIR=${BASE_DIR}/${KF_NAME}

mkdir -p ${KF_DIR}
cd ${KF_DIR}

# Download the config file and change the default login credentials.
wget -O kfctl_existing_arrikto.yaml $CONFIG_URI
export CONFIG_FILE=${KF_DIR}/kfctl_existing_arrikto.yaml

# Credentials for the default user are [email protected]:12341234
# To change them, please edit the dex-auth application parameters
# inside the KfDef file.
vim $CONFIG_FILE

kfctl apply -V -f ${CONFIG_FILE}

Post-Install Fixes

  • Add permissions for notebooks/finalizers on notebook-controller-role ClusterRole.

oc edit clusterrole notebook-controller-role -n kubeflow
  • Add permissions for workflow delete and workflows/finalizers on argo ClusterRole.

oc edit clusterrole argo -n kubeflow
  • Add permissions for experiments, trials and suggestions finalizers on katib-controller ClusterRole.

oc edit clusterrole katib-controller -n kubeflow
  • Add pods, pods/status, pods/finalizers resources on tf-job-operator ClusterRole.

oc edit clusterrole tf-job-operator
  • After installing, you may notice that some istio Pods are in CrashLoopBackoff. This happens when Istio Pods don’t have enough memory and end up getting OOMKilled. To fix it, please allocate more RAM to those Pods by editing their deployments. A proposed value is 256Mi for requests and 512Mi for limits.

oc edit deployment istio-ingressgateway -n istio-system
oc edit deployment istio-egressgateway -n istio-system
oc edit deployment istio-pilot -n istio-system
oc edit deployment istio-policy -n istio-system
...
  • When creating a notebook, you may notice that it can’t assign the fsGroup it desires. To give it the necessary permissions, add it to the nonroot scc:

NS=<ns>
oc adm policy add-scc-to-user anyuid system:serviceaccount:${NS}:default-editor

oc adm policy add-scc-to-user privileged -z default-editor  -n ${NS}

Connect to Kubeflow

After setting up everything, you can connect to Kubeflow by exposing the istio-ingressgateway Service.

oc expose service istio-ingressgateway --port 80 -n istio-system

You can also expose the ingressgateway via port-forward:

oc port-forward -n istio-system svc/istio-ingressgateway 8080:80

If you run CRC in a VM, you can use a SOCKS5 proxy to access the Kubeflow website:

ssh -D 127.0.0.1:12345 <user>@<public-ip>
google-chrome --incognito --user-data-dir=/tmp/delme --proxy-server=socks5://127.0.0.1:12345 --dns-prefetch-disable

Change the container runtime executor from docker to pns

oc edit cm workflow-controller-configmap -n kubeflow

containerRuntimeExecutor: pns

Compile and deploy pipelines

On RHEL 8.2:
yum install @python36
sudo pip3 install https://storage.googleapis.com/ml-pipeline/release/latest/kfp.tar.gz --upgrade
wget https://raw.githubusercontent.com/kubeflow/pipelines/master/samples/contrib/volume_ops/volumeop_sequential.py
dsl-compile --py volumeop_sequential.py --output volumeop.tar.gz
Upload the compiled pipeline (volumeop.tar.gz) to Kubeflow
Run the volumeop pipeline and validate that everything works
oc get pvc
NAME                               STATUS   VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
katib-mysql                        Bound    pv0014   10Gi       RWO,ROX,RWX                   5h43m
metadata-mysql                     Bound    pv0015   10Gi       RWO,ROX,RWX                   5h43m
minio-pv-claim                     Bound    pv0003   20Gi       RWO,ROX,RWX                   5h43m
mypvc                              Bound    pv0024   10Gi       RWO,ROX,RWX                   110m
mysql-pv-claim                     Bound    pv0002   20Gi       RWO,ROX,RWX                   5h43m
newpvc                             Bound    pv0017   10Gi       RWO,ROX,RWX                   4h29m
volumeop-sequential-2zz4q-newpvc   Bound    pv0022   10Gi       RWO,ROX,RWX                   5h17m
volumeop-sequential-52whw-newpvc   Bound    pv0028   10Gi       RWO,ROX,RWX                   45m
volumeop-sequential-k9tvz-newpvc   Bound    pv0016   10Gi       RWO,ROX,RWX                   13m
volumeop-sequential-qls6g-newpvc   Bound    pv0018   10Gi       RWO,ROX,RWX                   33m
$ ssh -i ~/.crc/machines/crc/id_rsa [email protected]
Red Hat Enterprise Linux CoreOS 42.80.20191010.0
[core@crc-847lc-master-0 ~]$ cd /mnt/pv-data/
[core@crc-847lc-master-0 pv-data]$ ls
pv0001  pv0003  pv0005  pv0007  pv0009  pv0011  pv0013  pv0015  pv0017  pv0019  pv0021  pv0023  pv0025  pv0027  pv0029
pv0002  pv0004  pv0006  pv0008  pv0010  pv0012  pv0014  pv0016  pv0018  pv0020  pv0022  pv0024  pv0026  pv0028  pv0030
[core@crc-847lc-master-0 pv-data]$ cd pv0016
[core@crc-847lc-master-0 pv0016]$ ls
file1  file2
[core@crc-847lc-master-0 pv0016]$ cat file1
1

Hyperparameter tuning with Katib

See short video at https://youtu.be/zMQFOjrfhrU

git clone https://github.com/kubeflow/katib.git
cd katib/examples/v1alpha3
oc logs pytorchjob-example-z6qzbgv6-master-0 -c metrics-collector
....
I1117 00:31:19.116058      23 main.go:78] Train Epoch: 1 [28160/60000 (47%)]	loss=0.0890
I1117 00:31:25.963848      23 main.go:78] Train Epoch: 1 [28800/60000 (48%)]	loss=0.0514
I1117 00:31:35.932306      23 main.go:78] Train Epoch: 1 [29440/60000 (49%)]	loss=0.1868
....

Modify existing examples to be configurable and run well in distributed mode

https://github.com/kubeflow/examples/tree/master/mnist#prepare-model