You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
镜像中缺环境变量:NVIDIA_VISIBLE_DEVICES、NVIDIA_DRIVER_CAPABILITIES,导致用nvidia-docker(docker run --runtime=nvidia)启动container的时候,不会加载宿主机的nvidia-driver相关库和工具,包括nvidia-smi等;
宿主机环境:
目前提供的xdl GPU Docker镜像,似乎有这么几个问题:
镜像中缺环境变量:NVIDIA_VISIBLE_DEVICES、NVIDIA_DRIVER_CAPABILITIES,导致用nvidia-docker(docker run --runtime=nvidia)启动container的时候,不会加载宿主机的nvidia-driver相关库和工具,包括nvidia-smi等;
镜像中带有nvidia-driver,导致冲突。
根据nvidia-docker的官方文档和说明:
https://github.com/NVIDIA/nvidia-docker
https://devblogs.nvidia.com/gpu-containers-runtime/
container应该直接使用宿主机的nvidia-driver,而镜像中不应该带有nvidia-driver。如果按照1中所说,在nvidia-docker启动镜像时,加上这两个环境变量,将导致尝试加载宿主机的nvidia-driver相关库和工具时,由于镜像中也存在同样的工具,导致冲突。镜像启动失败。
根据git上的文档:https://github.com/alibaba/x-deeplearning/wiki/%E7%BC%96%E8%AF%91%E5%AE%89%E8%A3%85
和git上的Dockerfile:https://github.com/alibaba/x-deeplearning/blob/master/xdl/docker/Dockerfile
安装 cuda=9.0.176-1 会导致在镜像中重新安装 nvidia-driver。参考Nvidia官方镜像Dockerfile的方式:https://gitlab.com/nvidia/cuda/blob/ubuntu16.04/9.0/base/Dockerfile,单独安装cuda的相应计算库。
镜像中的cuda 9库版本为 410.72,对宿主机的nvidia-driver版本有要求。解决方式是,在构建时,重新安装cuda相关组件;
镜像中的pip不能直接使用。
解决方案:
已提交包含上述解决方案的Dockfile的patch。
同时,XDL对hdfs有要求,所以实际跑模型之前,还需要设置以下环境变量:
export HADOOP_USER_NAME=hdfs
export HADOOP_HOME="/home/ec2-user/chenhe/TDM/x-deeplearning/xdl/test/binary/hadoop-2.8.5"
export HADOOP_HDFS_HOME="/home/ec2-user/chenhe/TDM/x-deeplearning/xdl/test/binary/hadoop-2.8.5"
export PATH=$HADOOP_HOME/bin:$PATH
export CLASSPATH=$(hadoop classpath --glob):$CLASSPATH
export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64"
export LD_LIBRARY_PATH="/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server":$LD_LIBRARY_PATH
The text was updated successfully, but these errors were encountered: