文章:
mp.weixin.qq.com/s/-Drh7PnYfW8BElgjILx9MA
实操:
project-hami.io/zh/docs/installation/online-installation
注意点:
一、hami-device-plugin 与 nvidia-device-plugin 是冲突的,如果要部署 hami-device-plugin 则需要先卸载 nvidia-device-plugin。
二、hami-device-plugin-xxxxx 部署后不启动问题,集群中没有在运行的hami-device-plugin Pod。
解决方法:
需要给有GPU的节点打上 gpu=on 的标签,因为 ds 上有 nodeSelector。
# kubectl get ds -n kube-system NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE hami-device-plugin 1 1 1 1 1 gpu=on 132m
命令:
kubectl label nodes worker2 gpu=on
三:hami-device-plugin-xxxxx 运行后Pod状态为 PostStartHookError 或 CrashLoopBackOff 问题,且日志内容为
I0506 14:54:55.508785 7402 main.go:395] Retrieving plugins. E0506 14:54:55.508868 7402 factory.go:135] Incompatible strategy detected auto E0506 14:54:55.508875 7402 factory.go:136] If this is a GPU node, did you configure the NVIDIA Container Toolkit? E0506 14:54:55.508879 7402 factory.go:137] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites E0506 14:54:55.508883 7402 factory.go:138] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start E0506 14:54:55.508887 7402 factory.go:139] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes E0506 14:54:55.508956 7402 main.go:201] error starting plugins: error getting plugins: unable to create plugins: failed to construct resource managers: invalid device discovery strategy
这个问题是因为容器内读不到nvidia的库文件,需要挂载和设置对应的读取权限。
解决方法:
device-plugin容器:
name: device-plugin
securityContext:
privileged: true
...
volumeMounts:
- mountPath: /usr/lib/x86_64-linux-gnu
name: nvidia-libs
...
vgpu-monitor容器:
name: vgpu-monitor
resources: {}
securityContext:
privileged: true
...
...
volumeMounts:
- mountPath: /usr/lib/x86_64-linux-gnu
name: nvidia-libs
...
volumes的配置:
volumes:
- hostPath:
path: /usr/lib/x86_64-linux-gnu
type: ""配置完成后删除 CrashLoopBackOff Pod后新 Pod 即可正常运行。