GPU
Install & use GPU on-prem
Official solution: How to install the NVIDIA GPU Operator with OpenShift
Entitle your openshift cluster
If not you run into: Bug 1835446 - Special resource operator gpu-driver-container pod error related to elfutils-libelf-devel
These instructions assume you downloaded an entitlement encoded in base64 from access.redhat.com or extracted it from an existing node.
In the following commands, the entitlement certificate is copied to nvidia.pem, but it can be copied to any accessible location.
# On RHEL8 machine pick entitlement from /etc/pki/entitlement/
# If you like to check the entitlement rct cat-cert /etc/pki/entitlement/xxx.pem
cat /etc/pki/entitlement/xxxxxx*.pem > nvidia.pem
curl -O https://raw.githubusercontent.com/openshift-psap/blog-artifacts/master/how-to-use-entitled-builds-with-ubi/0003-cluster-wide-machineconfigs.yaml.template
sed -i -f - 0003 -cluster-wide-machineconfigs.yaml.template << EOF
s/BASE64_ENCODED_PEM_FILE/$(base64 -w0 nvidia.pem)/g
EOF
oc apply -f 0003 -cluster-wide-machineconfigs.yaml.template
Based on nvidia docs
Wait for machineconfigpool is updated
oc wait --timeout=1800s --for=condition=Updated machineconfigpool/worker
# or check via oc get
oc get machineconfigpool/worker
Install NVIDIA Gpu Operator
NVIDIA Documentation: OpenShift on NVIDIA GPU Accelerated Clusters
Create new project/namespace
oc new-project gpu-operator-resources
Install Operators
Actions:
Install NVIDIA Operator (installs the Node Feature Discovery operator as a dependency.)
Instantiate Node Feature Discovery Operator
Instantiate NVIDIA Gpu Operator
oc apply -f - <<EOF
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: cluster-policy
spec:
dcgmExporter:
image: dcgm-exporter
repository: nvidia
version: 1.7.2-2.0.0-rc.11-ubi8
devicePlugin:
image: k8s-device-plugin
repository: nvidia
version: 1.0.0-beta6-ubi8
driver:
image: driver
repository: nvidia
version: 440.64.00
operator:
defaultRuntime: crio
toolkit:
image: container-toolkit
repository: nvidia
version: 1.0.2-ubi8
EOF
Check if all pods are running or completed
$ oc get pods
NAME READY STATUS RESTARTS AGE
gpu-operator-76d9bd6c65-7782f 1 /1 Running 0 33m
nfd-master-jkdmw 1 /1 Running 0 81m
nfd-master-l82dp 1 /1 Running 0 81m
nfd-master-sdggd 1 /1 Running 0 81m
nfd-operator-684fcd5c8d-p7qcc 1 /1 Running 0 82m
nfd-worker-6btp5 1 /1 Running 0 81m
nfd-worker-ch56g 1 /1 Running 0 81m
nfd-worker-jc572 1 /1 Running 1 81m
nfd-worker-llggg 1 /1 Running 0 81m
nvidia-container-toolkit-daemonset-jbv2b 1 /1 Running 0 82m
nvidia-dcgm-exporter-mfnb9 1 /1 Running 0 2m23s
nvidia-device-plugin-daemonset-wdfhq 1 /1 Running 0 82m
nvidia-device-plugin-validation 0 /1 Completed 0 18m
nvidia-driver-daemonset-2tzpb 1 /1 Running 0 29m
nvidia-driver-validation 0 /1 Completed 0 32m
Note
Pod nvidia-device-plugin-validation
stuck in Pending, problem was my gpu node had no capacity nvidia.com/gpu.
A reboot of the node helped.
oc debug node/...
chroot /host
reboot
Run test workload
$ oc create -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: nvidia-smi
spec:
containers:
- image: nvidia/cuda
name: nvidia-smi
command: [ nvidia-smi ]
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
EOF
$ oc logs nvidia-smi
Thu Jun 4 07 :24:52 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440 .64.00 Driver Version: 440 .64.00 CUDA Version: 10 .2 |
| -------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| =============================== +====================== +====================== |
| 0 Tesla V100-SXM2... On | 00000000 :00:1E.0 Off | 0 |
| N/A 32C P0 24W / 300W | 0MiB / 16160MiB | 0 % Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
| ============================================================================= |
| No running processes found |
+-----------------------------------------------------------------------------+
2020-12-06
2020-06-04
Contributors: