Skip to content

General OpenShift & GPU Nodes

Know-Issues

ImagePullBackOff of nvidia-driver-daemonset-* pods

1
2
3
4
$ oc describe pods -l app=nvidia-driver-daemonset  | grep image
  Normal   Pulling         116s (x4 over 3m36s)  kubelet, compute-0  Pulling image "nvidia/driver:440.64.00-rhel8.2"
  Warning  Failed          112s (x4 over 3m31s)  kubelet, compute-0  Failed to pull image "nvidia/driver:440.64.00-rhel8.2": rpc error: code = Unknown desc = Error reading manifest 440.64.00-rhel8.2 in docker.io/nvidia/driver: manifest unknown: manifest unknown
  Normal   BackOff         86s (x6 over 3m31s)   kubelet, compute-0  Back-off pulling image "nvidia/driver:440.64.00-rhel8.2"

Double check if the image is available:

1
2
3
4
5
6
7
8
9
$ curl -s https://registry.hub.docker.com/v1/repositories/nvidia/driver/tags | jq -r ' .[] | .name' | grep rhel8
418.87.01--rhel8
418.87.01-4.18.0-80.11.2.el8_0.x86_64-rhel8
418.87.01-rhel8
440.33.01-1.0.0-custom-rhel8
440.33.01-4.18.0-80.11.2.el8_0.x86_64-rhel8
440.33.01-rhel8
440.64.00-1.0.0-custom-rhel8
440.64.00-1.0.0-rhel8

Image is missing

How the GPU operator build the Image tag: 1) All fields of clusterpolicy.spec.driver 2) OS Version of the node in case of OpenShift it use the node labels feature.node.kubernetes.io/system-os_release.VERSION_ID and feature.node.kubernetes.io/system-os_release.ID Source If you like to check the node labels please run: oc get nodes -L feature.node.kubernetes.io/system-os_release.VERSION_ID -L feature.node.kubernetes.io/system-os_release.ID

Solution use a local image copy with matching tag

Create local image copy with matching tag:

1
2
3
4
oc -n gpu-operator-resources import-image \
  nvidia-driver:440.64.00-rhel8.2 \
  --from=docker.io/nvidia/driver:440.64.00-rhcos4.5 \
  --reference-policy=local --confirm

Update the clusterpolicy/cluster-policy oc edit clusterpolicy/cluster-policy to

1
2
3
4
5
6
7
spec:
...
  driver:
    image: nvidia-driver
    repository: image-registry.openshift-image-registry.svc:5000/gpu-operator-resources
    version: 440.64.00
...

Error: Unable to find a match: kernel-headers-4.18.0-193.23.1.el8_2.x86_64 kernel-devel-4.18.0-193.23.1.el8_2.x86_64

The package kernel-headers-4.18.0-193.23.1.el8_2.x86_64 is only available in repo: * rhocp-4.5-for-rhel-8-x86_64-rpms * rhocp-4.3-for-rhel-8-x86_64-rpms

rhocp-4.5-for-rhel-8-x86_64-rpms

Try to install by hand:

1
2
3
4
5
6
7
8
$ oc debug nvidia-driver-daemonset-95bfc
Starting pod/nvidia-driver-daemonset-95bfc-debug, command was: nvidia-driver init
Pod IP: 10.131.0.24
If you don't see a command prompt, try pressing enter.
sh-4.4# dnf -q -y install kernel-headers-4.18.0-193.23.1.el8_2.x86_64 kernel-devel-4.18.0-193.23.1.el8_2.x86_64
Error: Unable to find a match: kernel-headers-4.18.0-193.23.1.el8_2.x86_64 kernel-devel-4.18.0-193.23.1.el8_2.x86_64
sh-4.4# dnf install --enablerepo=rhocp-4.5-for-rhel-8-x86_64-rpms -q -y kernel-headers-4.18.0-193.23.1.el8_2.x86_64 kernel-devel-4.18.0-193.23.1.el8_2.x86_6
sh-4.4#

Adding --enablerepo=rhocp-4.5-for-rhel-8-x86_64-rpms solve the problem, let's patch the driver image.

Patching driver image

Fork the repo https://gitlab.com/nvidia/container-images/driver

Warning

This is the upstream version and will be different from the product. Might be hard to get it running.

Adjust the script rhel8/nvidia-driver in your fork.

Build the container image:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
oc create -n gpu-operator-resources is/nvidia-driver
oc apply -n gpu-operator-resources -f - <<EOF
apiVersion: build.openshift.io/v1
kind: BuildConfig
metadata:
  name: nvidia-driver
  labels:
    name: nvidia-driver
spec:
  triggers:
    - type: ConfigChange
  source:
    contextDir: "rhel8/"
    type: Git
    git:
      uri: https://gitlab.com/rbohne/driver.git
  strategy:
    type: Docker
    dockerStrategy:
      env:
      - name: DRIVER_VERSION
        value: 440.64.00
      buildArgs:
      - name: DRIVER_VERSION
        value: 440.64.00
  output:
    to:
      kind: ImageStreamTag
      name: 'nvidia-driver:440.64.00-rhel8.2'
EOF

Update the clusterpolicy/cluster-policy oc edit clusterpolicy/cluster-policy to

1
2
3
4
5
6
7
spec:
...
  driver:
    image: nvidia-driver
    repository: image-registry.openshift-image-registry.svc:5000/gpu-operator-resources
    version: 440.64.00
...

Problems with OCP >=4.4.11

From nvidia-driver-daemonset, from (Server Version: 4.4.12, 4.18.0-147.20.1.el8_1.x86_64):

1
2
3
4
5
...
Installing Linux kernel headers...
+ dnf -q -y install kernel-headers-4.18.0-147.20.1.el8_1.x86_64 kernel-devel-4.18.0-147.20.1.el8_1.x86_64
Error: Unable to find a match: kernel-headers-4.18.0-147.20.1.el8_1.x86_64 kernel-devel-4.18.0-147.20.1.el8_1.x86_64
...
OpenShift Version CoreOS Version Kernel Version Kernel Header available in repo
4.4.5 44.81.202005180831-0 4.18.0-147.8.1.el8_1.x86_64 rhel-8-for-x86_64-baseos-rpms,...
4.4.12 44.81.202007070223-0 4.18.0-147.20.1.el8_1.x86_64 rhel-8-for-x86_64-baseos-eus-rpms (8.1)
4.5.2 45.82.202007141718-0 4.18.0-193.13.2.el8_2.x86_64 rhel-8-for-x86_64-baseos-rpms,...

Get OS Version of OpenShift Release

1
2
$ oc adm release info 4.5.3 -o jsonpath="{.displayVersions.machine-os.Version}"
45.82.202007171855-0

Red Hat Internal Links: OpenShift Release page => 45.82.202007171855-0 => OS Content

Problems with OpenShift 4.5.x

NVidia does not provide a suitable CoreOS driver image.


Last update: October 15, 2020