Issue

I am experiencing intermittent issues particularly on nodes equipped with RTX 4090 and RTX A6000 GPUs. Occasionally, GPU pods—and sometimes even the worker nodes themselves—cannot access the GPU, displaying the following error


NVML: Driver/library version mismatch

How to solve once ( or manually) ?

I made scripts to detect nvml mismatch error


#!/bin/bash

# Log start
echo "Starting NVIDIA module check at $(date)" >> /home/dudaji/kade-lab/nvidia-reset.log

# Initially check for NVML mismatch error
initial_check=$(nvidia-smi)
if echo "$initial_check" | grep -q "Mismatch"; then
    echo "NVML mismatch detected at $(date)" >> /home/dudaji/kade-lab/nvidia-reset.log

    # Switch to multi-user target to stop graphical sessions
    sudo systemctl isolate multi-user.target

    # Kill all processes using NVIDIA devices
    sudo lsof /dev/nvidia* | awk 'NR > 1 {print $2}' | sudo xargs kill

    # Unload NVIDIA kernel modules
    sudo modprobe -r nvidia-drm
    sudo rmmod nvidia_drm
    sudo rmmod nvidia_modeset
    sudo rmmod nvidia_uvm
    sudo rmmod nvidia

    # Restart the graphical target
    sudo systemctl start graphical.target

    # Recheck operation and send Slack notification if successful
    output=$(nvidia-smi)
    if echo "$output" | grep -q "Mismatch"; then
        echo "NVML mismatch still present after reset at $(date)" >> /home/dudaji/kade-lab/nvidia-reset.log
    else
        echo "NVIDIA module reset process completed successfully at $(date)" >> /home/dudaji/kade-lab/nvidia-reset.log
        curl -X POST -H 'Content-type: application/json' --data '{"text":"{nodename}에서 NVML mismatch가 발생하여 nvidia 모듈을 reload  했습니다"}' ${SLACK_HOOK_URL}
    fi
else
    echo "No NVML mismatch detected at $(date), no action taken." >> /home/dudaji/kade-lab/nvidia-reset.log
fi

# Final log for operation
echo "Final check completed at $(date)" >> /home/dudaji/kade-lab/nvidia-reset.log

Crontab -e


*/10 * * * * sudo /bin/bash /home/dudaji/kade-lab/nvml-resolver.sh

Above scripts can resolve error in worker node host level issues, but cannot resolve gpu pod disconnected from GPU device (unix socket)

How to prevent reproducing of error (WIP, Experiment)

I saw discussion in below stackoverflow post

Nvidia NVML Driver/library version mismatch

When I run nvidia-smi, I get the following message: Failed to initialize NVML: Driver/library version mismatch An hour ago I received the same message and uninstalled my CUDA library and I was ab...

https://stackoverflow.com/questions/43022843/nvidia-nvml-driver-library-version-mismatch

I got hint by using dmesg command


dmesg | grep nvidia
...
...
[16510886.124665] nvidia-modeset: Version mismatch: nvidia.ko(535.161.07) nvidia-modeset.ko(535.171.04)
[16510886.833410] nvidia-modeset: Version mismatch: nvidia.ko(535.161.07) nvidia-modeset.ko(535.171.04)


dmesg | grep -i nvrm
[16486187.189440] NVRM: API mismatch: the client has the version 535.171.04, but
                  NVRM: this kernel module has the version 535.161.07.  Please
                  NVRM: make sure that this kernel module and all NVIDIA driver
                  NVRM: components have the same version.

dkms status


 dkms status
nvidia, 535.171.04, 5.15.0-102-generic, x86_64: installed
nvidia, 535.171.04, 5.15.0-84-generic, x86_64: installed

/var/lib/dkms


tree /var/lib/dkms/
/var/lib/dkms/
├── dkms_dbversion
└── nvidia
    ├── 535.171.04
    │   ├── 5.15.0-102-generic
    │   │   └── x86_64
    │   │       ├── log
    │   │       │   └── make.log
    │   │       └── module
    │   │           ├── nvidia-drm.ko
    │   │           ├── nvidia.ko
    │   │           ├── nvidia-modeset.ko
    │   │           ├── nvidia-peermem.ko
    │   │           └── nvidia-uvm.ko
    │   ├── 5.15.0-84-generic
    │   │   └── x86_64
    │   │       ├── log
    │   │       │   └── make.log
    │   │       └── module
    │   │           ├── nvidia-drm.ko
    │   │           ├── nvidia.ko
    │   │           ├── nvidia-modeset.ko
    │   │           ├── nvidia-peermem.ko
    │   │           └── nvidia-uvm.ko
    │   └── source -> /usr/src/nvidia-535.171.04
    ├── kernel-5.15.0-102-generic-x86_64 -> 535.171.04/5.15.0-102-generic/x86_64
    └── kernel-5.15.0-84-generic-x86_64 -> 535.171.04/5.15.0-84-generic/x86_64

I could know that node is using 535.161.07 kernel module, but client is trying to use 535.171.04 ( could see 535.171.04 in nvidia-smi too)

Maybe I guess It has occured when I installed gpu-operator on k8s node, and not restarted PC


nvidia-smi
Fri Apr 19 11:06:08 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:41:00.0 Off |                  Off |
| 46%   40C    P8              12W / 450W |   6278MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:A1:00.0 Off |                  Off |
| 30%   39C    P8               7W / 300W |      0MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090        Off | 00000000:E1:00.0 Off |                  Off |
| 45%   39C    P8              22W / 450W |      0MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    494725      C   /opt/conda/bin/python3                     6272MiB |
+---------------------------------------------------------------------------------------+

So what should I have to do?

prevent Un-Intended upgrade of driver version

Failed to initialize NVML: Driver/library version mismatch NVML library version: 535.171

Looks like you updated the driver and didn’t reboot.

https://forums.developer.nvidia.com/t/failed-to-initialize-nvml-driver-library-version-mismatch-nvml-library-version-535-171/289613/2

Enable Disable Unattended Upgrades in Ubuntu

Update packages are essential for the system to protect the data because these packages have specific security patches. Ubuntu's feature called Unattended Upgrades installs all of the latest security-related updates automatically. This feature is enabled by default in the newest Ubuntu versions. Due to this auto-update feature, sometimes users face different errors. How to Enable Disable Unattended Upgrades in Ubuntu is explained in this article.

https://linuxhint.com/enable-disable-unattended-upgrades-ubuntu/

Solution1 : Stop whole Unattended-upgrade


sudo vi /etc/apt/apt.conf.d/20auto-upgrades


APT::Periodic::Update-Package-Lists "1";
APT::Periodic::Unattended-Upgrade "0";


sudo dpkg-reconfigure unattended-upgrades

Solution2 : Adding blacklist to unattened-upgrades for nvidia


sudo vi /etc/apt/apt.conf.d/50unattended-upgrades


// Python regular expressions, matching packages to exclude from upgrading
Unattended-Upgrade::Package-Blacklist {
   "nvidia-";
   "libnvidia-";
    // The following matches all packages starting with linux-
//  "linux-";

    // Use $ to explicitely define the end of a package name. Without
    // the $, "libc6" would match all of them.
//  "libc6$";
//  "libc6-dev$";
//  "libc6-i686$";

    // Special characters need escaping
//  "libstdc\+\+6$";

    // The following matches packages like xen-system-amd64, xen-utils-4.1,
    // xenstore-utils and libxenstore3.0
//  "(lib)?xen(store)?";

    // For more information about Python regular expressions, see
    // https://docs.python.org/3/howto/regex.html
};


sudo dpkg-reconfigure unattended-upgrades

NVML mismatch error in mixed gpu k8s worker node

Issue

How to solve once ( or manually) ?

How to prevent reproducing of error (WIP, Experiment)

So what should I have to do?