Issue
- I am experiencing intermittent issues particularly on nodes equipped with RTX 4090 and RTX A6000 GPUs. Occasionally, GPU pods—and sometimes even the worker nodes themselves—cannot access the GPU, displaying the following error
NVML: Driver/library version mismatch
How to solve once ( or manually) ?
- I made scripts to detect nvml mismatch error
#!/bin/bash
# Log start
echo "Starting NVIDIA module check at $(date)" >> /home/dudaji/kade-lab/nvidia-reset.log
# Initially check for NVML mismatch error
initial_check=$(nvidia-smi)
if echo "$initial_check" | grep -q "Mismatch"; then
echo "NVML mismatch detected at $(date)" >> /home/dudaji/kade-lab/nvidia-reset.log
# Switch to multi-user target to stop graphical sessions
sudo systemctl isolate multi-user.target
# Kill all processes using NVIDIA devices
sudo lsof /dev/nvidia* | awk 'NR > 1 {print $2}' | sudo xargs kill
# Unload NVIDIA kernel modules
sudo modprobe -r nvidia-drm
sudo rmmod nvidia_drm
sudo rmmod nvidia_modeset
sudo rmmod nvidia_uvm
sudo rmmod nvidia
# Restart the graphical target
sudo systemctl start graphical.target
# Recheck operation and send Slack notification if successful
output=$(nvidia-smi)
if echo "$output" | grep -q "Mismatch"; then
echo "NVML mismatch still present after reset at $(date)" >> /home/dudaji/kade-lab/nvidia-reset.log
else
echo "NVIDIA module reset process completed successfully at $(date)" >> /home/dudaji/kade-lab/nvidia-reset.log
curl -X POST -H 'Content-type: application/json' --data '{"text":"{nodename}에서 NVML mismatch가 발생하여 nvidia 모듈을 reload 했습니다"}' ${SLACK_HOOK_URL}
fi
else
echo "No NVML mismatch detected at $(date), no action taken." >> /home/dudaji/kade-lab/nvidia-reset.log
fi
# Final log for operation
echo "Final check completed at $(date)" >> /home/dudaji/kade-lab/nvidia-reset.log
- Crontab -e
*/10 * * * * sudo /bin/bash /home/dudaji/kade-lab/nvml-resolver.sh
- Above scripts can resolve error in worker node host level issues, but cannot resolve gpu pod disconnected from GPU device (unix socket)
How to prevent reproducing of error (WIP, Experiment)
- I saw discussion in below stackoverflow post
- I got hint by using dmesg command
dmesg | grep nvidia
...
...
[16510886.124665] nvidia-modeset: Version mismatch: nvidia.ko(535.161.07) nvidia-modeset.ko(535.171.04)
[16510886.833410] nvidia-modeset: Version mismatch: nvidia.ko(535.161.07) nvidia-modeset.ko(535.171.04)
dmesg | grep -i nvrm
[16486187.189440] NVRM: API mismatch: the client has the version 535.171.04, but
NVRM: this kernel module has the version 535.161.07. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.
- dkms status
dkms status
nvidia, 535.171.04, 5.15.0-102-generic, x86_64: installed
nvidia, 535.171.04, 5.15.0-84-generic, x86_64: installed
- /var/lib/dkms
tree /var/lib/dkms/
/var/lib/dkms/
├── dkms_dbversion
└── nvidia
├── 535.171.04
│ ├── 5.15.0-102-generic
│ │ └── x86_64
│ │ ├── log
│ │ │ └── make.log
│ │ └── module
│ │ ├── nvidia-drm.ko
│ │ ├── nvidia.ko
│ │ ├── nvidia-modeset.ko
│ │ ├── nvidia-peermem.ko
│ │ └── nvidia-uvm.ko
│ ├── 5.15.0-84-generic
│ │ └── x86_64
│ │ ├── log
│ │ │ └── make.log
│ │ └── module
│ │ ├── nvidia-drm.ko
│ │ ├── nvidia.ko
│ │ ├── nvidia-modeset.ko
│ │ ├── nvidia-peermem.ko
│ │ └── nvidia-uvm.ko
│ └── source -> /usr/src/nvidia-535.171.04
├── kernel-5.15.0-102-generic-x86_64 -> 535.171.04/5.15.0-102-generic/x86_64
└── kernel-5.15.0-84-generic-x86_64 -> 535.171.04/5.15.0-84-generic/x86_64
- I could know that node is using 535.161.07 kernel module, but client is trying to use 535.171.04 ( could see 535.171.04 in nvidia-smi too)
- Maybe I guess It has occured when I installed gpu-operator on k8s node, and not restarted PC
nvidia-smi
Fri Apr 19 11:06:08 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04
Driver Version: 535.171.04
CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off |
| 46% 40C P8 12W / 450W | 6278MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A6000 Off | 00000000:A1:00.0 Off | Off |
| 30% 39C P8 7W / 300W | 0MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce RTX 4090 Off | 00000000:E1:00.0 Off | Off |
| 45% 39C P8 22W / 450W | 0MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 494725 C /opt/conda/bin/python3 6272MiB |
+---------------------------------------------------------------------------------------+
So what should I have to do?
prevent Un-Intended upgrade of driver version
- Solution1 : Stop whole Unattended-upgrade
sudo vi /etc/apt/apt.conf.d/20auto-upgrades
APT::Periodic::Update-Package-Lists "1";
APT::Periodic::Unattended-Upgrade "0";
sudo dpkg-reconfigure unattended-upgrades
- Solution2 : Adding blacklist to unattened-upgrades for nvidia
sudo vi /etc/apt/apt.conf.d/50unattended-upgrades
// Python regular expressions, matching packages to exclude from upgrading
Unattended-Upgrade::Package-Blacklist {
"nvidia-";
"libnvidia-";
// The following matches all packages starting with linux-
// "linux-";
// Use $ to explicitely define the end of a package name. Without
// the $, "libc6" would match all of them.
// "libc6$";
// "libc6-dev$";
// "libc6-i686$";
// Special characters need escaping
// "libstdc\+\+6$";
// The following matches packages like xen-system-amd64, xen-utils-4.1,
// xenstore-utils and libxenstore3.0
// "(lib)?xen(store)?";
// For more information about Python regular expressions, see
// https://docs.python.org/3/howto/regex.html
};
sudo dpkg-reconfigure unattended-upgrades