kade.im
NVML mismatch error in mixed gpu k8s worker node

NVML mismatch error in mixed gpu k8s worker node

Tags
Ops
Infra
k8s
Debug
Wrote
2024.04

Issue

  • I am experiencing intermittent issues particularly on nodes equipped with RTX 4090 and RTX A6000 GPUs. Occasionally, GPU pods—and sometimes even the worker nodes themselves—cannot access the GPU, displaying the following error
NVML: Driver/library version mismatch
 

How to solve once ( or manually) ?

  • I made scripts to detect nvml mismatch error
#!/bin/bash # Log start echo "Starting NVIDIA module check at $(date)" >> /home/dudaji/kade-lab/nvidia-reset.log # Initially check for NVML mismatch error initial_check=$(nvidia-smi) if echo "$initial_check" | grep -q "Mismatch"; then echo "NVML mismatch detected at $(date)" >> /home/dudaji/kade-lab/nvidia-reset.log # Switch to multi-user target to stop graphical sessions sudo systemctl isolate multi-user.target # Kill all processes using NVIDIA devices sudo lsof /dev/nvidia* | awk 'NR > 1 {print $2}' | sudo xargs kill # Unload NVIDIA kernel modules sudo modprobe -r nvidia-drm sudo rmmod nvidia_drm sudo rmmod nvidia_modeset sudo rmmod nvidia_uvm sudo rmmod nvidia # Restart the graphical target sudo systemctl start graphical.target # Recheck operation and send Slack notification if successful output=$(nvidia-smi) if echo "$output" | grep -q "Mismatch"; then echo "NVML mismatch still present after reset at $(date)" >> /home/dudaji/kade-lab/nvidia-reset.log else echo "NVIDIA module reset process completed successfully at $(date)" >> /home/dudaji/kade-lab/nvidia-reset.log curl -X POST -H 'Content-type: application/json' --data '{"text":"{nodename}에서 NVML mismatch가 발생하여 nvidia 모듈을 reload 했습니다"}' ${SLACK_HOOK_URL} fi else echo "No NVML mismatch detected at $(date), no action taken." >> /home/dudaji/kade-lab/nvidia-reset.log fi # Final log for operation echo "Final check completed at $(date)" >> /home/dudaji/kade-lab/nvidia-reset.log
 
  • Crontab -e
*/10 * * * * sudo /bin/bash /home/dudaji/kade-lab/nvml-resolver.sh
 
  • Above scripts can resolve error in worker node host level issues, but cannot resolve gpu pod disconnected from GPU device (unix socket)
    •  
 

How to prevent reproducing of error (WIP, Experiment)

 
  • I saw discussion in below stackoverflow post
 
  • I got hint by using dmesg command
dmesg | grep nvidia ... ... [16510886.124665] nvidia-modeset: Version mismatch: nvidia.ko(535.161.07) nvidia-modeset.ko(535.171.04) [16510886.833410] nvidia-modeset: Version mismatch: nvidia.ko(535.161.07) nvidia-modeset.ko(535.171.04)
 
dmesg | grep -i nvrm [16486187.189440] NVRM: API mismatch: the client has the version 535.171.04, but NVRM: this kernel module has the version 535.161.07. Please NVRM: make sure that this kernel module and all NVIDIA driver NVRM: components have the same version.
 
  • dkms status
dkms status nvidia, 535.171.04, 5.15.0-102-generic, x86_64: installed nvidia, 535.171.04, 5.15.0-84-generic, x86_64: installed
 
  • /var/lib/dkms
tree /var/lib/dkms/ /var/lib/dkms/ ├── dkms_dbversion └── nvidia ├── 535.171.04 │   ├── 5.15.0-102-generic │   │   └── x86_64 │   │   ├── log │   │   │   └── make.log │   │   └── module │   │   ├── nvidia-drm.ko │   │   ├── nvidia.ko │   │   ├── nvidia-modeset.ko │   │   ├── nvidia-peermem.ko │   │   └── nvidia-uvm.ko │   ├── 5.15.0-84-generic │   │   └── x86_64 │   │   ├── log │   │   │   └── make.log │   │   └── module │   │   ├── nvidia-drm.ko │   │   ├── nvidia.ko │   │   ├── nvidia-modeset.ko │   │   ├── nvidia-peermem.ko │   │   └── nvidia-uvm.ko │   └── source -> /usr/src/nvidia-535.171.04 ├── kernel-5.15.0-102-generic-x86_64 -> 535.171.04/5.15.0-102-generic/x86_64 └── kernel-5.15.0-84-generic-x86_64 -> 535.171.04/5.15.0-84-generic/x86_64
 
  • I could know that node is using 535.161.07 kernel module, but client is trying to use 535.171.04 ( could see 535.171.04 in nvidia-smi too)
 
  • Maybe I guess It has occured when I installed gpu-operator on k8s node, and not restarted PC
 
nvidia-smi Fri Apr 19 11:06:08 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 4090 Off | 00000000:41:00.0 Off | Off | | 46% 40C P8 12W / 450W | 6278MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA RTX A6000 Off | 00000000:A1:00.0 Off | Off | | 30% 39C P8 7W / 300W | 0MiB / 49140MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA GeForce RTX 4090 Off | 00000000:E1:00.0 Off | Off | | 45% 39C P8 22W / 450W | 0MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 494725 C /opt/conda/bin/python3 6272MiB | +---------------------------------------------------------------------------------------+
 

So what should I have to do?

prevent Un-Intended upgrade of driver version
 
 
  • Solution1 : Stop whole Unattended-upgrade
sudo vi /etc/apt/apt.conf.d/20auto-upgrades
APT::Periodic::Update-Package-Lists "1"; APT::Periodic::Unattended-Upgrade "0";
sudo dpkg-reconfigure unattended-upgrades
 
  • Solution2 : Adding blacklist to unattened-upgrades for nvidia
sudo vi /etc/apt/apt.conf.d/50unattended-upgrades
// Python regular expressions, matching packages to exclude from upgrading Unattended-Upgrade::Package-Blacklist { "nvidia-"; "libnvidia-"; // The following matches all packages starting with linux- // "linux-"; // Use $ to explicitely define the end of a package name. Without // the $, "libc6" would match all of them. // "libc6$"; // "libc6-dev$"; // "libc6-i686$"; // Special characters need escaping // "libstdc\+\+6$"; // The following matches packages like xen-system-amd64, xen-utils-4.1, // xenstore-utils and libxenstore3.0 // "(lib)?xen(store)?"; // For more information about Python regular expressions, see // https://docs.python.org/3/howto/regex.html };
sudo dpkg-reconfigure unattended-upgrades