← Back to Showcase

Loading Project...

Results

image

image

  • developed npu chip monitoring tool by shell script, similar to nvidia-smi
  • extracting value from file (/sys/class/xxx)
# for each device, print status
found=$(find /sys/class/xxx/ -iname "xxx[0-9]");
#declare -p found
readarray -td' ' controls < <(echo -n $found)

date=$(date '+%Y-%m-%d %H:%M:%S')
echo $date
printf "+-----------------------------------------------+\n"
printf "| xxx ID\txxx NAME / BOARD NAME\t\t|\n"
printf "| Fan\tTemp\tPwr:Usage/Cap\t\t\t|\n"
printf "|===============================================|\n"

for control_name in "${controls[@]}"
do
    string=$(cat "${control_name}/xxx_status_monitor")
    declare -ai status
    readarray -td ' ' status < <(echo -n $string);
    id=$((${control_name//[!0-9]/}))
    #status[0] = watt
    #status[1] = prod_id
    #status[2] = tmu
    #status[3] = temp0
    #status[4] = temp1
    power=$(echo ${status[0]}*0.025 | bc)
    power=${power%.*}
    printf "| %-2d\t\t%-20s\t\t|\n" $id "$PROJECT_NAME / $BOARD_NAME"
    printf "| N/A\t%02dC\t%3dW / %3dW\t\t\t|\n" ${status[3]} ${power} $CAP_POWER
    printf "+-----------------------------------------------+\n"
done

printf "+-----------------------------------------------+\n"
printf "| Processes:\t\t\t\t\t|\n"
printf "| xxx ID\tPID\tProcess Name\t\t|\n"
printf "|===============================================|\n"
for control_name in "${controls[@]}"
do
    string=$(cat "${control_name}/xxx_status_monitor")
    declare -ai status
    readarray -td ' ' status < <(echo -n $string);
    pid=$((status[5]))
    id=$((${control_name//[!0-9]/}))
    if [ ${pid} -eq 0 ]; then
        continue
    fi
		string=$(ps -p ${pid} -o comm)
    readarray -td ' ' names < <(echo -n $string)
    #names[0] = "COMM"
    #names[1] = "process name"
    printf "| %-2d\t\t%-5d\t%-10s\t\t|\n" $id $pid ${names[1]}
    printf "+-----------------------------------------------+\n"
done
  • made exporter by golang

bookmark

(base) kade/npu-exporter/exporter$ go run cmd/npu-exporter/main.go 
2024/07/11 13:58:42 Beginning to serve on port 9400
Collected metrics for NPU xxx0:
  Power usage: 4.98 W
  Temperature 0: 43.50 °C
  Temperature 1: 48.00 °C
  Process ID: 1036902
Collected metrics for NPU xxx0:
  Power usage: 4.95 W
  Temperature 0: 43.50 °C
  Temperature 1: 48.00 °C
  Process ID: 1036902

Key Learnings and Insights

Through this project, I gained a deeper understanding of Linux’s device file system and how it enables direct interaction with hardware through file-based operations. By extracting values from AI chip device files (/sys/class/xxx), I was able to develop a lightweight monitoring tool using shell scripting, similar to nvidia-smi.

Additionally, implementing a Prometheus exporter in Golang provided hands-on experience with instrumenting system metrics for real-time monitoring. This project enhanced my understanding of how Prometheus exporters work internally, particularly in terms of data collection, metric exposition, and efficient scraping by monitoring systems.

Overall, this project strengthened my ability to interface with hardware, extract real-time performance data, and integrate it into scalable monitoring solutions, which are crucial skills for AI infrastructure and system observability.

Incoming ConnectionEstablishing secure link...
Today--
Total Operations--