kade.im
AI CHIP (NPU) Monitoring Tool (Chip Pulse)

AI CHIP (NPU) Monitoring Tool (Chip Pulse)

표시
Project Years
2023
Tags
NPU
AICHIP
Research
Skills
Docker
k8s
Python
NextJS
notion image
notion image

Project Overview

Pioneered the development of a prototype monitoring web app for NPU cards tailored for AI inference tasks. This platform is the first of its kind in Korea to integrate multiple NPU chip brands, setting a new standard in the AI-Chip market.
 

Key Responsibilities

  • Project Management: Led the entire project lifecycle, ensuring seamless communication and issue resolution with Vietnamese designers and developers.
  • Dashboard and Architecture Design: Crafted intuitive dashboard screens and architected data structures to manage users, clusters, servers, NPUs, storage, and inference endpoints.
  • API Development: Created NPU inference endpoint APIs using Kubernetes and Istio to streamline AI inference workflows.
  • Object Storage Integration: Implemented features for user-specific Object Storage (MinIO) creation and deletion to manage inference data effectively.
  • Time-Series Data Collection: Deployed InfluxDB for capturing and analyzing NPU utilization metrics over time.
  • User Interface Development: Built a Streamlit-based UI for NPU inference tasks, enabling video uploads for object detection and returning detailed inference results.
 

Achievements

  • Market Innovation: Developed the first-ever Prometheus-compatible exporter, CLI tool, and unified monitoring dashboard in Korea’s emerging NPU and AI-Chip market.
  • Efficiency Gains: Reduced the project timeline by one-third through optimized collaboration with the Vietnamese branch.
  • Cloud Inference Demo Service: Delivered a Kubernetes cluster-based NPU inference demo platform, showcasing real-time AI capabilities to stakeholders.
 
 

Key Learnings and Insights

This project provided deep technical exposure to NPU monitoring, AI inference infrastructure, and multi-vendor AI chip integration, reinforcing my expertise in observability, system architecture, and cloud-based inference services.
  1. Building a Unified Monitoring Platform for Various AI Chips
      • Designed a multi-brand NPU monitoring system, gaining a comprehensive understanding of how different AI chips handle and export performance data.
      • Learned vendor-specific logging, telemetry, and data extraction mechanisms, allowing seamless integration into a unified dashboard.
  1. Understanding Prometheus and Time-Series Data Collection
      • Developed custom Prometheus exporters for various NPU chips, ensuring real-time performance tracking.
      • Gained a deeper understanding of Prometheus’s data collection architecture and applied similar principles to build custom monitoring tools.
      • Integrated InfluxDB for historical data analysis, enabling time-series insights into NPU utilization and inference workloads.
  1. Enhancing Expertise in Kubernetes and Cloud-Based AI Inference
      • Designed Kubernetes-based inference APIs, improving AI workload efficiency by leveraging Istio and service mesh concepts.
      • Implemented user-specific MinIO storage solutions, allowing efficient management of inference results and AI model artifacts.
      • Built a Streamlit-based UI, enhancing user interaction by enabling real-time video-based AI inference testing.
  1. Scaling AI Infrastructure and Tooling Capabilities
      • Gained proficiency in developing exporters, monitoring tools, and cloud-based inference services.
      • Strengthened DevOps and MLOps skills, ensuring scalable and high-performance AI infrastructure for real-world applications.
      • Successfully managed cross-functional teams, collaborating efficiently with Vietnamese developers and designers to reduce project timelines by one-third.
This project significantly expanded my expertise in AI infrastructure monitoring, multi-chip compatibility, and cloud-based AI inference, setting a strong foundation for scalable and efficient AI system management.