kade.im
AI Platform (Solution) Development and Operation (name : CAP)

AI Platform (Solution) Development and Operation (name : CAP)

표시
Project Years
2021-2024
Tags
AI
MLOPS
Cloud
Infra
Distributed System
BigData
Skills
Kubeflow
Python
PyTorch
Golang
Minio
Harbor
K8s
doc.gocap.kr ← I wrote this documents (Only Korean Version)
notion image
 

Project Overview

Managed and developed an in-house Kubernetes-based solution for machine learning model development, retraining, testing, and deployment.

Key Responsibilities

  • Developed and migrated main features
  • Wrote YAML scripts and shell scripts for platform installation.
  • Developed user and project-specific sharing functionalities using Kubernetes RBAC.
  • Resolved platform errors and improved reliability through chaos engineering and issue tracking.
  • Created tailored JupyterLab images for machine learning development.
  • Managed container images and Helm charts via a dedicated Harbor repository.

Achievements

  • Centralized development environment and enhanced GPU resource utilization.
  • Streamlined project sharing and provided diverse services within a unified domain and cluster.
 

Key Learnings and Insights

This project was a pivotal milestone in my career, as it allowed me to build an AI platform from the ground up, integrate MLOps and DevOps principles, and enhance my troubleshooting and problem-solving abilities.
  1. Building and Sustaining a Core AI Development Platform
      • Started development immediately upon joining the company, and the platform remains widely used today.
      • Played a crucial role in driving company growth and revenue, proving its value as a key product.
      • Used the platform for internal AI model training, enterprise solutions, and educational purposes, demonstrating its versatility.
  1. Mastering MLOps, DevOps, and Open-Source Integration
      • Designed seamless workflows for model development, training, and deployment, combining MLOps and DevOps methodologies.
      • Gained valuable insights into how to integrate and extend open-source tools like Kubernetes, JupyterLab, Helm, and Harbor for enterprise AI applications.
      • Learned best practices for RBAC (Role-Based Access Control) to enable secure, multi-user collaboration.
  1. Enhancing Troubleshooting and Platform Reliability
      • Resolved critical platform issues using chaos engineering and real-time monitoring, improving system reliability.
      • Continuously optimized containerized environments, ensuring efficient GPU resource utilization and minimal downtime.
      • Developed comprehensive user guides and hands-on project examples, making the platform more accessible to AI engineers and data scientists.
  1. Satisfaction from Seeing Users Benefit
      • Personally wrote detailed documentation and training materials, enabling smooth onboarding for new users.
      • Found great fulfillment in seeing AI engineers and data scientists adopt and successfully utilize the platform.
      • The combination of technical expertise, user-centric design, and hands-on support made this one of the most rewarding projects of my career.
This experience strengthened my expertise in AI infrastructure, cloud-native platforms, and operational automation, while also reinforcing the importance of intuitive documentation and user engagement in enterprise software development.