kade.im
AI Platform (Solution) Development and Operation (name : CAP)

AI Platform (Solution) Development and Operation (name : CAP)

표시
Project Years
2021-2024
Tags
AI
MLOPS
Cloud
Infra
Distributed System
BigData
Skills
Kubeflow
Python
PyTorch
Golang
Minio
Harbor
K8s
doc.gocap.kr ← I wrote this documents (Only Korean Version)
notion image
 

Project Overview

Managed and developed an in-house Kubernetes-based solution for machine learning model development, retraining, testing, and deployment.

Key Responsibilities

  • Wrote YAML scripts and shell scripts for platform installation.
  • Developed user and project-specific sharing functionalities using Kubernetes RBAC.
  • Resolved platform errors and improved reliability through chaos engineering and issue tracking.
  • Created tailored JupyterLab images for machine learning development.
  • Managed container images and Helm charts via a dedicated Harbor repository.

Achievements

  • Centralized development environment and enhanced GPU resource utilization.
  • Streamlined project sharing and provided diverse services within a unified domain and cluster.