doc.gocap.kr ← I wrote this documents (Only Korean Version)

Project Overview
Managed and developed an in-house Kubernetes-based solution for machine learning model development, retraining, testing, and deployment.
Key Responsibilities
- Developed and migrated main features
- Wrote YAML scripts and shell scripts for platform installation.
- Developed user and project-specific sharing functionalities using Kubernetes RBAC.
- Resolved platform errors and improved reliability through chaos engineering and issue tracking.
- Created tailored JupyterLab images for machine learning development.
- Managed container images and Helm charts via a dedicated Harbor repository.
Achievements
- Centralized development environment and enhanced GPU resource utilization.
- Streamlined project sharing and provided diverse services within a unified domain and cluster.
Key Learnings and Insights
This project was a pivotal milestone in my career, as it allowed me to build an AI platform from the ground up, integrate MLOps and DevOps principles, and enhance my troubleshooting and problem-solving abilities.
- Building and Sustaining a Core AI Development Platform
- Started development immediately upon joining the company, and the platform remains widely used today.
- Played a crucial role in driving company growth and revenue, proving its value as a key product.
- Used the platform for internal AI model training, enterprise solutions, and educational purposes, demonstrating its versatility.
- Mastering MLOps, DevOps, and Open-Source Integration
- Designed seamless workflows for model development, training, and deployment, combining MLOps and DevOps methodologies.
- Gained valuable insights into how to integrate and extend open-source tools like Kubernetes, JupyterLab, Helm, and Harbor for enterprise AI applications.
- Learned best practices for RBAC (Role-Based Access Control) to enable secure, multi-user collaboration.
- Enhancing Troubleshooting and Platform Reliability
- Resolved critical platform issues using chaos engineering and real-time monitoring, improving system reliability.
- Continuously optimized containerized environments, ensuring efficient GPU resource utilization and minimal downtime.
- Developed comprehensive user guides and hands-on project examples, making the platform more accessible to AI engineers and data scientists.
- Satisfaction from Seeing Users Benefit
- Personally wrote detailed documentation and training materials, enabling smooth onboarding for new users.
- Found great fulfillment in seeing AI engineers and data scientists adopt and successfully utilize the platform.
- The combination of technical expertise, user-centric design, and hands-on support made this one of the most rewarding projects of my career.
This experience strengthened my expertise in AI infrastructure, cloud-native platforms, and operational automation, while also reinforcing the importance of intuitive documentation and user engagement in enterprise software development.