kade.im
Unstructured Data Analysis System for University

Unstructured Data Analysis System for University

표시
Project Years
2020-Present
Tags
MLOPS
Distributed System
BigData
DataAnalysis
Operation
Infra
Skills
CookieCutter
Docker
DVC
gitlab-ci
K8s
Kubeflow
Python
PyTorch
Tensorboard
Tensorflow
Scikit-learn
Harbor
notion image
notion image
  • API Pods and logs from kubernetes
notion image
 

Project Overview

Implemented a big data analysis platform (Kubeflow) for statistical and AI services within a closed network. I attended this Project and operating this project more than 3+ years.
  • Developed a course recommendation model for current students by analyzing data from graduates and enrolled students.
  • Built a predictive model to analyze the probability of student attrition (e.g., withdrawal, expulsion) for currently enrolled students.
  • Analyzed system usage logs to calculate the most frequently used programs in real-time, daily, and weekly intervals.
  • Evaluated course registration logs to compute competition rates for popular courses during enrollment periods.
  • Created a model to provide popular book lists by department and grade using library data.
All models were implemented using linear regression techniques (e.g., Scikit-learn) instead of deep learning.
 

Key Responsibilities

  • Built CI/CD environments for model development and deployment in clsoed network and isolated clusters, Using DVC, Minio, Gitlab Community Edition, Harbor Registry, Kubeflow.
  • Developed statistical models using Pandas and NumPy.
  • Created APIs for AI and statistical models using Python Flask.
  • API overload test using locust, Split requests round-robin way by multiple deployment

Achievements

  • Automated the model service redeployment process, reducing time from 1 hour to 10 minutes.
  • Enhanced project stability and security by segregating dev, stage, and production environments.
 

Details

Model

  • Extracted data in CSV and pkl format using Kubeflow Notebooks and served it through an API.
  • Limited use of Kubeflow features, relying primarily on Notebooks.

Kubernetes (Version 1.15–1.16)

  • Consisted of 3 master nodes and 2 worker nodes.
  • Utilized Rook-Ceph as the storage class.
  • Managed API services with separate environments for loc, dev, and prod, each using distinct StatefulSets, Services, and VirtualServices.

Kubeflow (Version 1.0 or 1.1)

  • Only the Notebook functionality was actively used.
  • Workflow:
    • Model development in Notebooks → Push to GitLab → Build and deploy.
    • Queried the database directly from Notebooks to extract data.

GitLab

  • Deployed GitLab on Kubernetes for internal network use only.
  • Implemented CI/CD pipelines.
  • Automated workflow: Updating model data and pushing code changes triggers build and deployment.

MinIO

  • Adopted for DVC (Data Version Control), enabling storage and retrieval of DVC data via the internal MinIO setup.

Harbor

  • Operated a private Harbor registry for internal use.

PyPI Uploads

  • Managed an internal PyPI server for package uploads, as external network access was restricted.
 

I update model very easily (each semester)

  1. Extract data from database (python)
  1. Run notebooks to train models
  1. Test models by scripts
  1. dvc add models
  1. dvc push
  1. git add .
  1. git commit
  1. git push
  1. Gitlab CI/CD update models with newest data
 
notion image