kade.im
Unstructured Data Analysis System for University

Unstructured Data Analysis System for University

표시
Project Years
2020-Present
Tags
MLOPS
Distributed System
BigData
DataAnalysis
Operation
Infra
Skills
CookieCutter
Docker
DVC
gitlab-ci
K8s
Kubeflow
Python
PyTorch
Tensorboard
Tensorflow
Scikit-learn
Harbor
notion image
notion image
  • API Pods and logs from kubernetes
notion image
 

Project Overview

Implemented a big data analysis platform (Kubeflow) for statistical and AI services within a closed network. I attended this Project and operating this project more than 3+ years.
  • Developed a course recommendation model for current students by analyzing data from graduates and enrolled students.
  • Built a predictive model to analyze the probability of student attrition (e.g., withdrawal, expulsion) for currently enrolled students.
  • Analyzed system usage logs to calculate the most frequently used programs in real-time, daily, and weekly intervals.
  • Evaluated course registration logs to compute competition rates for popular courses during enrollment periods.
  • Created a model to provide popular book lists by department and grade using library data.
All models were implemented using linear regression techniques (e.g., Scikit-learn) instead of deep learning.
 

Main Features

The data analysis models were designed to predict student outcomes and recommend personalized academic paths using machine learning techniques. The core processes involved the following steps:
  1. Data Collection and Preprocessing
      • Extracted student data using 99 different SQL queries from the database.
      • Encrypted sensitive student information (e.g., personal details, student IDs) to ensure security.
      • Cleaned and normalized data (handled missing values, removed outliers, and standardized formats).
  1. Student Clustering for Similarity Analysis
      • Clustered current and graduated students based on academic performance, course selections, attendance records, and extracurricular activities.
      • Used clustering techniques (e.g., K-Means, DBSCAN) to identify groups with similar academic behaviors.
  1. Graduate Outcome Analysis
      • Identified key factors influencing final student outcomes, such as employment, graduate school enrollment, withdrawal, or expulsion.
      • Analyzed how academic performance, course choices, and engagement impacted career trajectories using machine learning models.
  1. Academic and Career Prediction Model
      • Developed a model that compares a current student’s data with similar graduates to provide:
        • Course recommendations tailored to career goals.
        • Early dropout risk prediction to help institutions provide proactive support.
  1. Automated Model Training and Deployment
      • Used Kubeflow Notebooks for data extraction and model training.
      • Managed model versioning and updates with DVC (Data Version Control) and MinIO.
      • Automated model retraining and deployment every semester via GitLab CI/CD.
These models empowered students with personalized academic insights and helped institutions enhance their student support systems with data-driven decision-making.
 

Key Responsibilities

  • Built CI/CD environments for model development and deployment in clsoed network and isolated clusters, Using DVC, Minio, Gitlab Community Edition, Harbor Registry, Kubeflow.
  • Developed statistical models using Pandas and NumPy.
  • Created APIs for AI and statistical models using Python Flask.
  • API overload test using locust, Split requests round-robin way by multiple deployment
 

Achievements

  • Automated model service redeployment, reducing deployment time from 1 hour to 10 minutes.
  • Improved project stability and security by segregating dev, stage, and production environments.
  • Increased student satisfaction by providing personalized course recommendations and dropout risk predictions, helping students make better academic decisions.
  • Enabled data-driven decision-making for school administrators, leading to improved student support services and academic planning.
  • Facilitated additional service contracts with the institution by demonstrating the platform’s effectiveness in enhancing student success and optimizing academic resources.
 

Details

Model

  • Extracted data in CSV and pkl format using Kubeflow Notebooks and served it through an API.
  • Limited use of Kubeflow features, relying primarily on Notebooks.

Kubernetes (Version 1.15–1.16)

  • Consisted of 3 master nodes and 2 worker nodes.
  • Utilized Rook-Ceph as the storage class.
  • Managed API services with separate environments for loc, dev, and prod, each using distinct StatefulSets, Services, and VirtualServices.

Kubeflow (Version 1.0 or 1.1)

  • Only the Notebook functionality was actively used.
  • Workflow:
    • Model development in Notebooks → Push to GitLab → Build and deploy.
    • Queried the database directly from Notebooks to extract data.

GitLab

  • Deployed GitLab on Kubernetes for internal network use only.
  • Implemented CI/CD pipelines.
  • Automated workflow: Updating model data and pushing code changes triggers build and deployment.

MinIO

  • Adopted for DVC (Data Version Control), enabling storage and retrieval of DVC data via the internal MinIO setup.

Harbor

  • Operated a private Harbor registry for internal use.

PyPI Uploads

  • Managed an internal PyPI server for package uploads, as external network access was restricted.
 

I update model very easily (each semester)

  1. Extract data from database (python)
  1. Run notebooks to train models
  1. Test models by scripts
  1. dvc add models
  1. dvc push
  1. git add .
  1. git commit
  1. git push
  1. Gitlab CI/CD update models with newest data
 
notion image
 

Key Learnings and Insights

This project provided significant experience in CI/CD pipeline automation, end-to-end deployment, and AI service management in a closed-network environment.
  1. Comprehensive Understanding of End-to-End CI/CD Pipelines
      • Gained hands-on experience in designing, implementing, and automating CI/CD workflows for data models and AI services.
      • Understood how to integrate GitLab, DVC, MinIO, Harbor, and Kubernetes to build a seamless development-to-deployment pipeline.
      • Automated the model training and deployment process, reducing update time from 1 hour to 10 minutes.
  1. Deep Insights into AI Service and Platform Development
      • Developed a data-driven approach to predictive analytics and student performance modeling, gaining insights into real-world AI applications.
      • Learned how to bridge the gap between development and operations, ensuring efficient deployment and maintenance of AI-powered services.
      • Understood the challenges of managing versioned AI models in an enterprise setting and how automated retraining pipelines improve long-term accuracy.
  1. Future Perspectives on AI and Platform Services
      • Recognized the evolution of AI service deployment, from manual operations to fully automated pipelines.
      • Understood how data engineering, model training, and API service management must be tightly integrated for scalable AI platforms.
      • Gained insights into best practices for managing AI/ML operations, which will be crucial for future AI-driven services.