PillTrack is an end-to-end MLOps pipeline designed for real-time pill identification on Edge devices.
Unlike traditional classification, this system leverages Deep Metric Learning to generate robust vector embeddings for pills, allowing for few-shot identification of new pill types without full retraining. To ensure low-latency inference on edge hardware, we utilize Knowledge Distillation to compress heavy teacher models (ResNet) into lightweight student models.
graph LR
subgraph Development_and_DataOps ["1. Development & DataOps"]
Dev["Developer<br/>(Git Flow)"]
GitHub["GitHub Actions<br/>(CI/CD)"]
DVC["DVC<br/>(Data Versioning)"]
S3["AWS S3<br/>(Remote Storage)"]
end
subgraph Training_Pipeline ["2. Knowledge Distillation & Training"]
direction TB
Teacher["Teacher (ResNet)"]
Student["Student (Lightweight)"]
MLflow["MLflow (Tracking)"]
Metric["Metric Learning"]
Teacher -->|Distill| Student
Student -->|Log| MLflow
Metric -->|Embed| Student
end
subgraph Edge_Deployment ["3. Edge Inference"]
Edge["Edge Device"]
VectorDB[("Vector Search")]
end
%% Interactions
GitHub -->|Trigger dvc repro| Teacher
Dev -->|Push Code| GitHub
Dev -->|Push Data| DVC
DVC -.->|Store| S3
Student -->|Deploy| Edge
Edge <-->|Search| VectorDB
%% Styling
style Development_and_DataOps fill:#f9f9f9,stroke:#333
style Training_Pipeline fill:#e1f5fe,stroke:#01579b
style Edge_Deployment fill:#fff3e0,stroke:#e65100
The pipeline follows a reproducible Data-centric AI approach using DVC for data versioning and Git for code versioning.
Tech Stack & Engineering Decisions
Data Version Control (DVC):
Decision: Decouples large datasets (.zip) from the codebase while maintaining version history aligned with Git commits. Ensures reproducibility of every experiment.
Deep Metric Learning:
Decision: Used instead of Softmax classification to handle the "open-set" problem (new pills appearing in the future) via Vector Search similarity.
Knowledge Distillation:
Decision: Compresses model size by transferring knowledge from a heavy Teacher network to a lightweight Student network, optimizing for Edge latency constraints.
GitHub Actions (CI/CD):
Decision: Automates the training pipeline (dvc repro) on Pull Requests to ensure model convergence before merging.
Getting Started
- Environment Setup Manage dependencies using Conda to ensure cross-platform compatibility.
conda env create -f environment.yamlconda activate pilltrack-condaconda env update --file environment.yaml --prune- Configuration (Secrets) To run the pipeline locally or in CI/CD, ensure the following environment variables are set (e.g., in .env or GitHub Secrets):
export AWS_ACCESS_KEY_ID="your_key"
export AWS_SECRET_ACCESS_KEY="your_secret"
export AWS_REGION="ap-southeast-1"
export MLFLOW_TRACKING_URI="your_mlflow_server"Development Workflow (Git Flow + DVC)
We follow a strict Feature Branch Workflow. Direct pushes to main are prohibited to maintain pipeline integrity.
Step 1: Start a New Feature
Always create a new branch for model experiments or bug fixes.
git checkout main
git pull origin main
git checkout -b feature/improved-resnet-backboneStep 2: Reproduce Pipeline & Train
Run the DVC pipeline to execute stages (train, convert, enroll) defined in dvc.yaml.
dvc reproStep 3: Commit & Push Changes Case A: Code or Hyperparameters Changed ONLY
If you only modified .py files or params.yaml:
dvc statusdvc pushgit add .
git commit -m "feat: optimize distillation temperature"
git push -u origin feature/improved-resnet-backboneCase B: Dataset Changed
If you updated the raw dataset (e.g., data/pills_dataset_resnet.zip):
dvc add --force data/pills_dataset_resnet.zipdvc pushgit add data/pills_dataset_resnet.zip.dvc .gitignore
git commit -m "chore: update dataset v2 with new pill types"
git pushCI/CD Pipeline
On every git push, the CI pipeline executes:
DVC Pull: Fetches data from AWS S3.
Reproduction: Runs dvc repro to validate the training pipeline.
Reporting: Pushes metrics to MLflow and comments results on the PR.
Sitta Boonkaew
AI Engineer Intern @ AI SmartTech
Β© 2025 AI SmartTech. All Rights Reserved.