Skip to content

A data preprocessing pipeline for TCAS admission data. This project leverages an LLM (Typhoon AI) for advanced, fine-grained text filtering and comparison against traditional Regex methods. Features a live Flask dashboard for displaying LLM-processed insights

Notifications You must be signed in to change notification settings

sitta07/LLM-Regex-DataPrep

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TCAS Dashboard

Data-driven analysis of TCAS Engineering Programs using Flask, Pandas, and Typhoon LLM

TCAS Dashboard is a web application for analyzing and comparing per-semester tuition costs of Computer Engineering programs across universities in Thailand. The project combines Data Engineering, NLP, and LLM-assisted information extraction into a single, end-to-end pipeline.

This repository is intentionally designed from a senior / production-oriented perspective, focusing on:

  • Clear data pipeline separation
  • Real-world Thai unstructured text challenges
  • Explicit comparison between rule-based (Regex) and LLM-based extraction

The raw data is collected from the MyTCAS API and processed using Typhoon AI (LLM) before being visualized through a clean and interactive Flask Dashboard.


Key Features

  • Automated Data Collection from MyTCAS API
  • Text Cleaning & Normalization (Regex vs LLM)
  • LLM-assisted Information Extraction using Typhoon AI
  • Interactive Dashboard with tables, charts, and cost rankings
  • Experimental-ready Architecture for extending to other engineering majors

Dashboard Preview

Example visualizations: tables and charts ranked by per-semester tuition cost

Dashboard Preview 1 Dashboard Preview 2


Regex vs LLM: Design Rationale

This repository intentionally implements two different approaches for extracting tuition cost information from Thai text.

Approach Description Limitations
Regex-based Rule-based pattern matching Brittle, hard to scale, sensitive to text variations
LLM-based (Typhoon) Context-aware extraction using an LLM Requires API usage and incurs cost

πŸ‘‰ Only the LLM-based results are used in the production dashboard, as they provide significantly better robustness and coverage for real-world data.


Project Structure

TCAS_dashboard/
β”‚
β”œβ”€β”€ app.py                 # Flask application for rendering the dashboard
β”œβ”€β”€ scraping_typhoon.py    # Data pipeline: fetch β†’ clean β†’ extract using Typhoon LLM
β”‚
β”œβ”€β”€ data/                  # Cleaned datasets (.csv / .xlsx)
β”‚   β”œβ”€β”€ regex_cleaned/
β”‚   └── llm_cleaned/
β”‚
β”œβ”€β”€ scripts/               # Notebooks and scripts for scraping and preprocessing
β”‚
β”œβ”€β”€ templates/             # HTML templates and dashboard assets
β”‚   β”œβ”€β”€ dashboard.html
β”‚   └── *.jpeg
β”‚
β”œβ”€β”€ experimental/          # Experiments extending the same logic to other engineering fields
β”‚                           # (e.g., Electrical, Civil Engineering)
β”‚
β”œβ”€β”€ scrap_regex.ipynb      # Baseline: Regex-only extraction (no LLM)
β”œβ”€β”€ requirements.txt       # Python dependencies
β”œβ”€β”€ .env.example           # Environment variable template
β”œβ”€β”€ .gitignore
└── README.md

Installation

git clone https://github.com/xooooiz7/TCAS_dashboard.git
cd TCAS_dashboard

python3 -m venv .venv
source .venv/bin/activate

pip install -r requirements.txt

Usage

1️⃣ Run the Dashboard

python app.py

Open your browser at: http://127.0.0.1:5000


2️⃣ Baseline: Regex-only Extraction

To understand the limitations of rule-based text extraction:

jupyter notebook scrap_regex.ipynb

3️⃣ LLM-based Extraction (Typhoon AI)

cp .env.example .env
# Add your API key
# TYPHOON_API_KEY=YOUR_TYPHOON_API_KEY

python scraping_typhoon.py

The cleaned output will be stored in the data/ directory and used by the dashboard.


Engineering Design Notes

  • Clear separation between data ingestion, processing, and visualization layers
  • Easily swappable data sources and extraction strategies
  • LLM usage is scoped only to tasks where rule-based methods do not scale
  • Repository structure is designed for extensibility, not just demo purposes

Future Work

  • Add filtering by university and region
  • Compare tuition costs with curriculum quality proxies
  • Cache LLM responses to reduce API costs
  • Production deployment (Docker + Gunicorn)

Disclaimer

This project is for educational and engineering experimentation purposes only. It is not an official system of TCAS or MyTCAS.


Built with ❀️ using Flask, Pandas, and Typhoon AI

About

A data preprocessing pipeline for TCAS admission data. This project leverages an LLM (Typhoon AI) for advanced, fine-grained text filtering and comparison against traditional Regex methods. Features a live Flask dashboard for displaying LLM-processed insights

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors