YouTube Comment Sentiment Analysis - MLOps Project

YouTube Comment Sentiment Analysis - MLOps Project

An end-to-end MLOps project that fetches YouTube comments, analyzes their sentiment, trains a model, and serves it via a live Streamlit application.

🚀 The Project

This repository contains a complete, end-to-end Machine Learning Operations (MLOps) project. The goal is to analyze the sentiment of comments from a YouTube video—in this case, Squeezie's famous "QUI EST L'IMPOSTEUR ?"—and provide an interactive web application for on-demand analysis of any YouTube video.

This project isn't just about building a model; it's about building a robust, automated, and reproducible system around it.

✨ Key Features

Automated Data Pipeline: Fetches, preprocesses, and annotates data automatically using DVC.
CI/CD with GitHub Actions:
- Continuous Integration (CI): Every push to the main branch automatically lints, tests, and validates the entire data pipeline.
- Continuous Delivery (CD): Pushing a new Git tag (e.g., v1.1) automatically trains the model, packages it, and creates a new release on GitHub.
Interactive Web App: A Streamlit application that allows anyone to analyze a YouTube video's comments using the latest trained model.
Version Control for Data & Models: DVC and MLflow ensure that every component—code, data, and models—is versioned and tracked.

🛠️ Tech Stack

This project leverages a modern MLOps stack:

Component	Tool	Purpose
Data & Pipeline	DVC	Data Version Control & Pipeline Orchestration.
Experiment Tracking	MLflow	Tracking experiments, logging models, and the Model Registry.
CI/CD	GitHub Actions	Automated testing and release creation.
Web Application	Streamlit	Building and deploying the interactive app.
Core Libraries	`scikit-learn`, `pandas`, `PyTorch`, `transformers`	Model training, data manipulation, and NLP tasks.

🏛️ Project Architecture

The project is designed with a clear separation of concerns, following best MLOps practices.

Development & CI Loop (Left):
- Code is written and pushed to GitHub.
- The CI pipeline (main.yml) runs tests and linters to ensure code quality.
Training & CD Loop (Middle):
- A Git tag (e.g., v1.0) triggers the CD pipeline (release.yml).
- The full DVC pipeline runs, from data fetching to model training.
- The final model and its metrics are packaged and published as a GitHub Release.
Inference Loop (Right):
- The Streamlit app automatically fetches the latest release from GitHub.
- It loads the pre-trained model.
- The user provides a YouTube URL, and the app performs fast inference to display sentiment analysis results.

⚙️ How It Works: The DVC Pipeline

The core logic is orchestrated by DVC. You can see the stages defined in dvc.yaml. Running dvc repro executes the following steps:

fetch: Fetches thousands of comments from a specific YouTube video using the YouTube Data API.
preprocess: Cleans the text data by removing URLs, mentions, and normalizing emojis.
annotate: Uses a pre-trained Hugging Face model (twitter-xlm-roberta-base-sentiment) to assign an initial sentiment label (Positive, Negative, Neutral) to each comment. This serves as our ground truth.
train: Trains a scikit-learn pipeline, which includes a TfidfVectorizer and a StackingClassifier, on the annotated data. The experiment is logged with MLflow.

🚀 Getting Started Locally

Want to run the project on your own machine? Here's how.

Prerequisites

Conda installed.
A YouTube Data API v3 key from the Google Cloud Console.

Installation

Clone the repository:

git clone https://github.com/BryanBradfo/youtube-sentiment-mlops.git
cd youtube-sentiment-mlops

Set up the Conda environment: This project uses a Conda environment defined in ci_environment.yml.
```
conda env create -f ci_environment.yml
conda activate sentiment-mlops
```
Configure your API Key: Create a file named .env in the root directory and add your API key:
```
# .env
YOUTUBE_API_KEY="YOUR_API_KEY_HERE"
```

Running the Pipeline

To run the entire pipeline from start to finish, simply use the DVC command:

dvc repro

This will generate all the data files and log a new MLflow experiment in the mlruns/ directory.

Viewing Experiments

To see the results of your training runs, launch the MLflow UI:

mlflow ui

Then, open your browser to http://127.0.0.1:5000.

🤝 Contributing

Contributions are welcome! If you have an idea for an improvement or find a bug, please feel free to:

Fork the repository.
Create a new branch (git checkout -b feature/AmazingFeature).
Make your changes.
Commit your changes (git commit -m 'Add some AmazingFeature').
Push to the branch (git push origin feature/AmazingFeature).
Open a Pull Request.

Made with ❤️ and MLOps principles.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.dvc		.dvc
.github/workflows		.github/workflows
mlruns		mlruns
src		src
tests		tests
.dvcignore		.dvcignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
ci_environment.yml		ci_environment.yml
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

YouTube Comment Sentiment Analysis - MLOps Project

🚀 The Project

✨ Key Features

🛠️ Tech Stack

🏛️ Project Architecture

⚙️ How It Works: The DVC Pipeline

🚀 Getting Started Locally

Prerequisites

Installation

Running the Pipeline

Viewing Experiments

🤝 Contributing

About

Uh oh!

Releases 3

Packages

Languages

BryanBradfo/youtube-sentiment-mlops

Folders and files

Latest commit

History

Repository files navigation

YouTube Comment Sentiment Analysis - MLOps Project

🚀 The Project

✨ Key Features

🛠️ Tech Stack

🏛️ Project Architecture

⚙️ How It Works: The DVC Pipeline

🚀 Getting Started Locally

Prerequisites

Installation

Running the Pipeline

Viewing Experiments

🤝 Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages