An end-to-end MLOps project that fetches YouTube comments, analyzes their sentiment, trains a model, and serves it via a live Streamlit application.
This repository contains a complete, end-to-end Machine Learning Operations (MLOps) project. The goal is to analyze the sentiment of comments from a YouTube video—in this case, Squeezie's famous "QUI EST L'IMPOSTEUR ?"—and provide an interactive web application for on-demand analysis of any YouTube video.
This project isn't just about building a model; it's about building a robust, automated, and reproducible system around it.
- Automated Data Pipeline: Fetches, preprocesses, and annotates data automatically using
DVC. - CI/CD with GitHub Actions:
- Continuous Integration (CI): Every push to the
mainbranch automatically lints, tests, and validates the entire data pipeline. - Continuous Delivery (CD): Pushing a new Git tag (e.g.,
v1.1) automatically trains the model, packages it, and creates a new release on GitHub.
- Continuous Integration (CI): Every push to the
- Interactive Web App: A Streamlit application that allows anyone to analyze a YouTube video's comments using the latest trained model.
- Version Control for Data & Models:
DVCandMLflowensure that every component—code, data, and models—is versioned and tracked.
This project leverages a modern MLOps stack:
The project is designed with a clear separation of concerns, following best MLOps practices.
- Development & CI Loop (Left):
- Code is written and pushed to GitHub.
- The CI pipeline (
main.yml) runs tests and linters to ensure code quality.
- Training & CD Loop (Middle):
- A Git tag (e.g.,
v1.0) triggers the CD pipeline (release.yml). - The full
DVCpipeline runs, from data fetching to model training. - The final model and its metrics are packaged and published as a GitHub Release.
- A Git tag (e.g.,
- Inference Loop (Right):
- The Streamlit app automatically fetches the latest release from GitHub.
- It loads the pre-trained model.
- The user provides a YouTube URL, and the app performs fast inference to display sentiment analysis results.
The core logic is orchestrated by DVC. You can see the stages defined in dvc.yaml. Running dvc repro executes the following steps:
fetch: Fetches thousands of comments from a specific YouTube video using the YouTube Data API.preprocess: Cleans the text data by removing URLs, mentions, and normalizing emojis.annotate: Uses a pre-trained Hugging Face model (twitter-xlm-roberta-base-sentiment) to assign an initial sentiment label (Positive, Negative, Neutral) to each comment. This serves as our ground truth.train: Trains ascikit-learnpipeline, which includes aTfidfVectorizerand aStackingClassifier, on the annotated data. The experiment is logged with MLflow.
Want to run the project on your own machine? Here's how.
- Conda installed.
- A YouTube Data API v3 key from the Google Cloud Console.
-
Clone the repository:
git clone https://github.com/BryanBradfo/youtube-sentiment-mlops.git cd youtube-sentiment-mlops -
Set up the Conda environment: This project uses a Conda environment defined in
ci_environment.yml.conda env create -f ci_environment.yml conda activate sentiment-mlops
-
Configure your API Key: Create a file named
.envin the root directory and add your API key:# .env YOUTUBE_API_KEY="YOUR_API_KEY_HERE"
To run the entire pipeline from start to finish, simply use the DVC command:
dvc reproThis will generate all the data files and log a new MLflow experiment in the mlruns/ directory.
To see the results of your training runs, launch the MLflow UI:
mlflow uiThen, open your browser to http://127.0.0.1:5000.
Contributions are welcome! If you have an idea for an improvement or find a bug, please feel free to:
- Fork the repository.
- Create a new branch (
git checkout -b feature/AmazingFeature). - Make your changes.
- Commit your changes (
git commit -m 'Add some AmazingFeature'). - Push to the branch (
git push origin feature/AmazingFeature). - Open a Pull Request.