A machine learning project to classify emails as spam or ham using the SpamAssassin Public Corpus. Built with Scikit-learn, NLTK, and custom NLP preprocessing, this project achieves 98.5% accuracy in detecting spam emails.
This project implements an NLP-based spam email classifier using logistic regression. It processes 3,000 emails (2,500 ham, 500 spam) and applies custom preprocessing to handle HTML, multipart formats, and imbalanced data.
- Dataset: 3,000 emails (2,500 ham, 500 spam) from SpamAssassin Public Corpus
- Preprocessing: NLTK stemming, HTML-to-text, URL/number replacement, word vectorization
- Model: Logistic Regression with Scikit-learn pipeline
- Performance:
- Accuracy: 98.5%
- Precision: 96.88%
- Recall: 97.89%
- Optional Deployment: Flask API for real-time predictions
- Python 3.7+
- Git
- Internet connection (for downloading dataset)
- Clone the Repository
git clone https://github.com/[your-username]/[your-repo-name].git
cd [your-repo-name]- Install Dependencies
pip install -r requirements.txt- Download Dataset
Run the
SpamClassifier.ipynbnotebook to automatically download:
20030228_easy_ham.tar.bz2(2,500 ham emails)20030228_spam.tar.bz2(500 spam emails)
Alternatively, manually download from the SpamAssassin Public Corpus and extract into datasets/spam/.
jupyter notebookOpen SpamClassifier.ipynb and run all cells. This will:
- Download & preprocess the dataset
- Train the logistic regression model
- Evaluate performance (accuracy, precision, recall)
- Save the model as
spam_classifier.pkl - Generate a confusion matrix at
images/SpamClassifier/confusion_matrix.png
Make sure spam_classifier.pkl is in the project root.
Start the API:
python app.py-
Source: SpamAssassin Public Corpus
-
Files Used:
20030228_easy_ham.tar.bz2(non-spam)20030228_spam.tar.bz2(spam)
-
Total Size: 3,000 emails
-
Format: Includes plain text, HTML, and multipart formats
| Metric | Score |
|---|---|
| Accuracy | 98.5% |
| Precision | 96.88% |
| Recall | 97.89% |
- Confusion Matrix:
images/SpamClassifier/confusion_matrix.png
- URL:
POST /predict - Request Payload (JSON):
{
"email": "Subject: Win a free prize!\nClick here: http://scam.com to claim your reward!"
}- Response:
{
"prediction": "spam",
"probability": 0.95
}curl -X POST http://localhost:5000/predict \
-H "Content-Type: application/json" \
-d '{"email": "Subject: Win a free prize!\nClick here: http://scam.com to claim your reward!"}'| File / Folder | Description |
|---|---|
SpamClassifier.ipynb |
Main Jupyter notebook for data pipeline |
app.py |
Flask API for real-time predictions |
spam_classifier.pkl |
Saved ML model (generated by notebook) |
requirements.txt |
Python dependencies |
images/SpamClassifier/ |
Contains confusion matrix image |
datasets/spam/ |
Folder for extracted dataset |
Key dependencies listed in requirements.txt:
scikit-learn >= 1.0.1nltkurlextractpandasflaskmatplotlib
Confusion Matrix
Shows true/false positives and negatives for spam and ham classification.
Path: images/SpamClassifier/confusion_matrix.png
- Use TF-IDF or BERT for more robust vectorization
- Address class imbalance with SMOTE or class weighting
- Deploy the Flask API on Heroku or AWS for live use
- Dataset: Apache SpamAssassin Public Corpus
- Libraries: Scikit-learn, NLTK, urlextract, Flask
For questions or feedback, feel free to open an issue or contact Daniels Shashkov via GitHub.