📧 Email Spam Classifier with NLP

A machine learning project to classify emails as spam or ham using the SpamAssassin Public Corpus. Built with Scikit-learn, NLTK, and custom NLP preprocessing, this project achieves 98.5% accuracy in detecting spam emails.

🚀 Project Overview

This project implements an NLP-based spam email classifier using logistic regression. It processes 3,000 emails (2,500 ham, 500 spam) and applies custom preprocessing to handle HTML, multipart formats, and imbalanced data.

🔍 Highlights

Dataset: 3,000 emails (2,500 ham, 500 spam) from SpamAssassin Public Corpus
Preprocessing: NLTK stemming, HTML-to-text, URL/number replacement, word vectorization
Model: Logistic Regression with Scikit-learn pipeline
Performance:
- Accuracy: 98.5%
- Precision: 96.88%
- Recall: 97.89%
Optional Deployment: Flask API for real-time predictions

🛠️ Setup Instructions

📦 Prerequisites

Python 3.7+
Git
Internet connection (for downloading dataset)

🧪 Installation

Clone the Repository

git clone https://github.com/[your-username]/[your-repo-name].git
cd [your-repo-name]

Install Dependencies

pip install -r requirements.txt

Download Dataset Run the SpamClassifier.ipynb notebook to automatically download:

20030228_easy_ham.tar.bz2 (2,500 ham emails)
20030228_spam.tar.bz2 (500 spam emails)

Alternatively, manually download from the SpamAssassin Public Corpus and extract into datasets/spam/.

🧠 Running the Project

▶️ Step 1: Run the Jupyter Notebook

jupyter notebook

Open SpamClassifier.ipynb and run all cells. This will:

Download & preprocess the dataset
Train the logistic regression model
Evaluate performance (accuracy, precision, recall)
Save the model as spam_classifier.pkl
Generate a confusion matrix at images/SpamClassifier/confusion_matrix.png

🌐 Step 2: Run the Flask API (Optional)

Make sure spam_classifier.pkl is in the project root.

Start the API:

python app.py

📚 Dataset Details

Source: SpamAssassin Public Corpus
Files Used:
- 20030228_easy_ham.tar.bz2 (non-spam)
- 20030228_spam.tar.bz2 (spam)
Total Size: 3,000 emails
Format: Includes plain text, HTML, and multipart formats

📈 Performance Metrics

Metric	Score
Accuracy	98.5%
Precision	96.88%
Recall	97.89%

Confusion Matrix: images/SpamClassifier/confusion_matrix.png

🔌 Using the Flask API

📨 Endpoint

URL: POST /predict
Request Payload (JSON):

{
  "email": "Subject: Win a free prize!\nClick here: http://scam.com to claim your reward!"
}

Response:

{
  "prediction": "spam",
  "probability": 0.95
}

🔁 Example curl Command

curl -X POST http://localhost:5000/predict \
     -H "Content-Type: application/json" \
     -d '{"email": "Subject: Win a free prize!\nClick here: http://scam.com to claim your reward!"}'

📁 Repository Contents

File / Folder	Description
`SpamClassifier.ipynb`	Main Jupyter notebook for data pipeline
`app.py`	Flask API for real-time predictions
`spam_classifier.pkl`	Saved ML model (generated by notebook)
`requirements.txt`	Python dependencies
`images/SpamClassifier/`	Contains confusion matrix image
`datasets/spam/`	Folder for extracted dataset

📦 Requirements

Key dependencies listed in requirements.txt:

scikit-learn >= 1.0.1
nltk
urlextract
pandas
flask
matplotlib

📊 Visualizations

Confusion Matrix Shows true/false positives and negatives for spam and ham classification. Path: images/SpamClassifier/confusion_matrix.png

🔮 Future Improvements

Use TF-IDF or BERT for more robust vectorization
Address class imbalance with SMOTE or class weighting
Deploy the Flask API on Heroku or AWS for live use

🙏 Acknowledgments

Dataset: Apache SpamAssassin Public Corpus
Libraries: Scikit-learn, NLTK, urlextract, Flask

📬 Contact

For questions or feedback, feel free to open an issue or contact Daniels Shashkov via GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md
Requirements.txt		Requirements.txt
SpamClassifier.ipynb		SpamClassifier.ipynb
app.py		app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📧 Email Spam Classifier with NLP

🚀 Project Overview

🔍 Highlights

🛠️ Setup Instructions

📦 Prerequisites

🧪 Installation

🧠 Running the Project

▶️ Step 1: Run the Jupyter Notebook

🌐 Step 2: Run the Flask API (Optional)

📚 Dataset Details

📈 Performance Metrics

🔌 Using the Flask API

📨 Endpoint

🔁 Example curl Command

📁 Repository Contents

📦 Requirements

📊 Visualizations

🔮 Future Improvements

🙏 Acknowledgments

📬 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

DanShash/SpamClassifier

Folders and files

Latest commit

History

Repository files navigation

📧 Email Spam Classifier with NLP

🚀 Project Overview

🔍 Highlights

🛠️ Setup Instructions

📦 Prerequisites

🧪 Installation

🧠 Running the Project

▶️ Step 1: Run the Jupyter Notebook

🌐 Step 2: Run the Flask API (Optional)

📚 Dataset Details

📈 Performance Metrics

🔌 Using the Flask API

📨 Endpoint

🔁 Example curl Command

📁 Repository Contents

📦 Requirements

📊 Visualizations

🔮 Future Improvements

🙏 Acknowledgments

📬 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages