Skip to content

DanShash/SpamClassifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

๐Ÿ“ง Email Spam Classifier with NLP

A machine learning project to classify emails as spam or ham using the SpamAssassin Public Corpus. Built with Scikit-learn, NLTK, and custom NLP preprocessing, this project achieves 98.5% accuracy in detecting spam emails.


๐Ÿš€ Project Overview

This project implements an NLP-based spam email classifier using logistic regression. It processes 3,000 emails (2,500 ham, 500 spam) and applies custom preprocessing to handle HTML, multipart formats, and imbalanced data.

๐Ÿ” Highlights

  • Dataset: 3,000 emails (2,500 ham, 500 spam) from SpamAssassin Public Corpus
  • Preprocessing: NLTK stemming, HTML-to-text, URL/number replacement, word vectorization
  • Model: Logistic Regression with Scikit-learn pipeline
  • Performance:
    • Accuracy: 98.5%
    • Precision: 96.88%
    • Recall: 97.89%
  • Optional Deployment: Flask API for real-time predictions

๐Ÿ› ๏ธ Setup Instructions

๐Ÿ“ฆ Prerequisites

  • Python 3.7+
  • Git
  • Internet connection (for downloading dataset)

๐Ÿงช Installation

  1. Clone the Repository
git clone https://github.com/[your-username]/[your-repo-name].git
cd [your-repo-name]
  1. Install Dependencies
pip install -r requirements.txt
  1. Download Dataset Run the SpamClassifier.ipynb notebook to automatically download:
  • 20030228_easy_ham.tar.bz2 (2,500 ham emails)
  • 20030228_spam.tar.bz2 (500 spam emails)

Alternatively, manually download from the SpamAssassin Public Corpus and extract into datasets/spam/.


๐Ÿง  Running the Project

โ–ถ๏ธ Step 1: Run the Jupyter Notebook

jupyter notebook

Open SpamClassifier.ipynb and run all cells. This will:

  • Download & preprocess the dataset
  • Train the logistic regression model
  • Evaluate performance (accuracy, precision, recall)
  • Save the model as spam_classifier.pkl
  • Generate a confusion matrix at images/SpamClassifier/confusion_matrix.png

๐ŸŒ Step 2: Run the Flask API (Optional)

Make sure spam_classifier.pkl is in the project root.

Start the API:

python app.py

๐Ÿ“š Dataset Details

  • Source: SpamAssassin Public Corpus

  • Files Used:

    • 20030228_easy_ham.tar.bz2 (non-spam)
    • 20030228_spam.tar.bz2 (spam)
  • Total Size: 3,000 emails

  • Format: Includes plain text, HTML, and multipart formats


๐Ÿ“ˆ Performance Metrics

Metric Score
Accuracy 98.5%
Precision 96.88%
Recall 97.89%
  • Confusion Matrix: images/SpamClassifier/confusion_matrix.png

๐Ÿ”Œ Using the Flask API

๐Ÿ“จ Endpoint

  • URL: POST /predict
  • Request Payload (JSON):
{
  "email": "Subject: Win a free prize!\nClick here: http://scam.com to claim your reward!"
}
  • Response:
{
  "prediction": "spam",
  "probability": 0.95
}

๐Ÿ” Example curl Command

curl -X POST http://localhost:5000/predict \
     -H "Content-Type: application/json" \
     -d '{"email": "Subject: Win a free prize!\nClick here: http://scam.com to claim your reward!"}'

๐Ÿ“ Repository Contents

File / Folder Description
SpamClassifier.ipynb Main Jupyter notebook for data pipeline
app.py Flask API for real-time predictions
spam_classifier.pkl Saved ML model (generated by notebook)
requirements.txt Python dependencies
images/SpamClassifier/ Contains confusion matrix image
datasets/spam/ Folder for extracted dataset

๐Ÿ“ฆ Requirements

Key dependencies listed in requirements.txt:

  • scikit-learn >= 1.0.1
  • nltk
  • urlextract
  • pandas
  • flask
  • matplotlib

๐Ÿ“Š Visualizations

Confusion Matrix Shows true/false positives and negatives for spam and ham classification. Path: images/SpamClassifier/confusion_matrix.png


๐Ÿ”ฎ Future Improvements

  • Use TF-IDF or BERT for more robust vectorization
  • Address class imbalance with SMOTE or class weighting
  • Deploy the Flask API on Heroku or AWS for live use

๐Ÿ™ Acknowledgments


๐Ÿ“ฌ Contact

For questions or feedback, feel free to open an issue or contact Daniels Shashkov via GitHub.

About

Tested and Trained a email spam classifier model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published