Skip to content

A compact Python scraper that turns selected subreddits into daily CSV + human-friendly HTML reports, with optional GitHub Actions automation.

License

Notifications You must be signed in to change notification settings

Ani-404/Reddit-Daily-Digest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

REDDIT DAILY DIGEST

A clean, modular Python project that scrapes informative subreddits, produces CSV + HTML daily reports, and (optionally) commits those reports back to the repository using GitHub Actions. Built to be reproducible, testable, and presentable, thus making it perfect for demos, portfolio projects, or lightweight research tooling.

Repository: here


Table of contents


Why this project

This project collects posts from selected subreddits, extracts title, URL, score & content, and generates human-friendly HTML reports plus CSVs for downstream analysis. It demonstrates:

  • robust scraping via Selenium + webdriver-manager,
  • automated scheduling via GitHub Actions,
  • clear project structure and reproducible dependencies,
  • a minimal reporting layer for sharing insights.

It’s intentionally small but production-minded, good for showing real engineering practices.


Features

  • Scrapes multiple subreddit URLs configured in src/config.json
  • Extracts: title, url, score, content, and source
  • Outputs: daily data/YYYY-MM-DD.csv and data/YYYY-MM-DD.html
  • Automated daily runs with GitHub Actions (.github/workflows/daily-digest.yml)
  • Simple, dependency-free HTML report generator (safe HTML escaping)
  • Pushes generated reports back to repo (via Actions)

Sample output

sample-output


Repo layout

repo-layout


Getting started

  1. Clone the repo
git clone https://github.com/Ani-404/Reddit-Daily-Digest.git
cd Reddit-Daily-Digest
  1. Create & activate a virtual environment
python -m venv scraper-env
.\scraper-env\Scripts\activate
  1. Install dependencies
pip install --upgrade pip
pip install -r requirements.txt
  1. Ensure Google Chrome is installed (or change settings to use a different browser). webdriver-manager will download the appropriate ChromeDriver automatically.

Configuration

Example config used by main.py:

{
  "output_dir": "data",
  "sites": [
    {
      "name": "MachineLearning",
      "url": "https://old.reddit.com/r/MachineLearning/",
      "posts_to_scrape": 10
    },
    {
      "name": "DataScience",
      "url": "https://old.reddit.com/r/datascience/",
      "posts_to_scrape": 10
    }
  ]
}

Notes

  • Using old.reddit.com provides a stable, simpler DOM for scraping.
  • output_dir should match the folder your main.py expects (default: data).

Run locally

From repo root (with venv active):

python main.py

Run with Docker (optional)

A simple Dockerfile makes the environment reproducible. Example:

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["python", "main.py"]

GitHub Actions

The repo contains .github/workflows/daily-digest.yml. It:

  • runs daily at the configured cron (configured in the workflow file)
  • sets up Python, installs requirements.txt
  • runs python main.py
  • commits data/ back to the repository using the built-in GITHUB_TOKEN.

Manual runs are available under Actions → Daily Digest → Run workflow.


Design & implementation notes

  • Selenium + webdriver-manager: webdriver-manager removes the need to manually download ChromeDriver; Selenium controls Chrome in headless mode.
  • Old Reddit: using old.reddit.com simplifies selectors (e.g., div.thing, a.title, div.score) — much more stable for scraping.
  • Report generation: the HTML generator is dependency-free (uses Python's html module for safe escaping, simple CSS).
  • Separation of concerns: scraper.py collects data, report_generator.py renders it, main.py orchestrates and handles config + saving.

Improvements & production ideas

These are realistic, meaningful upgrades that can be implemented to strengthen the project:

  • Use Reddit API (PRAW) — more robust and respectful than scraping.
  • Rate limiting & retries — polite scraping with exponential backoff.
  • Unit tests — tests for parsing logic and output formats (use sample HTML files).
  • Static analysis/linting — add ruff or flake8 in CI.
  • Containerization — documented Docker image for reproducible runs.
  • Data pipeline — push results to a database (SQLite / Postgres) and add visualization notebooks.
  • Dashboard — a small static dashboard (or GitHub Pages) to host daily reports.
  • Secrets management — store any API keys in GitHub Secrets (never commit them).

Contributing

Contributions are welcome, and small, focused PRs are best. Suggested workflow:

  • Fork the repo
  • Create a topic branch: git checkout -b feat/add-tests
  • Run tests (if added) and make changes
  • Open a PR with an explanation and screenshots if appropriate

License

This project is released under the MIT License — see LICENSE for details.


Acknowledgement

  • Selenium and webdriver-manager for making browser automation straightforward.
  • Inspiration from many small scraping/reporting projects; assembled here to be clear, reproducible, and presentable.

About

A compact Python scraper that turns selected subreddits into daily CSV + human-friendly HTML reports, with optional GitHub Actions automation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors