REDDIT DAILY DIGEST

A clean, modular Python project that scrapes informative subreddits, produces CSV + HTML daily reports, and (optionally) commits those reports back to the repository using GitHub Actions. Built to be reproducible, testable, and presentable, thus making it perfect for demos, portfolio projects, or lightweight research tooling.

Repository: here

Why this project

This project collects posts from selected subreddits, extracts title, URL, score & content, and generates human-friendly HTML reports plus CSVs for downstream analysis. It demonstrates:

robust scraping via Selenium + webdriver-manager,
automated scheduling via GitHub Actions,
clear project structure and reproducible dependencies,
a minimal reporting layer for sharing insights.

It’s intentionally small but production-minded, good for showing real engineering practices.

Features

Scrapes multiple subreddit URLs configured in src/config.json
Extracts: title, url, score, content, and source
Outputs: daily data/YYYY-MM-DD.csv and data/YYYY-MM-DD.html
Automated daily runs with GitHub Actions (.github/workflows/daily-digest.yml)
Simple, dependency-free HTML report generator (safe HTML escaping)
Pushes generated reports back to repo (via Actions)

Sample output

Repo layout

Getting started

Clone the repo

git clone https://github.com/Ani-404/Reddit-Daily-Digest.git
cd Reddit-Daily-Digest

Create & activate a virtual environment

python -m venv scraper-env
.\scraper-env\Scripts\activate

Install dependencies

pip install --upgrade pip
pip install -r requirements.txt

Ensure Google Chrome is installed (or change settings to use a different browser). webdriver-manager will download the appropriate ChromeDriver automatically.

Configuration

Example config used by main.py:

{
  "output_dir": "data",
  "sites": [
    {
      "name": "MachineLearning",
      "url": "https://old.reddit.com/r/MachineLearning/",
      "posts_to_scrape": 10
    },
    {
      "name": "DataScience",
      "url": "https://old.reddit.com/r/datascience/",
      "posts_to_scrape": 10
    }
  ]
}

Notes

Using old.reddit.com provides a stable, simpler DOM for scraping.
output_dir should match the folder your main.py expects (default: data).

Run locally

From repo root (with venv active):

python main.py

Run with Docker (optional)

A simple Dockerfile makes the environment reproducible. Example:

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["python", "main.py"]

GitHub Actions

The repo contains .github/workflows/daily-digest.yml. It:

runs daily at the configured cron (configured in the workflow file)
sets up Python, installs requirements.txt
runs python main.py
commits data/ back to the repository using the built-in GITHUB_TOKEN.

Manual runs are available under Actions → Daily Digest → Run workflow.

Design & implementation notes

Selenium + webdriver-manager: webdriver-manager removes the need to manually download ChromeDriver; Selenium controls Chrome in headless mode.
Old Reddit: using old.reddit.com simplifies selectors (e.g., div.thing, a.title, div.score) — much more stable for scraping.
Report generation: the HTML generator is dependency-free (uses Python's html module for safe escaping, simple CSS).
Separation of concerns: scraper.py collects data, report_generator.py renders it, main.py orchestrates and handles config + saving.

Improvements & production ideas

These are realistic, meaningful upgrades that can be implemented to strengthen the project:

Use Reddit API (PRAW) — more robust and respectful than scraping.
Rate limiting & retries — polite scraping with exponential backoff.
Unit tests — tests for parsing logic and output formats (use sample HTML files).
Static analysis/linting — add ruff or flake8 in CI.
Containerization — documented Docker image for reproducible runs.
Data pipeline — push results to a database (SQLite / Postgres) and add visualization notebooks.
Dashboard — a small static dashboard (or GitHub Pages) to host daily reports.
Secrets management — store any API keys in GitHub Secrets (never commit them).

Contributing

Contributions are welcome, and small, focused PRs are best. Suggested workflow:

Fork the repo
Create a topic branch: git checkout -b feat/add-tests
Run tests (if added) and make changes
Open a PR with an explanation and screenshots if appropriate

License

This project is released under the MIT License — see LICENSE for details.

Acknowledgement

Selenium and webdriver-manager for making browser automation straightforward.
Inspiration from many small scraping/reporting projects; assembled here to be clear, reproducible, and presentable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

REDDIT DAILY DIGEST

Table of contents

Why this project

Features

Sample output

Repo layout

Getting started

Configuration

Run locally

Run with Docker (optional)

GitHub Actions

Design & implementation notes

Improvements & production ideas

Contributing

License

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
data		data
images		images
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

License

Ani-404/Reddit-Daily-Digest

Folders and files

Latest commit

History

Repository files navigation

REDDIT DAILY DIGEST

Table of contents

Why this project

Features

Sample output

Repo layout

Getting started

Configuration

Run locally

Run with Docker (optional)

GitHub Actions

Design & implementation notes

Improvements & production ideas

Contributing

License

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages