Skip to content

glowfi/reddit-scraper

Repository files navigation

πŸ€– Reddit Scraper

Project Logo

Modular Reddit data collection framework
Scrape subreddits, posts, and users into clean structured JSON.

Python Data Database License


✨ Overview

A modular Reddit scraping pipeline designed for data collection, analytics, and research workflows.

The project gathers structured data about:

  • πŸ“š Subreddits
  • πŸ“ Posts
  • πŸ‘€ Users

and exports everything as clean JSON datasets ready for:

  • databases
  • machine learning pipelines
  • analytics
  • data exploration

No manual scraping steps required.


πŸš€ Features

  • Modular scraper architecture
  • Structured JSON output
  • Automated scraping workflow
  • MongoDB import helpers
  • Large dataset handling utilities
  • Environment-based configuration

Collects

Entity Data
Subreddits metadata & statistics
Posts content, scores, engagement
Users profile & activity info

🧠 How It Works


run.py
β”‚
β”œβ”€β”€ subreddits.py
β”œβ”€β”€ posts.py
└── users.py
↓
JSON datasets
↓
(optional) MongoDB import

Each scraper is independent and reusable.


πŸ“¦ Installation

1️⃣ Clone & setup environment

git clone https://github.com/glowfi/reddit-scraper
cd reddit-scraper

python -m venv env
source env/bin/activate      # Linux / macOS
# env\Scripts\activate       # Windows

pip install -r requirements.txt

2️⃣ Configure API credentials

Edit env-sample and rename it:

.env
username=<RedditUsername>
password=<RedditPassword>
client_id=<ClientID>
client_secret=<ClientSecret>

TOTAL_SUBREDDITS_PER_TOPICS=6
SUBREDDIT_SORT_FILTER="hot"
POSTS_PER_SUBREDDIT=10
POSTS_SORT_FILTER="new"

Create Reddit API credentials here:

πŸ‘‰ https://www.reddit.com/prefs/apps


3️⃣ Run scraper

python run.py

Pipeline execution:

  1. Scrape subreddits
  2. Scrape posts
  3. Scrape users
  4. Export JSON datasets
  5. Optional dataset splitting

πŸ“Š Output Examples

JSON files are large (16–25MB). Download instead of viewing in browser.

Subreddit Document

Subreddit example

Sample: https://files.catbox.moe/r7a7um.json


Post Document

Post example

Sample: https://files.catbox.moe/5cf2xw.json


User Document

User example

Sample: https://files.catbox.moe/yp506n.json


πŸ—‚οΈ Project Structure

reddit-scraper/
β”œβ”€β”€ subreddits.py
β”œβ”€β”€ posts.py
β”œβ”€β”€ users.py
β”œβ”€β”€ run.py
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ split.py
β”‚   └── import_data_to_mongodb.sh
└── output/

🧩 Utilities

Tool Purpose
run.py Executes full scraping pipeline
utils/split.py Splits large JSON datasets
import_data_to_mongodb.sh Bulk imports into MongoDB

πŸ—„οΈ MongoDB Import

After scraping:

./utils/import_data_to_mongodb.sh

Ensure MongoDB is running beforehand.


⚠️ Notes

  • Reddit API rate limits apply
  • Scraping speed depends on network/API limits
  • Designed for research & data workflows
  • Respect Reddit API terms of service

🀝 Contributing

Contributions, improvements, and issue reports are welcome.

Small focused PRs are preferred.


πŸ“„ License

GPL-3.0

About

Modular Reddit scraping pipeline that collects subreddit, post, and user data into structured JSON datasets for analytics and research.

Topics

Resources

License

Stars

Watchers

Forks

Contributors