Skip to content

A pipeline using data from Github API to analyze trends among repositories using Airflow, Kafka, Spark and Cassandra

Notifications You must be signed in to change notification settings

luluna02/Github-Data-Pipeline

Repository files navigation

Screenshot 2025-03-24 at 4 41 49 PM

The pipeline is built to:

  1. Ingest Data
    Fetch data from the GitHub API using Apache Airflow and stream it using Apache Kafka.

  2. Process Data
    Perform real-time processing and analysis of the data using Apache Spark for streaming. Perform batch analysis of the data using Apache Spark SQl .

  3. Store Data
    Store the processed data in Apache Cassandra for efficient querying and retrieval.

  4. Containerize the Pipeline
    Use Docker to containerize the entire pipeline.

Airflow dashboard

Screenshot 2025-03-24 at 10 36 51 AM

Kafka data

Screenshot 2025-03-17 at 12 52 44 AM

Cassandra database

Screenshot 2025-03-24 at 1 34 47 AM

How to Run

  1. Clone the repository
git clone https://github.com/luluna02/Github-Data-Pipeline
cd Github-Data-Pipeline
  1. Start Containers
docker compose up -d
  1. Run Spark Job
docker exec -it src-spark-master-1 spark-submit \
  --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1 \
  --packages com.datastax.spark:spark-cassandra-connector_2.12:3.5.1 \
  --master spark://localhost:7077 \
  spark_batch.py

About

A pipeline using data from Github API to analyze trends among repositories using Airflow, Kafka, Spark and Cassandra

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published