
-
Ingest Data
Fetch data from the GitHub API using Apache Airflow and stream it using Apache Kafka. -
Process Data
Perform real-time processing and analysis of the data using Apache Spark for streaming. Perform batch analysis of the data using Apache Spark SQl . -
Store Data
Store the processed data in Apache Cassandra for efficient querying and retrieval. -
Containerize the Pipeline
Use Docker to containerize the entire pipeline.



- Clone the repository
git clone https://github.com/luluna02/Github-Data-Pipeline
cd Github-Data-Pipeline
- Start Containers
docker compose up -d
- Run Spark Job
docker exec -it src-spark-master-1 spark-submit \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1 \
--packages com.datastax.spark:spark-cassandra-connector_2.12:3.5.1 \
--master spark://localhost:7077 \
spark_batch.py