Built a real-time weather data pipeline with Kafka, Flink, and Airflow | Dion Saldanha posted on the topic

Data Engineering Intern @MPL | B.E. Computer Science Student | Skilled in Python, SQL, and Data Pipelines

1mo

Excited to share my latest Data Engineering Project! I built a real-time data pipeline that processes live weather data at scale 🌦️. Here’s how it works: - Pulling data from the OpenWeather API (handling ~3000 API calls/minute). - Ingesting data into Kafka Producer. - A Flink Consumer processes and aggregates the weather data every minute, then pushes it into another Kafka topic. - Another consumer stores the processed data into Postgres (via Supabase). - Finally, I integrated Grafana with Postgres to visualize real-time insights, refreshing every 5 minutes. - To orchestrate this workflow, I’m using Apache Airflow to schedule the Flink jobs at 5-minute intervals. This project helped me explore: - Building scalable streaming pipelines - Real-time data aggregation - Orchestration with Airflow - Visualization of live data Tech stack highlights: Kafka Flink Airflow Postgres Supabase Grafana 🔗 GitHub Repository I’ve open-sourced the entire project here: https://lnkd.in/g2cgcAmz Would love to hear feedback and ideas from the community to make this even more production-ready 🙌 #DataEngineering #ApacheKafka #ApacheFlink #ApacheAirflow #RealTimeData #DataPipeline #Postgres #Grafana

3 Comments

Dion Saldanha

Data Engineering Intern @MPL | B.E. Computer Science Student | Skilled in Python, SQL, and Data Pipelines

1mo

Here’s a sneak peek of the Grafana dashboard I built on top of this pipeline — visualizing real-time weather patterns across the globe. The data refreshes every 5 minutes, powered by Postgres + Supabase + Grafana.

SHYAM VARSHAN

| Developer Advocate at Atatus | Apache Kafka • Observability • DevOps • Developer Relations |

Fantastic project! Love how you’ve combined real-time ingestion, processing, and visualization into a full end-to-end pipeline. The integration of Flink with Kafka and Airflow for orchestration is particularly impressive.

See more comments

To view or add a comment, sign in

More Relevant Posts

Abhishek Jain

Associate Vice President, Distributed application/Cloud based Infra/Data Analysis, Enthusiast along with Explorer for all new tools & technologies, Distributed Application/Infrastructure/Resiliency Automation Architect
2w
Report this post
Excited to publish Part 6 of my "Mastering Data Analytics" series! We're taking our Lakehouse from batch to real-time by building data pipelines with Apache Kafka and Redis with synthetically generating data along with stream processing to get data for real time analytics in Lakehouse like Storage Architecture. Check it out to learn how to ingest and process streaming data for instant insights. https://lnkd.in/gifDg__j #DataEngineering #Kafka #RealTimeAnalytics #BigData

Mastering Data Analytics — Beginner’s Guide: Building Real-Time Data Pipelines with Kafka… vardhmanandroid2015.medium.com
Like Comment
To view or add a comment, sign in
StreamNative

5,628 followers
2w
Report this post
🚀 Unify Your Data Architecture with Ursa The future of data streaming is open, lakehouse-native, and finally here. As it has been announced at #DataStreamingSummit, our VLDB 2025 Best Paper award-winning Ursa engine is now available everywhere! ✨ What does this mean for you? ❇️ Apache Pulsar Users: Enjoy a smooth, non-disruptive path to lakehouse storage with our new Ursa Storage Extension. Future-proof your investments today. ❇️ Apache Kafka Users: Achieve full API compatibility while slashing storage costs by up to 10x by writing directly to open formats like Iceberg and Delta Lake. ❇️ Data Teams: Break down the walls between real-time and batch. Unify streams and analytics in a single architecture, and say goodbye to complex, costly ETL pipelines. This is more than an update—it's a new standard for streaming data. 🐻 Read the full announcement to learn how Ursa paves the path to a lakehouse-native future: https://hubs.ly/Q03NF6F20 #DataStreaming #Lakehouse #ApachePulsar #ApacheKafka #RealTimeData #Ursa
Like Comment
To view or add a comment, sign in
Monika Sharma

Senior Data Engineer / Data Architect / AI Engineer
4w
Report this post
🧩 From pipelines to platforms When I first started in data engineering, a lot of the job was just getting data to land somewhere and stay alive. Back then, we mostly stored everything in one cloud warehouse or one vendor’s format. If the business wanted to move clouds or share data with another team, it meant long migration projects, brittle export jobs, and late nights rewriting pipelines just so the same tables could live somewhere else. Things feel very different now. • Open table formats like Iceberg and Delta mean we can design once and keep control — no more re-architecting just to change vendors or tools. • ETL is smarter — tools (and even LLMs) help generate SQL, build dbt models, and warn when schema changes might break prod. • Streaming has matured — Kafka + Flink + modern lakehouses make near real-time analytics a normal part of the stack. • Metadata & lineage aren’t just documentation anymore — they drive governance, cost control, and trust. The job is shifting from pipeline firefighter to platform builder. Honestly, it’s exciting — less duct tape, more real engineering. #DataEngineering #Lakehouse #ApacheIceberg #DeltaLake #StreamingData #ETL #dbt #Metadata #Lineage #DataPlatforms #CloudData
Like Comment
To view or add a comment, sign in
Fatemeh Sadegh Behnam

Data Engineer | Business Intelligence Developer | SQL | Snowflake | Azure | Python | ETL | ELT | Power BI
1w Edited
Report this post
🚀 Open-Source Powerhouses Behind Modern Data Engineering When people talk about data, they often think of dashboards and insights. But behind every clean dashboard, there’s a set of powerful open-source tools working quietly to move, process, and orchestrate data. Let’s start with some of the key Apache projects every Data Engineer should know 👇 ⚙️ Apache Airflow – orchestrates complex data pipelines. ⚡ Apache Kafka – handles real-time data streaming. 🔥 Apache Spark – processes huge datasets in parallel. 🌊 Apache Flink – brings real-time analytics to life. 🔄 Apache NiFi – automates and manages data flows. 🏗️ Apache Hive – enables SQL-like queries on big data. These open-source tools are the backbone of modern data platforms. In my next posts, I’ll break them down one by one — showing how they work together to turn raw data into real insight. #DataEngineering #Apache #Airflow #Kafka #Spark #Flink #NiFi #OpenSource
Like Comment
To view or add a comment, sign in
Gourav Jha

Data Engineer@ Bristlecone| Databricks | SQL | Python • Spark • ETL •Kafka| Snowflake| AWS| Supply chain|EDI
3w
Report this post
“Imagine trying to serve 1,000 restaurant orders at once — with waiters shouting across the kitchen.” That’s what traditional data pipelines feel like when systems try to exchange massive amounts of data in real time. That’s where Apache Kafka steps in. It acts like a smart waiter taking orders (data) from producers, delivering them safely to the right consumers, and ensuring nothing gets lost in the chaos. In real-world data engineering, Kafka powers: Real-time dashboards and alerts Streaming ETL pipelines Log aggregation and event-driven architectures I’ve recently started learning Apache Kafka, and it’s fascinating to see how it handles millions of events per second, all while keeping systems in sync and data flowing seamlessly. #ApacheKafka #DataEngineering #StreamingData #LearningJourney #BigData #ETL #DataWithAnurag #RealTimeProcessing
Like Comment
To view or add a comment, sign in
Igor Souza

Software Engineer at Teckro
1w
Report this post
The 9 Ways to Move Data Kafka -> Iceberg #apachekafka An in-depth comparison of each vendor’s solutions exploring the trade-offs between zero-copy and copy architecture https://lnkd.in/daQksGzP

The 9 Ways to Move Data Kafka -> Iceberg blog.streambased.io
Like Comment
To view or add a comment, sign in
Nikhil Kumar Reddy R

Data Engineer | SQL | PySpark | Azure Databricks | Azure Data Factory | Azure Synapse | Power BI | Python | Data Analysis | 2x Azure Certified | 2x Databricks Certified | 1x AWS Certified
1mo
Report this post
🌟 Excited to share my journey in "The Ultimate Big Data Master's Program (Cloud-Focused)" at TrendyTech - Big Data By Sumit Mittal! **Week 29: APACHE KAFKA** This week, I explored Apache Kafka, a powerful distributed streaming platform that's widely used for building real-time data pipelines and streaming applications. Here’s what I dived into: 📌 WHY KAFKA? 🔹 Understood Kafka’s role in enabling high-throughput, low-latency, and fault-tolerant data pipelines. 🔹 Explored its significance in decoupling data producers and consumers to support scalable architectures. 📌 KAFKA ARCHITECTURE & TERMINOLOGIES 🔹 Learned the core building blocks: Producer, Consumer, Broker, Kafka Cluster, Topics, Partitions, Offsets, and Consumer Groups. 🔹 Grasped how Kafka achieves fault tolerance through replication and scales horizontally via partitioned data. 📌 KAFKA PRODUCER & CONSUMER – PROGRAMMATIC APPROACH 🔹 Gained hands-on experience with implementing custom Producer and Consumer using Kafka APIs. 🔹 Explored essential components like bootstrap servers, and key methods: produce(), poll(), and flush(). 📌 COMMAND-LINE & CONFLUENT CLI USAGE 🔹 Practiced Kafka CLI commands for creating topics, producing and consuming messages. 🔹 Explored Confluent Kafka, a fully managed platform for enterprise-scale streaming use cases. 📌 KAFKA + SPARK STRUCTURED STREAMING INTEGRATION 🔹 Connected Kafka topics directly with Spark Structured Streaming. 🔹 Learned how to read, process, and persist real-time data using Spark’s streaming APIs. This module helped me gain a solid understanding of how Kafka powers real-time data infrastructure in modern data ecosystems. Feeling much more confident about building scalable streaming systems end to end! Huge thanks to Sumit Mittal sir for guiding us through such a powerful learning experience. On to the next challenge! 🚀 #ApacheKafka #KafkaStreaming #BigData #RealTimeData #SparkStructuredStreaming #KafkaIntegration #ConfluentKafka #StreamingPipelines #TrendyTech #LearningJourney #BigDataMastery
Like Comment
To view or add a comment, sign in
Anuj Shrivastav

PySpark | Hadoop | SQL | Python | Big Data | Azure | Databricks | Azure Data Factory | Data lake (ADLS) Gen2 | Hive | Git | ETL, ELT | Lead Data Engineer@PepsiCo
1mo
Report this post
🚀 Wide vs Narrow Dependencies in Apache Spark 👨💻 When working with big data in Spark, transformations decide how fast (or slow) your jobs run. 🔹 Narrow Transformations Each partition depends only on one parent partition No shuffle (no data movement across nodes) Very fast & efficient 👉 Examples: map, filter 🔹 Wide Transformations Partition depends on multiple parent partitions Involves a shuffle (data moves across cluster) Slower & resource heavy 👉 Examples: groupByKey, reduceByKey ✅ Key takeaway: Prefer narrow transformations whenever possible. Wide transformations are powerful but can become a bottleneck if overused. 💡 Think of it like this: Narrow = doing your homework alone (fast, no delays) Wide = doing a group project (coordination, exchange, delays) For more content, follow Anuj Shrivastav💡📈 Feel free to reshare ♻️ this post if you find it helpful! 🔁 #ApacheSpark #BigData #DataEngineering #PySpark #Databricks #DataPipeline #SparkOptimization #CloudComputing #ETL #DataScience
Like Comment
To view or add a comment, sign in
Bouke van der Meer

Ververica | Unified Streaming Data Platform
1mo
Report this post
🔹 One Platform, All Your Data — Real-Time + Historical Ever needed yesterday’s numbers and this second’s update… but had to jump between tools or wait for engineering? That’s the old way. With Apache Fluss (Incubating) 🦦 + Apache Flink, you get a single data view: >Fresh data streams (sub-second latency) stay hot in Fluss. >Historical data (weeks, months, years) lives in Iceberg’s cost-efficient lakehouse. >One query, one platform — no more stitching systems together. ✨ What this means for you as a Data User: >Instant answers: Track KPIs and spot anomalies in real time. >Unified view: Stream + batch data in one place, no separate pipelines. >Less waiting: No dependency on Kafka or juggling multiple warehouses. >Smarter spend: Cold storage for long-term analytics, hot storage only where speed matters. This is the evolution of data architecture: where real-time insights meet historical depth — seamlessly, simply, and affordably. 🚀 Coming soon in Apache Fluss 0.8 — a new era for anyone who needs fast, reliable, and complete data access. Curious how this could change your workflow? Drop your thoughts below ⬇️
Like Comment
To view or add a comment, sign in
Abhijit Bangal

Lead Data Engineer @Xebia | 3X SnowPro Certified | Microsoft Fabric Analytics Engineer Certified | Azure Fundamental Certified | Snowflake | Informatica PowerCenter| Python | Pandas| PySpark
2w
Report this post
Hey Snowflaker's, I have tried implementing a containerized log-based CDC (Change Data Capture) pipeline using 𝐏𝐨𝐬𝐭𝐠𝐫𝐞𝐒𝐐𝐋 → 𝐃𝐞𝐛𝐞𝐳𝐢𝐮𝐦 → 𝐊𝐚𝐟𝐤𝐚 → 𝐒𝐧𝐨𝐰𝐟𝐥𝐚𝐤𝐞 all fully orchestrated with Docker Compose 🐳. This setup captures real-time data changes from Postgres 𝐖𝐀𝐋 𝐥𝐨𝐠𝐬, streams them through Kafka, and loads them into Snowflake, where a dynamic SCD Type 2 table keeps historical versions seamlessly in sync. 💡 Highlights: 👉 WAL-based CDC using Debezium 👉 Fully containerized with Docker Compose 👉 Kafka + Snowflake connectors integration 👉 Dynamic SCD2 implementation in Snowflake 👉 End-to-end setup scripts for quick replication If you’re exploring real-time data pipelines, CDC patterns, or Snowflake ingestion via Kafka, this repo walks through every step — from Postgres setup to dynamic tables in Snowflake. 👉 Check out the full implementation, detailed setup guide, and debugging steps here: https://lnkd.in/ggN_vmbD . . . . . . Sachin Mittaal Vishal Kaushal Rahul Jain 🇮🇳 Sebastian Flak Xebia #snowflake #debezium #kafka #snowflakekafkaconnector #logbasedcdc #dataengineer

GitHub - Abhibangal/postgres-debezium-kafka-snowflake github.com

4 Comments
Like Comment
To view or add a comment, sign in

293 followers

4 Posts

View Profile Connect

More Relevant Posts

Explore content categories