Challenges in building Iceberg tables from Kafka data

This title was summarized by AI from the post below.

View organization page for Shannon Alliance

160 followers

Why Spark jobs and zero-copy Kafka won't cut it for building Iceberg tables from real-time data. • High latency: hours or days between data landing in Kafka and Iceberg table updates. • Complexity: writing finicky code for transformations, schema migrations, and Spark management. • Small file problem: frequent writes lead to slow queries and expensive storage. Takeaway: Traditional approaches to building Iceberg tables from Kafka data are plagued by latency, complexity, and scalability issues. #icebergdatabase #kafkaintegration #datalakeengineering https://lnkd.in/grAfKQNv

To view or add a comment, sign in

More Relevant Posts

Nayan L.

Hands-on Tech Lead | Node.js & TypeScript | Python (FastAPI) | React Native | Rust | AWS, Azure, GCP | GenAI | LLM Trainer | Cloud-Native & Distributed Systems | Fintech · EdTech · HealthTech · DevTools
1w Edited
Report this post
Ever wondered why Kafka is insanely fast at moving data? It’s all about zero-copy writes. Normally, when data moves from the network to the disk, it bounces between user space and kernel space multiple times, that’s a lot of wasted movement. But Kafka skips all that using the `sendfile` system call, which lets the OS directly move data from disk (or page cache) to the network socket. Fewer copies → faster throughput. In short, Kafka avoids passing data around like a hot potato. Fun fact, `sendfile` doesn’t work with TLS/SSL since encryption needs to happen in user space, so it only shines in plain transfers. I actually used this same zero-copy technique while working on the Remote Shuffle Service for Apache Spark, and the speedup was no joke. Which other system-level optimizations have amazed you lately? Drop your favorites below 👇 #Kafka #ZeroCopy #SystemDesign #Performance #Engineering #LearnWithNayan Credits: https://lnkd.in/dtVdnbGV
Like Comment
To view or add a comment, sign in
Sergey Kargopolov

I create video courses for Java Developers. Check my profile.
1w
Report this post
One of the hidden gems that I have is my video course on Kafka Connect. I think many underestimate its power. Kafka Connect lets you stream data between Kafka and almost any external system without writing a single line of code. That means real-time syncing with zero custom code. A new change in the source database can be instantly reflected in your destination system! I think this is a super powerful and convenient feature. https://lnkd.in/e6qfGKQy #Kafka #KafkaConnect

Kafka Connect - Building Data Pipelines With Kafka udemy.com
Like Comment
To view or add a comment, sign in
Jack Vanlightly

Principal Technologist at Confluent
2w
Report this post
New blog post on why I’m not a fan of zero-copy Iceberg tables for Apache Kafka. “Zero-copy” integration between Apache Kafka and Iceberg has been discussed a lot online recently. The idea is that Kafka topics live directly as Iceberg tables. It’s an appealing vision: one copy of data shared between stream processing and analytics. But from a systems design perspective, it creates a tightly coupled system that’s harder to operate and less efficient: 🔹 Kafka brokers take on heavy compute work building and reading Parquet files. 🔹 Optimizing for analytics breaks Kafka’s sequential read efficiency, and vice versa. 🔹 Schema evolution becomes a tug-of-war between fidelity and cleanliness. 🔹 Boundaries blur. Who owns the data, Kafka or the lakehouse? Sometimes, duplication is cheaper than coupling. 👉 Full post: https://lnkd.in/dEt2AfEu

8 Comments
Like Comment
To view or add a comment, sign in
Gunnar Morling

Technologist @ Confluent · Ex-lead of Debezium · Spec lead of Bean Validation 2.0 · Creator of JfrUnit, kcctl and MapStruct · Java Champion · morling.dev · 🚴
2w
Report this post
"Zero-copy" between Kafka and Iceberg is one of those ideas which seem intriguing on the surface, but if you think more about it, there's actually quite a few issues with it. Access patterns--which heavily impact how data should be laid out for efficiency--differ substantially between a log (Kafka) and analytics store (Iceberg). Trying to satisfy both use cases with a single instance of your data is going to yield a sub-optimal system. Great post by Jack Vanlightly.

Jack Vanlightly

Principal Technologist at Confluent
2w

New blog post on why I’m not a fan of zero-copy Iceberg tables for Apache Kafka. “Zero-copy” integration between Apache Kafka and Iceberg has been discussed a lot online recently. The idea is that Kafka topics live directly as Iceberg tables. It’s an appealing vision: one copy of data shared between stream processing and analytics. But from a systems design perspective, it creates a tightly coupled system that’s harder to operate and less efficient: 🔹 Kafka brokers take on heavy compute work building and reading Parquet files. 🔹 Optimizing for analytics breaks Kafka’s sequential read efficiency, and vice versa. 🔹 Schema evolution becomes a tug-of-war between fidelity and cleanliness. 🔹 Boundaries blur. Who owns the data, Kafka or the lakehouse? Sometimes, duplication is cheaper than coupling. 👉 Full post: https://lnkd.in/dEt2AfEu

7 Comments
Like Comment
To view or add a comment, sign in
Satwik Kunaparaju

Open to Opportunities (C2C) | Data Engineering Professional | Big Data | Cloud Platforms (AWS | Azure | GCP) | Spark | SQL | Python | ETL Pipelines | Data Warehousing | Snowflake | Kafka | Tableau | Power BI | MLOps
3d
Report this post
🚀 Demystifying Apache Spark Architecture Apache Spark remains one of the most powerful frameworks for distributed data processing — but understanding its core architecture helps you truly unlock its performance potential. This simple visualization breaks it down: Driver: Coordinates the job, builds the DAG, and schedules tasks. Cluster Manager: Allocates resources across the cluster (YARN / Mesos / Standalone / Kubernetes). Executors: Run the actual tasks, cache data, and report results back to the driver. ✨ Together, these components make Spark incredibly scalable, fault-tolerant, and lightning-fast for large-scale analytics and ML workloads. 💬 How are you using Spark in your data pipelines today — for ETL, ML, or streaming? #DataEngineering #ApacheSpark #BigData #ETL #PySpark #DataScience #Analytics #CloudComputing #DistributedSystems
Like Comment
To view or add a comment, sign in
Ravi Ranjan Kumar

Software Engineer @ International asset reconstruction company (Blackstone Portfolio)| Node js, Express js ,MongoDB, React, AWS, Google Cloud Services, Kafka
4w
Report this post
Kafka Series – Part 4: Kafka’s Core Design Principles Kafka isn’t “just fast” by accident — it’s designed with a few core principles that make it incredibly reliable and scalable. Let’s break them down clearly 👇 🧱 1. Durability - Kafka persists data to disk immediately. - Uses a commit log structure for sequential writes (very fast). - Data survives restarts and failures — critical for financial or critical systems. 🌐 2. Scalability - Topics can be split into many partitions and distributed across multiple brokers. - You can scale producers, consumers, and brokers independently. - Horizontal scaling makes it ideal for growing data volumes. 🧭 3. Fault Tolerance - Data is replicated to multiple brokers. - If one broker fails, replicas take over automatically. - No single point of failure when configured correctly. 🚀 4. High Throughput & Low Latency - Sequential disk I/O, zero-copy transfer, batching, and compression. - Capable of handling millions of messages/sec with millisecond latency. These principles are why Kafka became the backbone of real-time data platforms worldwide. 💡 If you feel I missed something important or explained it differently than you would, please drop your thoughts in the comments — I’d love to learn from your perspective too. #Kafka #SystemDesign #ScalableArchitecture #FaultTolerance #RealTimeData #LearnTogether
Like Comment
To view or add a comment, sign in
Fatemeh Sadegh Behnam

Data Engineer | Business Intelligence Developer | SQL | Snowflake | Azure | Python | ETL | ELT | Power BI
1w Edited
Report this post
🚀 Open-Source Powerhouses Behind Modern Data Engineering When people talk about data, they often think of dashboards and insights. But behind every clean dashboard, there’s a set of powerful open-source tools working quietly to move, process, and orchestrate data. Let’s start with some of the key Apache projects every Data Engineer should know 👇 ⚙️ Apache Airflow – orchestrates complex data pipelines. ⚡ Apache Kafka – handles real-time data streaming. 🔥 Apache Spark – processes huge datasets in parallel. 🌊 Apache Flink – brings real-time analytics to life. 🔄 Apache NiFi – automates and manages data flows. 🏗️ Apache Hive – enables SQL-like queries on big data. These open-source tools are the backbone of modern data platforms. In my next posts, I’ll break them down one by one — showing how they work together to turn raw data into real insight. #DataEngineering #Apache #Airflow #Kafka #Spark #Flink #NiFi #OpenSource
Like Comment
To view or add a comment, sign in
Savan Patel

Azure Data Engineer | Databricks Specialist
3d
Report this post
💡 How does Spark manage memory internally? Apache Spark’s performance heavily depends on efficient memory management — it determines how data is cached, shuffled, and processed across executors. Inside each JVM, Spark divides memory into four key regions: * Execution Memory: Temporary workspace for joins, aggregations, and shuffles. * Storage Memory: Used for caching and persisting RDDs/DataFrames. * User Memory: Holds user-defined objects and internal metadata. * Reserved Memory: Small portion reserved to prevent out-of-memory crashes. Spark’s Unified Memory Model allows Execution and Storage to share space dynamically — unused cache memory can be borrowed for computation and vice versa. A well-tuned memory configuration ensures minimal spills to disk and faster job execution. Mastering this balance is the key to unlocking true Spark performance. #ApacheSpark #BigData #DataEngineering #SparkOptimization #MemoryManagement #DataProcessing #ETL #PerformanceTuning #Databricks #CloudComputing #DataAnalytics #DistributedComputing #SparkJobs #DataScience #ClusterComputing #DataEngineer #TechArchitecture #DataOps #BigDataFramework #ModernDataStack
Like Comment
To view or add a comment, sign in
Akshay Khanna

Software Developer || SWE @American Express | Ex-Intern @American Express | Ex-Intern @EXL | Ex-Intern @Birlasoft
3w Edited
Report this post
I recently explored Apache Kafka, and it’s been a fascinating look into how real-time data streaming actually works behind the scenes. ➜ Why Kafka exists - understanding the problems it was built to solve and why it’s such a core piece of modern data infrastructure ➜ Inside the architecture - how topics, producers, consumers, partitions, and brokers all fit together to create a scalable, fault-tolerant system ➜ Making sense of data flow - learning how offsets, message keys, and consumer groups keep messages ordered and systems reliable ➜ Practical demonstrations - explored examples showing real-time message flow, partitioning, multiple consumer groups, and a small Python-based project that sent and received emails through Kafka ➜ Beyond the basics - a glimpse into Kafka Connect and Kafka Streams, and how they extend Kafka’s power into data integration and stream processing It’s incredible how systems like Kafka make real-time data pipelines possible at scale. This deep dive definitely helped me appreciate the design and thinking behind distributed systems even more. #ApacheKafka #DataEngineering #BigData #EventStreaming #RealTimeData #KafkaStreams #KafkaConnect #DistributedSystems #TechLearning #CareerGrowth
1 Comment
Like Comment
To view or add a comment, sign in
Anshul Kashyap

Data Engineer||Production/Application/Technical/SQL Support Engineer||Ex COMVIVA
3w
Report this post
Mastering Apache Kafka – From Basics to Performance Optimization! If you’ve ever worked with real-time data, event-driven systems, or streaming pipelines, you’ve probably heard of Apache Kafka. I’ve compiled a complete beginner-to-advanced guide with concepts, examples, and performance tuning tips to help you become Kafka-ready: 🔹 Kafka Basics – Topics, Partitions, Replication, Brokers, Leaders & Consumer Groups 🔹 Example Use Cases – Website tracking, real-time stream processing, log aggregation, event sourcing 🔹 Producers & Consumers – Ack values, batching, compression & client libraries 🔹 Performance Optimization – Tuning brokers, balancing partitions, ISR (In-Sync Replicas), retention policies 🔹 Kafka Architecture Deep Dive – Logs, offsets, Zoo Keeper, producer/consumer APIs 🔹 Best Practices – Partition distribution, avoiding hardcoding, scaling strategies, server concepts 💡 Whether you’re just starting with Kafka or looking to optimize production systems, this guide gives you a clear roadmap from basics ➝ advanced performance tuning. 👉 Check it out for complete notes & hands-on practices 😁 🧐 👍 : https://lnkd.in/gyjskYZN #ApacheKafka #Kafka #EventStreaming #BigData #DataEngineering #RealTimeData #LearningCommunity #HelpingHands #AnshLibrary
Like Comment
To view or add a comment, sign in

160 followers

View Profile Connect

Challenges in building Iceberg tables from Kafka data

More from this author

Is the U.S. Throwing Away Its Pharma Edge—or Forcing It to Evolve?

High-Performance Computing in Bioinformatics

Lean AI Development: Transform Ideas into MVMs into MVPs

Explore content categories