From ETL to AI Agents: My Decade-Long Dance with Data

Mohamed Elmaki

Senior Manager (Data & AI) @ KPMG | Transforming Data into Strategic Assets | Data Visionary & Analytics Innovator | Mentor & Thought Leader in Data Management

Published Mar 26, 2025

I’ve been building data solutions since 2013 — from scrappy analytics projects to full-blown data transformations across industries. If there’s one thing this decade-long journey has taught me, it’s this:

The implementation is rarely the hard part. Explaining why your approach is right — that’s the real challenge.

Whether I was cleaning spreadsheets for KPIs or orchestrating end-to-end data platforms for real-time insight delivery, the technical shift has always been significant — but the strategic thinking behind it has been the real differentiator.

🧱 The Early Days: Building with Bricks and Cron Jobs

Back in 2013, data meant on-prem relational databases — MSSQL, MySQL, Oracle. Pipelines were built with manual scripts, scheduled cron jobs, and an optimistic attitude.

Dashboards refreshed overnight (if at all). ETL pipelines failed quietly in the background. And every schema change required a war council.

🌩️ Enter the Cloud & Big Data Era (2016–2019)

Cloud disrupted everything.

I remember my first Kafka-Python connector. I spent a week tuning partition logic — just to stream data from MySQL to AWS S3.

Back then, real-time analytics started becoming possible — if you were ready to architect like a mad scientist.

🔍 Example: Real-Time with AWS Kinesis + MySQL Logs (CDC)

In one project, we used CDC (Change Data Capture) concept on a production MySQL instance. Changes flowed into Amazon MSK (Kafka), then into AWS Kinesis, where we processed records through Kinesis Data Analytics, enriched them, and stored the outputs in S3 for downstream querying via Athena.

💡 It worked. It scaled. But it required serious DevOps and stream engineering muscle. Every new use case meant provisioning, debugging, and lots of YAML.

🏡 The Lakehouse Era: Where Real-Time Became a Single Command

Fast forward to now — and the landscape feels radically different.

With Lakehouse platforms like Databricks and Snowflake, what used to take days now takes minutes.

Take Databricks Structured Streaming — a modern marvel.

⚡ Example: Real-Time with Databricks Delta Live Tables

Want to stream real-time data from Kafka into a Delta Lake and keep it clean and ready for BI or ML? It’s literally a one-liner in PySpark:


df = spark.readStream.format("kafka") \
  .option("kafka.bootstrap.servers", "broker:9092") \
  .option("subscribe", "transactions") \
  .load()

df.selectExpr("CAST(value AS STRING)") \
  .writeStream \
  .format("delta") \
  .option("checkpointLocation", "/delta/checkpoints/tx") \
  .start("/delta/transactions")

No more complex glue code. No more batch-then-stream architecture. Just one streaming pipeline, unified storage, and instant queryability via SQL, notebooks, or APIs.

This is what the Lakehouse architecture enables — combining the reliability of warehouses with the flexibility of data lakes and the power of streaming engines.

📊 Real-Time Analytics: From Luxury to Standard

What was once a bleeding-edge experiment is now an expected business capability. Stakeholders today ask:

“Can I get this updated in real time?”

And the answer is no longer “well… maybe”. It’s yes — and we already do.

Recommended by LinkedIn

Your Comprehensive Guide to Becoming a Data Engineer…

Brij kishore Pandey 1 year ago

15 Real-World Data Pipeline Issues and How to Solve…

Solon Das 5 months ago

Simplifying Data Work with Amazon EMR and PySpark for…

Coditation 1 year ago

Platforms like Snowflake Snowpipe, Databricks DLT, and Delta Sharing make real-time operational intelligence a first-class citizen.

time operational intelligence a first-class citizen.

👻 The Spooky Future: No More Data Engineering?

Let’s talk about the eerie silence creeping in behind the pipelines.

et’s talk about the eerie silence creeping in behind the pipelines.

Back in university, my graduation project was an agent-based intrusion detection system — a distributed network security solution where autonomous agents monitored traffic, identified anomalies, and flagged threats in real time.

It was my first deep dive into agent-driven automation — systems that don’t wait for humans to intervene, but instead learn, adapt, and act.

At the time, I was fascinated by:

How agents could operate independently, yet collaborate in a network.
The idea of self-healing, self-scaling, and self-optimizing systems.

I didn’t know then that this same philosophy would one day shape how we manage data.

Today, we’re seeing:

AI-driven data quality tools (like Anomalo and Monte Carlo) that don’t just detect anomalies — they explain them and suggest fixes.
Pipeline-as-code generators using LLMs and metadata catalogs to create robust flows without writing a single line of SQL.
Data observability platforms integrating with GitOps to enforce SLAs without manual monitoring.

Soon, I believe:

Agents will build, monitor, optimize, and even govern our data workflows.

🧠 So What Do We Become?

The future isn’t no humans in data. It’s new humans in new roles:

Data Engineers will evolve into Data Product Owners, defining business-driven models and lifecycle policies.
Governance Architects will embed trust into the system instead of chasing it after the fact.
AI Trainers will oversee autonomous data agents, tuning behavior instead of tuning queries.

We’ll no longer ask “What’s the pipeline doing?”

We’ll ask:

“Is the pipeline aligned with our product goals?”

🎓 Closing Thought: This Journey Isn’t Over

I didn’t write this article to flaunt certifications. But earning credentials in Snowflake and Databricks is symbolic for me — a marker that I’ve adapted to the tech shifts that matter most in today’s data landscape.

These tools — and the philosophy behind them — represent where we’re headed.

If you're on this journey too — learning, unlearning, building — I hope this gave you some ideas, validation, or inspiration.

And if one night, you hear an AI agent quietly optimizing your schema and whispering:

“You can sleep now… I’ve got the data pipeline covered.”

Just smile. Maki saw it coming. 😉

#DataEngineering #Databricks #Snowflake #Lakehouse #RealTimeAnalytics #DataOps #FutureOfData #MakiWasRight

Koenraad Block

Founder @ Bridge2IT +32 471 26 11 22 | Business Analyst @ Carrefour Finance

6mo

A decade in data shows just how far we’ve come—from writing ETL scripts by hand to deploying AI agents that think and adapt. Each step brought new tools, challenges, and possibilities. This journey isn’t just technical—it’s a constant evolution in how we understand, move, and activate data over time 📈🧠

1 Reaction

Shubh Bajpai

Data Scientist @Mumzworld | Computer Science Engineer @BITS Pilani

6mo

Really Impressive Maki! Way to go 🙌

1 Reaction

Mohamed Elmaki

Senior Manager (Data & AI) @ KPMG | Transforming Data into Strategic Assets | Data Visionary & Analytics Innovator | Mentor & Thought Leader in Data Management

6mo

2 Reactions

Mohamed Elmaki

Senior Manager (Data & AI) @ KPMG | Transforming Data into Strategic Assets | Data Visionary & Analytics Innovator | Mentor & Thought Leader in Data Management

6mo

2 Reactions

See more comments

To view or add a comment, sign in

Sign in

Stay updated on your professional world

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

From ETL to AI Agents: My Decade-Long Dance with Data

Mohamed Elmaki

Senior Manager (Data & AI) @ KPMG | Transforming Data into Strategic Assets | Data Visionary & Analytics Innovator | Mentor & Thought Leader in Data Management

🧱 The Early Days: Building with Bricks and Cron Jobs

🌩️ Enter the Cloud & Big Data Era (2016–2019)

🔍 Example: Real-Time with AWS Kinesis + MySQL Logs (CDC)

🏡 The Lakehouse Era: Where Real-Time Became a Single Command

⚡ Example: Real-Time with Databricks Delta Live Tables

📊 Real-Time Analytics: From Luxury to Standard

Recommended by LinkedIn

👻 The Spooky Future: No More Data Engineering?

🧠 So What Do We Become?

🎓 Closing Thought: This Journey Isn’t Over

Sign in

Others also viewed

The 6 Most Common Issues in Spark ETL Pipelines - and how to avoid them

Building a Universal Data Lake with EMR Serverless: Hands-On Labs for Querying with Snowflake, Athena, and Spark – A Guide for Beginners, Leaders

Query String Nested JSON Data in New S3 Table Buckets (Iceberg) with DuckDB via IRCC

🚀 The Ultimate Data Engineering Roadmap: A Complete Guide to Mastering the Field

Big Data Lambda (λ) Architecture variants Explained!

Spark in Microsoft Fabric: A Technical Perspective on Scalable Data Analytics

Ep. 7: The Rise of Zero ETL | By The Data Alchemist

Data Lake Demo using Change Data Capture (CDC) on AWS – Part 3 Hudi Table and Dashboard Creation

Delta lake insights

🚀 The Anatomy of a Modern Data Pipeline: DBT + Trino + Lakehouse

Explore content categories