From ETL to AI Agents: My Decade-Long Dance with Data
I’ve been building data solutions since 2013 — from scrappy analytics projects to full-blown data transformations across industries. If there’s one thing this decade-long journey has taught me, it’s this:
The implementation is rarely the hard part. Explaining why your approach is right — that’s the real challenge.
Whether I was cleaning spreadsheets for KPIs or orchestrating end-to-end data platforms for real-time insight delivery, the technical shift has always been significant — but the strategic thinking behind it has been the real differentiator.
🧱 The Early Days: Building with Bricks and Cron Jobs
Back in 2013, data meant on-prem relational databases — MSSQL, MySQL, Oracle. Pipelines were built with manual scripts, scheduled cron jobs, and an optimistic attitude.
Dashboards refreshed overnight (if at all). ETL pipelines failed quietly in the background. And every schema change required a war council.
🌩️ Enter the Cloud & Big Data Era (2016–2019)
Cloud disrupted everything.
I remember my first Kafka-Python connector. I spent a week tuning partition logic — just to stream data from MySQL to AWS S3.
Back then, real-time analytics started becoming possible — if you were ready to architect like a mad scientist.
🔍 Example: Real-Time with AWS Kinesis + MySQL Logs (CDC)
In one project, we used CDC (Change Data Capture) concept on a production MySQL instance. Changes flowed into Amazon MSK (Kafka), then into AWS Kinesis, where we processed records through Kinesis Data Analytics, enriched them, and stored the outputs in S3 for downstream querying via Athena.
💡 It worked. It scaled. But it required serious DevOps and stream engineering muscle. Every new use case meant provisioning, debugging, and lots of YAML.
🏡 The Lakehouse Era: Where Real-Time Became a Single Command
Fast forward to now — and the landscape feels radically different.
With Lakehouse platforms like Databricks and Snowflake, what used to take days now takes minutes.
Take Databricks Structured Streaming — a modern marvel.
⚡ Example: Real-Time with Databricks Delta Live Tables
Want to stream real-time data from Kafka into a Delta Lake and keep it clean and ready for BI or ML? It’s literally a one-liner in PySpark:
df = spark.readStream.format("kafka") \
.option("kafka.bootstrap.servers", "broker:9092") \
.option("subscribe", "transactions") \
.load()
df.selectExpr("CAST(value AS STRING)") \
.writeStream \
.format("delta") \
.option("checkpointLocation", "/delta/checkpoints/tx") \
.start("/delta/transactions")
No more complex glue code. No more batch-then-stream architecture. Just one streaming pipeline, unified storage, and instant queryability via SQL, notebooks, or APIs.
This is what the Lakehouse architecture enables — combining the reliability of warehouses with the flexibility of data lakes and the power of streaming engines.
📊 Real-Time Analytics: From Luxury to Standard
What was once a bleeding-edge experiment is now an expected business capability. Stakeholders today ask:
“Can I get this updated in real time?”
And the answer is no longer “well… maybe”. It’s yes — and we already do.
Recommended by LinkedIn
Platforms like Snowflake Snowpipe, Databricks DLT, and Delta Sharing make real-time operational intelligence a first-class citizen.
time operational intelligence a first-class citizen.
👻 The Spooky Future: No More Data Engineering?
Let’s talk about the eerie silence creeping in behind the pipelines.
et’s talk about the eerie silence creeping in behind the pipelines.
Back in university, my graduation project was an agent-based intrusion detection system — a distributed network security solution where autonomous agents monitored traffic, identified anomalies, and flagged threats in real time.
It was my first deep dive into agent-driven automation — systems that don’t wait for humans to intervene, but instead learn, adapt, and act.
At the time, I was fascinated by:
- How agents could operate independently, yet collaborate in a network.
- The idea of self-healing, self-scaling, and self-optimizing systems.
I didn’t know then that this same philosophy would one day shape how we manage data.
Today, we’re seeing:
- AI-driven data quality tools (like Anomalo and Monte Carlo) that don’t just detect anomalies — they explain them and suggest fixes.
- Pipeline-as-code generators using LLMs and metadata catalogs to create robust flows without writing a single line of SQL.
- Data observability platforms integrating with GitOps to enforce SLAs without manual monitoring.
Soon, I believe:
Agents will build, monitor, optimize, and even govern our data workflows.
🧠 So What Do We Become?
The future isn’t no humans in data. It’s new humans in new roles:
- Data Engineers will evolve into Data Product Owners, defining business-driven models and lifecycle policies.
- Governance Architects will embed trust into the system instead of chasing it after the fact.
- AI Trainers will oversee autonomous data agents, tuning behavior instead of tuning queries.
We’ll no longer ask “What’s the pipeline doing?”
We’ll ask:
“Is the pipeline aligned with our product goals?”
🎓 Closing Thought: This Journey Isn’t Over
I didn’t write this article to flaunt certifications. But earning credentials in Snowflake and Databricks is symbolic for me — a marker that I’ve adapted to the tech shifts that matter most in today’s data landscape.
These tools — and the philosophy behind them — represent where we’re headed.
If you're on this journey too — learning, unlearning, building — I hope this gave you some ideas, validation, or inspiration.
And if one night, you hear an AI agent quietly optimizing your schema and whispering:
“You can sleep now… I’ve got the data pipeline covered.”
Just smile. Maki saw it coming. 😉
#DataEngineering #Databricks #Snowflake #Lakehouse #RealTimeAnalytics #DataOps #FutureOfData #MakiWasRight
Founder @ Bridge2IT +32 471 26 11 22 | Business Analyst @ Carrefour Finance
6moA decade in data shows just how far we’ve come—from writing ETL scripts by hand to deploying AI agents that think and adapt. Each step brought new tools, challenges, and possibilities. This journey isn’t just technical—it’s a constant evolution in how we understand, move, and activate data over time 📈🧠
Data Scientist @Mumzworld | Computer Science Engineer @BITS Pilani
6moReally Impressive Maki! Way to go 🙌
Senior Manager (Data & AI) @ KPMG | Transforming Data into Strategic Assets | Data Visionary & Analytics Innovator | Mentor & Thought Leader in Data Management
6moSenior Manager (Data & AI) @ KPMG | Transforming Data into Strategic Assets | Data Visionary & Analytics Innovator | Mentor & Thought Leader in Data Management
6mo