At Foundations 2025 Andrew Chabot of FinThrive and Eric Tome of Databricks offered a real-world demo of combining a governed CData Software semantic layer with Databricks Genie to accelerate time-to-insight, enable smarter automation, and support rapid experimentation. “Does Databricks understand what's behind? No. Databricks doesn't care. Databricks just knows assets, locations, and sensors right here. And what's great about some of this Data Shop functionality is if I go in here and you click to see Python, it'll pull up this Python right here and pretty much give you the code that you need to run in Databricks to make your connection and actually pull data out of the endpoint and hydrate your lakehouse.” -- Andrew Chabot, FinThrive Access their session in FULL along with other insights from Foundations speakers: https://bit.ly/44K8MbW #CData #CDataFoundations #Databricks #DatabricksGenie #Lakehouse #SemanticLayer #DataVirtualization
More Relevant Posts
-
Databricks is shipping fast on Azure 🚀 A few highlights I’m excited about: Databricks One (Public Preview): a simpler UI that puts AI/BI & apps in one place. Lakeflow Pipelines Editor (Public Preview): Python/SQL file-first pipelines + easier debugging. New system tables: pipeline update history + data classification results for governance. Delta Sharing upgrades: share federated (foreign) tables across workspaces. Databricks SQL upgrades: semantic metadata in metric views, UTF8 collation LIKE, new spatial ST_ExteriorRing, multi-var DECLARE, TEMPORARY metric views, and streaming WITH options. Heads-up: upcoming time-travel/VACUUM behavior changes in DBR 18.0. #AzureDatabricks #DatabricksSQL #UnityCatalog #DeltaLake #Lakehouse #DataEngineering
To view or add a comment, sign in
-
Day 1 of #100K Followers DaysOfDataEngineering 🚀 Today, we’re focusing on variables and naming conventions—small things that make a big difference in data engineering. Variables store data, intermediate results, and configuration values, helping you build clean, reusable, and scalable pipelines. But the real magic comes when you name them clearly: ✅ Use descriptive names like customer_count, cleaned_sales_df, or aggregated_orders. ✅ Classes should be in CamelCase: TransactionPipeline, CustomerData. ✅ Constants in uppercase: MAX_RETRIES, DEFAULT_PATH. ✅ Avoid vague names like x or data1—clarity matters for collaboration. Good naming convention isn’t just style—it’s readability, maintainability, and fewer bugs makes sense to developers on the other hand to replicate. In large pipelines, clear variable names help your teammates understand your logic instantly, and they make debugging a lot easier. Remember, consistent naming today saves hours of headaches tomorrow! 💡 #Python #DataEngineering #BestPractices #ETL #DataPipelines #CleanCode #LearningJourney #100DaysOfDataEngineering
To view or add a comment, sign in
-
Pandas or PySpark… which one should you ACTUALLY use? Every data engineer has asked this question at least once. Let’s break it down in a real, no-fluff way- 𝐏𝐚𝐧𝐝𝐚𝐬 = Fast, simple, perfect for small to medium data 𝐏𝐲𝐒𝐩𝐚𝐫𝐤 = Distributed, scalable, built for BIG data The smartest teams? They use BOTH strategically. In this carousel, we’ll show you: ➡ When Pandas is the right choice ➡ Where Pandas fails ➡ Why PySpark saves the day ➡ And the BEST hybrid approach used by top companies in 2025! Want scalable, high-performance data pipelines? That’s exactly what we build at #ShrijanTechnology Check the slides and tell us in the comments: What are YOU currently using — Pandas or PySpark? #Pandas #PySpark #BigData #DataEngineering #TechInsights #ShrijanTech #Scalability #Python
To view or add a comment, sign in
-
What Actually Happens Inside a Data Pipeline – a Simple Breakdown Without clean, standardised data, dashboards don’t provide any value. The real impact starts with a well-structured data pipeline. When I first started working with data, the term “data pipeline” sounded intimidating, like something only big tech companies handled. After building pipelines for AI automation and BI dashboards, I realized it’s really just a systematic flow of data from source to insight. #DataEngineering #DataAnalytics #AIAutomation #DataPipelines #PowerBI #Python #Azure #ETL #AnalyticsEngineering #GenerativeAI Here’s a simple breakdown: 👇
To view or add a comment, sign in
-
🧠💭 It’s been a while since I dropped some data talk here… Truth is — I haven’t been building new pipelines lately 😅 But guess what? Sometimes the best way to level up isn’t by doing more — it’s by thinking smarter. Lately, I’ve been revisiting the basics — data modeling, architecture flow, and why a small design choice can make or break an entire system and building the strongest core on my Database and coding skills. Crazy how revisiting fundamentals can give you new insights, right? Next on my radar 👇 ⚙️ Creating and Optimizing data pipelines ⚡ Try to work on real time streaming data 🎯 Building scalable systems (without losing sleep 😴) So yeah — a little quiet, but definitely cooking something behind the scenes 🍳 How do you usually reset your learning mode when you hit pause? #DataEngineering #LearningJourney #Python #Azure #CareerGrowth #TechHumor #DataTalk
To view or add a comment, sign in
-
-
A Roadmap to different opportunities in the world of Data Science. Start with fundamentals the basics in Math, Python, SQL and version control. Then pick a path to specialize in data engineering, data analytics or machine learning. The map shows how these pieces connect and why crossing between them matters. Data engineering sets up the data storage and processing, data analytics turns data into actionable insights, and machine learning adds predictive power. Deployment brings models and dashboards into production and real world use. The goal is to become a true data science expert who can own end to end solutions. Which track are you focusing on this year and how will you connect the dots across the stack to deliver real impact? More Details: https://lnkd.in/eXWi7s-G #DataScience #DataEngineering #DataAnalytics #MachineLearning #Deployment #CareerPath #FullStackDataScience
To view or add a comment, sign in
-
-
🐼 Pandas vs PySpark — Same Goals, Different Scales! ⚡ Every data engineer or data analyst hits this moment — When your Pandas code runs perfectly on small data… but then you try the same on millions of rows 😅 That’s where PySpark steps in — same logic, but built to handle massive-scale data with distributed computing. Here’s the real deal 👇 📊 Pandas → Best for small to medium datasets, quick exploration, local analysis. 🔥 PySpark → Built for big data, parallel processing, and cluster environments. In short: ➡ Start with Pandas to understand data. ➡ Move to PySpark when your laptop fan starts sounding like a jet engine 🚀 #Pandas #PySpark #DataEngineering #BigData #DataAnalytics #MachineLearning #Python #Spark #ETL #DataScience #AnalyticsLife #DataFrame #Coding #learning
To view or add a comment, sign in
-
-
𝗪𝗵𝗲𝗻 𝘀𝗵𝗼𝘂𝗹𝗱 𝘆𝗼𝘂 𝘂𝘀𝗲 𝗣𝘆𝗦𝗽𝗮𝗿𝗸 𝗶𝗻𝘀𝘁𝗲𝗮𝗱 𝗼𝗳 𝗣𝗮𝗻𝗱𝗮𝘀 ? While i was working in 𝗝𝘂𝘀𝘁𝗔𝗱𝘀 we needed to generate millions of ad creative copies from massive XML files in gigabytes. We started with Pandas. At first, it worked. But as the data kept growing, our server memory began maxing out. Processing that should have taken minutes was running into hours, and scaling further felt impossible. That’s when team realized Pandas wasn’t built for this scale. 𝗪𝗶𝘁𝗵 𝗣𝘆𝗦𝗽𝗮𝗿𝗸: • The workload was distributed across a cluster • We could process huge XML files without memory bottlenecks • Generating creatives became much faster and more reliable 𝗞𝗲𝘆 𝗹𝗲𝘀𝘀𝗼𝗻 𝗳𝗿𝗼𝗺 𝘁𝗵𝗮𝘁 𝗽𝗿𝗼𝗷𝗲𝗰𝘁: • Pandas is best for smaller datasets that fit in memory, useful for exploration and prototyping • PySpark is built for large-scale, distributed processing of gigabytes to terabytes in production workloads #𝗣𝘆𝗦𝗽𝗮𝗿𝗸 #𝗣𝘆𝘁𝗵𝗼𝗻
To view or add a comment, sign in
-
Predictive Anomaly Detection for Data Center Assets! 🚀 Excited to share a project that proactively identifies hardware anomalies using simulated data center telemetry and powerful machine learning! This solution employs Isolation Forest and a Keras Autoencoder for effective anomaly detection, paving the way for better predictive maintenance. 💡 The project includes robust feature engineering (lag/rolling features) and clear training scripts. Quick Start pip install -r requirements.txt Run src/generate_synthetic.py, then src/preprocess.py, and finally src/train_and_evaluate.py. 🔗 GitHub Repository: https://lnkd.in/gmr_U9g5 🔗 Live Streamlit App: https://lnkd.in/gtRhuWKB Check it out and let me know your thoughts! 👇 #AnomalyDetection #MachineLearning #PredictiveMaintenance #DataCenter #Python
To view or add a comment, sign in
-
As year end approaches, many data teams are beginning to decide which projects will make their 2026 roadmaps. If improving on legacy orchestration is on your radar, check out this blog from Eric Thomas
Eric Thomas took an excellent lakehouse tutorial built with Airflow and rebuilt it with Dagster. The stack is the same: MinIO, Trino, Iceberg, and dbt. The orchestrator is different. The results were striking: → Event-driven sensors replaced time-based scheduling. Pipelines run when data arrives, not on a clock. → Smart partitioning enabled backfills and selective reruns. No more all-or-nothing processing. → Asset checks created multi-layered quality validation. Data quality became programmatic, not just hoped for. → Pure SQL patterns eliminated Python bottlenecks. Trino handles the heavy lifting. The Lakehouse provides the foundation, but the orchestration layer determines how effectively teams can actually use it. The original tutorial teaches lakehouse fundamentals beautifully. This comparison shows how much orchestration choice matters for production readiness. Check out the full blog today! Link in the comments
To view or add a comment, sign in
-