dbt Labs Open Sources MetricFlow: An Independent Schema for Data Interoperability
Data framework provider dbt Labs recently released as open source MetricFlow, a SQL generation tool that reinforces the dbt semantic layer via the Apache 2.0 license. The implications of this development span the furthest reaches of the data ecosystem.
It reaffirms dbt Labs’ commitment to the Open Semantic Interchange (OSI) Initiative, an effort led by like-minded vendors such as Snowflake, Salesforce, Atlan and Alation, to create standards for exchanging data across platforms and tools.
Part of the open-sourcing of MetricFlow includes making available its JSON-based metadata layer, which provides a universal schema independent from the definitions and metrics engine. Thus, even without adopting MetricFlow, the open source community can still utilize this semantic layer as a common exchange for understanding data across tools and vendors. Organizations can also continue to access it via MetricFlow.
The open-sourcing of this metadata layer may very well be the key to the long-awaited interoperability between data systems that many have longed for, yet few have achieved. The principal driver for each of these developments is the need to provide transparent trust in statistical AI applications — particularly those involving dynamic agents and the success of the emergent MCP protocol for agents interacting with tools.
“There’s two ways to think about semantic layers,” said Ryan Segar, chief product officer at dbt Labs. “The old way lets you define something and gives you an answer when you ask for it, but not the traceability under the hood to understand more, like the JOIN paths, where it came from, whether or not it’s trusted, and whether it’s been tested.”
“You can’t afford to do that in the AI era because when you’re using an LLM [large language model], and you’re talking about MCP [Model Context Protocol], the way to get more accurate answers is not just to give the surface-level answer of what revenue means and walk away. You need to be able to give a clearly defined and tested metadata trail to the models.”
Transforming MCP
MetricFlow — and its JSON-based metadata layer — can serve as the starting point for providing such granular information to agents, the language models powering them and to humans monitoring and auditing those agents. Although the actual adoption rates of MetricFlow since dbt Labs’ open-sourcing of the tool have not been intensely scrutinized just yet, the possibility of its impact on MCP’s evolution is very real. Even if the open source community only embraces its universal schema specification without the rest of MetricFlow, it can potentially transform the way MCP itself functions.
At best, it can reshape the protocol from a terminus to a launching point for the understanding and trust necessary for enterprises to obtain the results they desire from agentic deployments. Via this ideal, “MCP is not just an endpoint that gives you what you want and then is done,” Segar commented. “It’s the gateway to standardizing how any model thinks about interacting with your data and, more importantly, your metadata.”
The Universal Schema Specification
Realizing this ideal requires more than just MetricFlow or its JSON-based semantic layer that enables tools — including those for Business Intelligence (BI), AI, data warehousing, databases and more — to share metrics, terminology and definitions with one another. It requires a transformation tool like dbt to facilitate the provenance for gleaning where the data came from for answers to questions, and just what was done to that data to ensure that it’s the right data to use for a particular application or query. MetricFlow’s universal schema specification, however, is the launching point for tooling across vendors, whether that’s Databricks and Snowflake, Power BI and Tableau, or anything else, to effectively communicate with each other.
Subsequently, regardless of where metrics were created, users can input them into this global schema and still understand their meaning across vendor ecosystems. According to Segar, this global JSON schema or metadata stack functions as “the Rosetta stone that’s in the middle. It’s the common ground so companies don’t have to integrate directly with each other anymore. They can integrate and adopt this metadata spec that’s common across all of us and that’s what’s going to allow them to read and parse.” If users select to access this metadata stack independently of the rest of MetricFlow, they can rely on metrics that they’ve been using in a BI tool for years, for example, and still have other tools and products understand the underlying semantics.
Defining Metrics
Since MetricFlow is now accessible to the open source community, it’s just as easy to create metrics and their accordant definitions with its engine. MetricFlow effectively translates those definitions into SQL, with all of its ubiquitous benefits throughout the data space.
For example, “You can define the definition of gross margin, and what MetricFlow does is compile that definition into SQL,” Segar explained. “That SQL is not just there to say ‘you asked for gross margin and here’s the answer.’ It understands that if you talk about gross margin, calendars come up. So, the fiscal calendar, how do you honor that and what’s the logic underneath?” Naturally, organizations can still avail themselves of the global metadata standard that’s part and parcel of MetricFlow, if they like, in addition to being able to access it without the rest of the MetricFlow offering.
Interoperability
The host of use cases surrounding the interoperability that becomes possible when implementing MetricFlow’s JSON schema is innumerable. Still, the most pressing one at the moment seems to be making statistical AI deployments more trustworthy, reliable and accurate. These benefits appear to be redoubled when applying them to deployments of agent-based AI, particularly when one considers that many are infused with LLMs that organizations haven’t trained or fine-tuned.
“In this AI world where everyone is worried about accuracy scores and how the model derived the answer, you need it to be explainable,” Segar mentioned. “If you want trust, you have to have transparency. Transparency needs to not just be human-readable. It needs to be repeatable and portable so that AI can interact with it, and understand, and crawl through how the metrics are defined.”