Data platforms are evolving to be about more than building discrete dashboards and AI/ML models. Customers need to build systems that drive business outcomes and continually learn from operations. Management has always been about building processes. In the past, those processes were mostly captured in tacit knowledge. In the future, those processes will increasingly be embodied in software. Data platforms have to help customers build systems, not analytic artifacts.
Tools from data platform vendors will extract increasingly actionable intelligence from an entire data estate. That data will train models that capture what has happened to inform decisions about what should happen. Customers will architect applications as systems of models.
The data about the business outcomes these systems drive will become a flywheel to improve the entire system. AI/ML models will progressively learn more parts of the business and the business will progressively get more embodied in systems embedding those models.
Data is becoming the API to manage the business.
Data representations of the people, places, things, and activities in a business collectively become like a digital twin. This twin captures its state – from what happened, why it happened, what is likely to happen, to what should happen. Organizing and harmonizing today’s sprawling data estates ultimately becomes the critical function of corporate IT departments.
Value creation is shifting from tools that harmonize data products to toolchains that drive business outcomes
Value Creation Today
Harmonizing & governing data in order to create data “products”
Data platforms and their tools today are mostly about defining or up-leveling data from “strings” to “things” in manually hard-coded pipelines. The next step extends to the creation of discrete analytic artifacts such as BI dashboards and ML models. Catalogs have to govern all these artifacts with permissions, lineage, quality, and observability.
Emerging Value Creation
Harmonizing data by more easily converting “strings” into “things”
Data products today are built on static pipelines. Intelligent applications need to work with a repository that more seamlessly uplevels the “strings” that today’s databases care about to the people, places, and “things” that developers and businesses care about. Composing new applications is difficult; developers need to manually create new pipelines every time they need to define a new business “thing” from underlying data objects.
Toolchains to build systems over standalone artifacts
Data products today, including BI dashboards, ML models and even frontier GenAI models, can’t participate in intelligent applications that orchestrate the business. Customers need to use vendor toolchains to build systems of multiple models that orchestrate the business, drive outcomes, and then learn and improve from experience. These systems will be the foundation of intelligent applications. Integrating more of these systems together will mean that intelligent applications will start to embody increasing managerial know-how.
In conjunction with this shift in value creation, the control point for vendors is shifting from managing data to defining and governing data.
Changing point of control of the system of truth (SoT) from the database compute engine to the catalog harmonizing the most data
Point of Control Today
The DBMS compute engine owns the SoT
The DBMS compute engine “owns” the data today and any other tool or engine that wants to read or write that data has to go through that DBMS. The ultimate expression of this is Snowflake Container Services (SCS), which provided access to any Snowflake data via an SDK. With a cloud-native architecture, compute and storage are independently scalable. In the emerging world, customers need to bring many compute engines and tools to work with their data. In other words, customers need to be able to separate any vendor’s compute from their data. So we need a new control point as a system of truth.
Emerging Point of Control
The catalog as foundation for all the toolchains and compute engines
With the emerging focus on harmonizing a data estate, the emerging control point is the catalog. The catalog becomes the mediator when anyone reads or writes data. It also contains all the metadata required to uplevel strings into the things. This metadata is also the foundation on which all the other toolchains for building systems of models is built. But the analytic artifacts that a catalog governs also rest on tables of data. And we haven’t fully resolved the standards battle over table formats.
Iceberg vs Delta Tables: it’s not a solved problem
The table standards war isn’t over yet, despite Databricks’ purchase of Iceberg creator, Tabular. Table formats matter because customers want the option to use any tool or compute engine to read and write their data. Unfortunately it’s not going to be as easy as the announcement of the Uniform format by Databricks last year.
Databricks Uniform format, despite promising interoperability between native Delta Tables, Iceberg, and Hudi when it was introduced, still only offers read access in Iceberg format. It’ll will eventually get there. But mapping between formats is not easy even if they’re all based on the columnar Parquet file format. The technology industry had to build an entirely new category of analytic DBMSs just to rearrange data formats from row-oriented operational databases to the new columnar-oriented systems.
If there’s any doubt about how hard making high-performance interoperability possible will be, listen to Ryan Blue, co-creator Iceberg and CEO of Tabular. Here he’s speaking with me and Dain Sundstrom, co-creator of Starburst Trino, about how deep the differences are.
Why Databricks needed to buy Tabular
What’s significant is that all the expertise of the Tabular team will likely be redirected from building governance technology like a catalog to solving the Delta Tables interoperability problem. The challenge for Snowflake is that Iceberg may now evolve more based on Delta interop needs rather than growing into the extra functionality embodied in the proprietary Managed Iceberg table format.
What’s possible with Databricks and Uniform format today
Uniform enables any tool to read and write Delta Tables. Technically, there’s a Spark execution engine in the read/write path somewhere, but Databricks has packaged and priced that SKU in a way that it’s effectively open. Currently, Iceberg tables are read-only. And currently, Unity is the catalog and source of truth connected to all these tables. But now that Unity is open source, Databricks has made that free and open as well.
What’s possible with Snowflake and Iceberg today
Snowflake supports Managed Iceberg (MI) tables along with its native tables for read/write access. But MI isn’t vanilla Iceberg. So tools that support vanilla Iceberg can only read those tables. Write access has to go through the Snowflake SDK and its underlying execution engine. Since there’s no SKU with pricing by workload type, that’s full fare access.
Snowflake’s announcement of its open-source Polaris Iceberg catalog partially got lost when Databricks dramatically announced it was open-sourcing Unity along with its Iceberg support. But, again, Unity and Databricks Uniform currently only support read access to Iceberg tables. Polaris is useful for Snowflake customers with a growing Iceberg data estate because it synchronizes the advanced data governance policies from Snowflake’s native Horizon catalog with all the external Iceberg tables. Third-party engines such as Flink, Starburst Trino, Dremio, and even AWS EMR’s Spark can all read and write to those Iceberg tables with full governance centrally managed in Snowflake.
Despite being the control point, Databricks’ open-sourcing of Unity is likely to accelerate moving value capture to the tools that build on how the catalog harmonizes the data estate.
Changing Source of Vendor Value Capture
Value Capture Today: DBMS compute owns the data
Why Snowflake is hard to catch as a DBMS
Snowflake, had a lead in maturity, integration, and sophistication as a cloud DBMS that no one was likely to catch. It featured low latency, highly concurrent query workloads; transparent data partitioning; low latency high throughput data sharing – without copying within a single region; integration of streaming data with historical queries in end-to-end declarative pipelines that auto refresh with latency as low as a minute and soon as low as 10 seconds; an execution engine that can provide access to multiple storage engines such as graph or vector stores; and Snowflake Container Services that now is part of native apps for seamless extensibility and distribution via a marketplace. If that weren’t enough, some level of transaction support is imminent.
As Andy Jassy used to say, there’s no compression algorithm for experience.
Databricks, however, is catching up to Snowflake in some critical areas. Serverless availability of all its workloads means admins no longer have to worry about infrastructure. I believe serverless can even run in a customer’s tenant where the customer controls the data. Liquid clustering means admins no longer have to worry about data partitioning. And, most important, Databricks SQL appears to be improving more rapidly in performance. Snowflake touted a stat that showed the same queries running 20-25% faster than roughly two years ago. If I understood Databricks’ claim, they are saying their customers’ queries run roughly 73% faster in the same time period. Performance was always a crown Snowflake wore by itself.
Last year before conference season I didn’t think Databricks could ever catch Snowflake’s DBMS. Even though Databricks is getting closer, they changed the axis of competition. They focused on harmonizing and governing the data estate via the Unity catalog. And by open sourcing Unity, value capture is shifting to the set of tools and compute engines that create analytic artifacts and systems on that harmonized data.
Emerging Value Capture: Tools that create analytic systems that inform or automate decisions and learn based on the outcome of those decisions
Data platforms have been mostly about creating discrete analytic artifacts such as BI dashboards, ML models, and the pipelines that feed them. Snowflake went somewhat further with a marketplace for applications that interpreted the data they embodied. In the future, vendor platforms and tools will help customers use their harmonized data estates to build more sophisticated systems comprised of multiple analytic artifacts. These systems will help drive outcomes such as nurturing leads or better coordinating sales & operations planning.
Here we’re going to focus on the GenAI tools since they’re evolving so rapidly and because they are the area of most interest. Enterprises have far more data locked away behind their firewalls than frontier model vendors will ever be able to access on the public internet. Snowflake and Databricks’ opportunity is to uplevel the skills of the personas they already serve so they can train GenAI systems to reason over that data.
Snowflake Cortex makes GenAI a seamless extension for developers and business users.
Snowflake’s ethos has always been simplicity. They delivered that for capabilities such as fine-tuning, search, and query. They are using GenAI to expand both the scope of their user community and the breadth of data over which their tools can reason.
Developers can call Cortex GenAI model functionality from within SQL or Snowpark, making the functionality as easy as calling a user-defined function. Access to NVIDIA NIMS models adds an extremely rich library of specialized models.
Cortex Fine-Tuning converts fine-tuning open source models, currently Mistral and Llama 3, into a no/low code exercise where customers only need to supply the data set. Cortex Search brings query functionality to documents. Cortex Analyst extends natural language query to SQL data so all users can interact with data without needing a data analyst to build a dashboard in advance.
Right now Analyst functionality is somewhat constrained by developers having to prepare a text file that describes the database schema so the model can translate business terminology to the vocabulary the system uses. But Snowflake is working on an editor to improve the experience. In offline discussions I found out that somewhat further out, they’re building a full-blown metric definition and serving layer directly into the DBMS. Before reaching that milestone, they’ll merge Search and Analyst so end-users can query across all types of information in natural language. Until the metric definitions mature, however, it’s possible that BI tool vendors may be able to provide better natural language query experience via LLMs that go through their more robust metric and dimension definitions.
Databricks Mosaic toolchain is designed to power the most sophisticated systems
Databricks is building tools that enable developers to design systems, not just models. Their belief is that models will become like specialized subsystems that collectively drive the behavior of more sophisticated compound systems.
Mosaic evaluation functionality illustrates the sophistication of the tools Databricks is building. While testing has always been a part of traditional software development, evaluation has emerged as the GenAI equivalent. But model evaluation has been so hard to measure that researchers on X/Twitter typically track the very subjective comments about “vibes” when new models are released. Mosaic makes it possible to track performance metrics relative to the objective of the system of which models are a part. That makes it possible for developers to optimize the system’s behavior, not just a single component. Databricks also extended its Mflow for MLOps to encompass model testing and eval and is folding the data into the Unity catalog.
Looking further out offline conversations revealed the company has hired the researcher behind the popular DSPy framework, which makes it possible to optimize an entire pipeline of models during training. And the Mosaic CTO as well as Matei Zaharia, Databricks co-founder and CTO, both said in offline conversations that future versions of Unity would harmonize data estates into knowledge graphs. Future versions of Mosaic will be able to train models to be more effective at finding actionable insights within corporate upleveled by that richer representation.
The growing role of corporate data and systems of models relative to frontier models such as GPT4
When ChatGPT and then GPT4 were sweeping the industry, there was some speculation that “frontier models were all you need” to build intelligence into customer applications. The emerging consensus is that “intelligence” is a systems problem. That means customers will engineer systems comprised of multiple specialized models, which sometimes may include a frontier model as a component. But the evolving design of systems with ever-improving specialized components will “liberate” these custom systems from whatever performance curve single-model systems are on.
Frontier models themselves still seem to obey scaling laws. However, the early belief that enterprise systems would just involve a frontier model, an embedding model, a vector database for search, and a retriever is fading as customers realize they or their tool vendors can build far more sophisticated applications based on architecting multiple specialized building blocks. Databricks definitely appears to be addressing those more sophisticated needs. Currently, the Databricks tools are for pro-code developers. Microsoft’s Power Platform and Salesforce low-code tools both address more mainstream corporate developers.
What’s Still Missing from the Data Platform Supporting Systems Development
Monitoring and evaluation tools must mature. They need to provide critical feedback about business outcomes that supports system deployment and the data flywheel that improves the models within the systems.
Models, like developers, navigate and reason far more effectively over data and applications harmonized into semantically meaningful people, places, things, and activities. Whether catalogs like Databricks’ Unity evolve into knowledge graphs, Salesforce extends the Data Platform’s Customer 360 data graphs, Palantir’s Ontology, or declarative knowledge graphs from EnterpriseWeb or RelationalAI mature, we are likely to spend the next decade building this new layer. Agents and analytics will be far more powerful if this new layer can provide meaningful context and a palette of tools and actions.
As analytics pervades application functionality more fully, Snowflake and Databricks are becoming more critical to the capabilities of customers’ operational systems. Future articles will explore potential scenarios for this ecosystem.