Data lakehouses, a new kind of data store that combines the flexibility of data lakes with the structure and performance of data warehouses, are on track to co-opt data warehouses although they will not supplant data lakes or purpose-built data marts, predicts Tony Baer, a longtime database analyst and founder of the research firm dbInsight.
In a new report posted today, Baer argues that although lakehouses lack some of the more sophisticated features of their mature predecessors, the gaps are quickly being closed and will be largely addressed over the next 12 to 18 months. “The data lakehouse is about delivering the best of both worlds: the scale and flexibility of the data lake with the [service-level agreements], repeatability, and mature governance of the data warehouse,” he writes.
There will likely be some winnowing of the market, which is currently led by three open-source platforms: Databricks Inc.’s Delta Lake, Apache Hudi and Apache Iceberg. In the same way that the mobile device market settled on two standards – Apple Inc.’s iOS and the open-source Android – enterprise buyers will want to have a limited range of options and robust ecosystems.
Delta Lake, Iceberg lead
Delta Lake and Iceberg enjoy the pole positions, but major enterprise technology players such as IBM Corp. and SAP SE have yet to place their bets and their endorsements could raise Hudi’s profile. Onehouse, a startup launched by the principal developer of Hudi, announced $25 million in new funding less than two weeks ago.
Lakehouses bring many of the same advantages as data warehouses to the market at lower cost and support for a combination of structured and unstructured data, Baer writes. Today’s platforms sport warehouse-like features such as atomicity, consistency, isolation and durability compliance, which ensures that transactions are processed reliably. They provide schema-on-read capabilities and data transformation powered by open-source platforms such as Apache Spark, Apache Drill and Apache Trino.
Modern lakehouses can handle multipetabyte analytic machine learning workloads at performance levels that rival data warehouses. They do this while supporting relational table structures on top of semistructured file formats such as Parquet and CSV running on low-cost object storage. As a bonus, they support “time travel” queries against data at different points in time, enabling users to traverse the history of the decision.
Gaps to fill
That said, there are a few gaps lakehouses still must address, Baer writes. Most early implementations don’t manage cloud storage automatically. Multitable transactions and joins are enabled through proprietary functionality and tables work on an append-only basis, meaning that older data must be periodically pruned.
Some providers – including Amazon Web Services Inc., Oracle Corp. and Teradata Corp. — still use proprietary table formats, but Baer believes open source will win out in the long run. A consistent table structure “has always been table stakes, not the differentiator, among data warehouses, and that won’t change with data lakehouses,” he writes.
Market ecosystems, not technology differences, will define winners and losers, Baer believes. For example, Databricks supports read-and-write capabilities through its partner ecosystem and Iceberg is being bundled with a handful of analytics platforms.
Data lakes, purpose-built data warehouses and data marts won’t disappear, Baer predicts. Lakehouses will be overkill for small data marts and single-purpose workloads and are not yet robust enough to handle multiple outer joins and high concurrency. However, open source steadily improves and will likely address these deficiencies over time just as relational databases overcame their early performance disadvantages.