I think the blog post should point out very early that Onehouse is a Hudi company. There are some other recent benchmarks published in CIDR by Databricks that might paint a different picture: https://petereliaskraft.net/res/cidr_lakehouse.pdf
Thanks for the link. I'd be interested to see a perf comparison using a popular processing engine other than spark given the obvious potential for delta lake to be better tuned for spark workloads by default.
In Databricks published benchmark of course Delta is the fastest. I have also seen some Iceberg using company publishing benchmarks showing how Iceberg is the fastest.
I think vendor published benchmarks are fine if the dataset is open / accessible, the benchmark code is published, all software versions are disclosed, and the exact hardware is specified. I definitely wouldn't consider an audited TPC benchmark that's based on industry standard datasets / queries worthless in the data space. Disclosure: I work for Databricks.
It looks like the benchmarks used the latest versions of Delta and Iceberg, but chose a version of Hudi that is over 6 months old. Hudi v0.12.2 is more advanced than v0.12.0 which the benchmark did not consider. As the Databricks CIDR paper states, and as mentioned in the Onehouse article, Hudi by default is optimized for UPSERTs vs INSERTs and is a 1-line config change that is appropriate for a true apples-apples comparison. See both: https://www.onehouse.ai/blog/apache-hudi-vs-delta-lake-trans... and https://github.com/brooklyn-data/delta/pull/2
Hah, I could tell because in the "feature matrix" the Hudi column was mostly green compared to the others. Immediately made me suspicious so I looked it up and sure enough, not exactly an unbiased source.
Feature matrices are extremely easy to game depending on your choice of rows.
I recently evaluated these frameworks and went through all these links they have for each of those rows, on the first publish few months ago. FWIW I did not find any inaccuracies or wrong pointers.
Some high level context for those less familiar with the Lakehouse storage system space. For various reasons, several companies moved from data warehouses to data lakes starting around 7-10 years ago.
Data lakes are better for ML / AI workloads, cheaper, more flexible, and separate compute from storage. With a data warehouse, you need to share compute with other users. With data lakes you can attach an arbitrary number of computational clusters to the data.
Data lakes were limited in many regards. They were easily corrupted (no schema enforcement), required slow file listings when reading data, and didn't support ACID transactions.
I'm on the Delta Lake team and will speak to some of the benefits of Delta Lake compared to data lakes:
* Delta Lake supports ACID transactions, so Delta tables are harder to corrupt. The transaction log makes it easy to time travel, version datasets, and rollback to earlier versions of your data.
* Delta Lake allows for schema enforcement & evolution
* Delta Lake makes it easy to compact small files (big data systems don't like an excessive number of small files)
* Delta Lake lets readers get files and skip files via the transaction log (much faster than a file listing). Z ORDERING the data makes reads even faster.
The Delta Lake protocol is implemented in a Scala library and exposed via PySpark, Scala Spark, and Java Spark bindings. This is the library most people think of when conceptualizing Delta Lake.
There is also a Delta Lake Java Standalone library that's used to build other readers like the Trino & Hive readers.
The Delta Rust project is another implementation of the Delta Lake protocol that is implemented in Rust. This library is accessible via Rust or Python bindings. Polars just added a Delta Lake reader with delta-rs and this library can also be used to easily read Delta Lakes into other DataFrames like pandas or Dask.
Lots of DataFrame users are struggling with data lakes / single data files. They don't have any data skipping capabilities (unless Parquet file footers are read), their datasets are easily corruptible, and they don't have any schema enforcement / schema evolution / data versioning / etc. I expect the data community to accelerate the shift to Lakehouse storage systems as they learn about all of these advantages.
Not that I know what anything means in "big data lake OLAP database" anymore, but I always thought a data lake implied a lot of hybrid sources/formats/structures for the data, but the advocacy here implies that the data is all ingested and reformatted, which to me is a data warehouse.
But then again, data lake may simply be what a data warehouse is now called in marketspeak.
Also, I stopped paying attention when the treadmill of new frameworks became unbearable to track, is spark now settled as the standard of distributed "processing", as in mapreduce / distributed query / distributed batch / etc?
I get that performance can improve by unifying to a file format like parquet, but again that seems like a data warehouse. A data lake should be something over heterogenous sources with "drivers" or "adaptors" IMO, in particular because the restoration of the data inputs stays in the knowledge domain of the source production database maintainers.
You're understandably confused by the industry terminology that's ambiguous and morphing over time.
Data lakes are typically CSV/JSON/ORC/Avro/Parquet files stored in a storage system (cloud storage like AWS S3 or HDFS). Data lakes are schema on read (the query engine gets the schema when reading the data).
A data warehouse is something like Redshift that bundles storage and compute. You have to buy the storage and compute as a single package. Data warehouses are schema on write. The schema is defined when the table is created.
And yes, I'd say that Spark is generally considered the "standard" distributed data processing engine these days although there are alternatives.
You're making a lot of assertions I am not sure I agree with:
> Data lakes are better for ML / AI workloads, cheaper, more flexible, and separate compute from storage. With a data warehouse, you need to share compute with other users. With data lakes you can attach an arbitrary number of computational clusters to the data.
- I am not sure it's any cheaper than BQ or Snowflake storage.
- Modern CDW separates compute from storage.
- I am not sure what you mean by "you need to share compute with others". Why?
- You can attach an arbitrary number of "clusters" in BQ and Snowflake as well.
Additionally, modern CDW provides a very high level of abstraction and a very high level of manageability. Their time travel and compaction actually work, and their storage systems are continuously optimized for optimal performance.
These formats look like an attempt to get a halfway solution: you want to get something like a real MPP analytic DBMS (e.g., ClickHouse) but have to use a data lake for some reason.
It resembles previous trendy technologies that are mostly forgotten now, such as:
- Lambda architecture (based on a wrong assumption that you cannot have a real-time and historical layers in the same system);
- Multidimensional OLAP (based on a wrong assumption that you cannot do analytic queries directly on non-aggregated data);
- Big data (based on a wrong assumption that map-reduce is better than relational DBMS).
I'm exaggerating a little.
Disclaimer: I work on ClickHouse, and I'm a follower of every technology in the data processing area.
The formats are kind of a halfway solution, because trying to build something with MPP semantics on objects stores is difficult.
The difference between MPP and something like Databricks or Trino working with object store is that while MPP can likely get much better performance and especially latency from the same hardware, operating it is much harder.
You don't "backup" Databricks - the data is stored in object storage and that is it. You don't have to plan storage sizing quarters upfront, and you never get in trouble because there is unexpected data spike. Compute resizes are trivial, there is no rebalancing. Upgrades are easy, because you're just upgrading the compute and you can't break data that way. You can give each user group (like batch one and interactive one, or each team) their dedicated compute over common data and it works. That compute can spin up and down and autoscale to save some money. You don't have to think about how to replicate my table across a cluster or anything. And so on, and so forth.
Running a big data and analytics platform - place where tens of teams, tens of applications and hundreds or thousands of analysts come for data and where they build their solutions - is already enough of a challenge without all this operations work, and that is why Snowflake and Databricks are worth that crazy money.
If someone could solve the challenge of having MPP that is as easy to manage as Snowflake or a Lakehouse, that would be quite the differentiator. And maybe you people already did and I just didn't notice, I don't know :)
It would be good if you labeled your posts so as to reveal your bias.
I understand why folks want options. At the end of the day, folks want an easy to use, ALWAYS CORRECT stable database, with minimal well-documented predictable knobs, correct distributed execution plan, no OOMs, separation of storage and compute, and standard SQL, and Clickhouse struggles with all of the above.
Could you please elaborate on your comments and possible misconceptions about ClickHouse? Proven stability, massive scale, predictability, native SQL, and industry-best performance are all well-recognized characteristics of clickhouse, so your comments here seem a bit biased.
I am interested to learn more about your point of view, as well as tangentially the strategic vision of MotherDuck as a company.
Naming this concept a 'data lake' bugs me. Turns out when you get old, you read things sometimes and you're thinking "ah, so that's where my line was. Interesting."
It’s marketing BS designed to look good as a whiteboard diagram. “See all your disparate data sources here? That’s your problem. Your competitors have all their data in a data warehouse. You like warehouses? No? Neither do I. Stuffy, stale old places. You need all this data to flow into a central living breathing data lake. You like lakes? You do? Well now one thing to be aware of is to really appreciate the data lake, there’s nothing like a data lakehouse…”
Data Lakehouse is where you store your data on object stores and spin up a bunch of instances (cpu/memory) as needed to crunch the data on the timeline you desire. It’s incredible to me how “solutions” are continually invented that give cloud providers plenty of stages to charge for not only storage but movement and processing of your data too.
No, that's data lake. Data lakehouse is data lake where your objects are wrapped in a "table format" like Iceberg that allows you to query (update, itd) them like they were stored in a traditional data warehouse.
Why would you be charged for movement in this case, I thought intra datacenter traffic was free? Or you mean you get charged to update/query the object store?
I guess a setup such as hdfs where storage and compute are colocated and not disaggregated. But that also offers similar transactional semantics to lakehouses.
Vendor published benchmarks are worthless.
Feature matrices are extremely easy to game depending on your choice of rows.
Data lakes are better for ML / AI workloads, cheaper, more flexible, and separate compute from storage. With a data warehouse, you need to share compute with other users. With data lakes you can attach an arbitrary number of computational clusters to the data.
Data lakes were limited in many regards. They were easily corrupted (no schema enforcement), required slow file listings when reading data, and didn't support ACID transactions.
I'm on the Delta Lake team and will speak to some of the benefits of Delta Lake compared to data lakes:
* Delta Lake supports ACID transactions, so Delta tables are harder to corrupt. The transaction log makes it easy to time travel, version datasets, and rollback to earlier versions of your data.
* Delta Lake allows for schema enforcement & evolution
* Delta Lake makes it easy to compact small files (big data systems don't like an excessive number of small files)
* Delta Lake lets readers get files and skip files via the transaction log (much faster than a file listing). Z ORDERING the data makes reads even faster.
The Delta Lake protocol is implemented in a Scala library and exposed via PySpark, Scala Spark, and Java Spark bindings. This is the library most people think of when conceptualizing Delta Lake.
There is also a Delta Lake Java Standalone library that's used to build other readers like the Trino & Hive readers.
The Delta Rust project is another implementation of the Delta Lake protocol that is implemented in Rust. This library is accessible via Rust or Python bindings. Polars just added a Delta Lake reader with delta-rs and this library can also be used to easily read Delta Lakes into other DataFrames like pandas or Dask.
Lots of DataFrame users are struggling with data lakes / single data files. They don't have any data skipping capabilities (unless Parquet file footers are read), their datasets are easily corruptible, and they don't have any schema enforcement / schema evolution / data versioning / etc. I expect the data community to accelerate the shift to Lakehouse storage systems as they learn about all of these advantages.
But then again, data lake may simply be what a data warehouse is now called in marketspeak.
Also, I stopped paying attention when the treadmill of new frameworks became unbearable to track, is spark now settled as the standard of distributed "processing", as in mapreduce / distributed query / distributed batch / etc?
I get that performance can improve by unifying to a file format like parquet, but again that seems like a data warehouse. A data lake should be something over heterogenous sources with "drivers" or "adaptors" IMO, in particular because the restoration of the data inputs stays in the knowledge domain of the source production database maintainers.
Data lakes are typically CSV/JSON/ORC/Avro/Parquet files stored in a storage system (cloud storage like AWS S3 or HDFS). Data lakes are schema on read (the query engine gets the schema when reading the data).
A data warehouse is something like Redshift that bundles storage and compute. You have to buy the storage and compute as a single package. Data warehouses are schema on write. The schema is defined when the table is created.
And yes, I'd say that Spark is generally considered the "standard" distributed data processing engine these days although there are alternatives.
> Data lakes are better for ML / AI workloads, cheaper, more flexible, and separate compute from storage. With a data warehouse, you need to share compute with other users. With data lakes you can attach an arbitrary number of computational clusters to the data.
- I am not sure it's any cheaper than BQ or Snowflake storage.
- Modern CDW separates compute from storage.
- I am not sure what you mean by "you need to share compute with others". Why?
- You can attach an arbitrary number of "clusters" in BQ and Snowflake as well.
Additionally, modern CDW provides a very high level of abstraction and a very high level of manageability. Their time travel and compaction actually work, and their storage systems are continuously optimized for optimal performance.
Dead Comment
nick ( a t ) nickkarpov.com
I cant paste images here but imo this table comparing the 3 formats is the big takeaway https://assets-global.website-files.com/6064b31ff49a2d31e049... (explained inline, we do cite onehouse heavily but we are independent of them)
Dead Comment
It resembles previous trendy technologies that are mostly forgotten now, such as:
- Lambda architecture (based on a wrong assumption that you cannot have a real-time and historical layers in the same system);
- Multidimensional OLAP (based on a wrong assumption that you cannot do analytic queries directly on non-aggregated data);
- Big data (based on a wrong assumption that map-reduce is better than relational DBMS).
I'm exaggerating a little.
Disclaimer: I work on ClickHouse, and I'm a follower of every technology in the data processing area.
The difference between MPP and something like Databricks or Trino working with object store is that while MPP can likely get much better performance and especially latency from the same hardware, operating it is much harder.
You don't "backup" Databricks - the data is stored in object storage and that is it. You don't have to plan storage sizing quarters upfront, and you never get in trouble because there is unexpected data spike. Compute resizes are trivial, there is no rebalancing. Upgrades are easy, because you're just upgrading the compute and you can't break data that way. You can give each user group (like batch one and interactive one, or each team) their dedicated compute over common data and it works. That compute can spin up and down and autoscale to save some money. You don't have to think about how to replicate my table across a cluster or anything. And so on, and so forth.
Running a big data and analytics platform - place where tens of teams, tens of applications and hundreds or thousands of analysts come for data and where they build their solutions - is already enough of a challenge without all this operations work, and that is why Snowflake and Databricks are worth that crazy money.
If someone could solve the challenge of having MPP that is as easy to manage as Snowflake or a Lakehouse, that would be quite the differentiator. And maybe you people already did and I just didn't notice, I don't know :)
I understand why folks want options. At the end of the day, folks want an easy to use, ALWAYS CORRECT stable database, with minimal well-documented predictable knobs, correct distributed execution plan, no OOMs, separation of storage and compute, and standard SQL, and Clickhouse struggles with all of the above.
(co-founder of MotherDuck)
We have recently added support for Hudi and Delta Lake; you can check here: https://clickhouse.com/docs/en/engines/table-engines/integra...
It is a read-only implementation: ClickHouse can read and process the external data in the Hudi or Delta Lake format.
Apache Iceberg is pending. There is no good C++ library for it. But at least the overall structure is simple, as it is not hard to implement it.
The overall principle - whatever data format it is, ClickHouse should support it in a fast, stable, and ALWAYS CORRECT manner.
If you have more experience to share, please do it.
I am interested to learn more about your point of view, as well as tangentially the strategic vision of MotherDuck as a company.
(VP Support at ClickHouse)
Deleted Comment
The root of the problem is using object storage improperly.
Object stores are a terribly inefficient way to access and store changes of data.