Hey everyone, I'm a software engineer at Eventual, the team behind Daft! Huge thanks to the op for the benchmark, we're a huge fan of your blog posts and this gave us some really useful insights. For context, Daft is a high-performance data processing engine for AI workloads that works both on single-node and distributed setups.
We're actively looking into the results of the benchmark and hope to share some of our findings soon. From initial results, we found a lot of potential optimizations we could make to our deltalake reader to improve parallelism and our groupby operator to improve pipelining for count aggregations. We're hoping to roll our these improvements over the next couple of releases.
This isn’t true anymore we are way beyond 2014 Hadoop (what the blog post is about) at this point.
Go try doing an aggregation of 650gb of json data using normal CLI tools vs duckdb or clickhouse. These tools are pipelining and parallelizing in a way that isn’t easy to do with just GNU Parallel (trust me, I’ve tried).
What if it was 650TB? This article is obviously a microbenchmark. I work with much larger datasets, and neither awk nor DBD would make a difference to the overall architecture. You need a data catalog, and you need a clusters of jobs at scale, regardless of a data format library, or libraries.
1. Assume date is 8 bytes
2. Assume 64bit counters
So for each date in the dataset we need 16 bytes to accumulate the result.
That's ~180 years worth of daily post counts per gb ram - but the dataset in the post was just 1 year.
This problem should be mostly network limited in the OP's context, decompressing snappy compressed parquet should be circa 1gb/sec. The "work" of parsing a string to a date and accumulating isn't expensive compared to snappy decompression.
I don't have a handle on the 33% longer runtime difference between duckdb and polars here.
I think the entire point of the article (reading forward a bit through the linked redshift files posts) is that almost nobody in the world uses datasets bigger than 100Tb, that when they do, they use a small subset anyway, and that 650Gb is a pretty reasonable approximation of the entire dataset most companies are even working with. Certainly in my experience as a data engineer, they're not often in the many terabytes. It's good to know that OOTB duckdb can replace snowflake et all in these situations, especially with how expensive they are.
I often crunch 'biggish data' on a single node using duckdb (because I love using the modern style of painless and efficient SQL engines).
I don't use delta or iceberg (because I haven't needed to; I'm describing what I do, not what you can do :)), but rather just iterate over the underlying parquet files using filename listing or wildcarding. I often run queries on BigQuery and suck down the results to a bunch of ~1GB local parquet files - way bigger than RAM - that I can then mine in duckdb using wildcarding. Works great!
I'm in a world where I get into the weeds of 'this kind of aggregation works much faster on Bigquery than duckdb, or vice versa, so I'll split my job into this part of sql running on Bigquery then feeding into this part running in duckdb'. It's the fun end of data engineering.
Honestly this benchmark feels completely dominated by the instance's NIC capacity.
They used a c5.4xlarge that has peak 10Gbps bandwidth, which at a constant 100% saturation would take in the ballpark of 9 minutes to load those 650GB from S3, making those 9 minutes your best case scenario for pulling the data (without even considering writing it back!)
Minute differences in how these query engines schedule IO would have drastic effects in the benchmark outcomes, and I doubt the query engine itself was constantly fed during this workload, especially when evaluating DuckDB and Polars.
The irony of workloads like this is that it might be cheaper to pay for a gigantic instance to run the query and finish it quicker, than to pay for a cheaper instance taking several times longer.
It would be amusing to run this on a regular desktop computer or even a moderately nice laptop (with a fan - give it a chance!) and see how it does. 650GB will stream in quite quickly from any decent NVMe device, and those 8-16 cores might well be considerably faster than whatever cores the cloud machines are giving you.
S3 is an amazingly engineered product, operates at truly impressive scale, is quite reasonably priced if you think of it as warm-to-very-cold storage with excellent durability properties, and has performance that barely holds a candle to any decent modern local storage device.
Absolutely. I recently reworked a bunch of tests and found my desktop to outcompete our (larger, custom) Github Action runner by roughly 5x. And I expect this delta to increase a lot as you lean on the local I/O harder.
It really is shocking how much you're paying given how little you get. I certainly don't want to run a data center and handle all the scaling and complexity of such an endeavour. But wow, the tax you pay to have someone manage all that is staggering.
Totally true. I have a trusty old (like 2016 era) X99 setup that I use for 1.2TB of time series data hosted in a timescaledb PostGIS database. I can fetch all the data I need quickly to crunch on another local machine, and max out my aging network gear to experiment with different model training scenarios. It cost me ~$500 to build the machine, and it stays off when I'm not using it.
Much easier obviously dealing with a dataset that doesn't change, but doing the same in the cloud would just be throwing money away.
> They used a c5.4xlarge that has peak 10Gbps bandwidth, which at a constant 100% saturation would take in the ballpark of 9 minutes to load those 650GB from S3, making those 9 minutes your best case scenario for pulling the data (without even considering writing it back!)
The query being tested wouldn't scan the full files and in reality the query in most sane engines would be processing much less than 650GB of data (exploiting S3 byte-range reads): i.e. just 1 column: a timestamp, which is also correlated with the partition keys. Nowadays what I would mostly be worried about the distribution of file size, due to API calls + skew; or if the query is totally different to the common query access patterns that skips the metadata/columnar nature of the underlying parquet (i.e. doing an effective "full scan" over all row groups and/or columns).
> The irony of workloads like this is that it might be cheaper to pay for a gigantic instance to run the query and finish it quicker, than to pay for a cheaper instance taking several times longer.
Yep I think the value of the experiment is not clear.
You want to use Spark for a large dataset with multiple stages. In this case, their I/O bandwidth is 1GB/s from S3. CPU memory bandwidth is 100-200GB/s for a multi-stage job. Spark is a way to pool memory for a large dataset with multiple stages, and use cluster-internal network bandwidth to do shuffling instead of storage.
Maybe when you have S3 as your backend, the storage bandwidth bottleneck doesn't show up in perf, but it sure does show up in the bill. A crude rule of thumb: network bandwidth is 20X storage, main memory bandwidth is 20X network bandwidth, accelerator/GPU memory is 10X CPU. It's great that single-node DuckDB/Polars are that good, but this is like racing a taxiing aircraft against motorbikes.
10Gbps only? At Google where this type of processing would automatically be distributed, machines had 400Gbps NICs, not to mention other innovations like better TCP congestion control algorithms. No wonder people are tired of distributed computing.
"At Google" is doing all the heavy lifting in your comment here, with all due respect. There is but one Google but remain millions of us who are not "At Google".
This is a really good observation, and matches something I had to learn painfully over 30 years ago. At a Wall Street bank, we were trying to really push the limits with some middleware, and my mentor at the time very quietly suggested "before you test your system's performance, understand the theoretical maximum of your setup first with no work".
The gist was - find your resource limits and saturate them and see what the best possible performance could be, then measure your system, and you can express it as a percentage of optimal. Or if you can't directly test/saturate your limits at least be aware of them.
c5 is such a bad instance type, m6a would be so much better and even cheaper,
I would love to see this on an m8a.2xlarge (7th and 8th generations don’t use SMT) and that is even cheaper and has up to 15 Gbps
Actually for this kind of workload 15Gbps is still mediocre. What you actually want is the `n` variant of the instance types, which have higher NIC capacity.
In the c6n and m6n and maybe the upper-end 5th gens you can get 100Gbps NICs, and if you look at the 8th gen instances like the c8gn family, you can even get instances with 600Gbps of bandwidth.
1. tested column pruning and the dataset you access would have been 2 columns + metadata for the parquet files so probably fit in memory even without streaming.
2. Most of the processing time would be IO bound on S3 and the access patterns/simultaneous connection limits etc. would have more of an impact than any processing code.
Love that you went through the pain of trying the different systems but I'd like to see an actual larger than memory query.
1. Important points that the query is a projection that only returns a fraction of the 650GB that fits in memory. DuckDB is good at streaming larger than memory queries, Polars less mature there. That would show in the results.
2. S3 defaults shouldn't prevent all available threads/cpus from reading the files in parallel, so I would assume that the network bandwidth of the VM (or container) would be the bottleneck.
> It seems like these single-node libraries can process a terabyte on a typical machine, and you'd have have over 10TB before moving to Spark.
I'm surprised by how often people jump to Spark because "it's (highly) parallelizable!" and "you can throw more nodes at it easy-peasy!" And yet, there are so many cases where you can just do things with better tools.
Like the time a junior engineer asked for help processing 100s of ~5GB files of JSON data which turned out to be doing crazy amounts of string concatenation in Python (don't ask). It was taking something like 18 hours to run, IIRC, and writing a simple console tool to do the heavy lifting and letting Python's multiprocessing tackle it dropped the time to like 35 minutes.
I used pySpark some time ago when it was introduced to my company at the time and I realized that it was slow when you used python libraries in the UDFs rather than pySpark's own functions.
I think Spark was the best tool out there when data engineering started taking off, and it just works (provided you don't have to deal with jar dependency hell) so there's not a huge incentive to move away from it.
I had to do something like this for a few TB of json recently. The unique thing about this workload was it was a ton of small 10-20mb files.
I found that clickhouse was the fastest, but duckdb was the simplest to work with it usually just works. DuckDB was close enough to the max performance from clickhouse.
I tried flink & pyspark but they were way slower (like 3-5x) than clickhouse and the code was kind of annoying. Dask and Ray were also way too slow, but dask’s parallelism was easy to code but it was just too slow. I also tried Datafusion and polars but clickhouse ended up being faster.
These days I would recommend starting with DuckDB or Clickhouse for most workloads just cause it’s the easiest to work with AND has good performance. Personally I switched to using DuckDB instead of polars for most things where pandas is too slow.
We're actively looking into the results of the benchmark and hope to share some of our findings soon. From initial results, we found a lot of potential optimizations we could make to our deltalake reader to improve parallelism and our groupby operator to improve pipelining for count aggregations. We're hoping to roll our these improvements over the next couple of releases.
If you're interested to learn more about our findings, check out our GitHub (https://github.com/Eventual-Inc/Daft) or follow us on Twitter (https://x.com/daftengine) and LinkedIn (https://www.linkedin.com/showcase/daftengine) for updates. Also if Daft sounds interesting to you, give us a try via pip install daft!
That would be the best way to smoothly test it out and transition workloads from other engines for codebases in my teams.
Here's an oldie on the topic: https://adamdrake.com/command-line-tools-can-be-235x-faster-...
Go try doing an aggregation of 650gb of json data using normal CLI tools vs duckdb or clickhouse. These tools are pipelining and parallelizing in a way that isn’t easy to do with just GNU Parallel (trust me, I’ve tried).
Deleted Comment
working memory requirements
So for each date in the dataset we need 16 bytes to accumulate the result.That's ~180 years worth of daily post counts per gb ram - but the dataset in the post was just 1 year.
This problem should be mostly network limited in the OP's context, decompressing snappy compressed parquet should be circa 1gb/sec. The "work" of parsing a string to a date and accumulating isn't expensive compared to snappy decompression.
I don't have a handle on the 33% longer runtime difference between duckdb and polars here.
https://www.youtube.com/watch?v=3t6L-FlfeaI
I don't use delta or iceberg (because I haven't needed to; I'm describing what I do, not what you can do :)), but rather just iterate over the underlying parquet files using filename listing or wildcarding. I often run queries on BigQuery and suck down the results to a bunch of ~1GB local parquet files - way bigger than RAM - that I can then mine in duckdb using wildcarding. Works great!
I'm in a world where I get into the weeds of 'this kind of aggregation works much faster on Bigquery than duckdb, or vice versa, so I'll split my job into this part of sql running on Bigquery then feeding into this part running in duckdb'. It's the fun end of data engineering.
They used a c5.4xlarge that has peak 10Gbps bandwidth, which at a constant 100% saturation would take in the ballpark of 9 minutes to load those 650GB from S3, making those 9 minutes your best case scenario for pulling the data (without even considering writing it back!)
Minute differences in how these query engines schedule IO would have drastic effects in the benchmark outcomes, and I doubt the query engine itself was constantly fed during this workload, especially when evaluating DuckDB and Polars.
The irony of workloads like this is that it might be cheaper to pay for a gigantic instance to run the query and finish it quicker, than to pay for a cheaper instance taking several times longer.
S3 is an amazingly engineered product, operates at truly impressive scale, is quite reasonably priced if you think of it as warm-to-very-cold storage with excellent durability properties, and has performance that barely holds a candle to any decent modern local storage device.
It really is shocking how much you're paying given how little you get. I certainly don't want to run a data center and handle all the scaling and complexity of such an endeavour. But wow, the tax you pay to have someone manage all that is staggering.
Much easier obviously dealing with a dataset that doesn't change, but doing the same in the cloud would just be throwing money away.
The query being tested wouldn't scan the full files and in reality the query in most sane engines would be processing much less than 650GB of data (exploiting S3 byte-range reads): i.e. just 1 column: a timestamp, which is also correlated with the partition keys. Nowadays what I would mostly be worried about the distribution of file size, due to API calls + skew; or if the query is totally different to the common query access patterns that skips the metadata/columnar nature of the underlying parquet (i.e. doing an effective "full scan" over all row groups and/or columns).
> The irony of workloads like this is that it might be cheaper to pay for a gigantic instance to run the query and finish it quicker, than to pay for a cheaper instance taking several times longer.
That's absolutely right.
You want to use Spark for a large dataset with multiple stages. In this case, their I/O bandwidth is 1GB/s from S3. CPU memory bandwidth is 100-200GB/s for a multi-stage job. Spark is a way to pool memory for a large dataset with multiple stages, and use cluster-internal network bandwidth to do shuffling instead of storage.
Maybe when you have S3 as your backend, the storage bandwidth bottleneck doesn't show up in perf, but it sure does show up in the bill. A crude rule of thumb: network bandwidth is 20X storage, main memory bandwidth is 20X network bandwidth, accelerator/GPU memory is 10X CPU. It's great that single-node DuckDB/Polars are that good, but this is like racing a taxiing aircraft against motorbikes.
The gist was - find your resource limits and saturate them and see what the best possible performance could be, then measure your system, and you can express it as a percentage of optimal. Or if you can't directly test/saturate your limits at least be aware of them.
BUT the author did say this is the simple stupid naive take, in which case DuckDB and Polars really shined.
In the c6n and m6n and maybe the upper-end 5th gens you can get 100Gbps NICs, and if you look at the 8th gen instances like the c8gn family, you can even get instances with 600Gbps of bandwidth.
1. tested column pruning and the dataset you access would have been 2 columns + metadata for the parquet files so probably fit in memory even without streaming.
2. Most of the processing time would be IO bound on S3 and the access patterns/simultaneous connection limits etc. would have more of an impact than any processing code.
Love that you went through the pain of trying the different systems but I'd like to see an actual larger than memory query.
2. S3 defaults shouldn't prevent all available threads/cpus from reading the files in parallel, so I would assume that the network bandwidth of the VM (or container) would be the bottleneck.
It seems like these single-node libraries can process a terabyte on a typical machine, and you'd have have over 10TB before moving to Spark.
I'm surprised by how often people jump to Spark because "it's (highly) parallelizable!" and "you can throw more nodes at it easy-peasy!" And yet, there are so many cases where you can just do things with better tools.
Like the time a junior engineer asked for help processing 100s of ~5GB files of JSON data which turned out to be doing crazy amounts of string concatenation in Python (don't ask). It was taking something like 18 hours to run, IIRC, and writing a simple console tool to do the heavy lifting and letting Python's multiprocessing tackle it dropped the time to like 35 minutes.
Right cool for the right job, people.
Wrangling multiprocess is still annoying tho
I found that clickhouse was the fastest, but duckdb was the simplest to work with it usually just works. DuckDB was close enough to the max performance from clickhouse.
I tried flink & pyspark but they were way slower (like 3-5x) than clickhouse and the code was kind of annoying. Dask and Ray were also way too slow, but dask’s parallelism was easy to code but it was just too slow. I also tried Datafusion and polars but clickhouse ended up being faster.
These days I would recommend starting with DuckDB or Clickhouse for most workloads just cause it’s the easiest to work with AND has good performance. Personally I switched to using DuckDB instead of polars for most things where pandas is too slow.