Should you ditch Spark for DuckDB or Polars?

Great post but it seems like you still rely on Fabric to run Spark NEE. If you're on AWS or GCP, you should probably not ditch Spark but combine both. DuckDB's gotcha is that it can't scale horizontally (multi-node), unlike Databricks. A single node can get you as far as you or can rent 2TB memory + 20TB NVME in AWS, and if you use PySpark, you can run DuckDB until it doesn't scale with its Spark integration (https://duckdb.org/docs/api/python/spark_api.html) and switch to Databricks if you need to scale out. That way, you get the best of the two worlds.

DuckDB on AWS EC2's price performance rate is 10x that of Databricks and Snowflake with its native file format, so it's a better deal if you're not processing petabyte-level data. That's unsurprising, given that DuckDB operates in a single node (no need for distributed shuffles) and works primarily with NVME (no use of object stores such as S3 for intermediate data). Thus, it can optimize the workloads much better than the other data warehouses.

If you use SQL, another gotcha is that DuckDB doesn't have advanced catalog features in cloud data warehouses. Still, it's possible to combine DuckDB compute and Snowflake Horizon / Databricks Unity Catalog thanks to Apache Iceberg, which enables multi-engine support in the same catalog. I'm experimenting this multi-stack idea with DuckDB <> Snowflake, and it works well so far: https://github.com/buremba/universql

hamandcheese · a year ago

> A single node can get you as far as you or can rent 2TB memory + 20TB NVME in AWS

What I'm a little curious about with these "single node" solutions - is redundancy not a concern with setups like this? Is it assumed that you can just rebuild your data warehouse from some form of "cold" storage if you lose your nvme data?

buremba · a year ago

Exactly. Let's say you have a data pipeline in dbt or your favourite transformation tool. All the source data will be in your cold storage, AWS S3, all the intermediate data (TEMP tables) is written to NVME, and then the final tables are created in S3 & registered in your catalog. AWS recently released S3 Tables, which aims to make the maintenance easier with Apache Iceberg as well: https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tab...

You can use spot instances if you can afford more latency or might prefer on-demand instances and keep them warm if you need low latency. Databricks (compute) and Snowflake (warehouse) do that automatically for you for the premium price.

mrbungie · a year ago

In these setups compute is ephemeral and decoupled from storage (you would normally use an object storage offering that is actually redundant out-of-the-box), so the 2TB is for working memory + OS + whatever and the 20TB NVMe is purely for maybe local spilling and a local cache so you can save on storage reads.

If a node fails when running a process (e.g. for an external reason not related to your own code or data: like your spot EC2 instance terminating due to high demand), you just run it again. When you're done running your processes, normally the processing node is completely terminated.

tl;dr: you treat them like cattle with a very short lifecycle around data processes. The specifics of resource/process scheduling being dependent on your data needs.

hipadev23 · a year ago

Or just use Clickhouse.

buremba · a year ago

Clickhouse has excellent performance and it scales out indeed. Unfortunately, it's hard to deploy and maintain a Clickhouse cluster and its catalog features are pretty basic compared to Databricks and Snowflake.

brunoqc · a year ago

Clickhouse is open core. Which is a dealbreaker for me.

riku_iki · a year ago

in my testing, clickhouse dies on various queries with OOM on large datasets, because algos are subefficient.

I went through this trade off at my last job. I started off migrating my adhoc queries to duckdb directly from delta tables. Over time, I used duckdb enough to do some performance tuning. I found that migrating from Delta to duckdb's native file format provided substantial speed wins.

The author focuses on read/write performance on Delta (makes sense for the scope of the comparison). I think if an engineer is considering switching from spark to duckdb/polars for their data warehouse, they would likely be open to data formats other than Delta, which is tightly coupled to the spark (and even more so to the closed-source Databricks implementation). In my use case, we saw enough speed wins and cost savings that it made sense to fully migrate our data warehouse to a self managed duckdb warehouse using duckdb's native file format.

thenaturalist · a year ago

Thanks for sharing, very intersting!

I'm thinking the same wrt dropping Parquet.

I don't need concurrent writes, which seems to me about the only true caveat DuckDB would have.

Two other questions I am asking myself:

1) Is there a an upper file size limit in duckdb's native format where performance might degrade?

2) Are there significant performance degradations/ hard limits if I want to consolidate from X DuckDB's into a single one by programmatically attaching them all and pulling data in via a large `UNION ALL` query?

Or would you use something like Polars to query over N DuckDB's?

kianN · a year ago

I can offer my experience with respect to your two questions, though my use case is likely atypical.

1) I haven't personally run into upper size limits to the point of non linear performance degradation. However, some caveats to that are (a) most my files are in the range of 2-10gb with a few topping out near 100gb. (b) I am running a single r6gd metal as the primary interface with this which has 512 gb of ram. So, essentially, any one of my files can fit into ram.

Even given that setup, I will mention that I find myself hand tuning queries a lot more than I was with Spark. Since duckdb is meant to be more lightweight the query optimization engine is less robust.

2) I can't speak too much towards this use case. I haven't had any occasion to query across duckdb files. However, I experimented on top of delta lake between duckdb and polars and never really found a true performance case for polars in my (again atypical use case) set of test. But definitely worth doing your own benchmarking on your specific use case :)

benjaminwootton · a year ago

To state the obvious, Delta is an open standards format which should be widely supported.

Databricks have also bought into Iceberg and will probably lead with that or unify the two in future.

kianN · a year ago

There is an open source Delta that is a very good library. This is not the same as Databricks' implementation and there are at times compatability issues. For example, by default if you write a delta table using Databricks' dbr runtimes, that table is not readable by the open source Delta because due to the "deletion vectors" optimization that is only accessible within Databricks.

That aside, I was more pointing out that Delta, particularly via a commercial offering, is a data format biased towards Spark in terms of performance, since it is being developed primarily by Databricks as a part of the spark ecosystem. If you are plan to use Delta regardless of your compute engine, it makes perfect sense as a benchmark. However, for certain circumstances, the performance wins could be (in my case was) worth it to switch data formats.

RobinL · a year ago

Interesting. So what does that look like on disk? Possibly slightly naively I'm imagining a single massive file?

df.mutate(new_column=df.old_column.dosomething()) .alias('temp_table') .sql('SELECT db_only_function(new_column) AS newer_column from temp_table') .mutate(other_new_column = newer_column.do_other_stuff())

serjester · a year ago

Polars is much more useful if you’re doing complex transformations instead of basic ETL.

Something under appreciated about polars is how easy it is to build a plugin. I recently took a rust crate that reimplemented the h3 geospatial coordinate system, exposed it at as a polars plugin and achieved performance 5X faster than the DuckDB version.

With knowing 0 rust and some help from AI it only took me 2ish days - I can’t imagine doing this in C++ (DuckDB).

[1] https://github.com/Filimoa/polars-h3

datadeft · a year ago

Polars is a life saver. We used it in a fairly complex project and it worked very well.

Ostatnigrosh · a year ago

Neat! Really curious how you managed to outperform DuckDB 5x. Do you see yourself maintaining this long term? would love to use polars as my one stop shop + plugins rather than pulling in additional tooling.

moandcompany · a year ago

My opinion: the high-prevalence of implementations using Spark, Pandas, etc. are mostly driven by (1) people's tendency to work with tools that use APIs they are already familiar with, (2) resume driven development, and/or to a much lesser degree (3) sustainability with regard to future maintainers, versus what may be technically sensible with regard to performance. A decade ago we saw similar articles referencing misapplications of Hadoop/Mapreduce, and today it is Spark as its successor.

Pandas' use of the dataframe concepts and APIs were informed by R and a desire to provide something familiar and accessible to R users (i.e. ease of user adoption).

Likewise, when the Spark development community somewhere around the version 0.11 days began implementing the dataframe abstraction over its original native RDD abstractions, it understood the need to provide a robust Python API similar to the Pandas APIs for accessibility (i.e. ease of user adoption).

At some point those familiar APIs also became a burden, or were not-great to begin with, in several ways and we see new tools emerge like DuckDB and Polars.

However, we now have a non-unique issue where people are learning and applying specific tools versus general problem-solving skills and tradecraft in the related domain (i.e. the common pattern of people with hammers seeing everything as nails). Note all of the "learn these -n- tools/packages to become a great ____ engineer and make xyz dollars" type tutorials and starter-packs on the internet today.

esafak · a year ago

You can use Fugue as a translation layer. https://github.com/fugue-project/fugue

threeseed · a year ago

It's always easy to ignorantly criticise technology choices.

But from my experience in almost all cases it is misguided requirements e.g. we want to support 100x data requirements in 5 years that drive in hindsight bad choices. Not resume driven development.

And at least in enterprise space having a vendor who can support the technology is just as important as the merits of the technology itself. And vendors tend to spring up from popular, trendy technologies.

bdcravens · a year ago

It's not binary. Valid uses of a technology don't mean there aren't others using that same technology in a resume-driven manner.

pnut · a year ago

Maybe this is telling more of the company I work in, but it is just incomprehensible for me to casually contemplate dumping a generally comparable, installed production capability.

All I think when I read this is, standing up new environments, observability, dev/QA training, change control, data migration, mitigating risks to business continuity, integrating with data sources and sinks, and on and on...

I've got enough headaches already without another one of those projects.

I submitted this because I thought it was a good, high effort post, but I must admit I was surprised by the conclusion. In my experience, admittedly on different workloads, duckdb is both faster and easier to use than spark, and requires significantly less tuning and less complex infrastructure. I've been trying to transition as much as possible over to duckdb.

There are also some interesting points in the following podcast about ease of use and transactional capabilities of duckdb which are easy to overlook (you can skip the first 10 mins): https://open.spotify.com/episode/7zBdJurLfWBilCi6DQ2eYb

Of course, if you have truly massive data, you probably still need spark

diroussel · a year ago

Thanks, I'll give that a listen. Here is the Apple Podcast link to the same episode: https://podcasts.apple.com/gb/podcast/the-joe-reis-show/id16...

I've also experimented with duckdb whilst on a databricks project, and did also think "we could do this whole thing with duckdb and a large EC2 instance spun up for an few hours a week".

But of course duckdb was new then, and you can't re-architect on a hunch. Thanks for the aricle.

IshKebab · a year ago

This guy says "I live and breathe Spark"... I would take the conclusions with a grain of salt.

mwc360 · a year ago

Author of the blog here: fair point. Pretty much every published benchmark has an agenda that ultimately skews the conclusion. I did my best here to be impartial, I.e I fully designed the benchmark and each test prior to running code on any engine to mimic typical ELT demands w/o having the opportunity to optimize Spark since I know it well.

Do you still use file formats at all in your work?

I'm currently thinking of ditching Parquet all together, and going all in DuckDB files.

I don't need concurrent writes, my data would rarely exceed 1TB and if it were, I could still offload to Parquet.

Conceptually I can't see a reason for this not working, but given the novelty of the tech I'm wondering if it'll hold up.

We're still using parquet. So we use the native duckdb format for intermediate processing (during pipeline execution) but the end results are saved out as parquet. This is partly because customers often read the data from other tools (e.g. AWS athena)

I'd be interested in hearing about experiences of using duckdb files though, i can see instances where it could be useful to us

rapatel0 · a year ago

The author disparages ibis but i really think that this is short sighted. Ibis does a great job of mixing sql with dataframes to perform complex queries and abstracts away a lot of the underlyng logic and allows for query optimization.

Example:

df = (

)

It's super flexible and duckdb makes it very performant. The general vice i experience creating overly complex transforms but otherwise it's super useful and really easy to mix dataframes and SQL. Finally it supports pretty much every backend including pyspark and polars

memhole · a year ago

Nice write up. I don’t think the comments about duckdb spilling to disk are correct. I believe if you create a temp or persistent db duckdb will spill to disk.

I might have missed it, but the integration of duckdb and the arrow library makes mixing and matching dataframes and sql syntax fairly seamless.

I’m convinced the simplicity of duckdb is worth a performance penalty compared to spark for most workloads. Ime, people struggle with fully utilizing spark.

marshmellman · a year ago

About spilling to disk, in DuckDB’s docs I see:

> Both persistent and in-memory databases use spilling to disk to facilitate larger-than-memory workloads (i.e., out-of-core-processing).

I don’t have personal experience with it though.

https://duckdb.org/docs/connect/overview.html

Miles Cole here… thx for the correction, another reader just notes this as well. I’ll get this corrected tomorrow and possibly retest after verifying I have spill set up. Thx!

m_ke · a year ago

Yeah I'd never go for spark again, all of its use cases are better handled with either DuckDB or Ray (or combination of both).

anonymousDan · a year ago

I thought Ray was a reinforcement learning platform? Can you elaborate on how it is a replacement for Spark?