Smallpond – A lightweight data processing framework built on DuckDB and 3FS

We are seeing more and more specialized query engines. This is a query engine specialized for training pipelines. It is not general purpose - it is for providing batches of training data at workers. It uses Ray for parallelization. The kind of queries you need are random reads (to implement shuffling across epochs), arrow support (zero copy to Pandas DataFrames), and efficient checkpointing.

nyrikki · 6 months ago

Some of what they are doing is simply what was lost due to the ubiquitous nature of the relational model.

The hierarchical model is applicable to many problems and actually in part why moving off mainframes is challenging because IMS is so much more efficient than the relational model for applications like airline tickets.

There have been several efforts to leverage object stores in the way they did that I am aware of but it was a hard sell.

The hierarchical model really only works for many to one relationships, and it's integrity model differs and is not as DRY.

There are lessons to learn here but it requires some relearning.

When you have a shopping cart, having data local to the server handling the transaction is also a benefit.

Codd's relational model has advantages, but has held back some efforts because we are just use to dealing with the painful parts that we often don't consider other options.

HackerThemAll · 6 months ago

DuckDB is specialized in efficient storage and fast query for analytics (OLAP), using a columnar storage (in contrast to row storage, used by usual RDBMSes doing OLTP processing). It's nothing new, it's been there for couple decades already. But this "distributed" DuckDB can indeed be beneficial for training.

auxten · 6 months ago

Data operations are increasingly happening near the GPU side to boost efficiency—especially for compute-heavy workflows. Talking about Arrow file processing and zero-copy queries on DataFrames, which are becoming crucial for modern data pipelines. I think another option worth considering is chdb, which supports these features and fits well with this shift. (author of chdb here)

agilob · 6 months ago

I'm super impressed how much effort DeepSeek did and how much of it they opensourced.

DuckDB itself is cool enough, especially when combined with SQLite and/or PostgreSQL, and now this. Thanks DeepSeek!

dcreater · 6 months ago

How is duckdb combined with SQLite? Aren't they alternatives to each other?

jitl · 6 months ago

Not sure what the poster meant but DuckDB is an analytics DB, it doesn’t have a btree index - at least not last time I looked. You could consider it the OLAP embedded DB to SQLite’s OLTP embedded db.

DuckDB can read SQLite so you can even imagine using them side by side in the same system, serving point reads and writes from SQLite and using DuckDB for stuff like aggregates and searches that SQLite is slower at.

HackerThemAll · 6 months ago

They are complementary to each other. There's an SQLite extension for use within DuckDB [1], which gives you a power of great transactional capabilities of SQLite and speed of analytical queries within DuckDB's columnar storage engine, all within a single database.

[1] https://duckdb.org/docs/stable/extensions/sqlite.html

jamesblonde · 6 months ago

orlp · 6 months ago

One thing I found peculiar is that for the GraySort benchmark it dispatches to Polars by default to do the actual sorting, not DuckDB: https://github.com/deepseek-ai/smallpond/blob/ed112db42af4d0....

tomnipotent · 6 months ago

The function argument defaults to polars, but the actual benchmark code sets duckdb by default.

https://github.com/deepseek-ai/smallpond/blob/ed112db42af4d0...

I see, confusing multiple layers of defaults :)

dang · 6 months ago

Related ongoing thread:

Understanding Smallpond and 3FS - https://news.ycombinator.com/item?id=43232410

also:

DuckDB goes distributed? DeepSeek's smallpond takes on Big Data - https://news.ycombinator.com/item?id=43206964 (no comments there, but some people have been recommending that article)

rubenvanwyk · 6 months ago

May Data Engineering content keep on hitting front page HN!

Confused by the example in the repo? What is the use case for this? Is it a replacement for dask, ray etc? (Not a professional swe)

fastasucan · 6 months ago

What does this do - what is the benefit over DuckDB, Polers etc?

articsputnik · 6 months ago

Mehdi just wrote about this. Mainly starting DAGs parallelism using Ray (core) and their filesystem 3FS. See https://mehdio.substack.com/p/duckdb-goes-distributed-deepse....

mritchie712 · 6 months ago

I don't think you get any really benefits over duckdb unless your data is 10tb+ or you spin up 3FS (which seem challenging).

ilove196884 · 6 months ago

Any benchmark and comparisons?