DuckDB 1.0.0 - Readit News

gunapologist99 · a year ago

> DuckDB Labs, the company that employs DuckDB’s core contributors, has not had any outside investments, and as a result, the company is fully owned by the team. Labs’ business model is to provide consulting and support services for DuckDB, and we’re happy to report that this is going well. With the revenue from contracts, we fund long-term and strategic DuckDB development with a team of almost 20 people. At the same time, the intellectual property in the project is guarded by the independent DuckDB Foundation. This non-profit foundation ensures that DuckDB will be around long-term under the MIT license.

This seems like an excellent structure for long-term protection of the open source project. What other projects have taken this approach?

mritchie712 · a year ago

I thought this exact thing after reading the post. I can't imagine a better, practical structure:

DuckDB Labs: The core contributors. Instead of developing features that will be behind a paywall, they provide support and consulting. Awesome.

DuckDB Foundation: A non-profit that ensures DuckDB remains MIT licensed. Perfect.

We actually just published a post on how to replace your warehouse with DuckDB. It's certainly not a good move for every company using something like Snowflake, but it was the right move for us.

https://www.definite.app/blog/duckdb-datawarehouse

geokon · a year ago

How do you ensure shares don't get diluted as people leave and new ones get hired?

I remember this was an issue at an ESOP I worked for. They had to pressure former employees and buy back shares from them. If you have too many shareholders in the US then your corporate status changes

Granted this is a nonprofit so maybe the rules are different, but the fundamental problem remains

gunapologist99 · a year ago

A 10% option pool, set aside at the beginning, should help resolve this, but employees also need to be educated that dilution is just part of life in a startup.

nomilk · a year ago

Do any data scientists here use duckdb daily? Keen to hear your experiences and comparisons to other tools you used before it.

I love tools that make life simpler. I've been toying with the idea of storing 1TB of data in S3 and querying it using duckdb on an EC2. That's really old/boring infrastructure, but is hugely appealing to me, since it's so much simpler than than what I currently use.

Would love to hear of others' experiences with duckdb.

mritchie712 · a year ago

We just wrote a post[0] very similar to what you're thinking. Let me know if you have any questions.

0 - https://www.definite.app/blog/duckdb-datawarehouse

cbrozefsky · a year ago

Storing as parquet, and using hive path partitioning, you can get passable batch performance assuming you queries are not mostly scans and aggregates across large portions of the data. On the order of ten seconds for regex matching on columns for example.

I thought it was a great way to give analysts direct access to data to do adhoc queries while they were getting familiar with the data.

focused_mestorf · a year ago

We use it for simple use-cases and experimentation, especially because it works so well with various data formats and polars. Personally, I prefer to maintain proper SQL queries against duckdb including re-usable views, and then use polars for the remaining fiddling/exploration.

In combination with sqlalchemy, it is trivial to lift an app to other OLAP systems.

gizzlon · a year ago

Wouldn't that be pretty slow? :) Why not store it on the EC2 instance directly? Or locally on your computer?

1 TB is .. not a lot of data .. I rent a server with 15 TB for ~50$ per month and buying a new 2TB disk is less than 100$.

szarnyasg · a year ago

DuckDB supports partial reading of Parquet files (also via HTTPS and S3) [1], so it can limit the scans to the required columns in the Parquet file. It can also perform filter pushdown, so querying data in S3 can be quite efficient.

Disclaimer: I work at DuckDB Labs.

[1] https://duckdb.org/docs/data/parquet/overview#partial-readin...

nomilk · a year ago

> Why not store it on the EC2 instance directly?

Since the data in S3 receives updates every few hours, querying it directly ensures queries are on up-to-date data, whereas creating a copy of the dataset on the EC2 would necessitate periodically checking for updates and moving the deltas across to the EC2 (not insurmountable; but complexity worth avoiding if possible).

vgt · a year ago

You should give us at MotherDuck a try for this use case.

(Co-founder)

HermitX · a year ago

I suggest you try using StarRocks to query the data lake directly. I know many large companies, including Tencent and Pinterest, are doing this. StarRocks has a truly mature vectorized execution engine and a robust distributed architecture. It can provide you with impressive query performance.

mgt19937 · a year ago

One cool feature of duckdb is that you can directly run sql against a pandas dataframe/arrow table.[1] The seamless integration is amazing.

[1]: https://duckdb.org/docs/api/python/overview.html#dataframes

losvedir · a year ago

Congrats to the team! I feel like I see lots of posts here on HN and go "wow, I didn't know DuckDB could do that". It seems like a very powerful tool, which I haven't had the pleasure of using yet.

Due to policies at work it's unlikely we would use this in production, but as I understand it, it's still pretty useful for exploring and poking around local data. Is that right? Does anyone have examples of problems they've used it on to digest local files or logs or something?

bufferoverflow · a year ago

Have they fixed the incredibly slow queries on indexed columns?

https://www.lukas-barth.net/blog/sqlite-duckdb-benchmark/

1egg0myegg0 · a year ago

Howdy! Thanks for your benchmarking!

Your blog does a great job contrasting the two use cases. I don't think too much has changed on your main use case, however here are a few ideas to test out!

DuckDB can read SQLite files now! So if you like DuckDB syntax or query optimization, but want to use the SQLite format / indexes, that may work well.

Since DuckDB is columnar (and compressed), it frequently needs to read a big chunk of rows (~100K) just to get 1 row out and decompressed. Mind trying to store your data uncompressed? Might help in your case! (PRAGMA force_compression='uncompressed')

Links: https://duckdb.org/docs/extensions/sqlite

jiehong · a year ago

> The differences between DuckDB and SQLite are just always so large that plotting them together completely hides all other detail.

(From your blog post). When values span orders of magnitude, that’s when log plots are useful.

jiehong · a year ago

They say they have lots of benchmarks running for it, so it might be a good idea to add a similar benchmark directly to be able to track it?

By the way, I like your blog's style! Even the html of an article is clean and readable.

aranw · a year ago

I've been wanting to explore using DuckDB for in-process aggregation and windowing in stream processing with Golang, as I think it would be a great solution.

Curious if anyone else is using DuckDB for something similar? Does anyone have an example?

netcraft · a year ago

I havent had a chance to really use it yet, but I know duckdb is in my future. Being able to connect it to all the different data sources to run analytical queries, plus the support for parquet.

amath · a year ago

This seems like a good model for sustaining open source, but raises some questions.

Does anybody know how the DuckDB foundation works? The sponsors are MotherDuck, Voltron, and Posit, which are heavily venture-funded. Do DuckDB Lab employees work on sponsored projects for the foundation?

I am also curious if anyone can shed light on what kind of contract work DuckDB does to align its work with the open source project. This has always seemed like the holy grail, but it is difficult to do in practice.