Readit News logoReadit News
archiewood · a year ago
DuckDB is really having a moment

The ecosystem is very active, and they have recently opened up "community extensions" to bring your own functions, data types and connections. A barrier at the moment is that extensions are written in C++, though this limitation should be removed soon.

I've been building a lot on top of DuckDB, two of the projects I'm working on are linked in the article:

- Evidence (https://evidence.dev): Build data apps with SQL + Markdown

- DuckDB GSheets (https://duckdb-gsheets.com): Read/Write Google Sheets via DuckDB

thehours · a year ago
I'm using DuckDB for the first time for this year's Advent of Code and it's been a delightful experience so far. I was looking for something simple to set up and had more advanced functionality than what SQLite supports.
spratzt · a year ago
Extraordinary!

I can’t think of any of the advent of code questions this year where a database would have been of any use.

Do tell us more.

dilawar · a year ago
Duckdb is really great. DX is awesome.

I cloned the CVE repository and ingested it into duckdb without worrying about table structures. Build a fastapi wrapper to query CVEs like SQL. A little bit of caching and it was pretty fast. Was done in a few hours.

Shame that their Rust crate wasn't as performant as the python module.

drewbitt · a year ago
I really enjoy using DuckDB for data visualization. It’s so easy to work with that it’s even come in handy during a few Twitter arguments—where I might have given up before, I was able to back myself up with data in just a few minutes. That's not even evidence.dev, just built-ins. https://duckdb.org/docs/sql/functions/char.html#barx-min-max.... We also use it at work for working with Parquet. Huge fan
hipjiveguy · a year ago
this sound interesting - do you have any more samples on how you win twitter arguments with this?
tmpz22 · a year ago
Are the testimonials on your website real or fake? Apple, Flexport, BMW, etc are using evidence.dev?
ironchef · a year ago
Lots of folks out there use evidence.dev. It's a simple way to get some BI up and running without needing to deal with licensing / corporate IT, etc.

Remember all it takes is 1 employee to put that claim up there (although I do like evidence.dev).

Scubabear68 · a year ago
I got into DuckDb thanks to HN comments like these in general, it really shines for doing "desktop" analysis that goes beyond the usual Excel/Numbers shenanigans.
tlavoie · a year ago
The Developer Voices podcast episode with a couple authors of a DuckDB book was quite good as well: https://www.youtube.com/watch?v=_nA3uDx1rlg
LeftHandPath · a year ago
Wow. Evidence is the sort of thing I've been dreaming of for a long time. Awesome work!
Jtsummers · a year ago
https://pragprog.com/titles/pwrdata/seven-databases-in-seven... - A book by the same name. Instead of giving you a brief blurb on each database, the authors attempt to give you more context and exercises with them. Last updated in 2018 it covers PostgreSQL, HBase, MongoDB, CouchDB, Neo4J, DynamoDB, and Redis. The first edition covered Riak instead of DynamDB.
nindalf · a year ago
The only one in common in both lists is Postgres. Insane longevity.
alex7o · a year ago
I like to joke and say that it is the Postgres Cinematic universe
cpursley · a year ago
Yep, obligatory Postgres Is Enough listicle:

https://gist.github.com/cpursley/c8fb81fe8a7e5df038158bdfe0f...

worksonmymach · a year ago
SQL Server and Oracle too but not being open source is a big minus.

Deleted Comment

steveBK123 · a year ago
Yeah and only 6 years ago, which is nothing in the span life databases historically. It's not like DBs are a new tech.

Making a decision on DB is somewhat foundational for a lot of system architecture, in that you expect to be able to use it for years. It is not like some UI framework you toss out every 18 months.

So the benefits of the new hyped thing may be outweighed by the unknowns & risk of flaming out.

This is the kind of hype cycle that gives me pause when a new hot thing like DuckDB which actually ticks a TON of boxes for me, but has attracted some of the usual suspects in my work network that I consider to be contra signals.

fuzztester · a year ago
iirc, there was a similarly named book, maybe by pragprog or by o'reilly, titled something like 7 languages in 7 weeks. I forget the name of the author.
Jtsummers · a year ago
Yes. They turned it into a series of books after that one. They are published by pragprog, which is what I linked to.

https://pragprog.com/categories/seven-in-seven/

breadwinner · a year ago
ClickHouse is awesome, but there's a newer OLAP database in town: Apache Pinot, and it is significantly better: https://pinot.apache.org/

Here's why it is better:

1. User-facing analytics vs. business analytics. Pinot was designed for user-facing analytics (meaning analytics result is used by end-user (for example, "what is the expected delivery time for this restaurant?"). The demands are much higher, including latency, freshness, concurrency and uptime.

2. Better architecture. To scale out ClickHouse uses sharding. Which means if you want to add a node you have to bring down the database, re-partition the database and reload the data, then bring it back up. Expect downtime of 1 or 2 days at least. Pinot on the other hand uses segments, which is smaller (but self-contained) pieces of data, and there are lots of segments on each node. When you add a node, Pinot just moves around segments, no downtime needed. Furthermore, for high availability ClickHouse uses replicas. Each shard needs 1 or 2 replicas for HA. Pinot does not have shards vs replica nodes. Instead each segment is replicated to 2 to 3 nodes. This is better for hardware utilization.

3. Pre-aggregation. OLAP cubes became popular in the 1990s. They pre-aggregate data to make queries significantly faster, but the downside is high storage cost. ClickHouse doesn't have the equivalent of OLAP cubes at all. Pinot has something better than OLAP cubes: Star trees. Like cubes, star trees pre-aggregate data along multiple dimensions, but don't need as much storage.

hodgesrm · a year ago
> 2. Better architecture. To scale out ClickHouse uses sharding. Which means if you want to add a node you have to bring down the database, re-partition the database and reload the data, then bring it back up.

This is not true in practice. In a properly designed implementation you can add new shards without downtime. ClickHouse compression is so efficient that most installations don't need to shard in the first place--everything fits on a single server with a few TB of storage. (We've put up to 50Tb on single servers.)

breadwinner · a year ago
If you want to scale out you have to add new nodes, and that means re-partitioning the same data (not new shards) across the new nodes. For example if you have 3 servers, each has 33% of the data, but if you add 1 more server, then each server has 25% of the data. If one server meets your scale requirements, then sure ClickHouse is a good choice.
memset · a year ago
I’d like to learn more! Clickhouse benchmarks don’t show Pinot favorably by comparison.

Also, do you have thoughts on Starrocks?

shadow28 · a year ago
> Clickhouse benchmarks don’t show Pinot favorably by comparison

Looks like they don't configure any indexes for Pinot in their benchmarks, which is one of Pinot's main selling points on the performance front - https://github.com/ClickHouse/ClickBench/issues/37.

maxloh · a year ago
Apache Pinot is written in Java, which is a garbage-collected language and also a rare choice for mainstream databases.

Any idea if that would affect its performance?

The last time I checked there are several databases written in Go, which is also garbage-collected, but never saw one in Java except Apache Derby.

gunnarmorling · a year ago
Concurrent garbage collectors of recent JVM versions (ZGC, Shenandoah) can give you sub-millisecond pause times, i.e. GC pauses are not really an issue any more for typical analytics query workloads. The price to pay is reduced throughput, but a scale-out architecture like Pinot makes it easy to make up for that by adding another node, if needed.
vram22 · a year ago
I know of at least three databases written in Java:

HSQLDB:

https://en.m.wikipedia.org/wiki/HSQLDB

H2:

https://en.m.wikipedia.org/wiki/H2_(database)

PointBase:

https://en.m.wikipedia.org/wiki/PointBase

I have briefly used either one of the first two, I think it was HSQLDB, with Java, years ago.

PointBase was created by Bruce Scott, a co-founder of Oracle, and a few others, including an architect of Sybase. This was in the early days of Java.

All three still seem to exist, according to the links above.

mble_ · a year ago
Apache Cassandra would probably be the most notable one (outside of Kafka etc).
evnix · a year ago
Whenever I see a DB written Java, I run, very fast and very far away.

I've never had good experience with these implementations. If you say, the DB isn't performing well, the only answer you get is, well you forgot to tune these 1000 knobs and good luck if your data doesn't fit in RAM.

For some reason I feel your DB in a text file, would fare just as good as these given the same memory and resources that these Apache/java DB servers demand.

LunaSea · a year ago
Yes, I'm having ElasticSearch shard failures and Out Of Memory error fever every time I have to think about Java / Scala databases.
mble_ · a year ago
Pinot is something I haven't had any personal experience with, so that's why it wasn't on the list - same with StarRocks, or Druid.

Something for me to look into next year, clearly.

mble_ · a year ago
Author here.

Thanks for sharing! My choices are pretty coloured by personal experience, and I didn't want to re-tread anything from the book (Redis/Valkey, Neo4j etc) other than Postgres - mostly due to Postgres changing _a lot_ over the years.

I had considered an OSS Dynamo-like (Cassandra, ScyllaDB, kinda), or a Calvin-like (FaunaDB), but went with FoundationDB instead because to me, that was much more interesting.

After a decade of running DBaaS at massive scale, I'm also pretty biased towards easy-to-run.

leeoniya · a year ago
was hoping to see https://github.com/GreptimeTeam/greptimedb on the list

maybe 2026, or some bonus content for 2025 :)

https://www.greptime.com/blogs/2024-09-09-report-summary

biggestlou · a year ago
As a co-author of the book of the same name, I’m disappointed that you didn’t see fit to provide any kind of attribution
mble_ · a year ago
It was not intentional. I've corrected this oversight, and attribution is now provided - my apologies.
timbotron · a year ago
I'm curious why you said you don't find MongoDB interesting?
mble_ · a year ago
I lived through the MongoDB hype cycle.

For document databases, I'm more interested in things like PoloDB and SurrealDB.

pradeepchhetri · a year ago
> If I had to only pick two databases to deal with, I’d be quite happy with just Postgres and ClickHouse - the former for OLTP, the latter for OLAP.

As the author mentioned, I completely agree with this statement. In fact, many companies like Cloudflare are built with exactly this approach and it has scaled them pretty well without the need of any third database.

> Another reason I suggest checking out ClickHouse is that it is a joy to operate - deployment, scaling, backups and so on are well documented - even down to setting the right CPU governor is covered.

Another point mentioned by author which is worth highlighting is the ease of deployment. Most distributed databases aren't so easy to run at scale, ClickHouse is much much easier and it has become even more easier with efficient storage-compute separation.

saisrirampur · a year ago
Sai from ClickHouse here. Have been living and breathing past year helping customers integrating Postgres and ClickHouse together. Totally agreed with this statement - there are numerous production grade workloads solving most of their data problems using these 2 purpose-built Open Source databases.

My team at ClickHouse has been working hard to make the integration even seamless. We work on PeerDB, an open source tool enabling seamless Postgres replication to ClickHouse https://github.com/PeerDB-io/peerdb/ This integration is now also natively available in the Cloud through ClickPipes. The private preview was released just last week https://clickhouse.com/cloud/clickpipes/postgres-cdc-connect...

yas_hmaheshwari · a year ago
Out of curiosity: Why not mysql? I am also surprised that no one has even mentioned mysql in any of the comments so far -- so looks like the verdict is very clear on that one

PS: I am also a fan of Postgres, and we are using that for our startup. But I don't know the answer if someone asks me, why not Mysql. Hence asking

pradeepchhetri · a year ago
To my knowledge, both Postgres and MySQL has their own strengths and weaknesses. Example: mvcc implementation, data replication, connection pooling and difficulty of upgrades were the major weaknesses of Postgres which are much improved over time. Similarly mysql query optimizer is consider lesser developed than that of Postgres's.

Overall I think Postgres adoption and integrations and thus community is much more wider than MySQL which gives it major advantage over MySQL. Also looking at the number of database-as-a-service companies of Postgres vs those of MySQL we can immediately acknowledges that Postgres is much widely adopted.

codazoda · a year ago
I think this might come down to… Oracle.

Obviously there are alternatives like MariaDB but Postgres is a quality long standing open source solution.

rho4 · a year ago
The article mentions the TigerBeetle Style Guide: https://github.com/tigerbeetle/tigerbeetle/blob/main/docs/TI...

I agree so much with the paragraphs about "Dependencies" and "Tooling".

atonse · a year ago
I didn't realize this [1] was a thing. I've been informally referring to our Postgres/Elixir stack as "boring, but in the best way possible, it just works with no drama whatsoever" for years.

1: https://boringtechnology.club

maximus93 · a year ago
DuckDB really seems to be having its moment—projects like Evidence and DuckDB GSheets are super cool examples of its potential. And yeah, Postgres’s longevity is insane, it just keeps adapting.

On the AI front, vector databases like Pinecone and pgvector are exciting, but I’d love to see something even more integrated with AI workflows. The possibilities are huge. Curious to hear what others think!

vjerancrnjak · a year ago
I like Clickhouse more.

Unrelated, not sure if it is just me, but ever since LLMs became popular, I've been seeing an enormous amounts of special utf8 characters no one used regularly, like this em dash you used.

How is this simple to type? If you're on a phone keyboard, you have to switch to special characters, and then have to long-hold the dash and then slide to em dash.

arglebarnacle · a year ago
On a full keyboard it’s not too bad—just hold alt and tap 0151 on the numpad. Honestly I wish it was harder to type for stylistic reasons—it would help cure me of my em-dash addiction
airstrike · a year ago
- — are built into Mac OS with Opt+- and Opt+Shift+-. I use them all the time

there are also bindings for bullets • and probably other things I'm forgetting (or that may be conflicting with other bindings I have setup)

conductr · a year ago
I noticed that certain browsers started auto converting a double hyphen to an emdash as I type, no LLM needed, I think that’s just a timing coincidence
cfiggers · a year ago
Hi, not the person you asked, but I have an answer to the question.

I have an AutoHotkey script that auto-replaces "--" with "—" (and auto-disables it in contexts where it's likely intended, like SQL editors and my terminal).

I also auto-replace "`-" with "–" so I can conveniently do number ranges with the (objectively correct) n-dash to indicate a range.

elcritch · a year ago
On macs you can do `alt` and `-` to get a –. Even on a phone a proper emdash can add effect over a regular dash.
alwa · a year ago
My iPhone autocorrects two consecutive hyphens to an em dash. I’m fairly sure it’s not something I configured manually, so I assume it is or was a default. Possibly a component of the “smart punctuation” setting for the keyboard.

Deleted Comment

Deleted Comment

tejaskumar_ · a year ago
> I’d love to see something even more integrated with AI workflows

Do you mean a database still? Or something like Langflow or Dify? Curious what "something even more integrated" would look like as just a DB.

aravindputrevu · a year ago
Oh I think there are better vector stores than pinecone. For ex Marqo or extreme case Elasticsearch
tejaskumar_ · a year ago
What makes them better? Have we tried Astra DB or Milvus? Curious where that stands in relation to the others.
boundlessdreamz · a year ago
This comment was most likely generated using AI It is reusing phrases from previous comments -

https://news.ycombinator.com/item?id=42330710 and https://news.ycombinator.com/item?id=42330639