gunnarmorling (u/gunnarmorling)

gunnarmorling commented on The equality delete problem in Apache Iceberg blog.dataengineerthings.o... · Posted by u/dkgs

hodgesrm · 13 days ago

I don't really understand what problem replication of mutable transaction data to Iceberg or any data lake for that matter solves for most PostgreSQL users.

Iceberg is optimized for fact data in very large tables with relatively rare changes and likewise rare change to schema. It does that well and will continue to do so for the foreseeable future.

PostgreSQL databases typically don't generate huge amounts of data; that data can also be highly mutable in many cases. Not only that, the schema can change substantially. Both types of changes are hard to manage in replication, especially if the target is a system, like Iceberg, that does not handle change very well in the first place.

So that leaves the case where you have an lot of data in PostgreSQL that's creating bad economics. In that case, why not just skip PostgreSQL and put it in an analytic database to begin with?

p.s., I'm pretty familiar with trading systems that do archive transaction data to data lakes using Parquet for long-term analytics and compliance. That is a different problem. The data is for all intents and purposes immutable.

Edit: clarity

gunnarmorling · 13 days ago

> PostgreSQL databases typically don't generate huge amounts of data

The live data set may not be huge, but the entire trail of all changes of all current and all previously existing data may easily exceed the volume of data you can reasonably process with Postgres.

In addition, its row based storage format doesn't make it an ideal fit for typical analytical queries on large amounts of data.

Replicating the data from Postgres to Iceberg addresses these issues. But, of course, it's not without its own challenges, as demonstrated by the article.

gunnarmorling commented on The borrowchecker is what I like the least about Rust viralinstruction.com/post... · Posted by u/jakobnissen

gunnarmorling · a month ago

GC also has its downsides:

- Marking and sweeping cause latency spikes which may be unacceptable if your program must have millisecond responsiveness.

- GC happens intermittently, which means garbage accumulates until each collection, and so your program is overall less memory efficient.

With modern concurrent collectors like Java's ZGC, that's not the case any longer. They show sub-millisecond pause times and run concurrently. The trade-off is a higher CPU utilization and thus reduced overall throughput, which if and when it is a problem can oftentimes be mitigated by scaling out to more compute nodes.

gunnarmorling commented on Postgres LISTEN/NOTIFY does not scale recall.ai/blog/postgres-l... · Posted by u/davidgu

brightball · a month ago

You know, this would be a great talk at the 2026 Carolina Code Conference...

gunnarmorling · a month ago

Ha, that's interesting :) Do you have any more details to that one?

gunnarmorling commented on Postgres LISTEN/NOTIFY does not scale recall.ai/blog/postgres-l... · Posted by u/davidgu

williamdclt · 2 months ago

I think replication is the way to go, it’s kinda what it’s for.

Might be a bit tricky to get debezium to decode the logical event, not sure

gunnarmorling · 2 months ago

Debezium handles logical decoding messages OOTB. There's also an SMT (single message transform) for decoding the binary payload: https://debezium.io/documentation/reference/stable/transform....

gunnarmorling commented on Postgres LISTEN/NOTIFY does not scale recall.ai/blog/postgres-l... · Posted by u/davidgu

williamdclt · 2 months ago

I found recently that you can write directly to the WAL with transactional guarantees, without writing to an actual table. This sounds like it would be amazing for queue/outbox purposes, as the normal approaches of actually inserting data in a table cause a lot of resource usage (autovacuum is a major concern for these use cases).

Can’t find the function that does that, and I’ve not seen it used in the wild yet, idk if there’s gotchas

Edit: found it, it’s pg_logical_emit_message

gunnarmorling · 2 months ago

pg_logical_emit_message() is how I recommend users on Postgres to implement the outbox pattern [1]. No table overhead as you say, no need for housekeeping, etc. It has some other cool applications, e.g. providing application-specific metadata for CDC streams or transactional logging, wrote about it at [2] a while ago. Another one is making sure replication slots can advance also if there's no traffic in the database they monitor [3].

[1] https://speakerdeck.com/gunnarmorling/ins-and-outs-of-the-ou...

[2] https://www.infoq.com/articles/wonders-of-postgres-logical-d...

[3] https://www.morling.dev/blog/mastering-postgres-replication-...

gunnarmorling commented on BYU study: Why some people choose not to use artificial intelligence news.byu.edu/intellect/by... · Posted by u/computator

gunnarmorling · 2 months ago

The other day, I came across a blog post by someone I really value and it sounded very much like written by AI. So I decided to be very explicit and transparent about my own ways of using (and not using) AI for my blog: https://www.morling.dev/ai/.

TL,DR: I don't use it for writing (I want to say something original in my own voice), but I do use it for copy editing (improving wording, helping with title ideas, etc.).

gunnarmorling commented on Building agents using streaming SQL queries morling.dev/blog/this-ai-... · Posted by u/rmoff

simonw · 2 months ago

This combines a bunch of different things, but the key idea appears to be having a Flink (stream processing server) SQL query set up that effectively means that new documents added to the data store trigger a query that then uses an LLM (via a custom SQL function) to e.g. calculate a summary of a paper, then feeds that on to something that sends an alert to Slack or similar.

So this is about running LLM prompts as part of an existing streaming data processing setup.

I guess you could call a trigger-based SQL query an "agent", since that term is wide open to being defined however you want to use it!

gunnarmorling · 2 months ago

Author here, thanks for reading and commenting! Indeed my conclusion is that "agent" means different things to different people. The idea for this post was to explore what's there already and what may be missing for using SQL in this context. When following Anthropic's taxonomy, as of today, SQL let's you get quite far for building workflows. For agents in their terminology, some more work is needed to integrate things like MCP, but I don't see any fundamental reasons for why this couldn't be done.

gunnarmorling commented on Datalog in Rust github.com/frankmcsherry/... · Posted by u/brson

rebanevapustus · 2 months ago

They did, and their product is great.

It is the only database/query engine that allows you to use the same SQL for both batch and streaming (with UDFs).

I have made an accessible version of a subset of Differential Dataflow (DBSP) in Python right here: https://github.com/brurucy/pydbsp

DBSP is so expressive that I have implemented a fully incremental dynamic datalog engine as a DBSP program.

Think of SQL/Datalog where the query can change in runtime, and the changes themselves (program diffs) are incrementally computed: https://github.com/brurucy/pydbsp/blob/master/notebooks/data...

gunnarmorling · 2 months ago

> It is the only database/query engine that allows you to use the same SQL for both batch and streaming (with UDFs).

Flink SQL also checks that box.

gunnarmorling commented on Archil: From a file system, to a data company archil.com/post/archil-fi... · Posted by u/huntaub

huntaub · 3 months ago

Hey, I'm Hunter -- the founder of Archil. I'll be around in the comments to answer any questions that people have about the platform, or how things have changed since the Fall.

gunnarmorling · 3 months ago

Congrats on the launch! Could you clarify perhaps what Archil volumes actually are from a technical perspective? Like, is it EC2 instances with locally attached storage, your own bare metal machines in a CoLo space, something in between? It's not quite clear to me after reading the announcement.