We made Postgres writes faster, but it broke replication

Agree with others saying HN needs more content like this!

After reading I don’t get how locks held in memory affect WAL shipping. WAL reader reads it in a single thread, updates in-memory data structures periodically dumping them on disk. Perhaps you want to read one big instruction from WAL and apply it to many buffers using multiple threads?

>Adapting algorithms to work atomically at the block level is table stakes for physical replication

Why? To me the only thing you have to do atomically is WAL write. WAL readers read and write however they want given that they can detect partial writes and replay WAL.

>If a VACUUM is running on the primary at the same time that a query hits a read replica, it's possible for Postgres to abort the read.

The situation you referring to is: 1. Record inserted 2. Standby long query started 3. Record removed 4. Primary vacuum started 5. Vacuum replicated 6. Vacuum on standby cannot remove record because it is being read by the long query. 7. PG cancels the query to let vacuum proceed.

I guess your implementation generates a lot of dead tuples during compaction. You clearly fighting PG here. Could a custom storage engine be a better option?

Deleted Comment

stuhood · a month ago

Thanks for the questions!

    After reading I don’t get how locks held in memory affect WAL shipping.
    WAL reader reads it in a single thread, updates in-memory data structures
    periodically dumping them on disk. Perhaps you want to read one big
    instruction from WAL and apply it to many buffers using multiple threads?

We currently use an un-modified/generic WAL entry, and don't implement our own replay. That means we don't control the order of locks acquired/released during replay: and the default is to acquire exactly one lock to update a buffer.

But as far as I know, even with a custom WAL entry implementation, the maximum in one entry would still be ~8k, which might not be sufficient for a multi-block atomic operation. So the data structure needs to support block-at-a-time atomic updates.

    I guess your implementation generates a lot of dead tuples during
    compaction. You clearly fighting PG here. Could a custom storage
    engine be a better option?

`pg_search`'s LSM tree is effectively a custom storage engine, but it is an index (Index Access Method and Custom Scan) rather than a table. See more on it here: https://www.paradedb.com/blog/block_storage_part_one

LSM compaction does not generate any dead tuples on its own, as what is dead is controlled by what is "dead" in the heap/table due to deletes/updates. Instead, the LSM is cycling blocks into and out of a custom free space map (that we implemented to reduce WAL traffic).

> To be an effective alternative to Elasticsearch we needed to support high ingest workloads in real time.

Why not just use OpenSearch or ElasticSearch? The tool is already in the inventory; why use a screwdriver when a chisel is needed and available?

This is another one of those “when you have a hammer, everything looks like your thumb” stories.

iw2rmb · a month ago

Because you don’t need to sync and you have ACID with joins.

otterley · a month ago

Is there a whole business to be had with those advantages alone? I’m curious as to who the target market is.

izietto · a month ago

Because it's a whole another architectural component for data that is already there, in a tool that is made just for that lacking just of `SELECT full_text_search('kitty pictures');`

bastawhiz · a month ago

Running elasticsearch is a miserable experience. If you can run one tool that you already have with slightly more effort, amazing. And you never need to think about rebuilding indexes or tuning the garbage collector or planning an ES major version migration.

sandeepkd · a month ago

There can be multiple reasons, one that I can think of right away would be to keep the stack as simple as possible until you can. Realistically speaking most of the companies do not operate at the scale where they would need the specialized tools.

cheema33 · a month ago

> Why not just use OpenSearch or ElasticSearch?

There is a cost associated with adopting and integrating another tool like ElasticSearch. For some orgs, the ROI might not be there. And if their existing database provide some additional capabilities in this space, that might be preferrable.

> This is another one of those “when you have a hammer, everything looks like your thumb” stories.

Are you referring to people who think that every reporting problem must be solved by a dedicated OLAP database?

AtlasBarfed · a month ago

Why not the Cassandra based elastics if you need ingest?