Every System is a Log: Avoiding coordination in distributed applications

This is a basic concept in accounting. The general ledger is an immutable log of transactions. Other accounting documents are constructed from the general ledger, and can, if necessary, be rebuilt from it. This is the accepted way to do money-related things.

Synchronization is called "reconcilation" in accounting terminology.

The computer concept is that we have a current state, and changes to it come in. The database with the current state is authoritative. This is not suitable for handling money.

The real question is, do you really care what happened last month? Last year? If yes, a log-based approach is appropriate.

inopinatus · 8 months ago

I’ve always concurred with the Helland/Kleppman observation mentioned viz. that the transaction log of a typical RDBMS is the canonical form and all the rows & tables merely projections.

It’s curious that over those projections, we then build event stores for CQRS/ES systems, ledgers etc, with their own projections mediated by application code.

But look underneath too. The journaled filesystem on which the database resides also has a log representation, and under that, a modern SSD is using an adaptive log structure to balance block writes.

It’s been a long time since we wrote an application event stream linearly straight to media, and although I appreciate the separate concerns that each of these layers addresses, I’d probably struggle to justify them all from first principles to even a slightly more Socratic version of myself.

formerly_proven · 8 months ago

This is similar to the observation that memory semantics in most any powerful machine since the late 60s are implemented using messaging, and then applications go ahead and build messaging out of memory semantics. Or the more general observation that every layer of information exchange tends towards implementing packet switching if there's sufficient budget (power/performance/cost) to support doing so.

globular-toast · 8 months ago

> It’s curious that over those projections, we then build event stores for CQRS/ES systems, ledgers etc, with their own projections mediated by application code.

The database only supports CRUD. So while the CDC stream is the truth, it's very low level. We build higher-level event types (as in event sourcing) for the same reason we build any higher-level abstraction: it gives us a language in which to talk about business rules. Kleppmann makes this point in his book and it was something of an aha moment for me.

namtab00 · 8 months ago

we're a very short step away from a "hardware" database

ianburrell · 8 months ago

The table data in database is the canonical form. You can delete the transaction logs, and temporarily lose some reliability. It is very common to delete the transaction logs when not needed. When databases are backed up, they either dump the logical data or take snapshot of the data. Then can take stream of transaction logs for syncing or backup until the next checkpoint.

I'm pretty sure journalled filesystem recycle the journal. There are log-structured filesystem but they aren't used much beyond low-level flash.

LeanOnSheena · 8 months ago

You're correct on all points. Some additional refining points regarding accounting concepts:

- General legers are formed by way of transactions recorded as journal entries. Journal entries are where two or more accounts from the general ledger are debited & credited such that total debits equals total credits. For example, a sale will involve a journal entry which debits cash or accounts receivable, and credits revenue.

- The concept of the debits always needing to equal credits is the most important and fundamental control in accounting. It's is the core idea around which all of double entry bookkeeping is built.

- temporally ordered Journal entries are what form a log from which a general ledger can be derived. That log of journal entries is append-only and immutable. If you make an mistake with a journal entry, you typically don't delete it, you just make another adjusting (i.e. correcting) entry.

Having a traditional background in accounting as a CPA, as a programmer I have written systems that are built around a log of temporally ordered transactions that can be used to construct state across time. To my colleagues that didn't have that background they found it interesting but very strange as an idea (led to a lot of really interesting discussions!). It was totally strange to me that they found it odd because it was the most comfortable & natural way for me to think about many problems.

funcDropShadow · 8 months ago

Could you recommend some resource to understand this view of accounting better?

fellowniusmonk · 8 months ago

This is why EG-Walker is so important, diamond types adoption and a solid TS port can't come soon enough for distributed systems.

calvinmorrison · 8 months ago

You over estimate ERP and accounting systems.

baq · 8 months ago

That was basic accounting when computers were people with pencils and paper.

I’ve been doing a similar thing, although I called it “append only transaction ledgers”. Same idea as a log. A few principles:

- The order of log entries does not matter.

- Users of the log are peers. No client / server distinction.

- When appending a log entry, you can send a copy of the append to all your peers.

- You can ask your peers to refresh the latest log entries.

- When creating a new entry, it is a very good idea to have a nonce field. (I use nano IDs for this purpose along with a timestamp, which is probabilistically unique.)

- If you want to do database style queries of the data, load all the log entries into an in memory database and query away.

- You can append a log entry containing a summary of all log entries you have so far. For example: you’ve been given 10 new customer entries. You can create a log entry of “We have 10 customers as of this date.”

- When creating new entries, prepare the entry or list of entries in memory, allow the user to edit/revise them as a draft, then when they click “Save”, they are in the permanent record.

- To fix a mistake in an entry, create a new entry that “negates” that entry.

A lot of parallelism / concurrency problems just go away with this design.

XorNot · 8 months ago

How do you know summary entries are valid if order doesn't matter?

I.e. "we have 10 customers as of this date" can become immediately invalid if a new entry is appended afterwards with a date before that summary entry (i.e. because it was on a peer which hadn't yet sent it)

withinboredom · 8 months ago

Realistically, you never store summaries in the log. Instead, you store what it took to calculate them. So you won't store "we have 10 customers on this date with this range" but instead store "we found these 10 customers on this date with this range". This assumes you can store infinite sized lists in your log, but realistically, this is never a concern if you can keep your time windows small enough. Then, you periodically do a reconciliation and log corrections (look for entries not summarized -- easily done via a bloom filter which can tell you what entries are definitely NOT in your set) over a longer period.

For example, we had a 28-day reconciliation period at one company I worked at (and handled over 120 million events per day). If you appended an event earlier than 28 days prior, it was simply ignored. This very rarely happened, but allowed us to fix bugs with events for up to 28 days.

clayg · 8 months ago

IME you have to be willing to recalculate the summaries up to some kind of consistency window.

Yes you may be changing history and you may have a business reason not to address that revision immediately (you've already billed them?) - but the system can still learn it made a mistake and fix it (add activity from Jan 30 evening that comes in late to the Feb bill?)

Kinrany · 8 months ago

> The order of log entries does not matter.

This is surprising, Kafka-like logs are all strictly ordered.

trollbridge · 8 months ago

The reason I made ordering not matter is so that multiple peers don’t have to worry about keeping the exact same order when appending to each other’s logs.

The log entries do have timestamps on them. You can sort by timestamp, but a peer has the right to append an new entry that’s older than the latest timestamp.

cduzz · 8 months ago

* within a partition

grahamj · 8 months ago

The lack of ordering is surprising. Without that you can’t stream without a buffer.

trollbridge · 8 months ago

The idea is that all peers eventually have the same collection of log entries, with an agreed-upon sort method. Currently that sort method is time stamps.

When it’s time to have identical replicas, a log entry must be appended that says the log is closed. Once all peers have this, the log is now immutable and can be treated as such. An example of this (for a payroll system) is a collection of entries for time clock in and outs, payroll taxes withheld, direct deposits to employees, etc when it’s time to do a payroll. The log is closed and that data can never be altered, but now the data can be relied on by someone else.

Conceptually this is surprisingly similar to the type of internal logs a typical database like PostgreSQL keeps.

log4shell · 8 months ago

Calling a WAL a ledger, why? Ledger sounds fancier but why would it be a ledger in this case?

trollbridge · 8 months ago

We called it a ledger since we stored financial data and basically used the “ledger” format from plain text accounting, initially.

hcarvalhoalves · 8 months ago

I believe "ledger" implies commutative property (order does not matter).

glitchc · 8 months ago

How do you manage log size for high-transaction systems?

trollbridge · 8 months ago

It’s not terribly high transaction, yet. If it becomes that way, I would partition the one by entry type so all the high transaction stuff gets stuffed in a particular log, and then ensure it can be summarised regularly.

random3 · 8 months ago

if order does not matter how do you implement deletes?

trollbridge · 8 months ago

You don’t do deletes. The log is append only.

If you want to get rid of an entry, you create a new entry that negates the existing entry.

Animats · 8 months ago

shikhar · 8 months ago

This post makes a great case for how universal logs are in data systems. It was strange to me that there was no log-as-service with the qualities that make it suitable for building higher-level systems like durable execution: conditional appends (as called out by the post!), support very large numbers of logs, allow pushing high throughputs with strict ordering, and just generally provide a simple serverless experience like object storage. This led to https://s2.dev/ which is now available in preview.

It was interesting to learn how Restate links events for a key, with key-level logical logs multiplexed over partitioned physical logs. I imagine this is implemented with a leader per physical log, so you can consistently maintain an index. A log service supporting conditional appends allows such a leader to act like the log is local to it, despite offering replicated durability.

Leadership can be an important optimization for most systems, but shared logs also allow for multi-writer systems pretty easily. We blogged about this pattern https://s2.dev/blog/kv-store

xuancanh · 8 months ago

> It was strange to me that there was no log-as-service with the qualities that make it suitable for building higher-level systems like durable execution

There are several services like that, but they are mostly kept behind the scene as a competitive advantage when building distributed systems. AWS uses it behind the scene for many services, as mentioned here by Marc Brooker https://brooker.co.za/blog/2024/04/25/memorydb.html. Facebook has similar systems like LogDevice https://logdevice.io/, and recently Delos https://research.facebook.com/publications/log-structured-pr...

Indeed. We are trying to democratize that secret sauce. Since it is backed by object storage, the latencies are not what AWS enjoys with its internal Journal service, but we intend to get there with a NVMe-based tier later. In the meantime there is an existing large market for event streaming where a "truly serverless" (https://erikbern.com/2021/04/19/software-infrastructure-2.0-...) API has been missing.

logsr · 8 months ago

> log as a service

very exciting. this is the future. i am working on a very similar concept. every database is a log at its core, so the log, which is the highest performance part of the system, is buried behind many layers of much lower performing cruft. edge persistence with log-per-user application patterns opens up so many possibilities.

hinkley · 8 months ago

I just want a recognized standard format for write ahead logs. Start with replicating data between OLTP and OLAP databases with minimal glue code, and start moving other systems to a similar structure, like Kafka, then new things we haven’t thought of yet.

The structure for the write head logs needs to different between systems. For Postgres, the WAL is a record of writes with new blocks. It can't be used without knowing the Postgres disk format. I don't think it can be used to construct logical changes.

Using a standard format, converting things into logical data, would be significantly slower. It is important that WAL be fast because it is the bottleneck in transactions. It would make more sense to have a separate change streaming service.

thesz · 8 months ago

  > It was strange to me that there was no log-as-service...

Actually, there are plenty of them. Most heard of is Ethereum 2.0 - it is a distributed log of distributed logs.

Any blockchain that is built upon PBFT derivative is such a system.

gavindean90 · 8 months ago

What about journalctl?

This is why we didn't actually call it logs as a service, but streams :P I meant to refer to the log abstraction this post talks about, see links therein. Observability events are but one kind of data you may want as a stream of durable records.

p_l · 8 months ago

Surprisingly and disgustingly so it did not actually use a proper log structure on disk. So much pointer walking...

sewen · 8 months ago

A short summary:

Complex distributed coordination and orchestration is at the root of what makes many apps brittle and prone to inconsistencies.

But we can mitigate much of complexity with a neat trick, building on the fact that every system (database, queue, state machine) is effectively a log underneath the hood. By implementing interaction with those systems as (conditional) events on a shared log, we can build amazingly robust apps.

If you have come across “Turning the Database Inside Out” (https://martin.kleppmann.com/2015/11/05/database-inside-out-...), you can think of this a bit like “Turning the Microservice Inside Out”

The post also looks at how this can be used in practice, given that our DBs and queues aren't built like this, and how to strike a sweet-spot balance between this model with its great consistency, and maintaining healthy decoupling and separation of concerns.

teddyh · 8 months ago

Is this summary AI generated?

Haha, no, but maybe all the AI-generated contents out there is starting to train me to write in a similar style...

EGreg · 8 months ago

Since we’re on the subject of logs and embarassingly parallel distributed systems, I know someone who’s also in NYC who’s been building a project exactly along these lines. It’s called gossiplog and it uses Prolly trees to make some interesting results.

https://www.npmjs.com/package/@canvas-js/gossiplog

Joel Gustafson started this stuff at MIT and used to work at Protocol Labs. It’s very straightforward. By any chance sewen do you know him?

I first became aware of this guy’s work when he posted “merklizing the key value store for fun and profit” or something like that. Afterwards I looked at log protocols, including SLEEP protocol for Dat/Hypercore/ pear and time-travel DBs that track diffs, including including Dolt and even Quadrable.

https://news.ycombinator.com/item?id=36265429

Gossiplog’s README says exactly what this article says— everything is a log underneath and if you can sync that (using prolly tree techniques) people can just focus on business logic and get sync for free!

Never encountered it before, but it looks cool.

I think they are trying to solve a related problem. "We can consolidate the work by making a generic log that has networking and syncing built-in. This can be used by developers to make automatically-decentralized apps without writing a single line of networking code."

At a first glance, I would say that Gossiplog is a bit more low level, targeting developers of databases and queues, to save them from re-building a log every time. But then there are elements of sharing the log between components. Worth a deeper look, but seems a bit lower level abstraction.

It’s part of his higher-level framework called Canvas.

Check this out: https://joelgustafson.com/posts/2024-09-30/introduction-to-c...

And this: https://github.com/canvasxyz/canvas

hem777 · 8 months ago

There’s also OrbitDB https://github.com/orbitdb/orbitdb which to my understanding has been a pioneer for p2p logs, databases and CRDTs.

vdm · 8 months ago

Thank you @EGreg for sharing this.

Def. I geek out on this stuff, as I am building my own distributed systems. I have had discussions with a lot of people in the space, like Leslie Lamport, Petar Maymounkov etc.

You might like this interview: https://www.youtube.com/watch?v=JWrRqUkJpMQ

This is what I’m working on now: https://intercoin.org/intercloud.pdf

Deleted Comment

Some clarification on what "one log" means here:

- It means using one log across different concerns like state a, communication with b, lock c. Often that is in the scope of a single entity (payment, user, session, etc.) and thus the scope for the one log is still small. You would have a lot of independent logs still, for separate payments.

- It does _not_ mean that one should share the same log (and partition) for all the entities in your app, like necessarily funneling all users, payments, etc. through the same log. That goes actually beyond the proposal here - has some benefits of its own, but have a hard time scaling.

magicalhippo · 8 months ago

Interesting read, not my area but I think I got the gist of it.

In your Restate example of the "processPayment" function, how do you handle errors of the "accountService" call? Like, what if it times out or returns a server error?

Do you store the error result and the caller of "processPayment" has to re-trigger the payment, in order to generate a new log?

stsffap · 8 months ago

By default, failing ctx.run() calls (like the accountService call) will be retried indefinitely until they succeed unless you have configured a retry policy for them. In the case of a configured retry policy where you have exhausted the number of retry attempts, Restate will mark this call as terminally failed and record it in its log as such and return it to the caller.

daxfohl · 8 months ago

Haven't formed thoughts on the content yet, but happy to see a company launching something non-AI for a change.

gjtorikian · 8 months ago

My startup, Yetto (http://www.yetto.app) is building a better way for support professionals to do their job. (Shameless plug but we always gotta hustle.)

We, too, are weighed down by how much space AI-focused companies are taking.

hansonkd · 8 months ago

TBH looking at helpdesk software in 2025, I would expect new ones to be built AI first. It would be hard for me to consider one without at least some sort of LLMs helping with triage or at classifications of tickets, etc.

jaseemabid · 8 months ago

A notable example of a large-scale app built with a very similar architecture is ATproto/Bluesky[1].

"ATProto for Distributed Systems Engineers" describes how updates from the users end up in their own small databases (called PDS) and then a replicated log. What we traditionally think of as an API server (called a view server in ATProto) is simply one among the many materializations of this log.

I personally find this model of thinking about dataflow in large-scale apps pretty neat and easy to understand. The parallels are unsurprising since both the Restate blog and ATProto docs link to the same blog post by Martin Kleppmann.

This arch seems to be working really well for Bluesky, as they clearly aced through multiple 10x events very recently.

[1]: https://atproto.com/articles/atproto-for-distsys-engineers

That blog post is a great read as well. Truely, the log abstraction [1] and "Turning the DB inside out" [2] have been hugely influential.

In a way this article here suggests to extend that

(1) from a log that represents data (upserts, cdc, etc.) to a log of coordination commands (update this, acquire that log, journal that steo)

(2) have a way to link the events related to a broader operation (handler execution) together

(3) make the log aware of handler execution (better yet, put it in charge), so you can automatically fence outdated executions

[1] https://engineering.linkedin.com/distributed-systems/log-wha...

[2] https://martin.kleppmann.com/2015/11/05/database-inside-out-...

Table/log duality goes back further than Kleppmann though. An earlier article that really influenced me was

https://engineering.linkedin.com/distributed-systems/log-wha...

zellyn · 8 months ago

Martin Kleppmann was also directly involved with Bluesky as a consultant.