Jepsen: TigerBeetle 0.16.11

Very impressed with this report. Whenever I read TigerBeetle's claims on reliability and scalability, I'd think "ok, let's wait for the Jepsen report".

This report found a number of issues, which might be a cause for concern. But I think it's a positive because they didn't just fix the issues, they've expanded their internal test suite to catch similar bugs in future. With such an approach to engineering I feel like in 10 years TigerBeetle would have achieved the "just use Postgres" level of default database in its niche of financial applications.

Also great work aphyr! I feel like I learned a lot reading this report.

jorangreef · 3 months ago

Thanks!

Yes, we have around 6,000+ assertions in TigerBeetle. A few of these were overtight, hence some of the crashes. But those were the assertions doing their job, alerting us that we needed to adjust our mental model, which we did.

Otherwise, apart from a small correctness bug in an internal testing feature we added (only in our Java client and only for Jepsen to facilitate the audit) there was only one correctness bug found by Jepsen, and it didn’t affect durability. We’ve written about it here: https://tigerbeetle.com/blog/2025-06-06-fuzzer-blind-spots-m...

Finally, to be fair, TigerBeetle can (and is tested) to survive more faults than Postgres can, since it was designed with an explicit storage fault model and using research that was not available at the time when Postgres was released in ‘96. TB’s fault models are further tested with Deterministic Simulation Testing and we use techniques such as static memory allocation following NASA’s Power of Ten Rules for Safety-Critical Code. There are known scenarios in the literature that will cause Postgres to lose data, which TigerBeetle can detect and recover from.

For more on this, see the section in Kyle’s report on helical fault injection (most Raft and Paxos implementations were not designed to survive this) as well as a talk we gave at QCon London: https://m.youtube.com/watch?v=_jfOk4L7CiY

jrpelkonen · 3 months ago

Hi Joran,

I have followed TigerBeetle with interest for a while, and thank you for your inspirational work and informative presentations.

However, you have stated in several occasions that the lack of memory safety in Zig is not a concern since you don't dynamically allocate memory post startup. However, one of the defects uncovered here (#2435) was caused by dereferencing an uninitialized pointer. I find this pretty concerning, so I wonder if there is something that you will be doing differently to eliminate all similar bugs going forward?

nickpsecurity · 3 months ago

The correctness bug was due to combinations of features. I'm curious if you've looked into combinatorial testing which NIST claimed knocks out almost all bugs when 6-way testing was used.

https://csrc.nist.gov/projects/automated-combinatorial-testi...

My intro to other categories of test generation was usually this paper:

https://cs.stanford.edu/people/saswat/research/ASTJSS.pdf

Maybe see of your team can build combinatorial- or path-based testing in Zig next.

schubart · 3 months ago

> we have around 6,000+ assertions in TigerBeetle.

Are they enabled in production? Are there some expensive ones that aren’t?

anarazel · 3 months ago

> There are known scenarios in the literature that will cause Postgres to lose data, which TigerBeetle can detect and recover from.

What are you referencing here?

SOLAR_FIELDS · 3 months ago

I always get excited to read Kyle’s write ups. I feel like I level up my distributed systems knowledge every time he puts something out.

Really happy to see TigerBeetle live up to its claims as verified by aphyr - because it's good to see that when you take the right approach, you get the right results.

Question about how people end up using TigerBeetle. There's presumably a lot of external systems and other databases around a TigerBeetle install for everything that isn't an Account or Transfer. What's the typical pattern for those less reliable systems to square up to TigerBeetle, especially to recover from consistency issues between the two?

jorangreef · 3 months ago

Joran from TigerBeetle here! Thanks! Really happy to see the report published too.

The typical pattern in integrating TigerBeetle is to differentiate between control plane (Postgres for general purpose or OLGP) and data plane (TigerBeetle for transaction processing or OLTP).

All your users (names, addresses, passwords etc.) and products (descriptions, prices etc.) then go into OLGP as your "filing cabinet".

And then all the Black Friday transactions these users (or entities) make, to move products from inventory accounts to shopping cart accounts, and from there to checkout and delivery accounts—all these go into OLTP as your "bank vault". TigerBeetle lets you store up to 3 user data identifiers per account or transfer to link events (between entitites) back to your OLGP database which describes these entities.

This architecture [1] gives you a clean "separation of concerns", allowing you to scale and manage the different workloads independently. For example, if you're a bank, it's probably a good idea not to keep all your cash in the filing cabinet with the customer records, but rather to keep the cash in the bank vault, since the information has different performance/compliance/retention characteristics.

This pattern makes sense because users change their name or email address (OLGP) far less frequently than they transact (OLTP).

Finally, to preserve consistency, on the write path, you treat TigerBeetle as the OLTP data plane as your "system of record". When a "move to shopping cart" or "checkout" transaction comes in, you first write all your data dependencies to OLGP if any (and say S3 if you have related blob data) and then finally you commit your transaction by writing to TigerBeetle. On the read path, you query your system of record first, preserving strict serializability.

Does that make sense? Let me know if there's anything here we can drill into further!

[1] https://docs.tigerbeetle.com/coding/system-architecture/

tomhow · 3 months ago

See also:

Fuzzer Blind Spots (Meet Jepsen!) – https://tigerbeetle.com/blog/2025-06-06-fuzzer-blind-spots-m...

nindalf · 3 months ago

jitl · 3 months ago

DetroitThrow · 3 months ago

This is a particularly fun Jepsen report after reading their fuzzer blind spots post.

It looks like the segfaults on the JNI side would not have been protected if Rust or some other memory safe language were being used - the lack of memory safety bugs gives some decent proof that TigerBeetle's approach to Zig programming (TigerStyle iirc, lol) does what it sets out to do.

matklad · 3 months ago

See https://news.ycombinator.com/item?id=44201189. We did have one bug where Rust would've saved our bacon (instead, the bacon was saved by an assertion, so it was just slightly crispy, not charred).

EDIT: But, yeah, totally, if not for TigerStyle, we'd die to nasal demons!

FlyingSnake · 3 months ago

Love the wonderfully detailed report. Getting it tested and signed off by Jepsen is such a huge endorsement for TigerBeetle. It’s not even reached v1.0 and I can’t wait to see it hit new milestone in the future.

Special kudos to the founders who are sharing great insights in this thread.

Yes, Kyle did an incredible job and I also love the detail he put into the report. I kept saying to myself: “this is like a work of art”, the craftsmanship and precision.

Appreciate your kind words too, and look forward also to sharing something new in our talks at SD25 in Amsterdam soon!

12_throw_away · 3 months ago

A small appreciation for the section entitled "Panic! At the Disk 0": <golf clap>

ryeats · 3 months ago

I think it is interesting but obvious in hindsight that it is necessary to have the distributed system under test report the time/order things actually happened to enable accurate validation against an external model of the system instead of using wall-clock time.

Note that this works because we have strict serializability. With weaker consistency guarantees, there isn't necessarily a single global consistent timeline.

This is an interesting meta pattern where doing something _harder_ actually simplifies the system.

Another example is that, because we assume that the disk can fail and need to include repair protocol, we get state-synchronization for a lagging replica "for free", because it is precisely the same situation as when the entire disk gets corrupted!

aphyr · 3 months ago

To build on this--this is something of a novel technique in Jepsen testing! We've done arbitrary state machine verification before, but usually that requires playing forward lots of alternate timelines: one for each possible ordering of concurrent operations. That search (see the Knossos linearizability checker) is an exponential nightmare.

In TigerBeetle, we take advantage of some special properties to make the state machine checking part linear-time. We let TigerBeetle tell us exactly which transactions happen. We can do this because it's a.) strong serializable, b.) immutable (in that we can inspect DB state to determine whether an op took place), and c.) exposes a totally ordered timestamp for every operation. Then we check that that timestamp order is consistent with real-time order, using a linear-time cycle detection approach called Elle. Having established that TigerBeetle's claims about the timestamp order are valid, we can apply those operations to a simulated version of the state machine to check semantic correctness!

I'd like to generalize this to other systems, but it's surprisingly tricky to find all three of those properties in one database. Maybe an avenue for future research!

shepherdjerred · 3 months ago

I think this is a classic approach, e.g. https://lamport.azurewebsites.net/pubs/time-clocks.pdf

cmrdporcupine · 3 months ago

The articles link to the paper about "Viewstamped Replication" is unfortunately broken (https://pmg.csail.mit.edu/papers/vr-revisited.pdf connection refused).

I think it should be http://pmg.csail.mit.edu/papers/vr-revisited.pdf (http scheme not https) ?

And now I have some Friday evening reading material.

It should be fixed soon!

The VSR 2012 paper is one of my favorites as is “Protocol-Aware Recovery for Consensus-Based Storage”, which is so powerful.

Hope you enjoy the read!