Readit News logoReadit News
_vvhw · 4 years ago
“Paxos is one of the first and most celebrated ideas that laid the foundation for how different things come to an agreement asynchronously”

“Most—if not all—consensus algorithms fundamentally derive concepts from Paxos”

It's not well known (even in the field of distributed systems) that in fact Viewstamped Replication pioneered the field of consensus in 1988, a year before Paxos, with an intuitive algorithm essentially identical to Raft, so that Raft is more within the family and tradition of Viewstamped Replication (as is Multi-Paxos state machine replication) than the other way round.

Viewstamped Replication is significantly easier to implement correctly in the presence of real world storage faults, which formal proofs often do not take into account in their fault model. For example, Paxos and Raft proofs typically assume only pristine stable storage, and the correctness unfortunately rapidly breaks down when physical disks are used, since these may misdirect reads and writes.

UW-Madison have done some excellent research in this area in particular, the intersection of consensus protocol correctness and storage faults, and I would recommend reading "Protocol-Aware Recovery for Consensus-Based Storage" if you'd like to dive into some of this: https://www.usenix.org/conference/fast18/presentation/alagap...

Aleksey Charapko's DistSys Reading Group also recently covered Viewstamped Replication as a foundational paper, focusing on the crystal clear 2012 revision by Barbara Liskov and James Cowling, which I was able to present, covering some of our experience implementing the protocol for TigerBeetle, a distributed database with a strict storage fault model: http://charap.co/reading-group-viewstamped-replication-revis...

_benedict · 4 years ago
What people think of as Paxos was not originally intended to be called Paxos by Lamport, but the leader election phase for what is now called Multi-Paxos (and he intended to call Paxos). So I don't think there's a difference for algorithmic heritage between Paxos and Multi-Paxos (perhaps you know this, but for those reading your comment it might help clarify).

I think it is anyway reasonable to say that most consensus algorithms derive concepts from Paxos, since (despite how terribly it was presented by Lamport) it is the consensus protocol that captured most attention. Most recent advances in distributed consensus derive from Paxos, not Viewstamped Replication. As far as I know all leaderless distributed consensus protocols derive from Paxos, and most protocol optimisations that have been developed apply to Paxos or one of its derivatives.

I also happen to think Paxos is pretty easy to implement correctly, particularly by comparison to the other protocols, in large part due to its active replication semantics, permitting that commands may be processed by replicas in any order. This means failover is much less complicated to negotiate, as the new leader does not expect to have a complete view of the log. Though of course membership changes remain complicated, and it may be beneficial for the leader to be able to assume it has a complete view of the log - but this is an optimisation rather than an inherent property for correct implementation.

> correctness unfortunately rapidly breaks down when physical disks are used, since these may misdirect reads and writes

What do you mean by "misdirect" here?

_vvhw · 4 years ago
I would say that the majority of leader-based state machine replication protocols today are based on the more popular RAFT, which is itself a derivative of Viewstamped Replication.

At least most leader-based consensus implementations that I know of seem to be derived from RAFT these days? I would also guess that leaderless distributed protocols derived from Paxos tend to be more niche than leader-based state machine replication?

While Viewstamped Replication as contributed by Brian M. Oki certainly established all the foundational elements of consensus that Paxos would later reiterate and generalize, Viewstamped Replication also immediately showed how to use consensus to do leader-based state machine replication, and to do this simply and practically, as seen in RAFT's massive success in industry.

> What do you mean by "misdirect" here?

A misdirected read/write I/O. This is a rare kind of storage fault in the literature where the disk device redirects the I/O to a different sector. It's rare, but it happens.

> I also happen to think Paxos is pretty easy to implement correctly

What was especially interesting about the PAR paper from UW-Madison was how even fairly common single disk sector faults such as latent sector errors or corruption could cause global cluster data loss for implementations based on Paxos and RAFT as specified, where they depend on pristine fault-free stable storage for correctness. The PAR paper further motivated why simple checksums are not enough and why local storage and global protocol need to be aware of each other if cluster availability is to be maximized.

Implementing Paxos correctly in the presence of a realistic storage fault model is certainly non-trivial.

I believe Viewstamped Replication (VSR), and in particular the 2012 revision by Liskov and Cowling, is a much better place to start when implementing a state machine replication protocol, since it places absolutely no demands on the disk for correctness of the protocol.

In fact, this was the key reason we picked VSR for TigerBeetle, it's not only elegant, but it's one of the very few protocols that actually make sense in a production context where a realistic storage fault model is at play. For example, the VSR view change is entirely correct without any reliance on disk, whereas most other protocols would need significant deviations (and proofs) to make the same guarantee.

hawk_ · 4 years ago
> correctness unfortunately rapidly breaks down when physical disks are used, since these may misdirect reads and writes.

Isn't the whole point of consensus to get around discrepancies due to faults in individual devices. With high enough quorum in Raft for example you can lower the probability of dataloss to an arbitrarily low point to not matter at all in practice.

_vvhw · 4 years ago
"Isn't the whole point of consensus to get around discrepancies due to faults in individual devices"

Yes, but only with respect to the fault models actually adopted, proven, implemented and tested. Again, most formal proofs, papers and implementations are not actually incorporating any kind of a storage fault model, yet this is essential for correctness in a production deployment.

If you look closely at Paxos and RAFT, they only attempt correctness for the network fault model {packets may be reordered/lost/delayed/replayed/redirected} and process fault model {crash failure}, nothing else. For example, there's no mention of what to do if your disk forgets some writes either because of a misdirected write, or because of a latent sector error, or because of corruption in the middle of the committed consensus log.

Unfortunately, simply increasing quorum will also not actually lower the probability of data loss. In fact, there are trivial cases where even a single local disk sector failure on one replica can cause an entire Paxos or RAFT cluster to experience global cluster data loss.

Protocol-Aware Recovery for Consensus-Based Storage from UW-Madison goes into this in detail, showing where well-known distributed systems are not correct, and also showing how this can be fixed. It won best paper at FAST '18 but is still catching on in industry.

We implemented Protocol-Aware Recovery for TigerBeetle [1], and use deterministic simulation testing to inject all kinds of storage faults on all replicas in a cluster, even as high as 20% simultaneously on all 5 replicas, including the leader. It was pretty magical for us to watch TigerBeetle just keep on running the first time, with high availability and no loss of data, despite radioactively high levels of corruption across the cluster.

[1] https://www.tigerbeetle.com

hawk_ · 4 years ago
do you happen to have a mirror for the vr-revisited paper? This link doesn't work

https://pmg.csail.mit.edu/papers/vr-revisited.pdf

_vvhw · 4 years ago
I think the issue is with the scheme, this works:

http://pmg.csail.mit.edu/papers/vr-revisited.pdf

throw10920 · 4 years ago
Assuming it's possible for most proofs to become automatically generated (which I don't think is a given), then it seems like verification work will shift from manually writing proofs into writing the specification from which proofs will automatically be generated - in which case, verification engineers will be able to look forward to an exciting future of figuring out how to tell the computer what is meant by "is".

Or doing system implementation directly - it doesn't seem like automated program synthesis is going anywhere fast...

(I jest, while still growing to be more of a fan of type systems, borrow checkers, and formal proofs)

lmm · 4 years ago
I mean that's always been the key skill of programming. Telling the computer how to do the thing isn't the hard part, figuring out exactly what needs to be done is.
johnbender · 4 years ago
More likely we’re looking at codesign/correct-by-construction for fully general purpose verification.

In many cases it’s possibly to refine a state machine based specification to an imperative implementation (and thereby carry safety properties down to the implementation) but at present the implementation usually looks like the state machine (thus codesign)

ncmncm · 4 years ago
A proof about a state machine is more useful than a proof about a program, or an automatically generated program. Proving that a program correctly implements a state machine is a much simpler exercise than teaching a proof system about your programming language.
threatofrain · 4 years ago
Towards an Automatic Proof of Lamport's Paxos (2021)

https://arxiv.org/pdf/2108.08796.pdf

sethammons · 4 years ago
> The tool the researchers designed for this proof is called IC3PO, a model checking system that looks through every state a program can enter and determines whether it matches a description of safe behavior. If the protocol is correct, IC3PO produces what’s termed an inductive invariant—a proof by induction that the property holds in all cases. If instead a bug is found in the protocol, it will produce a counter-example and execution trace, showing step by step how the bug manifests.

That's really cool.

thrtythreeforty · 4 years ago
You can also find this in hardware test benches - Symbiyosys (FOSS) and JasperGold (very proprietary) can prove via induction that assertions about your system on chip hold true for all time.
meetups323 · 4 years ago
> The inductive invariant IC3PO produced for Paxos in under an hour identically matches the human-written one previously derived with significant manual effort using a technique called interactive theorem proving. On top of speeding the process up, it also produces a proof with very succinct and digestible documentation.

Is there anything actually different between "interactive theorem proving" in a sufficiently advanced language (say Coq) and what they've done here? Honest question, I had to drop out of my college Coq class on account of a broken leg (and lack of willpower).

johnbender · 4 years ago
In Coq the user generally writes meta programs to generate proofs as programs at varying levels of automation (Ltac scripts). Here they used IC3 to automatically produce proof that their state based model of paxos satisfies the invariant.

In principle one could build an automated proof system for a state machine-esque formalism written in Coq using Ltac but that would be a nightmare in practice.

generationP · 4 years ago
Wondering where the word "secure" in the title is coming from. These are purely safety results; the setting doesn't even allow for Byzantine failures.
lowbloodsugar · 4 years ago
I thought that Lamport already did that with TLA+? What is new about this one?
MatteoFrigo · 4 years ago
TLA+ is a specification language. Its goal is to say exactly what the algorithm is, because an informal description in English or even pseudo-code is usually too ambiguous and not amenable to mechanical proofs. TLA+ by itself does not prove anything. You can write incorrect algorithms in TLA+, in the same way that you can write incorrect C programs.

Once you have a TLA+ specification, lazy people like me run the specification through a tool called TLC that exhaustively explores all possible behaviors of the algorithm in a finite search space. For example, the specification may say that a property is valid for all N, but TLC checks it for N=1, 2, and 3. This step is not a "proof" (it's more like a test suite), but people like me say "good enough" and ship at this point.

Lamport and colleagues have a tool called TLAPS where you can write a proof yourself (e.g., valid for all N), and the tool checks that the proof proves what it claims to prove.

The next step, which is what this paper is all about, is to derive the proof automatically.

aman-goel · 4 years ago
Here is a link to a YouTube video on this work- https://youtu.be/nPwlj6w6EXU

The tool IC3PO is open source on GitHub- https://github.com/aman-goel/ic3po