rystsov (u/rystsov) - Readit News

rystsov commented on Why `fsync()`: Losing unsynced data on a single node leads to global data loss redpanda.com/blog/why-fsy... · Posted by u/avinassh

ruuda · 2 years ago

> A system must use cutting-edge Byzantine fault-tolerant (BFT) replication protocols, which neither of these systems currently employ.

Cutting-edge? pBFT (Practical Byzantine Fault Tolerance) was published in 1999. The first Tendermint release was in 2015. With few exceptions, almost all big proof of stake blockchains are powered by variations of pBFT and have been for many years.

rystsov · 2 years ago

Yep, I still consider them to be cutting edge. Paxos was written in 1990 but the industry adopted it only in 2010s. For example I've looked through pBFT and it doesn't mention reconfiguration protocol which is essential for industry use. I've found one from 2012 so it should be getting ripe by now.

rystsov commented on Why `fsync()`: Losing unsynced data on a single node leads to global data loss redpanda.com/blog/why-fsy... · Posted by u/avinassh

judofyr · 2 years ago

It just feels like two widely different scenarios we're talking about here.

https://jack-vanlightly.com/blog/2023/4/24/why-apache-kafka-... talks about the case of a single failure and it shows how (a) Raft without fsync() loses ACK-ed messages and (b) Kafka without fsync() handles it fine.

This post on the other hand talks about a case where we have (a) one node being network partitioned, (b) the leader crashing, losing data, and combing back up again, all while (c) ZooKeeper doesn't catch that the leader crashed and elects another leader.

I think definitely the title/blurb should be updated to clarify that this is only in the "exceptional" case of >f failures.

I mean, the following paragraph seems completely misleading:

> Even the loss of power on a single node, resulting in local data loss of unsynchronized data, can lead to silent global data loss in a replicated system that does not use fsync, regardless of the replication protocol in use.

The next section (and the Kafka example) is talking about loss of power on a single node combined with another node being isolated. That's very different from just "loss of power on a single node".

rystsov · 2 years ago

We can't ignore or pretend that network partitioning doesn't happen. When people talk about choosing two out of CAP the real question is C or A because P is out of our control.

When we combine network partitioning with single local data suffix loss it either leads to a consistency violation or to a system being unavailable desperate the majority of the nodes being are up. At the moment Kafka chooses availability over consistency.

Also I read Kafka source and the role of network partitioning doesn't seem to be crucial. I suspect that it's also possible to cause similar problem with a single node power-outage https://twitter.com/rystsov/status/1641166637356417027 and unfortunate timing

rystsov commented on Why `fsync()`: Losing unsynced data on a single node leads to global data loss redpanda.com/blog/why-fsy... · Posted by u/avinassh

judofyr · 2 years ago

I'm confused here. If I'm reading this correctly, the following is happening:

- Node 3 is the leader.

- Node 1 is isolated (failure #1).

- 10 records are written. They are successful because a majority is still alive (persisted to Node 2 and 3).

- Node 3 loses data (failure #2).

- Node 1 comes back up again.

However, isn't this two failures at the same time? Kafka with three nodes can only guarantee a single failure, no?

rystsov · 2 years ago

Strongly consistent protocols such as a Paxos and Raft always choose consistency over availability and when consistency isn't certain they refuse to answer.

Raft & Paxos: any number of nodes may be down, as soon as the majority is available a replicated system is available and doesn't lie.

Kafka as it's described in the post(): any number of nodes may be down, at most one power outage is allowed (loss of unsynced data), as soon as the majority is available a replicated system is available and doesn't lie.

The counter-example simulates a single power outage

(

) https://jack-vanlightly.com/blog/2023/4/24/why-apache-kafka-...

rystsov commented on Redpanda’s official Jepsen What we fixed, and what we shouldn’t redpanda.com/blog/redpand... · Posted by u/rystsov

metadat · 3 years ago

I realize running it through Mr. Kingsbury's Jepsen is a solid PR move, 10/10 top nerds love what submitting to Jepsen signals - Confidence and commitment to correct behavior. I doubt anyone would fault you for dropping a hundred grand, it's more or less the ticket price to enter the arena of "proper" distributed systems.

I'm curious though, what _new_ bugs or integrity violations did you learn about from the Jepsen runs? In your post, it mentions you were already aware of most or all problems through in-house chaos monkey testing. Did I read correctly?

rystsov · 3 years ago

I was following Jepsen results since Kyle's first post and it's amazing that the blog post series became a well respected company

The report revealed the following unknown consistency issues which we had to fix:

  - duplicated writes by default
  - aborted read with InvalidTxnState
  - lost transactional writes

The first issue was caused by a client: starting with recent versions the client has idempotency on by default but when the server-side doesn’t support it (we had idempotency behind a feature flag) the client doesn’t complain. We will enable idempotency by default in 21.1.1 so it shouldn't be an issue. Also it's possible to turn the feature flag on for the older versions.

The other two issues were related to the transactions; we haven’t chaos tested the abort operation since it’s very similar to commit but even the tiny difference in logic was enough to hide the bug. It’s fixed now too.

rystsov commented on Redpanda’s official Jepsen What we fixed, and what we shouldn’t redpanda.com/blog/redpand... · Posted by u/rystsov

rystsov · 3 years ago

Hey folks, I wrote this post and I'm happy to answer questions

rystsov commented on Jepsen: Redpanda 21.10.1 jepsen.io/analyses/redpan... · Posted by u/aphyr

doommius · 3 years ago

Always great to read this. I preformed a jenkins test on Microsoft internal infra and it's a huge insight. From an academic side it's just as interesting looking into the lack of standards within consistently and the definitions of them.

rystsov · 3 years ago

Cool! What did you test? I've played with Jepsen and Cosmos DB when I was at Microsoft but we had to ditch ssh, write custom agent and inject faults with PowerShell command lets.

rystsov commented on Jepsen: Redpanda 21.10.1 jepsen.io/analyses/redpan... · Posted by u/aphyr

antonmry · 3 years ago

This report seems to have some wrong insights. Auto-commit offsets doesn't imply dataloss if records are processed synchronously. This is the safest way to test Kafka instead of commit offsets manually

rystsov · 3 years ago

Can you clarify what you mean? AFAIK with manual commit you have the most control over when the commit happens

Look at this blog post describing a data loss caused by auto-commit: https://newrelic.com/blog/best-practices/kafka-consumer-conf...

Also there also may be more subtle issues with auto-commit: https://github.com/edenhill/librdkafka/issues/2782

rystsov commented on Jepsen: Redpanda 21.10.1 jepsen.io/analyses/redpan... · Posted by u/aphyr

excuses_ · 3 years ago

I wonder if Redpanda thinks about or offers some alternative protocol that would be better defined in terms of transaction guarantees. At this point it looks like Kafka’s protocol was a nice try but it needs a major refactoring.

rystsov · 3 years ago

Documentation is a bit confusing: the protocol was evolved over time (new KIPs) and there is mismatch between the database model and kafka model. But we see a lot of potential in the Kafka transactional protocol.

At Redpanda we were able to push to 5k distributed transactions cross replicated shard. It's a mind-blowing for a database to achieve the same result.

Also Kafka transactional protocol works at low level it's very easy to build systems on top of it. For example, it's very easy to build a Calvin inspired system http://cs.yale.edu/homes/thomson/publications/calvin-sigmod1...

rystsov commented on Jepsen: Redpanda 21.10.1 jepsen.io/analyses/redpan... · Posted by u/aphyr

cgaebel · 3 years ago

Thanks for working with Jespen. Being willing to subject your product to their testing is a huge boon for Redpanda's credibility.

I have two questions:

1. How surprising were the bugs that Jepsen found?

2. Besides the obvious regression tests for bugs that Jepsen found, how did this report change Redpanda's overall approach to testing? Were there classes of tests missing?

rystsov · 3 years ago

It wasn't a big surprise for us. Redpanda is a complex distributed system with multiple components even at the core level: consensus, idmepotency, transactions so we were ready that something might be off (but we were pleased to find that all the safety issues were with the things which were behind the feature flags at the time).

Also we have internal chaos test and by the time partnership with Kyle started we already identified half of the consistency issues and sent PRs with fixes. The issues got in the report because by the time we started the changes weren't released yet. But it is acknowledged in the report

> The Redpanda team already had an extensive test suite— including fault injection—prior to our collaboration. Their work found several serious issues including duplicate writes (#3039), inconsistent offsets (#3003), and aborted reads/circular information flow (#3036) before Jepsen encountered them

We missed other issues because haven't exercised some scenario. As soon as Kyle found the issues we were able to reproduce them with the in-house chaos tests and fix. This dual testing (jepsen + existing chaos harness) approach was very beneficial. We were able to check the results and give feedback to Kyle if he found a real thing or if it looks more like an expected behavior.

We fixed all the consistency (safety) issues, but there are several unresolved availability dips. We'll stick with Jepsen (the framework) until we're sure we fixed then too. But then we probably rely just on the in house tests.

Clojure is very powerful language and I was truly amazed how fast Kyle for able to adjust his tests to new information but we don't have clojure expertise and even simple tasks take time. So it's probably wiser to use what we already know even it it a bit more verbose.