Paper Reading -- The Byzantine Generals Problem

11 May 2016

We care about reliable communication in distributed systems. One important aspect is how the system effectively cope with failures. There exists many well-studied failures, like crash failure which failed component simply stops communicating messages. Yet some other failures may send out conflicting messages to different system components. Lamport et al. investigated this failure, and they model it as the Byzantine Generals Problem, which made this special type of failure well known as the ``byzantine failure’’ model. Their simple conclusion is that, using only oral messages (which implied the message is forgeable), the problem is solvable iff more than two thirds of the generals are loyal. With unforgeable (written) messages, the problem is solvable for any number of generals and possbile traitors. This result indeed is very interesting and let us see how the authors unfold this story little by little.

1. Introduction

The authors motivated the problem as the decision making process of the Byzantine Generals, among who may exists traitors. They made goals for the Byzantine Generals that:

A. All loyal generals decide upon the same plan of action;
B. A small number of traitors cannot cause the loyal generals to adopt a bad plan.

These two reasonable goals say about the outcome of the decision, yet they are hard to formalize (especially goal B). Instead, authors took another angle by considering how the actions (decisions) were taken. Given all the messages the generals communicate: \( v_1, v_2, …, v_n \), our goal is to find a combining scheme to generate a single plan of action out of all these values. Applying this strategy to convert our goals into a formal definition, it becomes:

1. Any two loyal generals use the same value of v_i
2. If the i^{th} general is loyal, then the calue that he sends must be used by every loyal general as the value of v_i

These conditions are both on the single value sent by the \(i^{th}\) general. Based on such observation, the original problem can be further rephrased into the following one:

Byzantine Generals Problem. A commanding genral must send an order to his \(n-1\) lieutenant generals such that

IC1. All loyal lieutenants obey the same order.
IC2. If the commanding general is loyal, then every loyal lieutenant obeys the order he sends.

Where condition IC stands for the interactive consistency conditions. So really, we are now considering a single order that a single commanding general sends out. Amazing. Now this problem seems (deceptively) simple, right?

2. Impossibility Results

The authors show using a very simple example – three generals – to illustrate the impossibility result that we all know:

No solution with fewer than 3m + 1 generals can cope with m traitors.

The main idea is to prove by contradiction: if a solution exists for \(m\) traitors among fewer than \(3m+1\) generals, then a solution exists for ONE traitor among three generals, which is proven to be impossible.

However, it is very interesting that they point out this is NOT a rigorous mathematical proof and refer readers to their proof in the paper published in JACM’80 by claiming that

"We strongly advise the reader to be very suspicious of such nonrigorous reasoning. Although this result is indeed correct, we have seen equally plausible 'proofs' of invalid results. We know of no area in computer science or mathematics in which informal reasoning is more likely to lead to errors than in the study of this type of algorithm."

Indeed, they are right. I find these sentences candid and thought provoking, and I totally agree with them. We have seen many ‘bogus’ results due to the nonrigorous reasoning and it should alert us. Lamport indeed has strong mathematical background (he did both his MS and PhD in Mathematics), but it should not be the reason why we should not be rigorous and critical.

Going on, the authors pointed out that such impossibility is not posed by the strictness of reaching exact agreement. They further demonstrated that reaching approximate agreement is just as hard as reaching exact agreement. Similarly, they first prove by contructing a three generals agreeing on an approximate time of attack. Then they proved by contradiction for the \(3m+1\) generals.

So far, the proof is complete.

3. A Solution with Oral Messages

After proving the previous impossibility results, naturally it is time for a solution. The authors started from the formal definition of the ``oral’’ message system that:

A1. Every message that is sent is delivered correctly.
A2. The receiver of a message knows who sent it.
A3. The absence of a message can be detected.

The first two assumptions makes impossible to interfere the communication between two generals. The third assumption eliminates the ambiguity that the message may be delayed or due to that the traitor does not send the message. Moreover, the default action of liutenant is RETREAT if A3 is true.

The authors provided the Oral Message algorithm (OM) for this scenario. Given one commander and \(n-1\) lieutenants, for all nonnegative intergers \(m\), we have recursive algorithms

Algorithm OM(0).

(1) The commander send his value to every lieutenant.
(2) Each lieutenant uses the value he receives grom the commander, or uses the value RETREAT if he receives no value.

Algorithm OM(m), \(m>0\).

(1) The commander sends his value to every lieutenant.
(2) For each i, let v_i be the value lieutenant i receives from the commander, or else be RETREAT if he receives no value. Lrutenant i acts as the commander in Algorithm OM(m-1) to send the value v_i to each of the n-2 other lieutenants.
(3) For each i, and each j ≠ i, let v_j be the value lieutenant i received from lieutenant j in step (2), or else RETREAT if he received no such value. Lieutenant i uses the value majority(v_1, ..., v_{n-1}).

Finally, the following theorem asserts that Algorithm \(OM(m)\) solves the Byzantine Generals Problem:

THEOREM 1. For any m, Algorithm OM(m) satisfies conditions IC1 and IC2 if htere are more than 3m generals and at most m traitors.

The prooves are omitted here. It is not hard to see that such algorithms involve sending up to \((n-1)(n-2)\cdots(n-m-1)\) messages, which complexity is exponential. Though such algorithm is expensive, it is optimal given Fisher and Lynch’s proof. It may be possible to reduce the messages transfered, but the authors still expect that a large number of messages will still be required. Interesting though, they did not study it in detail either.

I will stop here for the sake of simplicity, leaving the cases of (a) how signed messages can relax the requirement for number of traitors, and (b) how missing communication paths changes the picture.

The key take-away point from this classical article is that, reaching consensus in a faulty environment is expensive. In the Byzantine failure model assumed in this paper, without proper message authentication, you need at least \(3m+1\) processes to combat \(m\) faulty processes.