Consensus Algorithms: Paxos and Raft

Consensus algorithms are crucial for building reliable distributed systems. While Paxos has been the go-to solution for many years, Raft has gained popularity due to its simpler design and easier implementation. Understanding these algorithms is essential for anyone working with distributed systems or building fault-tolerant applications.

The Need for Consensus

Consensus algorithms allow a group of processes in a distributed system to agree on a shared state. This is crucial for maintaining consistency across the system, especially in the face of failures or network partitions.

Paxos: The Classic Consensus Algorithm

Paxos, developed by Leslie Lamport in 1989, is one of the most well-known consensus algorithms. Despite its proven effectiveness, Paxos has a reputation for being difficult to understand and implement.

Key Components of Paxos:

Proposers: Nodes that propose values
Acceptors: Nodes that evaluate and decide on proposals
Learners: Nodes that need to learn the agreed-upon value

Paxos Protocol Phases:

Prepare Phase: A proposer sends a prepare request with a proposal number.
Accept Phase: If a majority of acceptors respond positively, the proposer sends an accept request.
Learn Phase: Acceptors inform learners of the accepted value.

Paxos guarantees safety (only proposed values are chosen, and only a single value is agreed upon) but does not guarantee liveness (progress is not always guaranteed).

Raft: A More Understandable Consensus Algorithm

Raft was developed in 2014 by Diego Ongaro and John Ousterhout as a more understandable alternative to Paxos.

Key Features of Raft:

Leader Election: A distinct phase for electing a leader
Log Replication: The leader manages log replication among followers
Safety: Guarantees log consistency across nodes

Raft Protocol Phases:

Leader Election: Nodes vote to elect a leader
Normal Operation (Log Replication): The leader accepts client requests and replicates log entries

Raft simplifies the consensus process by clearly separating concerns and introducing the concept of terms (similar to views in other protocols).

Comparing Paxos and Raft

While both algorithms solve the consensus problem, they differ in their approach:

Complexity: Raft is designed to be more understandable and easier to implement
Leadership: Raft has a strong leader model, while Paxos allows for multiple proposers
Understandability: Studies have shown that students tend to understand Raft more easily than Paxos

Real-world Applications

Both Paxos and Raft have been implemented in various production systems:

Paxos: Google's Chubby lock service, Apache ZooKeeper
Raft: etcd, Consul, TiKV

Key Concepts

Consensus: The process by which a group of nodes in a distributed system agree on a single value or state.

Distributed System: A network of independent computers that appear to users as a single coherent system.

Safety: The property that ensures nothing bad ever happens in the system (e.g., two nodes never disagree on a chosen value).

Liveness: The property that ensures something good eventually happens (e.g., the system eventually reaches a decision).

Quorum: The minimum number of nodes that must participate in a vote for the vote to be considered valid, typically a majority.

State Machine Replication: A technique where each node in a distributed system is modeled as a state machine, and all nodes process the same sequence of commands to maintain consistency.

View: In viewstamped replication, a configuration of the system with an assigned leader.

Snapshot: A point-in-time copy of the system's state, used to optimize log management and node recovery.

Log Compaction: The process of reducing the size of the replicated log by discarding old entries that are no longer needed.

Byzantine Fault Tolerance: The ability of a system to function correctly even when some nodes fail in arbitrary or malicious ways (not addressed by Paxos or Raft).

Paxos-specific Concepts

Proposer: A node in Paxos that proposes values to be agreed upon.

Acceptor: A node in Paxos that decides whether to accept proposed values.

Learner: A node in Paxos that learns and stores the agreed-upon values.

Proposal Number: A unique identifier for each proposal in Paxos, used to order proposals.

Prepare Phase: The first phase of Paxos where a proposer asks acceptors to promise not to accept lower-numbered proposals.

Accept Phase: The second phase of Paxos where a proposer asks acceptors to accept a specific value.

Multi-Paxos: An optimization of Paxos for agreeing on a sequence of values, often using a stable leader.

Raft-specific Concepts

Leader: The node responsible for handling all client requests and managing replication in Raft.

Follower: A passive node in Raft that responds to requests from leaders and candidates.

Candidate: A node in Raft that is competing to become a leader during an election.

Term: A logical clock in Raft that represents a period with a single leader.

Log Entry: A record in Raft containing a client command and the term when the entry was received by the leader.

Log Replication: The process in Raft where the leader replicates its log entries to followers.

Leader Election: The process in Raft where nodes vote to select a new leader.

Heartbeat: A periodic message sent by the Raft leader to maintain its authority and prevent new elections.

Review Questions

What are the main ideas for 2PC and 3PC?

2PC (Two-Phase Commit):

Uses a coordinator
Assumes coordinator won't fail
Coordinator proposes value, participants vote, coordinator tallies and communicates decision
Simple but blocks if failures occur, does not guarantee liveness

3PC (Three-Phase Commit):

Addresses blocking problem of 2PC
Has pre-prepare, prepare, and decision phases
Guarantees liveness but only works with fail-stop mode
Issues with safety if nodes restart or delayed messages are delivered later

Provide some history behind PAXOS

Originally written by Leslie Lamport in 1990
Not published until 1998 due to reviewers not appreciating the humorous description
Original paper described the algorithm using fictional Greek parliamentarians on Paxos island
In 2001, Lamport wrote "Paxos Made Simple" to make the algorithm more approachable

Description of PAXOS? What's expected from the system (the system model) for the protocol to hold?

Designed for systems with asynchronous communication and non-Byzantine process failures
Agents operate at arbitrary speed, may fail by stopping and restart
Participants have persistent memory to remember information after restarting
Messages can take arbitrarily long to be delivered, can be duplicated, lost, or reordered, but cannot be corrupted

What is hard about determining whom to follow?

Determining leadership in distributed systems is challenging due to asynchronous communication, potential failures, and the need for consensus among participants.

Main ideas that PAXOS assumes: state-machine replication, majority quorum, …

State machine replication: Each node is a replica of a state machine following the same rules for state updates
Majority quorum: Decisions based on majority agreement, ensuring two quorums always intersect
Heavy reliance on time steps (e.g., Lamport logical clocks) to order events and tolerate arbitrary message delays

Goal and description of the phases in Paxos

Reach consensus on a single value among distributed nodes.

Phases:

Prepare phase: Proposer sends prepare request with proposal number to acceptors
Accept phase: If proposer receives majority agreement, it sends accept request to acceptors
Learn phase: Acceptors communicate decided value to learners

What is the functionality Paxos provides? Why we need Multi-Paxos?

Paxos provides consensus on a single value

Multi-Paxos is needed for agreeing on sequences of values, which is more practical for real-world applications (e.g., database updates, application state changes)

Motivation for Raft, key idea and differences than Paxos

Motivation: Improve understandability of consensus algorithms

Key idea: Separate leader election from log replication phases

Differences from Paxos:

Distinct leader election phase
More straightforward log comparison and management
Designed for better understandability and easier implementation

Log management in Raft - how is info used to update, discard entries, trim/garbage collection?

Logs are append-only

Log matching property: If two entries in different logs share the same index and term, all previous entries are the same
New leaders force other servers to use their log
Uncommitted entries from previous terms may be discarded
Periodic log truncation using snapshots for efficiency
Leaders can send snapshots to help lagging nodes catch up quickly