Add a design for leader schedule rotation and genesis. (#2714)

Leader schedule rotation.
2019-02-15 16:34:34 -08:00 · 2019-02-15 16:34:34 -08:00 · c74b8b6df3
parent 573116e259
commit c74b8b6df3
1 changed files with 124 additions and 9 deletions
--- a/book/src/leader-rotation.md
+++ b/book/src/leader-rotation.md
@ -3,17 +3,102 @@
 At any given moment, a cluster expects only one fullnode to produce ledger
 entries. By having only one leader at a time, all validators are able to replay
 identical copies of the ledger. The drawback of only one leader at a time,
-however, is that a malicious leader is cabable of censoring votes and
+however, is that a malicious leader is capable of censoring votes and
 transactions. Since censoring cannot be distinguished from the network dropping
 packets, the cluster cannot simply elect a single node to hold the leader role
-indefinitely. Instead, the cluster minimizes the influence of a malcioius
+indefinitely. Instead, the cluster minimizes the influence of a malicious
 leader by rotating which node takes the lead.

 Each validator selects the expected leader using the same algorithm, described
 below. When the validator receives a new signed ledger entry, it can be certain
-that entry was produced by the expected leader.
+that entry was produced by the expected leader.  The order of slots which each
+leader is assigned a slot is called a *leader schedule*.

-## Leader Schedule Generation
+## Leader Schedule Rotation
+
+A validator rejects blocks that are not signed by the *slot leader*.  The list
+of identities of all slot leaders is called a *leader schedule*. The leader
+schedule is recomputed locally and periodically. It assigns slot leaders for a
+duration of time called an _epoch_. The schedule must be computed far in advance
+of the slots it assigns, such that the ledger state it uses to compute the
+schedule is finalized. That duration is called the *leader schedule offset*.
+Solana sets the offset to the duration of slots until the next epoch. That is,
+the leader schedule for an epoch is calculated from the ledger state at the
+start of the previous epoch. The offset of one epoch is fairly arbitrary and
+assumed to be sufficiently long such that all validators will have finalized
+their ledger state before the next schedule is generated. A cluster may choose
+to shorten the offset to reduce the time between stake changes and leader
+schedule updates.
+
+While operating without partitions lasting longer than an epoch, the schedule
+only needs to be generated when the root fork crosses the epoch boundary.  Since
+the schedule is for the next epoch, any new stakes committed to the root fork
+will not be active until the next epoch.  The block used for generating the
+leader schedule is the first block to cross the epoch boundary.
+
+Without a partition lasting longer than an epoch, the cluster will work as
+follows:
+
+1. A validator continuously updates its own root fork as it votes.
+
+2. The validator updates its leader schedule each time the slot height crosses
+an epoch boundary.
+
+For example:
+
+The epoch duration is 100 slots. The root fork is updated from fork computed at
+slot height 99 to a fork computed at slot height 102. Forks with slots at height
+100,101 were skipped because of failures.  The new leader schedule is computed
+using fork at slot height 102.  It is active from slot 200 until it is updated
+again.
+
+No inconsistency can exist because every validator that is voting with the
+cluster has skipped 100 and 101 when its root passes 102.  All validators,
+regardless of voting pattern, would be committing to a root that is either 102,
+or a descendant of 102.
+
+### Leader Schedule Rotation with Epoch Sized Partitions.
+
+The duration of the leader schedule offset has a direct relationship to the
+likelihood of a cluster having an inconsistent view of the correct leader
+schedule.
+
+Consider the following scenario:
+
+Two partitions that are generating half of the blocks each.  Neither is coming
+to a definitive supermajority fork.  Both will cross epoch 100 and 200 without
+actually committing to a root and therefore a cluster wide commitment to a new
+leader schedule.
+
+In this unstable scenario, multiple valid leader schedules exist.
+
+* A leader schedule is generated for every fork whose direct parent is in the
+previous epoch.
+
+* The leader schedule is valid after the start of the next epoch for descendant
+forks until it is updated.
+
+Each partition's schedule will diverge after the partition lasts more than an
+epoch.  For this reason, the epoch duration should be selected to be much much
+larger then slot time and the expected length for a fork to be committed to
+root.
+
+After observing the cluster for a sufficient amount of time, the leader schedule
+offset can be selected based on the median partition duration and its standard
+deviation.  For example, an offset longer then the median partition duration
+plus six standard deviations would reduce the likelihood of an inconsistent
+ledger schedule in the cluster to 1 in 1 million.
+ 
+## Leader Schedule Generation at Genesis
+
+The genesis block declares the first leader for the first epoch.  This leader
+ends up scheduled for the first two epochs because the leader schedule is also
+generated at slot 0 for the next epoch.  The length of the first two epochs can
+be specified in the genesis block as well.  The minimum length of the first
+epochs must be greater than or equal to the maximum rollback depth as defined in
+[fork selection](fork-selection.md).
+
+## Leader Schedule Generation Algorithm

 Leader schedule is generated using a predefined seed.  The process is as follows:

@ -27,12 +112,42 @@ Leader schedule is generated using a predefined seed.  The process is as follows
   stake-weighted ordering.
 5. This ordering becomes valid after a cluster-configured number of ticks.

+## Schedule Attack Vectors
+
+### Seed
+
 The seed that is selected is predictable but unbiasable.  There is no grinding
-attack to influence its outcome. The active set, however, can be biased by a
-leader by censoring validator votes. To reduce the likelihood of censorship,
-the active set is sampled many slots in advance, such that votes will have been
-collected by multiple leaders. If even one node is honest, the malicious
-leaders will not be able to use censorship to influence the leader schedule.
+attack to influence its outcome. 
+
+### Active Set
+
+A leader can bias the active set by censoring validator votes.  Two possible
+ways exist for leaders to censor the active set:
+
+* Ignore votes from validators 
+* Refuse to vote for blocks with votes from validators
+
+To reduce the likelihood of censorship, the active set is calculated at the
+leader schedule offset boundary over an *active set sampling duration*. The
+active set sampling duration is long enough such that votes will have been
+collected by multiple leaders.
+
+### Staking
+
+Leaders can censor new staking transactions or refuse to validate blocks with
+new stakes.  This attack is similar to censorship of validator votes.
+
+### Validator operational key loss
+
+Leaders and validators are expected to use ephemeral keys for operation, and
+stake owners authorize the validators to do work with their stake via
+delegation.
+
+The cluster should be able to recover from the loss of all the ephemeral keys
+used by leaders and validators, which could occur through a common software
+vulnerability shared by all the nodes.  Stake owners should be able to vote
+directly co-sign a validator vote even though the stake is currently delegated
+to a validator.

 ## Appending Entries