Add a design for leader schedule rotation and genesis. (#2714)

Leader schedule rotation.
2019-02-15 16:34:34 -08:00 · 2019-02-15 16:34:34 -08:00 · c74b8b6df3
parent 573116e259
commit c74b8b6df3
1 changed files with 124 additions and 9 deletions
--- a/book/src/leader-rotation.md
+++ b/book/src/leader-rotation.md
@ -3,17 +3,102 @@
 At any given moment, a cluster expects only one fullnode to produce ledger
 entries. By having only one leader at a time, all validators are able to replay
 identical copies of the ledger. The drawback of only one leader at a time,
-however, is that a malicious leader is cabable of censoring votes and
+however, is that a malicious leader is capable of censoring votes and
 transactions. Since censoring cannot be distinguished from the network dropping
 packets, the cluster cannot simply elect a single node to hold the leader role
-indefinitely. Instead, the cluster minimizes the influence of a malcioius
+indefinitely. Instead, the cluster minimizes the influence of a malicious
 leader by rotating which node takes the lead.
 Each validator selects the expected leader using the same algorithm, described
 below. When the validator receives a new signed ledger entry, it can be certain
-that entry was produced by the expected leader.
+that entry was produced by the expected leader.  The order of slots which each
 leader is assigned a slot is called a *leader schedule*.
-## Leader Schedule Generation
+## Leader Schedule Rotation
 A validator rejects blocks that are not signed by the *slot leader*.  The list
 of identities of all slot leaders is called a *leader schedule*. The leader
 schedule is recomputed locally and periodically. It assigns slot leaders for a
 duration of time called an _epoch_. The schedule must be computed far in advance
 of the slots it assigns, such that the ledger state it uses to compute the
 schedule is finalized. That duration is called the *leader schedule offset*.
 Solana sets the offset to the duration of slots until the next epoch. That is,
 the leader schedule for an epoch is calculated from the ledger state at the
 start of the previous epoch. The offset of one epoch is fairly arbitrary and
 assumed to be sufficiently long such that all validators will have finalized
 their ledger state before the next schedule is generated. A cluster may choose
 to shorten the offset to reduce the time between stake changes and leader
 schedule updates.
 While operating without partitions lasting longer than an epoch, the schedule
 only needs to be generated when the root fork crosses the epoch boundary.  Since
 the schedule is for the next epoch, any new stakes committed to the root fork
 will not be active until the next epoch.  The block used for generating the
 leader schedule is the first block to cross the epoch boundary.
 Without a partition lasting longer than an epoch, the cluster will work as
 follows:
 1. A validator continuously updates its own root fork as it votes.
 2. The validator updates its leader schedule each time the slot height crosses
 an epoch boundary.
 For example:
 The epoch duration is 100 slots. The root fork is updated from fork computed at
 slot height 99 to a fork computed at slot height 102. Forks with slots at height
 100,101 were skipped because of failures.  The new leader schedule is computed
 using fork at slot height 102.  It is active from slot 200 until it is updated
 again.
 No inconsistency can exist because every validator that is voting with the
 cluster has skipped 100 and 101 when its root passes 102.  All validators,
 regardless of voting pattern, would be committing to a root that is either 102,
 or a descendant of 102.
 ### Leader Schedule Rotation with Epoch Sized Partitions.
 The duration of the leader schedule offset has a direct relationship to the
 likelihood of a cluster having an inconsistent view of the correct leader
 schedule.
 Consider the following scenario:
 Two partitions that are generating half of the blocks each.  Neither is coming
 to a definitive supermajority fork.  Both will cross epoch 100 and 200 without
 actually committing to a root and therefore a cluster wide commitment to a new
 leader schedule.
 In this unstable scenario, multiple valid leader schedules exist.
 * A leader schedule is generated for every fork whose direct parent is in the
 previous epoch.
 * The leader schedule is valid after the start of the next epoch for descendant
 forks until it is updated.
 Each partition's schedule will diverge after the partition lasts more than an
 epoch.  For this reason, the epoch duration should be selected to be much much
 larger then slot time and the expected length for a fork to be committed to
 root.
 After observing the cluster for a sufficient amount of time, the leader schedule
 offset can be selected based on the median partition duration and its standard
 deviation.  For example, an offset longer then the median partition duration
 plus six standard deviations would reduce the likelihood of an inconsistent
 ledger schedule in the cluster to 1 in 1 million.
 ## Leader Schedule Generation at Genesis
 The genesis block declares the first leader for the first epoch.  This leader
 ends up scheduled for the first two epochs because the leader schedule is also
 generated at slot 0 for the next epoch.  The length of the first two epochs can
 be specified in the genesis block as well.  The minimum length of the first
 epochs must be greater than or equal to the maximum rollback depth as defined in
 [fork selection](fork-selection.md).
 ## Leader Schedule Generation Algorithm
 Leader schedule is generated using a predefined seed.  The process is as follows:
@ -27,12 +112,42 @@ Leader schedule is generated using a predefined seed.  The process is as follows
   stake-weighted ordering.
 5. This ordering becomes valid after a cluster-configured number of ticks.
 ## Schedule Attack Vectors
 ### Seed
 The seed that is selected is predictable but unbiasable.  There is no grinding
-attack to influence its outcome. The active set, however, can be biased by a
+attack to influence its outcome. 
-leader by censoring validator votes. To reduce the likelihood of censorship,
+
-the active set is sampled many slots in advance, such that votes will have been
+### Active Set
-collected by multiple leaders. If even one node is honest, the malicious
+
-leaders will not be able to use censorship to influence the leader schedule.
+A leader can bias the active set by censoring validator votes.  Two possible
 ways exist for leaders to censor the active set:
 * Ignore votes from validators 
 * Refuse to vote for blocks with votes from validators
 To reduce the likelihood of censorship, the active set is calculated at the
 leader schedule offset boundary over an *active set sampling duration*. The
 active set sampling duration is long enough such that votes will have been
 collected by multiple leaders.
 ### Staking
 Leaders can censor new staking transactions or refuse to validate blocks with
 new stakes.  This attack is similar to censorship of validator votes.
 ### Validator operational key loss
 Leaders and validators are expected to use ephemeral keys for operation, and
 stake owners authorize the validators to do work with their stake via
 delegation.
 The cluster should be able to recover from the loss of all the ephemeral keys
 used by leaders and validators, which could occur through a common software
 vulnerability shared by all the nodes.  Stake owners should be able to vote
 directly co-sign a validator vote even though the stake is currently delegated
 to a validator.
 ## Appending Entries