solana/book/src/reliable-vote-transmission.md

# Reliable Vote Transmission

Validator votes are messages that have a critical function for consensus and
continuous operation of the network. Therefore it is critical that they are
reliably delivered and encoded into the ledger.

## Challenges

1. Leader rotation is triggered by PoH, which is clock with high drift.  So many
nodes are likely to have an incorrect view if the next leader is active in
realtime or not.

2. The next leader may be easily be flooded.  Thus a DDOS would not only prevent
delivery of regular transactions, but also consensus messages.

3. UDP is unreliable, and our asynchronous protocol requires any message that is
transmitted to be retransmitted until it is observed in the ledger.
Retransmittion could potentially cause an unintentional *thundering herd*
against the leader with a large number of validators.  Worst case flood would be
`(num_nodes * num_retransmits)`.

4. Tracking if the vote has been transmitted or not via the ledger does not
guarantee it will appear in a confirmed block.  The current observed block may
be unrolled. Validators would need to maintain state for each vote and fork.


## Design

1. Send votes as a push message through gossip.  This ensures delivery of the
vote to all the next leaders, not just the next future one.

2. Leaders will read the Crds table for new votes and encode any new received
votes into the blocks they propose.  This allows for validator votes to be
included in rollback forks by all the future leaders.

3. Validators that receive votes in the ledger will add them to their local crds
table, not as a push request, but simply add them to the table.  This shortcuts
the push message protocol, so the validation messages do not need to be
retransmitted twice around the network.

4. CrdsValue for vote should look like this ``` Votes(Vec<Transaction>) ```

Each vote transaction should maintain a `wallclock` in its data.  The merge
strategy for Votes will keep the last N set of votes as configured by the local
client.  For push/pull the vector is traversed recursively and each Transaction
is treated as an individual CrdsValue with its own local wallclock and
signature.

Gossip is designed for efficient propagation of state.  Messages that are sent
through gossip-push are batched and propagated with a minimum spanning tree to
the rest of the network. Any partial failures in the tree are actively repaired
with the gossip-pull protocol while minimizing the amount of data transfered
between any nodes.


## How this design solves the Challenges

1. Because there is no easy way for validators to be in sync with leaders on the
leader's "active" state, gossip allows for eventual delivery regardless of that
state.

2. Gossip will deliver the messages to all the subsequent leaders, so if the
current leader is flooded the next leader would have already received these
votes and is able to encode them.

3. Gossip minimizes the number of requests through the network by maintaining an
efficient spanning tree, and using bloom filters to repair state.  So retransmit
back-off is not necessary and messages are batched.

4. Leaders that read the crds table for votes will encode all the new valid
votes that appear in the table.  Even if this leader's block is unrolled, the
next leader will try to add the same votes without any additional work done by
the validator.  Thus ensuring not only eventual delivery, but eventual encoding
into the ledger.


## Performance

1. Worst case propagation time to the next leader is Log(N) hops with a base
depending on the fanout.  With our current default fanout of 6, it is about 6
hops to 20k nodes.

2. The leader should receive 20k validation votes aggregated by gossip-push into
64kb blobs. Which would reduce the number of packets for 20k network to 80
blobs.

3. Each validators votes is replicated across the entire network.  To maintain a
queue of 5 previous votes the Crds table would grow by 25 megabytes.  `(20,000
nodes * 256 bytes * 5)`.

## Two step implementation rollout

Initially the network can perform reliably with just 1 vote transmitted and
maintained through the network with the current Vote implementation.  For small
networks a fanout of 6 is sufficient.  With small network the memory and push
overhead is minor.

### Sub 1k validator network

1. Crds just maintains the validators latest vote.

2. Votes are pushed and retransmitted regardless if they are appearing in the
ledger.

3. Fanout of 6.

* Worst case 256kb memory overhead per node.
* Worst case 4 hops to propagate to every node.
* Leader should receive the entire validator vote set in 4 push message blobs.

### Sub 20k network

Everything above plus the following:

1. CRDS table maintains a vector of 5 latest validator votes.

2. Votes encode a wallclock.  CrdsValue::Votes is a type that recurses into the
transaction vector for all the gossip protocols.

3. Increase fanout to 20.

* Worst case 25mb memory overhead per node.
* Sub 4 hops worst case to deliver to the entire network.
* 80 blobs received by the leader for all the validator messages.
Add design proposal for reliable vote transmission (#2601) * reliable vote transmission design proposal * summary * comments 2019-01-31 07:34:49 -08:00			`# Reliable Vote Transmission`

			`Validator votes are messages that have a critical function for consensus and`
			`continuous operation of the network. Therefore it is critical that they are`
			`reliably delivered and encoded into the ledger.`

			`## Challenges`

			`1. Leader rotation is triggered by PoH, which is clock with high drift. So many`
			`nodes are likely to have an incorrect view if the next leader is active in`
			`realtime or not.`

			`2. The next leader may be easily be flooded. Thus a DDOS would not only prevent`
			`delivery of regular transactions, but also consensus messages.`

			`3. UDP is unreliable, and our asynchronous protocol requires any message that is`
			`transmitted to be retransmitted until it is observed in the ledger.`
			`Retransmittion could potentially cause an unintentional thundering herd`
			`against the leader with a large number of validators. Worst case flood would be`
			`(num_nodes * num_retransmits)`.

			`4. Tracking if the vote has been transmitted or not via the ledger does not`
			`guarantee it will appear in a confirmed block. The current observed block may`
			`be unrolled. Validators would need to maintain state for each vote and fork.`


			`## Design`

			`1. Send votes as a push message through gossip. This ensures delivery of the`
			`vote to all the next leaders, not just the next future one.`

			`2. Leaders will read the Crds table for new votes and encode any new received`
			`votes into the blocks they propose. This allows for validator votes to be`
			`included in rollback forks by all the future leaders.`

			`3. Validators that receive votes in the ledger will add them to their local crds`
			`table, not as a push request, but simply add them to the table. This shortcuts`
			`the push message protocol, so the validation messages do not need to be`
			`retransmitted twice around the network.`

			4. CrdsValue for vote should look like this ``` Votes(Vec<Transaction>) ```

Rename userdata to data (#3282) * Rename userdata to data Instead of saying "userdata", which is ambiguous and imprecise, say "instruction data" or "account data". Also, add `ProgramError::InvalidInstructionData` Fixes #2761 2019-03-14 09:48:27 -07:00			Each vote transaction should maintain a `wallclock` in its data. The merge
Add design proposal for reliable vote transmission (#2601) * reliable vote transmission design proposal * summary * comments 2019-01-31 07:34:49 -08:00			`strategy for Votes will keep the last N set of votes as configured by the local`
			`client. For push/pull the vector is traversed recursively and each Transaction`
			`is treated as an individual CrdsValue with its own local wallclock and`
			`signature.`

			`Gossip is designed for efficient propagation of state. Messages that are sent`
			`through gossip-push are batched and propagated with a minimum spanning tree to`
			`the rest of the network. Any partial failures in the tree are actively repaired`
			`with the gossip-pull protocol while minimizing the amount of data transfered`
			`between any nodes.`


			`## How this design solves the Challenges`

			`1. Because there is no easy way for validators to be in sync with leaders on the`
			`leader's "active" state, gossip allows for eventual delivery regardless of that`
			`state.`

			`2. Gossip will deliver the messages to all the subsequent leaders, so if the`
			`current leader is flooded the next leader would have already received these`
			`votes and is able to encode them.`

			`3. Gossip minimizes the number of requests through the network by maintaining an`
			`efficient spanning tree, and using bloom filters to repair state. So retransmit`
			`back-off is not necessary and messages are batched.`

			`4. Leaders that read the crds table for votes will encode all the new valid`
			`votes that appear in the table. Even if this leader's block is unrolled, the`
			`next leader will try to add the same votes without any additional work done by`
			`the validator. Thus ensuring not only eventual delivery, but eventual encoding`
			`into the ledger.`


			`## Performance`

			`1. Worst case propagation time to the next leader is Log(N) hops with a base`
			`depending on the fanout. With our current default fanout of 6, it is about 6`
			`hops to 20k nodes.`

			`2. The leader should receive 20k validation votes aggregated by gossip-push into`
			`64kb blobs. Which would reduce the number of packets for 20k network to 80`
			`blobs.`

			`3. Each validators votes is replicated across the entire network. To maintain a`
			queue of 5 previous votes the Crds table would grow by 25 megabytes. `(20,000
			nodes * 256 bytes * 5)`.

			`## Two step implementation rollout`

			`Initially the network can perform reliably with just 1 vote transmitted and`
			`maintained through the network with the current Vote implementation. For small`
			`networks a fanout of 6 is sufficient. With small network the memory and push`
			`overhead is minor.`

			`### Sub 1k validator network`

			`1. Crds just maintains the validators latest vote.`

			`2. Votes are pushed and retransmitted regardless if they are appearing in the`
			`ledger.`

			`3. Fanout of 6.`

			`* Worst case 256kb memory overhead per node.`
			`* Worst case 4 hops to propagate to every node.`
			`* Leader should receive the entire validator vote set in 4 push message blobs.`

			`### Sub 20k network`

			`Everything above plus the following:`

			`1. CRDS table maintains a vector of 5 latest validator votes.`

			`2. Votes encode a wallclock. CrdsValue::Votes is a type that recurses into the`
			`transaction vector for all the gossip protocols.`

			`3. Increase fanout to 20.`

			`* Worst case 25mb memory overhead per node.`
			`* Sub 4 hops worst case to deliver to the entire network.`
			`* 80 blobs received by the leader for all the validator messages.`