Brush up data-plane-fanout to read less like a proposal

2019-03-01 15:45:08 -07:00 · 2019-03-01 15:45:08 -07:00 · 8e273caf7d
parent b1a648113f
commit 8e273caf7d
2 changed files with 64 additions and 76 deletions
--- a/book/src/cluster.md
+++ b/book/src/cluster.md
@ -96,4 +96,5 @@ that header information becomes the primary consumer of network bandwidth. At
 the time of this writing, the approach is scaling well up to about 150
 validators. To scale up to hundreds of thousands of validators, each node can
 apply the same technique as the leader node to another set of nodes of equal
-size. We call the technique *data plane fanout*, but it is not yet implemented.
+size. We call the technique *data plane fanout*; learn more in the [data plan
+fanout](data-plane-fanout.md) section.
--- a/book/src/data-plane-fanout.md
+++ b/book/src/data-plane-fanout.md
@ -1,86 +1,59 @@
 # Data Plane Fanout

-The the cluster organizes itself by stake and divides into a collection
-of nodes, called `neighborhoods`.  The leader broadcasts its blobs to the
-layer-1 (neighborhood 0) nodes exactly like it does without this mechanism. The
-main difference being the number of nodes in layer-1 is capped via the
-configurable `DATA_PLANE_FANOUT`. If the fanout is smaller than the nodes in
-the cluster then the mechanism will add layers below layer-1. Subsequent layers
-(beyond layer-1) follow the following constraints to determine layer-capacity.
-Each neighborhood has `NEIGHBORHOOD_SIZE` nodes and `fanout/2` neighborhoods
-are allowed per layer.
+A Solana cluster uses a multi-layer mechanism called *data plane fanout* to
+broadcast transaction blobs to all nodes in a very quick and efficient manner.
+In order to establish the fanout, the cluster divides itself into small
+collections of nodes, called *neighborhoods*. Each node is responsible for
+sharing any data it receives with the other nodes in its neighborhood, as well
+as propagating the data on to a small set of nodes in other neighborhoods.

-Nodes in a layer will broadcast their blobs to exactly 1 node in each
-next-layer neighborhood and each node in a neighborhood will perform
-retransmits amongst its neighbors (just like layer-1 does with its layer-1
-peers).  This means any node has to only send its data to its neighbors and
-each neighborhood in the layer below instead of every single TVU peer it has.
-The retransmit mechanism also supports a second, `grow`,  mode of operation
-that squares the number of neighborhoods allowed per layer which dramatically
-reduces the number of layers needed to support a large cluster but can also
-have a negative impact on the network pressure each node in the lower layers
-has to deal with.  A good way to think of the default mode (when `grow` is
-disabled) is to imagine it as `chain` of layers where the leader sends blobs to
-layer-1 and then layer-1 to layer-2 and so on, but instead of growing layer-3
-to the square of number of nodes in layer-2, we keep the `layer capacities`
-constant, so all layers past layer-2 will have the same number of nodes until
-the whole cluster is covered. When `grow` is enabled, this quickly turns into a
-traditional fanout where layer-3 will have the square of the number of nodes in
-layer-2 and so on.
-
-Below is an example of a two layer cluster. Note - this example doesn't
-describe the same `fanout/2` limit for lower layer neighborhoods.
+During its slot, the leader node distributes blobs between the validator nodes
+in one neighborhood (layer 1). Each validator shares its data within its
+neighborhood, but also retransmits the blobs to one node in each of multiple
+neighborhoods in the next layer (layer 2). The layer-2 nodes each share their
+data with their neighborhood peers, and retransmit to nodes in the next layer,
+etc, until all nodes in the cluster have received all the blobs.

 <img alt="Two layer cluster" src="img/data-plane.svg" class="center"/>

-#### Neighborhoods
+## Neighborhood Assignment - Weighted Selection

-The following diagram shows how two neighborhoods in different layers interact.
-What this diagram doesn't capture is that each `neighbor` actually receives
-blobs from 1 one validator _per_ neighborhood above it. This means that, to
-cripple a neighborhood, enough nodes (erasure codes +1 _per_ neighborhood) from
-the layer above need to fail.  Since multiple neighborhoods exist in the upper
-layer and a node will receive blobs from a node in each of those neighborhoods,
-we'd need a big network failure in the upper layers to end up with incomplete
-data.
+In order for data plane fanout to work, the entire cluster must agree on how the
+cluster is divided into neighborhoods. To achieve this, all the recognized
+validator nodes (the TVU peers) are sorted by stake and stored in a list. This
+list is then indexed in different ways to figure out neighborhood boundaries and
+retransmit peers. For example, the leader will simply select the first nodes to
+make up layer 1. These will automatically be the highest stake holders, allowing
+the heaviest votes to come back to the leader first. Layer-1 and lower-layer
+nodes use the same logic to find their neighbors and lower layer peers.

-<img alt="Inner workings of a neighborhood"
-src="img/data-plane-neighborhood.svg" class="center"/>
+## Layer and Neighborhood Structure

-#### A Weighted Selection Mechanism
+The current leader makes its initial broadcasts to at most `DATA_PLANE_FANOUT`
+nodes. If this layer 1 is smaller than the number of nodes in the cluster, then
+the data plane fanout mechanism adds layers below. Subsequent layers follow
+these constraints to determine layer-capacity: Each neighborhood contains
+`NEIGHBORHOOD_SIZE` nodes and each layer may have up to `DATA_PLANE_FANOUT/2`
+neighborhoods.

-To support this mechanism, there needs to be a agreed upon way of dividing the
-cluster amongst the nodes. To achieve this the `tvu_peers` are sorted by stake
-and stored in a list. This list can then be indexed in different ways to figure
-out neighborhood boundaries and retransmit peers.  For example, the leader will
-simply select the first `DATA_PLANE_FANOUT` nodes as its layer 1 nodes. These
-will automatically be the highest stake holders allowing the heaviest votes to
-come back to the leader first.  The same logic determines which nodes each node
-in layer needs to retransmit its blobs to. This involves finding its neighbors
-and lower layer peers.
+As mentioned above, each node in a layer only has to broadcast its blobs to its
+neighbors and to exactly 1 node in each next-layer neighborhood, instead of to
+every TVU peer in the cluster. In the default mode, each layer contains
+`DATA_PLANE_FANOUT/2` neighborhoods. The retransmit mechanism also supports a
+second, `grow`, mode of operation that squares the number of neighborhoods
+allowed each layer. This dramatically reduces the number of layers needed to
+support a large cluster, but can also have a negative impact on the network
+pressure on each node in the lower layers. A good way to think of the default
+mode (when `grow` is disabled) is to imagine it as chain of layers, where the
+leader sends blobs to layer-1 and then layer-1 to layer-2 and so on, the `layer
+capacities` remain constant, so all layers past layer-2 will have the same
+number of nodes until the whole cluster is covered. When `grow` is enabled, this
+becomes a traditional fanout where layer-3 will have the square of the number of
+nodes in layer-2 and so on.

-#### Broadcast Service Broadcast service uses a bank to figure out stakes and
-hands this off to ClusterInfo to figure out the top `DATA_PLANE_FANOUT` stake
-holders.  These top stake holders will be considered Layer 1. For the leader
-this is pretty straightforward and can be achieved with a `truncate` call on a
-sorted list of peers.
+#### Configuration Values

-#### Retransmit Stage The biggest challenge in updating to this mechanism is to
-update the retransmit stage and make it "layer aware"; i.e using the bank each
-node can figure out which layer it belongs in and which lower layer peers nodes
-to send blobs to with minimal overlap across neighborhood boundaries.  Overlaps
-will be minimized based on `((node.layer_index) % (layer_neighborhood_size) *
-cur_neighborhood_index)` where `cur_neighborhood_index` is the loop index in
-`num_neighborhoods` so that a node only forwards blobs to a single node in a
-lower layer neighborhood.
-
-Each node can receive blobs froms its peer in the layer above as well as its
-neighbors. As long as the failure rate is less than the number of erasure
-codes, blobs can be repaired without the cluster failing.
-
-#### Constraints
-
-`DATA_PLANE_FANOUT` - The size of layer 1 is determined by this. Subsequent
+`DATA_PLANE_FANOUT` - Determines the size of layer 1. Subsequent
 layers have `DATA_PLANE_FANOUT/2` neighborhoods when `grow` is inactive.

 `NEIGHBORHOOD_SIZE` - The number of nodes allowed in a neighborhood.
@ -89,9 +62,23 @@ neighborhood isn't full, it _must_ be the last one.

 `GROW_LAYER_CAPACITY` - Whether or not retransmit should be behave like a
 _traditional fanout_, i.e if each additional layer should have growing
-capacities. When this mode is disabled (default) all layers after layer 1 have
-the same capacity to keep the network pressure on all nodes equal.
+capacities. When this mode is disabled (default), all layers after layer 1 have
+the same capacity, keeping the network pressure on all nodes equal.

-Future work would involve moving these parameters to on chain configuration
-since it might be beneficial tune these on the fly as the cluster sizes change.
+Currently, configuration is set when the cluster is launched. In the future,
+these parameters may be hosted on-chain, allowing modification on the fly as the
+cluster sizes change.

+## Neighborhoods
+
+The following diagram shows how two neighborhoods in different layers interact.
+What this diagram doesn't capture is that each neighbor actually receives
+blobs from 1 one validator per neighborhood above it. This means that, to
+cripple a neighborhood, enough nodes (erasure codes +1 per neighborhood) from
+the layer above need to fail.  Since multiple neighborhoods exist in the upper
+layer and a node will receive blobs from a node in each of those neighborhoods,
+we'd need a big network failure in the upper layers to end up with incomplete
+data.
+
+<img alt="Inner workings of a neighborhood"
+src="img/data-plane-neighborhood.svg" class="center"/>