Brush up data-plane-fanout to read less like a proposal

This commit is contained in:
Tyera Eulberg 2019-03-01 15:45:08 -07:00 committed by Tyera Eulberg
parent b1a648113f
commit 8e273caf7d
2 changed files with 64 additions and 76 deletions

View File

@ -96,4 +96,5 @@ that header information becomes the primary consumer of network bandwidth. At
the time of this writing, the approach is scaling well up to about 150
validators. To scale up to hundreds of thousands of validators, each node can
apply the same technique as the leader node to another set of nodes of equal
size. We call the technique *data plane fanout*, but it is not yet implemented.
size. We call the technique *data plane fanout*; learn more in the [data plan
fanout](data-plane-fanout.md) section.

View File

@ -1,86 +1,59 @@
# Data Plane Fanout
The the cluster organizes itself by stake and divides into a collection
of nodes, called `neighborhoods`. The leader broadcasts its blobs to the
layer-1 (neighborhood 0) nodes exactly like it does without this mechanism. The
main difference being the number of nodes in layer-1 is capped via the
configurable `DATA_PLANE_FANOUT`. If the fanout is smaller than the nodes in
the cluster then the mechanism will add layers below layer-1. Subsequent layers
(beyond layer-1) follow the following constraints to determine layer-capacity.
Each neighborhood has `NEIGHBORHOOD_SIZE` nodes and `fanout/2` neighborhoods
are allowed per layer.
A Solana cluster uses a multi-layer mechanism called *data plane fanout* to
broadcast transaction blobs to all nodes in a very quick and efficient manner.
In order to establish the fanout, the cluster divides itself into small
collections of nodes, called *neighborhoods*. Each node is responsible for
sharing any data it receives with the other nodes in its neighborhood, as well
as propagating the data on to a small set of nodes in other neighborhoods.
Nodes in a layer will broadcast their blobs to exactly 1 node in each
next-layer neighborhood and each node in a neighborhood will perform
retransmits amongst its neighbors (just like layer-1 does with its layer-1
peers). This means any node has to only send its data to its neighbors and
each neighborhood in the layer below instead of every single TVU peer it has.
The retransmit mechanism also supports a second, `grow`, mode of operation
that squares the number of neighborhoods allowed per layer which dramatically
reduces the number of layers needed to support a large cluster but can also
have a negative impact on the network pressure each node in the lower layers
has to deal with. A good way to think of the default mode (when `grow` is
disabled) is to imagine it as `chain` of layers where the leader sends blobs to
layer-1 and then layer-1 to layer-2 and so on, but instead of growing layer-3
to the square of number of nodes in layer-2, we keep the `layer capacities`
constant, so all layers past layer-2 will have the same number of nodes until
the whole cluster is covered. When `grow` is enabled, this quickly turns into a
traditional fanout where layer-3 will have the square of the number of nodes in
layer-2 and so on.
Below is an example of a two layer cluster. Note - this example doesn't
describe the same `fanout/2` limit for lower layer neighborhoods.
During its slot, the leader node distributes blobs between the validator nodes
in one neighborhood (layer 1). Each validator shares its data within its
neighborhood, but also retransmits the blobs to one node in each of multiple
neighborhoods in the next layer (layer 2). The layer-2 nodes each share their
data with their neighborhood peers, and retransmit to nodes in the next layer,
etc, until all nodes in the cluster have received all the blobs.
<img alt="Two layer cluster" src="img/data-plane.svg" class="center"/>
#### Neighborhoods
## Neighborhood Assignment - Weighted Selection
The following diagram shows how two neighborhoods in different layers interact.
What this diagram doesn't capture is that each `neighbor` actually receives
blobs from 1 one validator _per_ neighborhood above it. This means that, to
cripple a neighborhood, enough nodes (erasure codes +1 _per_ neighborhood) from
the layer above need to fail. Since multiple neighborhoods exist in the upper
layer and a node will receive blobs from a node in each of those neighborhoods,
we'd need a big network failure in the upper layers to end up with incomplete
data.
In order for data plane fanout to work, the entire cluster must agree on how the
cluster is divided into neighborhoods. To achieve this, all the recognized
validator nodes (the TVU peers) are sorted by stake and stored in a list. This
list is then indexed in different ways to figure out neighborhood boundaries and
retransmit peers. For example, the leader will simply select the first nodes to
make up layer 1. These will automatically be the highest stake holders, allowing
the heaviest votes to come back to the leader first. Layer-1 and lower-layer
nodes use the same logic to find their neighbors and lower layer peers.
<img alt="Inner workings of a neighborhood"
src="img/data-plane-neighborhood.svg" class="center"/>
## Layer and Neighborhood Structure
#### A Weighted Selection Mechanism
The current leader makes its initial broadcasts to at most `DATA_PLANE_FANOUT`
nodes. If this layer 1 is smaller than the number of nodes in the cluster, then
the data plane fanout mechanism adds layers below. Subsequent layers follow
these constraints to determine layer-capacity: Each neighborhood contains
`NEIGHBORHOOD_SIZE` nodes and each layer may have up to `DATA_PLANE_FANOUT/2`
neighborhoods.
To support this mechanism, there needs to be a agreed upon way of dividing the
cluster amongst the nodes. To achieve this the `tvu_peers` are sorted by stake
and stored in a list. This list can then be indexed in different ways to figure
out neighborhood boundaries and retransmit peers. For example, the leader will
simply select the first `DATA_PLANE_FANOUT` nodes as its layer 1 nodes. These
will automatically be the highest stake holders allowing the heaviest votes to
come back to the leader first. The same logic determines which nodes each node
in layer needs to retransmit its blobs to. This involves finding its neighbors
and lower layer peers.
As mentioned above, each node in a layer only has to broadcast its blobs to its
neighbors and to exactly 1 node in each next-layer neighborhood, instead of to
every TVU peer in the cluster. In the default mode, each layer contains
`DATA_PLANE_FANOUT/2` neighborhoods. The retransmit mechanism also supports a
second, `grow`, mode of operation that squares the number of neighborhoods
allowed each layer. This dramatically reduces the number of layers needed to
support a large cluster, but can also have a negative impact on the network
pressure on each node in the lower layers. A good way to think of the default
mode (when `grow` is disabled) is to imagine it as chain of layers, where the
leader sends blobs to layer-1 and then layer-1 to layer-2 and so on, the `layer
capacities` remain constant, so all layers past layer-2 will have the same
number of nodes until the whole cluster is covered. When `grow` is enabled, this
becomes a traditional fanout where layer-3 will have the square of the number of
nodes in layer-2 and so on.
#### Broadcast Service Broadcast service uses a bank to figure out stakes and
hands this off to ClusterInfo to figure out the top `DATA_PLANE_FANOUT` stake
holders. These top stake holders will be considered Layer 1. For the leader
this is pretty straightforward and can be achieved with a `truncate` call on a
sorted list of peers.
#### Configuration Values
#### Retransmit Stage The biggest challenge in updating to this mechanism is to
update the retransmit stage and make it "layer aware"; i.e using the bank each
node can figure out which layer it belongs in and which lower layer peers nodes
to send blobs to with minimal overlap across neighborhood boundaries. Overlaps
will be minimized based on `((node.layer_index) % (layer_neighborhood_size) *
cur_neighborhood_index)` where `cur_neighborhood_index` is the loop index in
`num_neighborhoods` so that a node only forwards blobs to a single node in a
lower layer neighborhood.
Each node can receive blobs froms its peer in the layer above as well as its
neighbors. As long as the failure rate is less than the number of erasure
codes, blobs can be repaired without the cluster failing.
#### Constraints
`DATA_PLANE_FANOUT` - The size of layer 1 is determined by this. Subsequent
`DATA_PLANE_FANOUT` - Determines the size of layer 1. Subsequent
layers have `DATA_PLANE_FANOUT/2` neighborhoods when `grow` is inactive.
`NEIGHBORHOOD_SIZE` - The number of nodes allowed in a neighborhood.
@ -89,9 +62,23 @@ neighborhood isn't full, it _must_ be the last one.
`GROW_LAYER_CAPACITY` - Whether or not retransmit should be behave like a
_traditional fanout_, i.e if each additional layer should have growing
capacities. When this mode is disabled (default) all layers after layer 1 have
the same capacity to keep the network pressure on all nodes equal.
capacities. When this mode is disabled (default), all layers after layer 1 have
the same capacity, keeping the network pressure on all nodes equal.
Future work would involve moving these parameters to on chain configuration
since it might be beneficial tune these on the fly as the cluster sizes change.
Currently, configuration is set when the cluster is launched. In the future,
these parameters may be hosted on-chain, allowing modification on the fly as the
cluster sizes change.
## Neighborhoods
The following diagram shows how two neighborhoods in different layers interact.
What this diagram doesn't capture is that each neighbor actually receives
blobs from 1 one validator per neighborhood above it. This means that, to
cripple a neighborhood, enough nodes (erasure codes +1 per neighborhood) from
the layer above need to fail. Since multiple neighborhoods exist in the upper
layer and a node will receive blobs from a node in each of those neighborhoods,
we'd need a big network failure in the upper layers to end up with incomplete
data.
<img alt="Inner workings of a neighborhood"
src="img/data-plane-neighborhood.svg" class="center"/>