Brush up data-plane-fanout to read less like a proposal
This commit is contained in:
parent
b1a648113f
commit
8e273caf7d
|
@ -96,4 +96,5 @@ that header information becomes the primary consumer of network bandwidth. At
|
|||
the time of this writing, the approach is scaling well up to about 150
|
||||
validators. To scale up to hundreds of thousands of validators, each node can
|
||||
apply the same technique as the leader node to another set of nodes of equal
|
||||
size. We call the technique *data plane fanout*, but it is not yet implemented.
|
||||
size. We call the technique *data plane fanout*; learn more in the [data plan
|
||||
fanout](data-plane-fanout.md) section.
|
||||
|
|
|
@ -1,86 +1,59 @@
|
|||
# Data Plane Fanout
|
||||
|
||||
The the cluster organizes itself by stake and divides into a collection
|
||||
of nodes, called `neighborhoods`. The leader broadcasts its blobs to the
|
||||
layer-1 (neighborhood 0) nodes exactly like it does without this mechanism. The
|
||||
main difference being the number of nodes in layer-1 is capped via the
|
||||
configurable `DATA_PLANE_FANOUT`. If the fanout is smaller than the nodes in
|
||||
the cluster then the mechanism will add layers below layer-1. Subsequent layers
|
||||
(beyond layer-1) follow the following constraints to determine layer-capacity.
|
||||
Each neighborhood has `NEIGHBORHOOD_SIZE` nodes and `fanout/2` neighborhoods
|
||||
are allowed per layer.
|
||||
A Solana cluster uses a multi-layer mechanism called *data plane fanout* to
|
||||
broadcast transaction blobs to all nodes in a very quick and efficient manner.
|
||||
In order to establish the fanout, the cluster divides itself into small
|
||||
collections of nodes, called *neighborhoods*. Each node is responsible for
|
||||
sharing any data it receives with the other nodes in its neighborhood, as well
|
||||
as propagating the data on to a small set of nodes in other neighborhoods.
|
||||
|
||||
Nodes in a layer will broadcast their blobs to exactly 1 node in each
|
||||
next-layer neighborhood and each node in a neighborhood will perform
|
||||
retransmits amongst its neighbors (just like layer-1 does with its layer-1
|
||||
peers). This means any node has to only send its data to its neighbors and
|
||||
each neighborhood in the layer below instead of every single TVU peer it has.
|
||||
The retransmit mechanism also supports a second, `grow`, mode of operation
|
||||
that squares the number of neighborhoods allowed per layer which dramatically
|
||||
reduces the number of layers needed to support a large cluster but can also
|
||||
have a negative impact on the network pressure each node in the lower layers
|
||||
has to deal with. A good way to think of the default mode (when `grow` is
|
||||
disabled) is to imagine it as `chain` of layers where the leader sends blobs to
|
||||
layer-1 and then layer-1 to layer-2 and so on, but instead of growing layer-3
|
||||
to the square of number of nodes in layer-2, we keep the `layer capacities`
|
||||
constant, so all layers past layer-2 will have the same number of nodes until
|
||||
the whole cluster is covered. When `grow` is enabled, this quickly turns into a
|
||||
traditional fanout where layer-3 will have the square of the number of nodes in
|
||||
layer-2 and so on.
|
||||
|
||||
Below is an example of a two layer cluster. Note - this example doesn't
|
||||
describe the same `fanout/2` limit for lower layer neighborhoods.
|
||||
During its slot, the leader node distributes blobs between the validator nodes
|
||||
in one neighborhood (layer 1). Each validator shares its data within its
|
||||
neighborhood, but also retransmits the blobs to one node in each of multiple
|
||||
neighborhoods in the next layer (layer 2). The layer-2 nodes each share their
|
||||
data with their neighborhood peers, and retransmit to nodes in the next layer,
|
||||
etc, until all nodes in the cluster have received all the blobs.
|
||||
|
||||
<img alt="Two layer cluster" src="img/data-plane.svg" class="center"/>
|
||||
|
||||
#### Neighborhoods
|
||||
## Neighborhood Assignment - Weighted Selection
|
||||
|
||||
The following diagram shows how two neighborhoods in different layers interact.
|
||||
What this diagram doesn't capture is that each `neighbor` actually receives
|
||||
blobs from 1 one validator _per_ neighborhood above it. This means that, to
|
||||
cripple a neighborhood, enough nodes (erasure codes +1 _per_ neighborhood) from
|
||||
the layer above need to fail. Since multiple neighborhoods exist in the upper
|
||||
layer and a node will receive blobs from a node in each of those neighborhoods,
|
||||
we'd need a big network failure in the upper layers to end up with incomplete
|
||||
data.
|
||||
In order for data plane fanout to work, the entire cluster must agree on how the
|
||||
cluster is divided into neighborhoods. To achieve this, all the recognized
|
||||
validator nodes (the TVU peers) are sorted by stake and stored in a list. This
|
||||
list is then indexed in different ways to figure out neighborhood boundaries and
|
||||
retransmit peers. For example, the leader will simply select the first nodes to
|
||||
make up layer 1. These will automatically be the highest stake holders, allowing
|
||||
the heaviest votes to come back to the leader first. Layer-1 and lower-layer
|
||||
nodes use the same logic to find their neighbors and lower layer peers.
|
||||
|
||||
<img alt="Inner workings of a neighborhood"
|
||||
src="img/data-plane-neighborhood.svg" class="center"/>
|
||||
## Layer and Neighborhood Structure
|
||||
|
||||
#### A Weighted Selection Mechanism
|
||||
The current leader makes its initial broadcasts to at most `DATA_PLANE_FANOUT`
|
||||
nodes. If this layer 1 is smaller than the number of nodes in the cluster, then
|
||||
the data plane fanout mechanism adds layers below. Subsequent layers follow
|
||||
these constraints to determine layer-capacity: Each neighborhood contains
|
||||
`NEIGHBORHOOD_SIZE` nodes and each layer may have up to `DATA_PLANE_FANOUT/2`
|
||||
neighborhoods.
|
||||
|
||||
To support this mechanism, there needs to be a agreed upon way of dividing the
|
||||
cluster amongst the nodes. To achieve this the `tvu_peers` are sorted by stake
|
||||
and stored in a list. This list can then be indexed in different ways to figure
|
||||
out neighborhood boundaries and retransmit peers. For example, the leader will
|
||||
simply select the first `DATA_PLANE_FANOUT` nodes as its layer 1 nodes. These
|
||||
will automatically be the highest stake holders allowing the heaviest votes to
|
||||
come back to the leader first. The same logic determines which nodes each node
|
||||
in layer needs to retransmit its blobs to. This involves finding its neighbors
|
||||
and lower layer peers.
|
||||
As mentioned above, each node in a layer only has to broadcast its blobs to its
|
||||
neighbors and to exactly 1 node in each next-layer neighborhood, instead of to
|
||||
every TVU peer in the cluster. In the default mode, each layer contains
|
||||
`DATA_PLANE_FANOUT/2` neighborhoods. The retransmit mechanism also supports a
|
||||
second, `grow`, mode of operation that squares the number of neighborhoods
|
||||
allowed each layer. This dramatically reduces the number of layers needed to
|
||||
support a large cluster, but can also have a negative impact on the network
|
||||
pressure on each node in the lower layers. A good way to think of the default
|
||||
mode (when `grow` is disabled) is to imagine it as chain of layers, where the
|
||||
leader sends blobs to layer-1 and then layer-1 to layer-2 and so on, the `layer
|
||||
capacities` remain constant, so all layers past layer-2 will have the same
|
||||
number of nodes until the whole cluster is covered. When `grow` is enabled, this
|
||||
becomes a traditional fanout where layer-3 will have the square of the number of
|
||||
nodes in layer-2 and so on.
|
||||
|
||||
#### Broadcast Service Broadcast service uses a bank to figure out stakes and
|
||||
hands this off to ClusterInfo to figure out the top `DATA_PLANE_FANOUT` stake
|
||||
holders. These top stake holders will be considered Layer 1. For the leader
|
||||
this is pretty straightforward and can be achieved with a `truncate` call on a
|
||||
sorted list of peers.
|
||||
#### Configuration Values
|
||||
|
||||
#### Retransmit Stage The biggest challenge in updating to this mechanism is to
|
||||
update the retransmit stage and make it "layer aware"; i.e using the bank each
|
||||
node can figure out which layer it belongs in and which lower layer peers nodes
|
||||
to send blobs to with minimal overlap across neighborhood boundaries. Overlaps
|
||||
will be minimized based on `((node.layer_index) % (layer_neighborhood_size) *
|
||||
cur_neighborhood_index)` where `cur_neighborhood_index` is the loop index in
|
||||
`num_neighborhoods` so that a node only forwards blobs to a single node in a
|
||||
lower layer neighborhood.
|
||||
|
||||
Each node can receive blobs froms its peer in the layer above as well as its
|
||||
neighbors. As long as the failure rate is less than the number of erasure
|
||||
codes, blobs can be repaired without the cluster failing.
|
||||
|
||||
#### Constraints
|
||||
|
||||
`DATA_PLANE_FANOUT` - The size of layer 1 is determined by this. Subsequent
|
||||
`DATA_PLANE_FANOUT` - Determines the size of layer 1. Subsequent
|
||||
layers have `DATA_PLANE_FANOUT/2` neighborhoods when `grow` is inactive.
|
||||
|
||||
`NEIGHBORHOOD_SIZE` - The number of nodes allowed in a neighborhood.
|
||||
|
@ -89,9 +62,23 @@ neighborhood isn't full, it _must_ be the last one.
|
|||
|
||||
`GROW_LAYER_CAPACITY` - Whether or not retransmit should be behave like a
|
||||
_traditional fanout_, i.e if each additional layer should have growing
|
||||
capacities. When this mode is disabled (default) all layers after layer 1 have
|
||||
the same capacity to keep the network pressure on all nodes equal.
|
||||
capacities. When this mode is disabled (default), all layers after layer 1 have
|
||||
the same capacity, keeping the network pressure on all nodes equal.
|
||||
|
||||
Future work would involve moving these parameters to on chain configuration
|
||||
since it might be beneficial tune these on the fly as the cluster sizes change.
|
||||
Currently, configuration is set when the cluster is launched. In the future,
|
||||
these parameters may be hosted on-chain, allowing modification on the fly as the
|
||||
cluster sizes change.
|
||||
|
||||
## Neighborhoods
|
||||
|
||||
The following diagram shows how two neighborhoods in different layers interact.
|
||||
What this diagram doesn't capture is that each neighbor actually receives
|
||||
blobs from 1 one validator per neighborhood above it. This means that, to
|
||||
cripple a neighborhood, enough nodes (erasure codes +1 per neighborhood) from
|
||||
the layer above need to fail. Since multiple neighborhoods exist in the upper
|
||||
layer and a node will receive blobs from a node in each of those neighborhoods,
|
||||
we'd need a big network failure in the upper layers to end up with incomplete
|
||||
data.
|
||||
|
||||
<img alt="Inner workings of a neighborhood"
|
||||
src="img/data-plane-neighborhood.svg" class="center"/>
|
||||
|
|
Loading…
Reference in New Issue