Fully deserializing shreds in window-service before sending them to
retransmit stage adds latency to shreds propagation.
This commit instead channels through the payload and relies on only
partial deserialization of a few required fields: slot, shred-index,
shred-type.
Following commits will skip shreds deserializaton before retransmit, and
so we will only have a ShredId and not a fully deserialized shred to
obtain the shuffling seed from.
Indices for code and data shreds of the same slot overlap; and so they
will have the same random number generator seed when shuffling cluster
nodes for turbine broadcast.
This results in the same propagation path for code and data shreds of
the same index and effectively smaller sample size for re-transmitter
nodes. For example a 32:32 batch (32 code + 32 data shreds), is
retransmitted through _at most_ 32 unique nodes, whereas ideally we want
~64 unique re-transmitters.
This commit adds shred-type to seed function so that code and data
sherds of the same (slot, index) will (most likely) have different
propagation paths.
Most nodes in the cluster receive the same shred from two different
nodes: parent, and the first node of their neighborhood:
https://github.com/solana-labs/solana/blob/a8c695ba5/core/src/cluster_nodes.rs#L178-L197
Because of the erasure codings, half of the shreds are already
redundant. So this redundant propagation path will only add extra
overhead.
Additionally the very first node of the broadcast tree has 2x fanout
(i.e. 400 nodes) which adds too much load at one node.
This commit simplifies the broadcast tree by dropping the redundant
propagation path and removing the 2x fanout at root node.
Turbine randomly shuffles cluster nodes on a broadcast tree for each
shred. This requires knowing the stakes and nodes' contact-infos (from
gossip).
However gossip is subject to partitioning and propogation delays.
Additionally unstaked nodes may join and leave the cluster at any
moment, changing the cluster view from one node to another.
This commit:
* Always arranges the unstaked nodes at the bottom of turbine broadcast
tree.
* Staked nodes are always included regardless of if their contact-info
is available in gossip or not.
* Uses the unbiased WeightedShuffle construct for shuffling nodes.
bank.get_leader_schedule_epoch(shred_slot)
is one epoch after epoch_schedule.get_epoch(shred_slot).
At epoch boundaries, shred is already one epoch after the root-slot. So
we need epoch-stakes 2 epochs ahead of the root. But the root bank only
has epoch-stakes for one epoch ahead, and as a result looking up epoch
staked-nodes from the root-bank fails.
To be backward compatible with the current master code, this commit
implements a fallback on working-bank if epoch staked-nodes obtained
from the root-bank is none.
If two threads simultaneously call into ClusterNodesCache::get for the
same epoch, and the cache entry is outdated, then both threads recompute
cluster-nodes for the epoch and redundantly overwrite each other.
This commit wraps ClusterNodesCache entries in Arc<Mutex<...>>, so that
when needed only one thread does the computations to update the entry.
Cluster nodes are cached keyed by the respective epoch from which stakes
are obtained, and so if epoch changes cluster-nodes will be recomputed.
A time-to-live eviction policy is enforced to refresh entries in case
gossip contact-infos are updated.
Broadcast stage and retransmit stage should arrange nodes on turbine
broadcast tree in exactly same order. Additionally any changes to this
ordering (e.g. updating how unstaked nodes are handled) requires feature
gating to keep the cluster in sync.
Current implementation is scattered out over several public methods and
exposes too much of implementation details (e.g. usize indices into
peers vector) which makes code changes and checking for feature
activations more difficult.
This commit encapsulates turbine peer computations into a new struct,
and only exposes two public methods, get_broadcast_peer and
get_retransmit_peers, for call-sites.