Shreds recovered from erasure codes have not been received from turbine
and have not been retransmitted to other nodes downstream. This results
in more repairs across the cluster which is slower.
This commit channels through recovered shreds to retransmit stage in
order to further broadcast the shreds to downstream nodes in the tree.
Working towards sending shreds (instead of packets) to retransmit stage
so that shreds recovered from erasure codes are as well retransmitted.
Following commit will add these metrics back to window-service, earlier
in the pipeline.
Cross cluster gossip contamination is causing cluster-slots hash map to
contain a lot of bogus values and consume too much memory:
https://github.com/solana-labs/solana/issues/17789
If a node is using the same identity key across clusters, then these
erroneous values might not be filtered out by shred-versions check,
because one of the variants of the contact-info will have matching
shred-version:
https://github.com/solana-labs/solana/issues/17789#issuecomment-896304969
The cluster-slots hash-map is bounded and trimmed at the lower end by
the current root. This commit also discards slots epochs ahead of the
root.
Renaming these types to better communicate their usages, which will
further diverge as incremental snapshot support is added.
With the new names, AccountsPacakge now refers to the type between
AccountsBackgroundProcess and AccountsHashVerifier, and SnapshotPackage
refers to the type between AccountsHashVerifier and
SnapshotPackagerService.
* added realtime cost checking logic to reject block that would exceed max limit:
- defines max limits at block_cost_limits.rs
- right after each bath's execution, accumulate its cost and check again
limit, return error if limit is exceeded
* update abi that changed due to adding additional TransactionError
* To avoid counting stats mltiple times, only accumulate execute-timing when a bank is completed
* gate it by a feature
* move cost const def into block_cost_limits.rs
* redefine the cost for signature and account access, removed signer part as it is not well defined for now
* check if per_program_timings of execute_timings before sending
Add a test for snapshots that spins up AccountsBackgroundService,
AccountsHashVerifier, and SnapshotPackagerService.
Currently there is not a test for snapshots that spins up the background
services fully. This means that there's not a current test that I can
use when adding incremental snapshot support to these three services.
Fixes#19014
Filtering out storages for incremental snapshots will be needed by the
background services for incremental snapshot support, but there is not a
Bank at that point. Since the filtering doesn't apply only to Bank, and
more to snapshots, move the functionality into snapshot_utils.
AccountsHashVerifier will need access to both the full and incremental
snapshot archive interval slots config values, which is in the
SnapshotConfig.
Also, cleanup some `Option<>` params and their references.
While reviewing PR #18565, as issue was brought up to refactor some code
around verifying the bank after rebuilding from snapshots. A new
top-level function has been added to get the latest snapshot archives
and load the bank then verify. Additionally, new tests have been
written and existing tests have been updated to use this new function.
Fixes#18973
While resolving the issue, it became clear there was some additional
low-hanging fruit this change enabled. Specifically, the functions
`bank_to_xxx_snapshot_archive()` now return their respective
`SnapshotArchiveInfo`. And on the flip side,
`bank_from_snapshot_archives()` now takes `SnapshotArchiveInfo`s instead
of separate paths and archive formats. This bundling simplifies bank
rebuilding.
bank.get_leader_schedule_epoch(shred_slot)
is one epoch after epoch_schedule.get_epoch(shred_slot).
At epoch boundaries, shred is already one epoch after the root-slot. So
we need epoch-stakes 2 epochs ahead of the root. But the root bank only
has epoch-stakes for one epoch ahead, and as a result looking up epoch
staked-nodes from the root-bank fails.
To be backward compatible with the current master code, this commit
implements a fallback on working-bank if epoch staked-nodes obtained
from the root-bank is none.
If two threads simultaneously call into ClusterNodesCache::get for the
same epoch, and the cache entry is outdated, then both threads recompute
cluster-nodes for the epoch and redundantly overwrite each other.
This commit wraps ClusterNodesCache entries in Arc<Mutex<...>>, so that
when needed only one thread does the computations to update the entry.
* Current caching mechanism does not update cluster-nodes when the epoch
(and so epoch staked nodes) changes:
https://github.com/solana-labs/solana/blob/19bd30262/core/src/broadcast_stage/standard_broadcast_run.rs#L332-L344
* Additionally, the cache update has a concurrency bug in which the
thread which does compare_and_swap may be blocked when it tries to
obtain the write-lock on cache, while other threads will keep running
ahead with the outdated cache (since the atomic timestamp is already
updated).
In the new ClusterNodesCache, entries are keyed by epoch, and so if
epoch changes cluster-nodes will be recalculated. The time-to-live
eviction policy is also encapsulated and rigidly enforced.
The new cluster-nodes cache will:
* ensure cluster-nodes are recalculated if the epoch (and so the epoch
staked nodes) changes.
* encapsulate time-to-live eviction policy.
Cluster nodes are cached keyed by the respective epoch from which stakes
are obtained, and so if epoch changes cluster-nodes will be recomputed.
A time-to-live eviction policy is enforced to refresh entries in case
gossip contact-infos are updated.