This change sets the receive_window for non-staked node to 1 * PACKET_DATA_SIZE, and maps the staked nodes's connection's receive_window between 1.2 * PACKET_DATA_SIZE to 10 * PACKET_DATA_SIZE based on the stakes.
The changes is based on Quinn library change to support per connection receive_window tweak at the server side. quinn-rs/quinn#1393
#### Problem
LedgerCleanupService requires compactions to propagate & digest range-delete tombstones
to eventually reclaim disk space.
#### Summary of Changes
This PR makes LedgerCleanupService::cleanup_ledger delete any file whose slot-range is
older than the lowest_cleanup_slot. This allows us to reclaim disk space more often with
fewer IOps. Experimental results on mainnet validators show that the PR can effectively
reduce 33% to 40% ledger disk size.
* Keypair: implement clone()
This was not implemented upstream in ed25519-dalek to force everyone to
think twice before creating another copy of a potentially sensitive
private key in memory.
See https://github.com/dalek-cryptography/ed25519-dalek/issues/76
However, there are now 9 instances of
Keypair::from_bytes(&keypair.to_bytes())
in the solana codebase and it would be preferable to have a function.
In particular since this also comes up when writing programs and can
cause users to either start messing with lifetimes or discover the
from_bytes() workaround themselves.
This patch opts to not implement the Clone trait. This avoids automatic
use in order to preserve some of the original "let developers think
twice about this" intention.
* Use Keypair::clone
Prior to this change, long running commands like `solana-ledger-tool
verify` would OOM due to AccountsDb cleanup not happening.
Co-authored-by: Michael Vines <mvines@gmail.com>
* Concurrent replay slots
* Split out concurrent and single bank replay paths
* Sub function processing of replay results for readability
* Add feature switch for concurrent replay
* Only take the latest vote for each validator in gossip
Since the new vote updates are no longer incremental, there
is no value in storing intermediate votes.
* Address pr feedback
* Handle potential downgrade path, FullTowerVote -> Incremental
* Rename sent to bank -> gossip slot
* Handle downgrade case properly
* Only downgrade for newer votes and feature flag, ignore incremental votes otherwise
* Update test
* Use client certs in QUIC to get peer's stake
* fixes to cert processing
* integrate the code
* clippy
* more cleanup
* sort cargo deps
* test fixes
* info -> debug
Shreds have different workload and traffic pattern from TPU vote and
transaction packets. Some of recent changes to SigVerifyStage are not
suitable or at least optimal for shreds sig-verify; e.g. random discard,
dedup with false positives, discard excess by IP-address, ...
SigVerifier trait is meant to abstract out the distinctions between the
two pipelines, but in practice it has led to more verbose and convoluted
code.
This commit discards SigVerifier implementation for shreds sig-verify
and instead provides a standalone stage for verifying shreds signatures.
With recent patches, window-service recv-window does not do much other
than redirecting packets/shreds to downstream channels.
The commit removes window-service recv-window and instead sends
packets/shreds directly from sigverify to retransmit-stage and
window-service insert thread.
- Forward packets by prioritization in desc order
- Add support of cost-tracking by transaction requested compute units
- Hook up account buckets to forwarder
- Add metrics for forwardable batches count
- Remove redundant invalid packets filtering at end of slot since forwarder will do the same when batch forwardable packets
- Add bench test for forwarding
* Remove UseQuic type
Move to storing the UdpSocket on ConnectionCache and accepting a bool
* Remove use_quic from ConnectionCache constructor
Replace with separate with_udp constructor to force callers to choose
Fully deserializing shreds in window-service before sending them to
retransmit stage adds latency to shreds propagation.
This commit instead channels through the payload and relies on only
partial deserialization of a few required fields: slot, shred-index,
shred-type.
* allow initial hash calc to occur in bg
* validator_initialized -> startup_verification_complete
* add infos for leader and vote
* rework snapshot for startup verification
* change to assert
slot_stats are submitted at a different cadence from the rest of
RetransmitStats. Current code erroneously clears slot_stats before
submitting any metrics.
Shred slot and parent are not verified until window-service where
resources are already wasted to sig-verify and deserialize shreds.
This commit moves above verification to earlier in the pipeline in fetch
stage.
Shreds are dropped in window-service if the slot leader is the node
itself:
https://github.com/solana-labs/solana/blob/cd2878acf/core/src/window_service.rs#L181-L185
However this is done after wasting resources verifying signature on
these shreds, and requires a redundant 2nd lookup of the slot leader.
This commit instead discards such shreds in sigverify stage where we
already know the leader for the slot.
Following commits will skip shreds deserializaton before retransmit, and
so we will only have a ShredId and not a fully deserialized shred to
obtain the shuffling seed from.
* Make sure to root local slots even with hard fork
* Address review comments
* Cleanup a bit
* Further clean up
* Further clean up a bit
* Add comment
* Tweak hard fork reconciliation code placement
In order to preserve current behavior, the threshold is set to the
current value of the argument to IndexedParallelIterator::with_min_len.
Follow up commits will recalibrate this threshold to optimize
performance on mainnet-beta.
Shreds arriving at a node for retransmit tend to belong to the same slot
(or a just a couple of different slots). Slot leader and cluster nodes
are common for the shreds of the same slot, and so the common work to
look up these values can be factored out.
This commit first group-bys shreds by slot to factor out that common
lookup work.
* Define shuffle to prep using same shuffle for multiple slices
* Determine transaction indexes and plumb to execute_batch
* Pair transaction_index with transaction in TransactionStatusService
* Add new ReplicaTransactionInfoVersion
* Plumb transaction_indexes through BankingStage
* Prepare BankingStage to receive transaction indexes from PohRecorder
* Determine transaction indexes in PohRecorder; add field to WorkingBank
* Add PohRecorder::record unit test
* Only pass starting_transaction_index around PohRecorder
* Add helper structs to simplify test DashMap
* Pass entry and starting-index into process_entries_with_callback together
* Add tx-index checks to test_rebatch_transactions
* Revert shuffle definition and use zip/unzip
* Only zip/unzip if randomize
* Add confirm_slot_entries test
* Review nits
* Add type alias to make sender docs more clear
In prepration of
https://github.com/solana-labs/solana/pull/25807
which reworks erasure batch sizes, this commit:
* adds a helper function mapping the number of data shreds to the
erasure batch size.
* adds ProcessShredsStats to Shredder::entries_to_shreds in order to
replace and remove entries_to_data_shreds from the public interface.
Shred versions are not verified until window-service where resources are
already wasted to sig-verify and deserialize shreds.
The commit verifies shred-version earlier in the pipeline in fetch stage.
RetransmitSlotStats can already be utilized to track when the first
shred for a slot was received; therefore
first_shreds_received: &Mutex<BTreeSet<Slot>>
is redundant. Sending update notifications after shreds retransmit will
also bypass the need for a mutex.
* To include forwarding counters in leader slot metrics
* Capture slot_end_detected time when checking leader slots, to be used in reporting later
* Simplify banking stage loop to report leader slot metrics
Co-authored-by: carllin <carl@solana.com>
* Connection pool in connection cache and handle connection errors
1. The connection not has a pool of connections per address, configurable, default 4
2. The connections per address share a lazy initialized endpoint
3. Handle connection issues better, avoid race conditions
4. Various log improvement for help debug connection issues
#### Problem
blockstore clean and compact is quite slow with wait-for-supermajority purge and can take 20-30 minutes
as described in #25710.
#### Summary of Changes
This PR removes the compaction logic in backup_and_clear_blockstore as the
actual the restoration from a bad fork is handled by `blockstore.purge_slots`
(which is done by issuing rocksdb range-delete that makes the bad fork
unavailable.)
Compaction is irreverent to the shred version, as its main job in this context
is to reclaim disk storage from the deleted slots, which we can let the rocksdb
automatic background compaction to handle it.
Fixes#25710
* client: Remove static connection cache, plumb it instead
* Add TpuClient::new_with_connection_cache to not break downstream
* Refactor get_connection and RwLock into ConnectionCache
* Fix merge conflicts from new async TpuClient
* Remove `ConnectionCache::set_use_quic`
* Move DEFAULT_TPU_USE_QUIC to client, use ConnectionCache::default()
Add in some CPU utilization metrics such as: number of vCPUs, clock frequency, average load across different time intervals, and number of total threads
* Remove the args param from Measure::this since we don't ever use it
* banking_stage.rs: convert to measure!
* poh_recorder.rs: convert to measure!
* cost_update_service.rs: convert to measure!
* poh_service.rs: convert to measure!
* bank.rs: convert to measure!
* measure.rs: Remove Measure::this now that all have been converted to measure!
Packets are at the boundary of the system where, vast majority of the
time, they are received from an untrusted source. Raw indexing into the
data buffer can open attack vectors if the offsets are invalid.
Validating offsets beforehand is verbose and error prone.
The commit updates Packet::data() api to take a SliceIndex and always to
return an Option. The call-sites are so forced to explicitly handle the
case where the offsets are invalid.
* Spawn QUIC server to receive forwarded txs
* Update validator port range
* forward votes using UDP
* no forwarding from unstaked nodes
* forwarding stats in banking stage
* fix test builds
* fix lifetime of forward sender
It used to report the number of packets with successful signature
validations but was accidentally changed to count packets passed into
the verifier by e4409a87fe.
This restores the previous meaning.
#### Problem
blockstore_db.rs has a mutual dependency between blockstore_metrics.rs.
#### Summary of Changes
This PR removes the mutual dependency by moving the option-related stuff
out from blockstore_db.rs to its new home --- blockstore_options.rs.
By doing this, we address the mutual dependency and also make the code cleaner.
* FindPacketSenderStake: Remove parallelism to improve performance
The work unit sizes were so small that using the thread pool
slowed down this stage significantly.
* fix checks
Co-authored-by: Justin Starry <justin@solana.com>
Indices for code and data shreds of the same slot overlap; and so they
will have the same random number generator seed when shuffling cluster
nodes for turbine broadcast.
This results in the same propagation path for code and data shreds of
the same index and effectively smaller sample size for re-transmitter
nodes. For example a 32:32 batch (32 code + 32 data shreds), is
retransmitted through _at most_ 32 unique nodes, whereas ideally we want
~64 unique re-transmitters.
This commit adds shred-type to seed function so that code and data
sherds of the same (slot, index) will (most likely) have different
propagation paths.
Bytes past Packet.meta.size are not valid to read from.
The commit makes the buffer field private and instead provides two
methods:
* Packet::data() which returns an immutable reference to the underlying
buffer up to Packet.meta.size. The rest of the buffer is not valid to
read from.
* Packet::buffer_mut() which returns a mutable reference to the entirety
of the underlying buffer to write into. The caller is responsible to
update Packet.meta.size after writing to the buffer.
Upcoming changes to PacketBatch to support variable sized packets will
modify the internals of PacketBatch. So, this change removes usage of
the internal packet struct and instead uses accessors (which are
currently just wrappers of Vector functions but will change down the
road).
* - get prioritization fee from compute_budget instruction;
- update compute_budget::process_instruction function to take instruction iter to support sanitized versioned message;
- updated runtime.md
* update transaction fee calculation for prioritization fee rate as lamports per 10K CUs
* review changes
* fix test
* fix a bpf test
* fix bpf test
* patch feedback
* fix clippy
* fix bpf test
* feedback
* rename prioritization fee rate to compute unit price
* feedback
Co-authored-by: Justin Starry <justin@solana.com>
A VoteAccount may only wrap an account if the account owner is
solana_vote_program:id or equivalently this check returns true:
solana_vote_program::check_id(account.owner())
In addition to thread_local -> lazy_static change, a number of thread-pools are
initialized with get_max_thread_count to achieve parity with the older code in
terms of number of validator threads.
* SigVerify: Add total time metrics for dedup/discard/verify
Previously it was impossible to determine the total time the stage spent
on these activities within a measurement window.
* SigVerify: Add _us postfix to time metrics
Shred::new_empty_data_shred returns an invalid shred (i.e.
shred.sanitize() returns error). The method is only used in tests and
can be easily replaced with Shred::new_from_data. To keep the shred api
surface small, this commit removes this method.