#### Problem
Blockstore operations such as get_slots_since() issues multiple rocksdb::get()
at once which is not optimal for performance.
#### Summary of Changes
This PR adds LedgerColumn::multi_get() based on rocksdb::batched_multi_get(),
the optimized version of multi_get() where get requests are processed in batch
to minimize read I/O.
These methods are only used in tests but invoked on a merkle shred they
will always invalidate the shred because the merkle proof will no longer
verify. As a result the shred will not sanitize and blockstore will
avoid inserting them. Their use in tests will result in spurious test
coverage because the shreds will not be ingested.
The commit removes implementation of these methods for merkle shreds.
Follow up commits will entirely remove these methods from shreds api.
Add ledger-tool command print-file-metadata
#### Summary of Changes
This PR adds a ledger tool subcommand print-file-metadata.
```
USAGE:
solana-ledger-tool print-file-metadata [FLAGS] [OPTIONS] [SST_FILE_NAME]
Prints the metadata of the specified ledger-store file.
If no file name is unspecified, then it will print the metadata of all ledger files
```
#### Summary of Changes
Add code comments for lowest_cleanup_slot related functions to improve
the code readability for the consistency between blockstore purge logic
and the read side.
#### Problem
RocksDB's delete_range applies to [from, to) while delete_file_in_range
applies to [from, to] by default, and the rust-rocksdb api does not include
the option to make delete_file_in_range apply to [from, to). Such inconsistency
might cause `blockstore::run_purge` to produce an inconsistent result as it
invokes both delete_range and delete_file_in_range.
#### Summary of Changes
This PR makes all our purge / delete related functions to be inclusive
on both starting and ending slots.
Tenets:
1. Limit thread names to 15 characters
2. Prefix all Solana-controlled threads with "sol"
3. Use Camel case. It's more character dense than Snake or Kebab case
The commit
* Identifies Merkle shreds when recovering from erasure codes and
dispatches specialized code to reconstruct shreds.
* Coding shred headers are added to recovered erasure shards.
* Merkle tree is reconstructed for the erasure batch and added to
recovered shreds.
* The common signature (for the root of Merkle tree) is attached to all
recovered shreds.
#### Problem
get_entries_in_data_block() panics when there's inconsistency between
slot_meta and data_shred.
However, as we don't lock on reads, reading across multiple column families is
not atomic (especially for older slots) and thus does not guarantee consistency
as the background cleanup service could purge the slot in the middle. Such
panic was reported in #26980 when the validator serves a high load of RPC calls.
#### Summary of Changes
This PR makes get_entries_in_data_block() panic only when the inconsistency
between slot-meta and data-shred happens on a slot older than lowest_cleanup_slot.
As a consequence of removing buffering when generating coding shreds:
https://github.com/solana-labs/solana/pull/25807
more coding shreds are generated than data shreds, and so
MAX_CODE_SHREDS_PER_SLOT needs to be adjusted accordingly.
The respective value is tied to ERASURE_BATCH_SIZE.
Given the 32:32 erasure recovery schema, current implementation requires
exactly 32 data shreds to generate coding shreds for the batch (except
for the final erasure batch in each slot).
As a result, when serializing ledger entries to data shreds, if the
number of data shreds is not a multiple of 32, the coding shreds for the
last batch cannot be generated until there are more data shreds to
complete the batch to 32 data shreds. This adds latency in generating
and broadcasting coding shreds.
In addition, with Merkle variants for shreds, data shreds cannot be
signed and broadcasted until coding shreds are also generated. As a
result *both* code and data shreds will be delayed before broadcast if
we still require exactly 32 data shreds for each batch.
This commit instead always generates and broadcast coding shreds as soon
as there any number of data shreds available. When serializing entries
to shreds:
* if the number of resulting data shreds is less than 32, then more
coding shreds will be generated so that the resulting erasure batch
has the same recovery probabilities as a 32:32 batch.
* if the number of data shreds is more than 32, then the data shreds are
split uniformly into erasure batches with _at least_ 32 data shreds in
each batch. Each erasure batch will have the same number of code and
data shreds.
For example:
* If there are 19 data shreds, 27 coding shreds are generated. The
resulting 19(data):27(code) erasure batch has the same recovery
probabilities as a 32:32 batch.
* If there are 107 data shreds, they are split into 3 batches of 36:36,
36:36 and 35:35 data:code shreds each.
A consequence of this change is that code and data shreds indices will
no longer align as there will be more coding shreds than data shreds
(not only in the last batch in each slot but also in the intermediate
ones);
#### Problem
LedgerCleanupService requires compactions to propagate & digest range-delete tombstones
to eventually reclaim disk space.
#### Summary of Changes
This PR makes LedgerCleanupService::cleanup_ledger delete any file whose slot-range is
older than the lowest_cleanup_slot. This allows us to reclaim disk space more often with
fewer IOps. Experimental results on mainnet validators show that the PR can effectively
reduce 33% to 40% ledger disk size.
Prior to this change, long running commands like `solana-ledger-tool
verify` would OOM due to AccountsDb cleanup not happening.
Co-authored-by: Michael Vines <mvines@gmail.com>
* Concurrent replay slots
* Split out concurrent and single bank replay paths
* Sub function processing of replay results for readability
* Add feature switch for concurrent replay
#### Problem
Ledger-tool doesn't support shred-compaction-type other than the default rocksdb level compaction.
#### Summary of Changes
This PR enables ledger-tool to automatically detect the shred-compaction-type of the specified ledger.
#### Test Plan
New ledger-tool tests are added for both level and fifo compactions.
* Adjusts test cases for stricter requirements.
* Removes account reset in deserialization test.
* Removes verify related test cases.
* Replicates account modification verification logic of PreAccount in BorrowedAccount.
* Adds TransactionContext::account_touched_flags.
* Adds account modification verification to the BPF ABIv0 and ABIv1 deserialization, CPI syscall and program-test.
* Replicates the total sum of all lamports verification of PreAccounts in InstructionContext
* Check that the callers instruction balance is maintained during a call / push.
* Replicates PreAccount statistics in TransactionContext.
* Disable verify() and verify_and_update() if the feature enable_early_verification_of_account_modifications is enabled.
* Moves Option<Rent> of enable_early_verification_of_account_modifications into TransactionContext::new().
* Relaxes AccountDataMeter related test cases.
* Don't touch the account if nothing changes.
* Adds two tests to trigger InstructionError::UnbalancedInstruction.
Co-authored-by: Justin Starry <justin@solana.com>
#### Summary of Changes
Blockstore::blockstore_directory() function takes a ShredStorageType and returns a path.
As it's a ShredStorageType related function, moving it under ShredStorageType as a
member function is a better fit.
In addition, as we start doing automatic shred compaction type selection under ledger-tool,
it becomes more convenient to reuse the function and const under blockstore_options
namespace instead of blockstore.
#### Summary of Changes
Define PERF_METRIC_OP_NAME_PUT and PERF_METRIC_OP_NAME_WRITE_BATCH
to replace repetitive / hard-coded operation names for report_rocksdb_write_perf.
#### Problem
report_rocksdb_read_perf() always uses the hard-coded operation name "get"
#### Summary of Changes
As we will add a new read operation -- multi_get(), report_rocksdb_read_perf()
needs to have an input parameter for operation name.
* slots_connected check used by ledger-tool should not require a full slot for snapshot slot
* Cleaner Result<Option<>> unwrap/default
* return false if no meta for starting slot
* Add clarifying comments
Fully deserializing shreds in window-service before sending them to
retransmit stage adds latency to shreds propagation.
This commit instead channels through the payload and relies on only
partial deserialization of a few required fields: slot, shred-index,
shred-type.
Shred slot and parent are not verified until window-service where
resources are already wasted to sig-verify and deserialize shreds.
This commit moves above verification to earlier in the pipeline in fetch
stage.
Following commits will skip shreds deserializaton before retransmit, and
so we will only have a ShredId and not a fully deserialized shred to
obtain the shuffling seed from.
* Make sure to root local slots even with hard fork
* Address review comments
* Cleanup a bit
* Further clean up
* Further clean up a bit
* Add comment
* Tweak hard fork reconciliation code placement
* Define shuffle to prep using same shuffle for multiple slices
* Determine transaction indexes and plumb to execute_batch
* Pair transaction_index with transaction in TransactionStatusService
* Add new ReplicaTransactionInfoVersion
* Plumb transaction_indexes through BankingStage
* Prepare BankingStage to receive transaction indexes from PohRecorder
* Determine transaction indexes in PohRecorder; add field to WorkingBank
* Add PohRecorder::record unit test
* Only pass starting_transaction_index around PohRecorder
* Add helper structs to simplify test DashMap
* Pass entry and starting-index into process_entries_with_callback together
* Add tx-index checks to test_rebatch_transactions
* Revert shuffle definition and use zip/unzip
* Only zip/unzip if randomize
* Add confirm_slot_entries test
* Review nits
* Add type alias to make sender docs more clear
In prepration of
https://github.com/solana-labs/solana/pull/25807
which reworks erasure batch sizes, this commit:
* adds a helper function mapping the number of data shreds to the
erasure batch size.
* adds ProcessShredsStats to Shredder::entries_to_shreds in order to
replace and remove entries_to_data_shreds from the public interface.
* working on local snapshot
* Parallelization for slot storage minimization
* Additional clean-up and fixes
* make --minimize an option of create-snapshot
* remove now unnecessary function
* Parallelize parts of minimized account set generation
* clippy fixes
* Add rent collection accounts and voting node_pubkeys
* Simplify programdata_accounts generation
* Loop over storages to get slot set
* Parallelize minimized slot set generation
* Parallelize adding owners and programdata_accounts
* Remove some now unncessary checks on the blockstore
* Add a warning for minimized snapshots across epoch boundary
* Simplify ledger-tool minimize
* Clarify names of bank's minimization helper functions
* Remove unnecesary funciton, fix line spacing
* Use DashSets instead of HashSets for minimized account and slot sets
* Filter storages uses all threads instead of thread_pool
* Add some additional comments on functions for minimization
* Moved more into bank and parallelized
* Update programs/bpf/Cargo.lock for dashmap in ledger
* Clippy fix
* ledger-tool: convert minimize_bank_for_snapshot Measure into measure!
* bank.rs: convert minimize_bank_for_snapshot Measure into measure!
* accounts_db.rs: convert minimize_accounts_db Measure into measure!
* accounts_db.rs: add comment about use of minimize_accounts_db
* ledger-tool: CLI argument clarification
* minimization functions: make infos unique
* bank.rs: Add test_get_rent_collection_accounts_between_slots
* bank.rs: Add test_minimization_add_vote_accounts
* bank.rs: Add test_minimization_add_stake_accounts
* bank.rs: Add test_minimization_add_owner_accounts
* bank.rs: Add test_minimization_add_programdata_accounts
* accounts_db.rs: Add test_minimize_accounts_db
* bank.rs: Add negative case and comments in test_get_rent_collection_accounts_between_slots
* bank.rs: Negative test in test_minimization_add_programdata_accounts
* use new static runtime and sdk ids
* bank comments to doc comments
* Only need to insert the maximum slot a key is found in
* rename remove_pubkeys to purge_pubkeys
* add comment on builtins::get_pubkeys
* prevent excessive logging of removed dead slots
* don't need to remove slot from shrink slot candidates
* blockstore.rs: get_accounts_used_in_range shouldn't return Result
* blockstore.rs: get_accounts_used_in_range: parallelize slot loop
* report filtering progress on time instead of count
* parallelize loop over snapshot storages
* WIP: move some bank minimization functionality into a new class
* WIP: move some accounts_db minimization functionality into SnapshotMinimizer
* WIP: Use new SnapshotMinimizer
* SnapshotMinimizer: fix use statements
* remove bank and accounts_db minimization code, where possible
* measure! doesn't take a closure
* fix use statement in blockstore
* log_dead_slots does not need pub(crate)
* get_unique_accounts_from_storages does not need pub(crate)
* different way to get stake accounts/nodes
* fix tests
* move rent collection account functionality to snapshot minimizer
* move accounts_db minimize behavior to snapshot minimizer
* clean up
* Use bank reference instead of Arc. Additional comments
* Add a comment to blockstore function
* Additional clarifying comments
* Moved all non-transaction account accumulation into the SnapshotMinimizer.
* transaction_account_set does not need to be mutable now
* Add comment about load_to_collect_rent_eagerly
* Update log_dead_slots comment
* remove duplicate measure/print of get_minimized_slot_set
Shred versions are not verified until window-service where resources are
already wasted to sig-verify and deserialize shreds.
The commit verifies shred-version earlier in the pipeline in fetch stage.
A slot may be purged from the blockstore with clear_unconfirmed_slot().
If the slot is added back, the slot should only exist once in its'
parent SlotMeta::next_slots Vec. Prior to this change, repeated clearing
and re-adding of a slot could result in the slot existing in parent's
next_slots multiple times. The result is that if the next time the
parent slot is processed (node restart or ledger-tool-replay), slot
could be added to the queue of slots to play multiple times.
Added test that failed before change and works now as well
#### Problem
Currently, the creation of ShredStorageType::RocksFifo is hard coded in validator/src/main.rs.
But this common code will also need to be used in other places like ledger-tool.
#### Summary of Changes
This PR creates a helper functionShredStorageType::rocks_fifo that takes a total shred_storage_size
and equally allocates to data-shred and coding-shred storage.
Coding shreds can only be signed once erasure codings are already
generated. Therefore coding shreds recovered from erasure codings lack
slot leader's signature and so cannot be retransmitted to the rest of
the cluster.
shred/merkle.rs implements a new shred variant where we generate merkle
tree for each erasure encoded batch and each shred includes:
* root of the merkle tree (Hash truncated to 20 bytes).
* slot leader's signature of the root of the merkle tree.
* merkle tree nodes along the branch the shred belongs to, where hashes
are trimmed to 20 bytes during tree construction.
This schema results in the same signature for all shreds within an
erasure batch.
When recovering shreds from erasure codes, we can reconstruct merkle
tree for the batch and for each recovered shred also recover respective
merkle tree branch; then snap the slot leader's signature from any of
the shreds received from turbine and retransmit all recovered code or
data shreds.
Backward compatibility is achieved by encoding shred variant at byte 65
of payload (previously shred-type at this position):
* 0b0101_1010 indicates a legacy coding shred, which is also equal to
ShredType::Code for backward compatibility.
* 0b1010_0101 indicates a legacy data shred, which is also equal to
ShredType::Data for backward compatibility.
* 0b0100_???? indicates a merkle coding shred with merkle branch size
indicated by the last 4 bits.
* 0b1000_???? indicates a merkle data shred with merkle branch size
indicated by the last 4 bits.
Merkle root and branch are encoded at the end of the shred payload.
#### Problem
When FIFO compaction is used, the size ratio between data shred and coding
shred is set to 1:1 based on the `--rocksdb_fifo_shred_storage_size` arg.
However, BlockstoreRocksFifoOptions::default() uses a slightly optimized
5:4 ratio instead, and the default() function is only used in benchmarks.
#### Summary of Changes
This PR makes both validator argument and BlockstoreRocksFifoOptions::default()
to use 1:1 ratio between data and coding shred size.
Packets are at the boundary of the system where, vast majority of the
time, they are received from an untrusted source. Raw indexing into the
data buffer can open attack vectors if the offsets are invalid.
Validating offsets beforehand is verbose and error prone.
The commit updates Packet::data() api to take a SliceIndex and always to
return an Option. The call-sites are so forced to explicitly handle the
case where the offsets are invalid.
In preparation of
https://github.com/solana-labs/solana/pull/25237
which adds a new shred variant with merkle tree branches, the commit
embeds versioning into shred binary by encoding a new ShredVariant type
at byte 65 of payload replacing previously ShredType at this offset.
enum ShredVariant {
LegacyCode, // 0b0101_1010
LegacyData, // 0b0101_1010
}
* 0b0101_1010 indicates a legacy coding shred, which is also equal to
ShredType::Code for backward compatibility.
* 0b1010_0101 indicates a legacy data shred, which is also equal to
ShredType::Data for backward compatibility.
Following commits will add merkle variants to this type:
enum ShredVariant {
LegacyCode, // 0b0101_1010
LegacyData, // 0b1010_0101
MerkleCode(/*proof_size:*/ u8), // 0b0100_????
MerkleData(/*proof_size:*/ u8), // 0b1000_????
}
Fix pre-check of blockstore slts during load_bank_forks. Now iterates from starting_slot to halt_slot via slot_meta.next_slots to confirm they are connected.
Data shreds are batched into MAX_DATA_SHREDS_PER_FEC_BLOCK shreds for
each erasure batch. If there are residual shreds not making a full
batch, then we cannot generate coding shreds and need to buffer shreds
until there is a full batch; This may add latency to coding shreds
generation and broadcast.
In order to evaluate upcoming changes removing this buffering logic,
this commit adds metrics tracking residual number of data shreds which
don't make a full batch.
#### Summary of Changes
Use the new datapoint macro that supports group-by for RocksDB column family metrics.
By using the new macro, we can further remove large chunks of boilerplate code that try to work around the previous datapoint macro that does not support group-by.
#### Problem
blockstore_db.rs has a mutual dependency between blockstore_metrics.rs.
#### Summary of Changes
This PR removes the mutual dependency by moving the option-related stuff
out from blockstore_db.rs to its new home --- blockstore_options.rs.
By doing this, we address the mutual dependency and also make the code cleaner.
Indices for code and data shreds of the same slot overlap; and so they
will have the same random number generator seed when shuffling cluster
nodes for turbine broadcast.
This results in the same propagation path for code and data shreds of
the same index and effectively smaller sample size for re-transmitter
nodes. For example a 32:32 batch (32 code + 32 data shreds), is
retransmitted through _at most_ 32 unique nodes, whereas ideally we want
~64 unique re-transmitters.
This commit adds shred-type to seed function so that code and data
sherds of the same (slot, index) will (most likely) have different
propagation paths.
Bytes past Packet.meta.size are not valid to read from.
The commit makes the buffer field private and instead provides two
methods:
* Packet::data() which returns an immutable reference to the underlying
buffer up to Packet.meta.size. The rest of the buffer is not valid to
read from.
* Packet::buffer_mut() which returns a mutable reference to the entirety
of the underlying buffer to write into. The caller is responsible to
update Packet.meta.size after writing to the buffer.
Upcoming changes to PacketBatch to support variable sized packets will
modify the internals of PacketBatch. So, this change removes usage of
the internal packet struct and instead uses accessors (which are
currently just wrappers of Vector functions but will change down the
road).
#### Problem
The current RocksDB read/write perf metrics do not include the total operation nanos
and thus we have to include all fields that might contribute to the total operation nanos.
#### Summary of Changes
This PR includes the total operation nanos in RocksDB's read/write perf and reduces the
number of reported fields in its perf metric.
#### Problem
When the number of RocksDB read/write operations spikes, its payload size
might exceed the limit (413 Payload Too Large).
#### Summary of Changes
This PR rate-limit the perf-sampling of RocksDB read/write operations by one second
in addition to the existing sampling that is configurable via the hidden validator
argument --rocksdb-perf-sample-interval.
Working towards revising shred struct to embed versioning so that a new
variant can contain merkle tree hashes of the erasure batch. To ease out
migration the commit adds more type-safety by distinguishing data vs
code shreds at the type level.
Additionally having both data and coding headers in each shred is
redundant as only one is relevant for each shred. The revised shred type
in this commit will only have one type-specific header.
https://github.com/solana-labs/solana/blob/c785f1ffc/ledger/src/shred.rs#L198-L203
Adding const_assert_eq:
* Documents explicitly what the constants are equal to.
* Prevents introducing bugs by silently changing the constants as the
code is updated.
Added multiple blockstore read threads.
Run the bigtable upload in tokio::spawn context.
Run bigtable tx and tx-by-addr uploads in tokio::spawn context.
#### Problem
After #25042, each LedgerColumn has its own BlockstoreRocksDbWritePerfMetrics
and BlockstoreRocksDbReadPerfMetrics instances. As it has total ownership,
its member field does not need to use Arc.
#### Summary of Changes
Change perf_samples_counter from Arc<AtomicUsize> to AtomicUsize
under BlockstoreRocksDbWritePerfMetrics and BlockstoreRocksDbReadPerfMetrics.
* Add ConfirmedBlockUploadConfig, no behavior changes
* Add comment
* A little DRY cleanup
* Add configurable limit to number of blocks to check in Blockstore and Bigtable before uploading
* Limit blockstore and bigtable look-ahead
* Exit iterator early when reach ending_slot
* Use rooted_slot_iterator instead of slot_meta_iterator
* Only check blocks in the ledger
* Added additional costs to block capacity computation, and pushed alloc of CostModel all the way to the top of the call chain, instead of reallocing
* Fix two compiler errors
* Update block processing to propagate computed costs, rather than re-computing deeper in the call stack
* Clippy fix
* Reformatting fix after merge
* Add CostModel::sum_without_bpf
#### Problem
LedgerColumnOptions contain two fields, perf_read_counter and perf_write_counter,
that are not really options but internal counters.
#### Summary of Changes
This PR introduces BlockstoreRocksDbPerfSamplingStatus, a struct that holds internal
status for RocksDB perf sampling and moves perf_read_counter and perf_write_counter
out from LedgerColumnOptions.
A VoteAccount may only wrap an account if the account owner is
solana_vote_program:id or equivalently this check returns true:
solana_vote_program::check_id(account.owner())
In addition to thread_local -> lazy_static change, a number of thread-pools are
initialized with get_max_thread_count to achieve parity with the older code in
terms of number of validator threads.
#### Problem
blockstore_db.rs becomes bigger.
#### Summary of Changes
Move BlockstoreRocksDbColumnFamilyMetrics to blockstore_metric.rs out from blockstore_db.rs.
#### Problem
blockstore_db.rs becomes bigger.
#### Summary of Changes
Move trait ColumnMetrics and metric-macros to blockstore_metric.rs out from blockstore_db.rs.
#### Problem
blockstore_db.rs becomes bigger.
#### Summary of Changes
This PR creates blockstore_metric.rs and moves metric-related functions out from blockstore_db.rs.
In preparation of the upcoming changes to shred struct, the added
hard-coded tests in this commit ensure that shreds are backward
compatible when serialized and de-serialized.
#### Summary of Changes
This PR replaces the use of thread_rng in RocksDB perf metric samples by
AtomicU32 with Ordering::Relaxed to improve the performance of determining
whether to sample the current RocksDB's read/write perf metric.
The extra wrapping and indirection by the Session struct is not used in
any form. The commit removes Session and instead uses Reed-Solomon
constructs directly.
#### Problem
Currently, the number of RocksDB perf samples is controlled by an env arg
which is later handled using a lazy_static variable. However, there is a known
performance overhead of using lazy_static as mentioned in
https://github.com/solana-labs/solana/pull/6472.
#### Summary of Changes
Instead, this PR uses a hidden validator argument, --rocksdb-perf-sample-interval,
for controlling how often RocksDB read/write performance sample is collected.
Removed Default implementation for ShredType. ShredType should always be
explicitly specified, and not rely on default values.
Simplified single-arg Shred Error variants to use shorter syntax.
Renamed erasure blocks to shards, to be consistent with reed_solomon
crate and not to confuse with FEC blocks.
#### Problem
The RocksDB wrapper,`Rocks`, under blockstore_db is currently implemented
as a tuple with unnamed fields. Accessing its fields requires syntax like `self.0`
which limits readability.
#### Summary of Changes
This PR converts Rocks from tuple to struct so that it has more human-readable
fields.
Shred::new_empty_data_shred returns an invalid shred (i.e.
shred.sanitize() returns error). The method is only used in tests and
can be easily replaced with Shred::new_from_data. To keep the shred api
surface small, this commit removes this method.
* nonblocking send when when droping banks
* detect and report drop signal queue full/disconnect events
* comments
* use counter for reporting bank_drop_queue events
* reduce log
* use datapoint to report stats
* logging instead of reporting bank drop signal full
* fix a corner case for reporting
* fix build
#### Problem
Currently, even if SOLANA_METRICS_ROCKSDB_PERF_SAMPLES_IN_1K == 0, we are still doing
the sampling check for every RocksDB read.
```
thread_rng().gen_range(0, METRIC_SAMPLES_1K) > *ROCKSDB_PERF_CONTEXT_SAMPLES_IN_1K
```
#### Summary of Changes
This PR skips the sampling check when SOLANA_METRICS_ROCKSDB_PERF_SAMPLES_IN_1K
is set to 0.