Shred versions are not verified until window-service where resources are
already wasted to sig-verify and deserialize shreds.
The commit verifies shred-version earlier in the pipeline in fetch stage.
A slot may be purged from the blockstore with clear_unconfirmed_slot().
If the slot is added back, the slot should only exist once in its'
parent SlotMeta::next_slots Vec. Prior to this change, repeated clearing
and re-adding of a slot could result in the slot existing in parent's
next_slots multiple times. The result is that if the next time the
parent slot is processed (node restart or ledger-tool-replay), slot
could be added to the queue of slots to play multiple times.
Added test that failed before change and works now as well
#### Problem
Currently, the creation of ShredStorageType::RocksFifo is hard coded in validator/src/main.rs.
But this common code will also need to be used in other places like ledger-tool.
#### Summary of Changes
This PR creates a helper functionShredStorageType::rocks_fifo that takes a total shred_storage_size
and equally allocates to data-shred and coding-shred storage.
Coding shreds can only be signed once erasure codings are already
generated. Therefore coding shreds recovered from erasure codings lack
slot leader's signature and so cannot be retransmitted to the rest of
the cluster.
shred/merkle.rs implements a new shred variant where we generate merkle
tree for each erasure encoded batch and each shred includes:
* root of the merkle tree (Hash truncated to 20 bytes).
* slot leader's signature of the root of the merkle tree.
* merkle tree nodes along the branch the shred belongs to, where hashes
are trimmed to 20 bytes during tree construction.
This schema results in the same signature for all shreds within an
erasure batch.
When recovering shreds from erasure codes, we can reconstruct merkle
tree for the batch and for each recovered shred also recover respective
merkle tree branch; then snap the slot leader's signature from any of
the shreds received from turbine and retransmit all recovered code or
data shreds.
Backward compatibility is achieved by encoding shred variant at byte 65
of payload (previously shred-type at this position):
* 0b0101_1010 indicates a legacy coding shred, which is also equal to
ShredType::Code for backward compatibility.
* 0b1010_0101 indicates a legacy data shred, which is also equal to
ShredType::Data for backward compatibility.
* 0b0100_???? indicates a merkle coding shred with merkle branch size
indicated by the last 4 bits.
* 0b1000_???? indicates a merkle data shred with merkle branch size
indicated by the last 4 bits.
Merkle root and branch are encoded at the end of the shred payload.
#### Problem
When FIFO compaction is used, the size ratio between data shred and coding
shred is set to 1:1 based on the `--rocksdb_fifo_shred_storage_size` arg.
However, BlockstoreRocksFifoOptions::default() uses a slightly optimized
5:4 ratio instead, and the default() function is only used in benchmarks.
#### Summary of Changes
This PR makes both validator argument and BlockstoreRocksFifoOptions::default()
to use 1:1 ratio between data and coding shred size.
Packets are at the boundary of the system where, vast majority of the
time, they are received from an untrusted source. Raw indexing into the
data buffer can open attack vectors if the offsets are invalid.
Validating offsets beforehand is verbose and error prone.
The commit updates Packet::data() api to take a SliceIndex and always to
return an Option. The call-sites are so forced to explicitly handle the
case where the offsets are invalid.
In preparation of
https://github.com/solana-labs/solana/pull/25237
which adds a new shred variant with merkle tree branches, the commit
embeds versioning into shred binary by encoding a new ShredVariant type
at byte 65 of payload replacing previously ShredType at this offset.
enum ShredVariant {
LegacyCode, // 0b0101_1010
LegacyData, // 0b0101_1010
}
* 0b0101_1010 indicates a legacy coding shred, which is also equal to
ShredType::Code for backward compatibility.
* 0b1010_0101 indicates a legacy data shred, which is also equal to
ShredType::Data for backward compatibility.
Following commits will add merkle variants to this type:
enum ShredVariant {
LegacyCode, // 0b0101_1010
LegacyData, // 0b1010_0101
MerkleCode(/*proof_size:*/ u8), // 0b0100_????
MerkleData(/*proof_size:*/ u8), // 0b1000_????
}
Fix pre-check of blockstore slts during load_bank_forks. Now iterates from starting_slot to halt_slot via slot_meta.next_slots to confirm they are connected.
Data shreds are batched into MAX_DATA_SHREDS_PER_FEC_BLOCK shreds for
each erasure batch. If there are residual shreds not making a full
batch, then we cannot generate coding shreds and need to buffer shreds
until there is a full batch; This may add latency to coding shreds
generation and broadcast.
In order to evaluate upcoming changes removing this buffering logic,
this commit adds metrics tracking residual number of data shreds which
don't make a full batch.
#### Summary of Changes
Use the new datapoint macro that supports group-by for RocksDB column family metrics.
By using the new macro, we can further remove large chunks of boilerplate code that try to work around the previous datapoint macro that does not support group-by.
#### Problem
blockstore_db.rs has a mutual dependency between blockstore_metrics.rs.
#### Summary of Changes
This PR removes the mutual dependency by moving the option-related stuff
out from blockstore_db.rs to its new home --- blockstore_options.rs.
By doing this, we address the mutual dependency and also make the code cleaner.
Indices for code and data shreds of the same slot overlap; and so they
will have the same random number generator seed when shuffling cluster
nodes for turbine broadcast.
This results in the same propagation path for code and data shreds of
the same index and effectively smaller sample size for re-transmitter
nodes. For example a 32:32 batch (32 code + 32 data shreds), is
retransmitted through _at most_ 32 unique nodes, whereas ideally we want
~64 unique re-transmitters.
This commit adds shred-type to seed function so that code and data
sherds of the same (slot, index) will (most likely) have different
propagation paths.
Bytes past Packet.meta.size are not valid to read from.
The commit makes the buffer field private and instead provides two
methods:
* Packet::data() which returns an immutable reference to the underlying
buffer up to Packet.meta.size. The rest of the buffer is not valid to
read from.
* Packet::buffer_mut() which returns a mutable reference to the entirety
of the underlying buffer to write into. The caller is responsible to
update Packet.meta.size after writing to the buffer.
Upcoming changes to PacketBatch to support variable sized packets will
modify the internals of PacketBatch. So, this change removes usage of
the internal packet struct and instead uses accessors (which are
currently just wrappers of Vector functions but will change down the
road).
#### Problem
The current RocksDB read/write perf metrics do not include the total operation nanos
and thus we have to include all fields that might contribute to the total operation nanos.
#### Summary of Changes
This PR includes the total operation nanos in RocksDB's read/write perf and reduces the
number of reported fields in its perf metric.
#### Problem
When the number of RocksDB read/write operations spikes, its payload size
might exceed the limit (413 Payload Too Large).
#### Summary of Changes
This PR rate-limit the perf-sampling of RocksDB read/write operations by one second
in addition to the existing sampling that is configurable via the hidden validator
argument --rocksdb-perf-sample-interval.
Working towards revising shred struct to embed versioning so that a new
variant can contain merkle tree hashes of the erasure batch. To ease out
migration the commit adds more type-safety by distinguishing data vs
code shreds at the type level.
Additionally having both data and coding headers in each shred is
redundant as only one is relevant for each shred. The revised shred type
in this commit will only have one type-specific header.
https://github.com/solana-labs/solana/blob/c785f1ffc/ledger/src/shred.rs#L198-L203
Adding const_assert_eq:
* Documents explicitly what the constants are equal to.
* Prevents introducing bugs by silently changing the constants as the
code is updated.
Added multiple blockstore read threads.
Run the bigtable upload in tokio::spawn context.
Run bigtable tx and tx-by-addr uploads in tokio::spawn context.
#### Problem
After #25042, each LedgerColumn has its own BlockstoreRocksDbWritePerfMetrics
and BlockstoreRocksDbReadPerfMetrics instances. As it has total ownership,
its member field does not need to use Arc.
#### Summary of Changes
Change perf_samples_counter from Arc<AtomicUsize> to AtomicUsize
under BlockstoreRocksDbWritePerfMetrics and BlockstoreRocksDbReadPerfMetrics.
* Add ConfirmedBlockUploadConfig, no behavior changes
* Add comment
* A little DRY cleanup
* Add configurable limit to number of blocks to check in Blockstore and Bigtable before uploading
* Limit blockstore and bigtable look-ahead
* Exit iterator early when reach ending_slot
* Use rooted_slot_iterator instead of slot_meta_iterator
* Only check blocks in the ledger
* Added additional costs to block capacity computation, and pushed alloc of CostModel all the way to the top of the call chain, instead of reallocing
* Fix two compiler errors
* Update block processing to propagate computed costs, rather than re-computing deeper in the call stack
* Clippy fix
* Reformatting fix after merge
* Add CostModel::sum_without_bpf
#### Problem
LedgerColumnOptions contain two fields, perf_read_counter and perf_write_counter,
that are not really options but internal counters.
#### Summary of Changes
This PR introduces BlockstoreRocksDbPerfSamplingStatus, a struct that holds internal
status for RocksDB perf sampling and moves perf_read_counter and perf_write_counter
out from LedgerColumnOptions.
A VoteAccount may only wrap an account if the account owner is
solana_vote_program:id or equivalently this check returns true:
solana_vote_program::check_id(account.owner())
In addition to thread_local -> lazy_static change, a number of thread-pools are
initialized with get_max_thread_count to achieve parity with the older code in
terms of number of validator threads.
#### Problem
blockstore_db.rs becomes bigger.
#### Summary of Changes
Move BlockstoreRocksDbColumnFamilyMetrics to blockstore_metric.rs out from blockstore_db.rs.
#### Problem
blockstore_db.rs becomes bigger.
#### Summary of Changes
Move trait ColumnMetrics and metric-macros to blockstore_metric.rs out from blockstore_db.rs.
#### Problem
blockstore_db.rs becomes bigger.
#### Summary of Changes
This PR creates blockstore_metric.rs and moves metric-related functions out from blockstore_db.rs.
In preparation of the upcoming changes to shred struct, the added
hard-coded tests in this commit ensure that shreds are backward
compatible when serialized and de-serialized.
#### Summary of Changes
This PR replaces the use of thread_rng in RocksDB perf metric samples by
AtomicU32 with Ordering::Relaxed to improve the performance of determining
whether to sample the current RocksDB's read/write perf metric.
The extra wrapping and indirection by the Session struct is not used in
any form. The commit removes Session and instead uses Reed-Solomon
constructs directly.
#### Problem
Currently, the number of RocksDB perf samples is controlled by an env arg
which is later handled using a lazy_static variable. However, there is a known
performance overhead of using lazy_static as mentioned in
https://github.com/solana-labs/solana/pull/6472.
#### Summary of Changes
Instead, this PR uses a hidden validator argument, --rocksdb-perf-sample-interval,
for controlling how often RocksDB read/write performance sample is collected.
Removed Default implementation for ShredType. ShredType should always be
explicitly specified, and not rely on default values.
Simplified single-arg Shred Error variants to use shorter syntax.
Renamed erasure blocks to shards, to be consistent with reed_solomon
crate and not to confuse with FEC blocks.
#### Problem
The RocksDB wrapper,`Rocks`, under blockstore_db is currently implemented
as a tuple with unnamed fields. Accessing its fields requires syntax like `self.0`
which limits readability.
#### Summary of Changes
This PR converts Rocks from tuple to struct so that it has more human-readable
fields.
Shred::new_empty_data_shred returns an invalid shred (i.e.
shred.sanitize() returns error). The method is only used in tests and
can be easily replaced with Shred::new_from_data. To keep the shred api
surface small, this commit removes this method.
* nonblocking send when when droping banks
* detect and report drop signal queue full/disconnect events
* comments
* use counter for reporting bank_drop_queue events
* reduce log
* use datapoint to report stats
* logging instead of reporting bank drop signal full
* fix a corner case for reporting
* fix build
#### Problem
Currently, even if SOLANA_METRICS_ROCKSDB_PERF_SAMPLES_IN_1K == 0, we are still doing
the sampling check for every RocksDB read.
```
thread_rng().gen_range(0, METRIC_SAMPLES_1K) > *ROCKSDB_PERF_CONTEXT_SAMPLES_IN_1K
```
#### Summary of Changes
This PR skips the sampling check when SOLANA_METRICS_ROCKSDB_PERF_SAMPLES_IN_1K
is set to 0.
#### Summary of Changes
This PR enables perf metrics reporting for RocksDB deletes.
Samples are reported under "blockstore_rocksdb_write_perf" with op=delete
The sampling rate is still controlled by env arg SOLANA_METRICS_ROCKSDB_PERF_SAMPLES_IN_1K
and its default to 10 (meaning we report 10 in 1000 perf samples).
#### Summary of Changes
This PR enables perf metrics reporting for RocksDB write-batches.
Samples are reported under "blockstore_rocksdb_write_perf" with op=write_batch
Its cf_name tag is set to "write_batch" as well as each write-batch could include multiple column families.
The sampling rate is still controlled by env arg SOLANA_METRICS_ROCKSDB_PERF_SAMPLES_IN_1K
and its default to 10 (meaning we report 10 in 1000 perf samples).
#### Summary of Changes
This PR implements the reporting of RocksDB write perf metrics to blockstore_rocksdb_write_perf
based on RocksDB's PerfContext. The default sample rate is 10 in 1000, and the env arg SOLANA_METRICS_ROCKSDB_PERF_SAMPLES_IN_1K can control the sample rate.
Previously, the metric reporting functions are implemented under LedgerColumnMetric.
However, there're operations like write batch which is issued by the function inside Rocks.
This PR moves reporting functions to its own dedicate mod so that both LedgerColumn and
Rocks can report column perf metrics.
#### Summary of Changes
This PR enables RocksDB read side performance metrics to report to blockstore_rocksdb_read_perf.
The sampling rate is controlled by an env arg `SOLANA_METRICS_ROCKSDB_PERF_SAMPLES_IN_1K`,
specifies the number of perf samples for every 1000 operations. The default value is set to 10, meaning
we will report 10 out of 1000 (or 1/100) reads.
The metrics are based on the RocksDB [PerfContext](https://github.com/facebook/rocksdb/blob/main/include/rocksdb/perf_context.h).
It includes many useful metrics including block read time, cache hit rate, and time spent on decompressing the block.
* initial work for poh timing report service
* add poh_timing_report_service to validator
* fix comments
* clippy
* imrove test coverage
* delete record when complete
* rename shred full to slot full.
* debug logging
* fix slot full
* remove debug comments
* adding fmt trait
* derive default
* default for poh timing reporter
* better comments
* remove commented code
* fix test
* more test fixes
* delete timestamps for slot that are older than root_slot
* debug log
* record poh start end in bank reset
* report full to start time instead
* fix poh slot offset
* report poh start for normal ticks
* fix typo
* refactor out poh point report fn
* rename
* optimize delete - delete only when last_root changed
* change log level to trace
* convert if to match
* remove redudant check
* fix SlotPohTiming comments
* review feedback on poh timing reporter
* review feedback on poh_recorder
* add test case for out-of-order arrival of timing points and incomplete timing points
* refactor poh_timing_points into its own mod
* remove option for poh_timing_report service
* move poh_timing_point_sender to constructor
* clippy
* better comments
* more clippy
* more clippy
* add slot poh timing point macro
* clippy
* assert in test
* comments and display fmt
* fix check
* assert format
* revise comments
* refactor
* extrac send fn
* revert reporting_poh_timing_point
* align loggin
* small refactor
* move type declaration to the top of the module
* replace macro with constructor
* clippy: remove redundant closure
* review comments
* simplify poh timing point creation
Co-authored-by: Haoran Yi <hyi@Haorans-MacBook-Air.local>
Now that nodes correctly populate position field in coding shreds, and
first_coding_index in erasure meta, the old code to maintain backward
compatibility can be removed.
The commit is working towards changing erasure coding schema to 32:64.
Current slot stats are removed when the slot is full or every 30 seconds
if the slot is before root:
https://github.com/solana-labs/solana/blob/493a8e234/ledger/src/blockstore.rs#L2017-L2027
In order to track if the slot is ultimately marked as dead or rooted and
emit more metrics, this commit expands lifetime of SlotStats while
bounding total size of cache using an LRU eviction policy.
This PR does a refactoring on column family-related metrics reporting.
As the metric reporting is per column family basis, the PR creates
ColumnMetrics trait and move the metric reporting logic into it.
This refactoring will make future column metric reporting (such as
read PerfContext) much cleaner.
* transaction-status: Add return data to meta
* Add return data to simulation results
* Use pretty-hex for printing return data
* Update arg name, make TransactionRecord struct
* Rename TransactionRecord -> ExecutionRecord
This PR adds `--rocksdb-ledger-compression` as a hidden argument to the validator
for specifying the compression algorithm for TransactionStatus. Available compression
algorithms include `lz4`, `snappy`, `zlib`. The default value is `none`.
Experimental results show that with lz4 compression, we can achieve ~37% size-reduction
on the TransactionStatus column family, or ~8% size-reduction of the ledger store size.
This PR renames BlockstoreAdvancedOptions to LedgerColumnOptions, as we will
pass-down this struct to LedgerColumn to allow it to perform metric reporting.
As we start adding more options into BlockstoreOptions, it's better to allow
new_cf_descriptor to take the reference to BlockstoreOptions so that
we can avoid future function API changes on new_cf_descriptor.
Creating a new ledger implicitly means that no other process could have
previously held access to it. Additionally, creating a new ledger
implicitly requires writing, so it follows that Primary access is
required and we can drop access type as an argument.
#### Summary of Changes
This PR further enables group by operation on storage type in blockstore_rocksdb_cfs metrics.
Such group-by allows us to further compare the performance metrics between rocks-level and
rocks-fifo.
To make things extensible, this PR introduces BlockstoreAdvancedOptions and move shred_storage_type.
All fields in BlockstoreAdvancedOptions will support group-by operation in blockstore_rocksdb_cfs.
Dependency: #23580
* Rename excludes_from_compaction to should_exclude_from_compaction
* Make subfunction to create all cf descriptors
* Condense logic for when to disable compactions
This PR enables blockstore to periodically report RocksDB column family properties.
The reported properties are under blockstore_rocksdb_cfs, and the properties also
support group by operation on cf_name.
#### Summary of Changes
This PR adds two hidden arguments to the validator that allow users to use RocksDB's FIFO compaction for storing shreds.
--shred-storage <SHRED_STORAGE>
EXPERIMENTAL: Controls how RocksDB compacts shreds. *WARNING*: You will lose your ledger data
when you switch between options. Possible values are: 'level': stores shreds using RocksDB's default (level)
compaction. 'fifo': stores shreds under RocksDB's FIFO compaction. This option is more efficient on
disk-write-bytes of the ledger store. [default: level] [possible values: level, fifo]
--shred-storage-size <SHRED_STORAGE_SIZE_BYTES>
The shred storage size in bytes. The suggested value is 50% of your ledger storage size in bytes. [default:
268435456000]
#### Summary of Changes
To avoid mixing the use of different shred storage types, each shred storage type
will have its blockstore in a different directory.
This PR still keeps the RocksFifo setting hidden. The default ShredStorageType and
blockstore directory are still RocksLevel and `rocksdb`.
Will follow-up with PRs on making FIFO option public in ledger-tool and validator.
#### Test Plan
* Added a new test to verify the existence of `rocksdb-fifo` directory when FIFO compaction is used.
* Updated existing test to verify the current setting still store ledger under `rocksdb` directory.
* Manually ran ledger_cleanup_test with both level and fifo compaction and verified the resulting ledger.
* Ran a validator with this PR.
* Bump first-available block to first complete block
* Remove obsolete purges in tests (PrimaryIndex toggling no longer in use
* Check first-available block in Rpc check_slot_cleaned_up
* Add failing test for precompile transition
* Skip adding builtins if they will be removed
* cargo clean
* nits
* fix abi check
* remove workaround
Co-authored-by: Jack May <jack@solana.com>
Transaction logs are not being saved to the database through the plugin interface.
Summary of Changes
Retain the transaction logs when transaction notification plugin is loaded.
Fixes #
lijunwangs/solana-accountsdb-plugin-postgres#6
* Refactor Bank::load_and_execute_transactions
* Refactor: improve type safety of TransactionExecutionResult
* Add enum for extra type safety in execution results
* feedback
* get_signatures_for_address does not correctly account for result sets that span Blockstore and Bigtable.
This causes Bigtable to return `RowNotFound` until the new tx is uploaded.
Check that `before` exists in Bigtable, and if not, set it to `None` to return the full data set.
References #21442Closes#22110
* Differentiate between before sig not found and no newer signatures
* Dedupe bigtable results to account for potential upload race
Co-authored-by: Tyera Eulberg <tyera@solana.com>
The indices for erasure coding shreds are tied to data shreds:
https://github.com/solana-labs/solana/blob/90f41fd9b/ledger/src/shred.rs#L921
However with the upcoming changes to erasure schema, there will be more
erasure coding shreds than data shreds and we can no longer infer coding
shreds indices from data shreds.
The commit adds constructs to track coding shreds indices explicitly.
next-shred-index is already readily available from returned data shreds.
The commit simplifies the api for upcoming changes to erasure coding
schema which will require explicit tracking of indices for coding shreds
as well as data shreds.
The number of shreds that result from a given number of entries is
variable and in our test case, somewhat unintuitive to think about when
trying to determine how much data we're pushing into the blockstore. So,
this change converts the unit of test parameters from entries to shreds.
This change also cleans up some variable naming for clarity and prints.
SlotMeta.last_index may be unknown and current code is using u64::MAX to
indicate that:
https://github.com/solana-labs/solana/blob/6c108c8fc/ledger/src/blockstore_meta.rs#L169-L174
This lacks type-safety and can introduce bugs if not always checked for
Several instances of slot_meta.last_index + 1 are also subject to
overflow.
This commit updates the type to Option<u64>. Backward compatibility is
maintained by customizing serde serialize/deserialize implementations.
https://github.com/solana-labs/solana/pull/16646
removed first_coding_index since the field is currently redundant and
always equal to fec_set_index.
However, with upcoming changes to erasure coding schema, this will no
longer be the same as fec_set_index and so requires a separate field to
represent.