### Problem
The documentation of each column family is missing
### Summary
The goal is to create a comment block that will essentially include a high-level
concept on what each column family is about and what are their key/value formats.
This PR is the first cut that includes the key/value format of each column family.
This should at least provide an easy pointer for readers to understand what this
column family stores by searching its value type and how to access the data based
on the key type.
Several of the get() methods return a deserialized object (as opposed to
a Vec<u8>) by first getting a byte array out of Rocks, and then using
bincode::deserialize() to get the underlying type. However,
deserialize() only requires a u8 slice, not an owned Vec<u8>. So, we can
use get_pinned_cf() to reference memory owned by Rocks and avoid an
unnecessary copy.
#### Problem
Previously before #26651, our LedgerCleanupService needs RocksDB background
compactions to reclaim ledger disk space via our custom CompactionFilter.
However, since RocksDB's compaction isn't smart enough to know which file to pick,
we rely on the 1-day compaction period so that each file will be forced to be compacted
once a day so that we can reclaim ledger disk space in time. The downside of this is
each ledger file will be rewritten once per day.
#### Summary of Changes
As #26651 makes LedgerCleanupService actively delete those files whose entire slot-range
is older than both --limit-ledger-size and the current root, we can remove the 1-day compaction
period and get rid of the daily ledger file rewrite.
The results on mainnet-beta shows that this PR reduces ~20% write-bytes-per-second
and reduces ~50% read-bytes-per-second on ledger disk.
#### Problem
Blockstore operations such as get_slots_since() issues multiple rocksdb::get()
at once which is not optimal for performance.
#### Summary of Changes
This PR adds LedgerColumn::multi_get() based on rocksdb::batched_multi_get(),
the optimized version of multi_get() where get requests are processed in batch
to minimize read I/O.
Add ledger-tool command print-file-metadata
#### Summary of Changes
This PR adds a ledger tool subcommand print-file-metadata.
```
USAGE:
solana-ledger-tool print-file-metadata [FLAGS] [OPTIONS] [SST_FILE_NAME]
Prints the metadata of the specified ledger-store file.
If no file name is unspecified, then it will print the metadata of all ledger files
```
#### Problem
RocksDB's delete_range applies to [from, to) while delete_file_in_range
applies to [from, to] by default, and the rust-rocksdb api does not include
the option to make delete_file_in_range apply to [from, to). Such inconsistency
might cause `blockstore::run_purge` to produce an inconsistent result as it
invokes both delete_range and delete_file_in_range.
#### Summary of Changes
This PR makes all our purge / delete related functions to be inclusive
on both starting and ending slots.
#### Problem
LedgerCleanupService requires compactions to propagate & digest range-delete tombstones
to eventually reclaim disk space.
#### Summary of Changes
This PR makes LedgerCleanupService::cleanup_ledger delete any file whose slot-range is
older than the lowest_cleanup_slot. This allows us to reclaim disk space more often with
fewer IOps. Experimental results on mainnet validators show that the PR can effectively
reduce 33% to 40% ledger disk size.
#### Summary of Changes
Define PERF_METRIC_OP_NAME_PUT and PERF_METRIC_OP_NAME_WRITE_BATCH
to replace repetitive / hard-coded operation names for report_rocksdb_write_perf.
#### Problem
report_rocksdb_read_perf() always uses the hard-coded operation name "get"
#### Summary of Changes
As we will add a new read operation -- multi_get(), report_rocksdb_read_perf()
needs to have an input parameter for operation name.
#### Problem
When FIFO compaction is used, the size ratio between data shred and coding
shred is set to 1:1 based on the `--rocksdb_fifo_shred_storage_size` arg.
However, BlockstoreRocksFifoOptions::default() uses a slightly optimized
5:4 ratio instead, and the default() function is only used in benchmarks.
#### Summary of Changes
This PR makes both validator argument and BlockstoreRocksFifoOptions::default()
to use 1:1 ratio between data and coding shred size.
#### Summary of Changes
Use the new datapoint macro that supports group-by for RocksDB column family metrics.
By using the new macro, we can further remove large chunks of boilerplate code that try to work around the previous datapoint macro that does not support group-by.
#### Problem
blockstore_db.rs has a mutual dependency between blockstore_metrics.rs.
#### Summary of Changes
This PR removes the mutual dependency by moving the option-related stuff
out from blockstore_db.rs to its new home --- blockstore_options.rs.
By doing this, we address the mutual dependency and also make the code cleaner.
#### Problem
The current RocksDB read/write perf metrics do not include the total operation nanos
and thus we have to include all fields that might contribute to the total operation nanos.
#### Summary of Changes
This PR includes the total operation nanos in RocksDB's read/write perf and reduces the
number of reported fields in its perf metric.
#### Problem
When the number of RocksDB read/write operations spikes, its payload size
might exceed the limit (413 Payload Too Large).
#### Summary of Changes
This PR rate-limit the perf-sampling of RocksDB read/write operations by one second
in addition to the existing sampling that is configurable via the hidden validator
argument --rocksdb-perf-sample-interval.
#### Problem
After #25042, each LedgerColumn has its own BlockstoreRocksDbWritePerfMetrics
and BlockstoreRocksDbReadPerfMetrics instances. As it has total ownership,
its member field does not need to use Arc.
#### Summary of Changes
Change perf_samples_counter from Arc<AtomicUsize> to AtomicUsize
under BlockstoreRocksDbWritePerfMetrics and BlockstoreRocksDbReadPerfMetrics.
#### Problem
LedgerColumnOptions contain two fields, perf_read_counter and perf_write_counter,
that are not really options but internal counters.
#### Summary of Changes
This PR introduces BlockstoreRocksDbPerfSamplingStatus, a struct that holds internal
status for RocksDB perf sampling and moves perf_read_counter and perf_write_counter
out from LedgerColumnOptions.
#### Problem
blockstore_db.rs becomes bigger.
#### Summary of Changes
Move BlockstoreRocksDbColumnFamilyMetrics to blockstore_metric.rs out from blockstore_db.rs.
#### Problem
blockstore_db.rs becomes bigger.
#### Summary of Changes
Move trait ColumnMetrics and metric-macros to blockstore_metric.rs out from blockstore_db.rs.
#### Problem
blockstore_db.rs becomes bigger.
#### Summary of Changes
This PR creates blockstore_metric.rs and moves metric-related functions out from blockstore_db.rs.
#### Summary of Changes
This PR replaces the use of thread_rng in RocksDB perf metric samples by
AtomicU32 with Ordering::Relaxed to improve the performance of determining
whether to sample the current RocksDB's read/write perf metric.
#### Problem
Currently, the number of RocksDB perf samples is controlled by an env arg
which is later handled using a lazy_static variable. However, there is a known
performance overhead of using lazy_static as mentioned in
https://github.com/solana-labs/solana/pull/6472.
#### Summary of Changes
Instead, this PR uses a hidden validator argument, --rocksdb-perf-sample-interval,
for controlling how often RocksDB read/write performance sample is collected.
#### Problem
The RocksDB wrapper,`Rocks`, under blockstore_db is currently implemented
as a tuple with unnamed fields. Accessing its fields requires syntax like `self.0`
which limits readability.
#### Summary of Changes
This PR converts Rocks from tuple to struct so that it has more human-readable
fields.
#### Problem
Currently, even if SOLANA_METRICS_ROCKSDB_PERF_SAMPLES_IN_1K == 0, we are still doing
the sampling check for every RocksDB read.
```
thread_rng().gen_range(0, METRIC_SAMPLES_1K) > *ROCKSDB_PERF_CONTEXT_SAMPLES_IN_1K
```
#### Summary of Changes
This PR skips the sampling check when SOLANA_METRICS_ROCKSDB_PERF_SAMPLES_IN_1K
is set to 0.
#### Summary of Changes
This PR enables perf metrics reporting for RocksDB deletes.
Samples are reported under "blockstore_rocksdb_write_perf" with op=delete
The sampling rate is still controlled by env arg SOLANA_METRICS_ROCKSDB_PERF_SAMPLES_IN_1K
and its default to 10 (meaning we report 10 in 1000 perf samples).
#### Summary of Changes
This PR enables perf metrics reporting for RocksDB write-batches.
Samples are reported under "blockstore_rocksdb_write_perf" with op=write_batch
Its cf_name tag is set to "write_batch" as well as each write-batch could include multiple column families.
The sampling rate is still controlled by env arg SOLANA_METRICS_ROCKSDB_PERF_SAMPLES_IN_1K
and its default to 10 (meaning we report 10 in 1000 perf samples).
#### Summary of Changes
This PR implements the reporting of RocksDB write perf metrics to blockstore_rocksdb_write_perf
based on RocksDB's PerfContext. The default sample rate is 10 in 1000, and the env arg SOLANA_METRICS_ROCKSDB_PERF_SAMPLES_IN_1K can control the sample rate.
Previously, the metric reporting functions are implemented under LedgerColumnMetric.
However, there're operations like write batch which is issued by the function inside Rocks.
This PR moves reporting functions to its own dedicate mod so that both LedgerColumn and
Rocks can report column perf metrics.
#### Summary of Changes
This PR enables RocksDB read side performance metrics to report to blockstore_rocksdb_read_perf.
The sampling rate is controlled by an env arg `SOLANA_METRICS_ROCKSDB_PERF_SAMPLES_IN_1K`,
specifies the number of perf samples for every 1000 operations. The default value is set to 10, meaning
we will report 10 out of 1000 (or 1/100) reads.
The metrics are based on the RocksDB [PerfContext](https://github.com/facebook/rocksdb/blob/main/include/rocksdb/perf_context.h).
It includes many useful metrics including block read time, cache hit rate, and time spent on decompressing the block.
This PR does a refactoring on column family-related metrics reporting.
As the metric reporting is per column family basis, the PR creates
ColumnMetrics trait and move the metric reporting logic into it.
This refactoring will make future column metric reporting (such as
read PerfContext) much cleaner.
This PR adds `--rocksdb-ledger-compression` as a hidden argument to the validator
for specifying the compression algorithm for TransactionStatus. Available compression
algorithms include `lz4`, `snappy`, `zlib`. The default value is `none`.
Experimental results show that with lz4 compression, we can achieve ~37% size-reduction
on the TransactionStatus column family, or ~8% size-reduction of the ledger store size.
This PR renames BlockstoreAdvancedOptions to LedgerColumnOptions, as we will
pass-down this struct to LedgerColumn to allow it to perform metric reporting.
As we start adding more options into BlockstoreOptions, it's better to allow
new_cf_descriptor to take the reference to BlockstoreOptions so that
we can avoid future function API changes on new_cf_descriptor.
#### Summary of Changes
This PR further enables group by operation on storage type in blockstore_rocksdb_cfs metrics.
Such group-by allows us to further compare the performance metrics between rocks-level and
rocks-fifo.
To make things extensible, this PR introduces BlockstoreAdvancedOptions and move shred_storage_type.
All fields in BlockstoreAdvancedOptions will support group-by operation in blockstore_rocksdb_cfs.
Dependency: #23580
* Rename excludes_from_compaction to should_exclude_from_compaction
* Make subfunction to create all cf descriptors
* Condense logic for when to disable compactions
This PR enables blockstore to periodically report RocksDB column family properties.
The reported properties are under blockstore_rocksdb_cfs, and the properties also
support group by operation on cf_name.