* Add fully-reproducible online tracer for banking
* Don't use eprintln!()...
* Update programs/sbf/Cargo.lock...
* Remove meaningless assert_eq
* Group test-only code under aptly named mod
* Remove needless overflow handling in receive_until
* Delay stat aggregation as it's possible now
* Use Cow to avoid needless heap allocs
* Properly consume metrics action as soon as hold
* Trace UnprocessedTransactionStorage::len() instead
* Loosen joining api over type safety for replaystage
* Introce hash event to override these when simulating
* Use serde_with/serde_as instead of hacky workaround
* Update another Cargo.lock...
* Add detailed comment for Packet::buffer serialize
* Rename sender_overhead_minimized_receiver_loop()
* Use type interference for TraceError
* Another minor rename
* Retire now useless ForEach to simplify code
* Use type alias as much as possible
* Properly translate and propagate tracing errors
* Clarify --enable-banking-trace with better naming
* Consider unclean (signal-based) node restarts..
* Tweak logging and cli
* Remove Bank events as it's not needed anymore
* Make tpu own banking tracer thread
* Reduce diff a bit..
* Use latest serde_with
* Finally use the published rolling-file crate
* Make test code change more consistent
* Revive dead and non-terminating test code path...
* Dispose batches early now that possible
* Split off thread handle very early at ::new()
* Tweak message for TooSmallDirByteLimitl
* Remove too much of indirection
* Remove needless pub from ::channel()
* Clarify test comments
* Avoid needless event creation if tracer is disabled
* Write tests around file rotation and spill-over
* Remove unneeded PathBuf::clone()s...
* Introduce inner struct instead of tuple...
* Remove unused enum BankStatus...
* Avoid .unwrap() for the case of disabled tracer...
* introduce workspace.package
* introduce workspace.dependencies
* read version from root cargo.toml
* pass check when version = { workspace = true }
* don't bump version when version = { workspace = true }
* including workspace Cargo.toml when bump version
* programs/sbf use workspace inheritance
* fix increasing cargo version ignore program/sbf/Cargo.toml
Currently, the cleanup service counts the number of shreds in the
database by iterating the entire SlotMeta column and reading the number
of received shreds for each slot. This gives us a fairly accurate count
at the expense of performing a good amount of IO.
Instead of counting the individual slots, use the live_files()
rust-rocksdb entrypoint that we expose in Blockstore. This API allows us
to get the number of entries (shreds) in the data shred column family by
reading file metadata. This is much more efficient from IO perspective.
* Increase turbine propagation const
Value is used as a delay threshold for issuing shred repairs and analysis is showing we are overly aggressive in requesting repairs. Shreds show up via turbine before the repair completes the vast majority of the time
* Use Duration type for MAX_TURBINE_PROPAGATION
Store non-vote transaction counts that are now recorded by the banks
into the `blockstore`.
`SamplePerformanceService` now populates `PerfSampleV2` with counts from
the banks.
{verify,sign}_shreds_gpu need to point to offsets within the packets for
the signed data. For merkle shreds this signed data is the merkle root
of the erasure batch and this would necessitate embedding the merkle
roots in the shreds payload.
However this is wasteful and reduces shreds capacity to store data
because the merkle root can already be recovered from the encoded merkle
proof.
Instead of pointing to offsets within the shreds payload, this commit
recovers merkle roots from the merkle proofs and stores them in an
allocated buffer. {verify,sign}_shreds_gpu would then point to offsets
within this new buffer for the respective signed data.
This would unblock us from removing merkle roots from shreds payload
which would save capacity to send more data with each shred.
The commit adds an associated SignedData type to Shred trait so that
merkle and legacy shreds can return different types for signed_data
method.
This would allow legacy shreds to point to a section of the shred
payload, whereas merkle shreds would compute and return the merkle root.
Ultimately this would allow to remove the merkle root from the shreds
binary.
Merkle shreds within the same erasure batch have the same merkle root.
The root of the merkle tree is signed. So either the signatures match
or one fails sigverify, and the comparison of merkle roots is redundant.
If data is empty, make_shreds_from_data will now return one data shred
with empty data. This preserves invariants verified in tests regardless
of data size.
We currently use the is_connected field to be able to signal to
ReplayStage that a slot has replayable updates. It was discovered that
this functionality is effectively broken, and that is_connected is never
true. In order to convey this information to ReplayStage more
effectively, we need extra state information so this PR changes the
existing bool to bitflags with two bits.
From a compatibility standpoint, the is_connected bool was already
occupying one byte in the serialized SlotMeta in blockstore. Thus, the
change from a bool to bitflags still "fits" in that one byte allotment.
In consideration of a case where a client may wish to downgrade software
and use the same ledger, deserializing the bitflags into a bool could
fail if the new bit is set. As such, this PR introduces the second bit
field, but does not set it anywhere. Once clusters have mass adopted a
software version with this PR, a subsequent change to actually set and
use the new field can be introduced.
The num_repair field is only blockstore insertion metric being updated
outside of Blockstore::insert() call chain; move the update to insert()
with the rest of the fields in BlockstoreInsertionMetrics struct.
* support CpiGuard and PermanentDelegate extensions in transaction-status and account-decoder
* update transaction-status and account-decoder to new ConfidentialTransfer interfaces
* Update cost model to use requested_cu instead of estimated cu #27608
* remove CostUpdate and CostModel from replay/tvu
* revive cost update service to send cost tracker stats
* CostModel is now static
* remove unused package
Co-authored-by: Tao Zhu <tao@solana.com>
The manual Blockstore compaction that was being initiated from
LedgerCleanupService has been disabled for quite some time in favor of
several optimizations.
Co-authored-by: Ryo Onodera <ryoqun@gmail.com>
PR #28317 previously attempted to fix a case where blockstore processing
would create children banks for slots past the halt_at_slot.
However, the previous fix didn't handle the case where a slot could be
strictly less than the halt_at_slot, but have children that were greater
than the halt_at_slot. For example, this could happen if a child of slot
S is S+n where n > 1.
Thus, this change covers our processing logic to cover this second case
as well.
load_frozen_forks() finds new slots to process by creating new Banks for
the children of the current slot in process_next_slots(). Prior to this
change, we would then immediately check if we had reached the
halt_at_slot and correctly halt processing when appropriate. As such, it
would be possible for Banks to be created for slots beyond the
halt_at_slot.
While a potential child slot that is past halt_at_slot wouldn't be
replayed, the Bank being created still alters some universal state in
AccountsDb. So, this change moves the halt_at_slot check before we
create children Banks in process_next_slots().
A fifo rocksdb instance must be opened with max size parameter on the
fifo columns. To support this, we previously plumbed a constant up to
callers that provided a default if unbounded growth desired.
This change attempts to be more rusty by exposing an option for this
value, and converting the option to a constant at the lowest level
possible.
#### Summary of Changes
Removes the constant default for ShredStorageType::RocksFifo
as the shred storage size is either user-specified or derived
from --limit-ledger-size in #27459.
### Problem
When FIFO compaction is used while --rocksdb_fifo_shred_storage_size
is unspecified, the FIFO shred storage size is set to a const default based
on the default `--limit-ledger-size`.
### Summary of the Change
When --rocksdb_fifo_shred_storage_size is unspecified, it is now
derived from `--limit-ledger-size` by reserving 1500 bytes for each
shred.
### Problem
The documentation of each column family is missing
### Summary
The goal is to create a comment block that will essentially include a high-level
concept on what each column family is about and what are their key/value formats.
This PR is the first cut that includes the key/value format of each column family.
This should at least provide an easy pointer for readers to understand what this
column family stores by searching its value type and how to access the data based
on the key type.
#### Problem
The current implementation of get_slots_since() invokes multiple rocksdb::get().
As a result, each get() operation may end up requiring one disk read. This leads
to poor performance of get_slots_since described in #24878.
#### Summary of Changes
This PR makes get_slots_since() use the batched version of multi_get() instead,
which allows multiple get operations to be processed in batch so that they can
be answered with fewer disk reads.
Several of the get() methods return a deserialized object (as opposed to
a Vec<u8>) by first getting a byte array out of Rocks, and then using
bincode::deserialize() to get the underlying type. However,
deserialize() only requires a u8 slice, not an owned Vec<u8>. So, we can
use get_pinned_cf() to reference memory owned by Rocks and avoid an
unnecessary copy.
#### Problem
Previously before #26651, our LedgerCleanupService needs RocksDB background
compactions to reclaim ledger disk space via our custom CompactionFilter.
However, since RocksDB's compaction isn't smart enough to know which file to pick,
we rely on the 1-day compaction period so that each file will be forced to be compacted
once a day so that we can reclaim ledger disk space in time. The downside of this is
each ledger file will be rewritten once per day.
#### Summary of Changes
As #26651 makes LedgerCleanupService actively delete those files whose entire slot-range
is older than both --limit-ledger-size and the current root, we can remove the 1-day compaction
period and get rid of the daily ledger file rewrite.
The results on mainnet-beta shows that this PR reduces ~20% write-bytes-per-second
and reduces ~50% read-bytes-per-second on ledger disk.
#### Problem
Blockstore operations such as get_slots_since() issues multiple rocksdb::get()
at once which is not optimal for performance.
#### Summary of Changes
This PR adds LedgerColumn::multi_get() based on rocksdb::batched_multi_get(),
the optimized version of multi_get() where get requests are processed in batch
to minimize read I/O.
These methods are only used in tests but invoked on a merkle shred they
will always invalidate the shred because the merkle proof will no longer
verify. As a result the shred will not sanitize and blockstore will
avoid inserting them. Their use in tests will result in spurious test
coverage because the shreds will not be ingested.
The commit removes implementation of these methods for merkle shreds.
Follow up commits will entirely remove these methods from shreds api.
Add ledger-tool command print-file-metadata
#### Summary of Changes
This PR adds a ledger tool subcommand print-file-metadata.
```
USAGE:
solana-ledger-tool print-file-metadata [FLAGS] [OPTIONS] [SST_FILE_NAME]
Prints the metadata of the specified ledger-store file.
If no file name is unspecified, then it will print the metadata of all ledger files
```
#### Summary of Changes
Add code comments for lowest_cleanup_slot related functions to improve
the code readability for the consistency between blockstore purge logic
and the read side.
#### Problem
RocksDB's delete_range applies to [from, to) while delete_file_in_range
applies to [from, to] by default, and the rust-rocksdb api does not include
the option to make delete_file_in_range apply to [from, to). Such inconsistency
might cause `blockstore::run_purge` to produce an inconsistent result as it
invokes both delete_range and delete_file_in_range.
#### Summary of Changes
This PR makes all our purge / delete related functions to be inclusive
on both starting and ending slots.
Tenets:
1. Limit thread names to 15 characters
2. Prefix all Solana-controlled threads with "sol"
3. Use Camel case. It's more character dense than Snake or Kebab case
The commit
* Identifies Merkle shreds when recovering from erasure codes and
dispatches specialized code to reconstruct shreds.
* Coding shred headers are added to recovered erasure shards.
* Merkle tree is reconstructed for the erasure batch and added to
recovered shreds.
* The common signature (for the root of Merkle tree) is attached to all
recovered shreds.
#### Problem
get_entries_in_data_block() panics when there's inconsistency between
slot_meta and data_shred.
However, as we don't lock on reads, reading across multiple column families is
not atomic (especially for older slots) and thus does not guarantee consistency
as the background cleanup service could purge the slot in the middle. Such
panic was reported in #26980 when the validator serves a high load of RPC calls.
#### Summary of Changes
This PR makes get_entries_in_data_block() panic only when the inconsistency
between slot-meta and data-shred happens on a slot older than lowest_cleanup_slot.
As a consequence of removing buffering when generating coding shreds:
https://github.com/solana-labs/solana/pull/25807
more coding shreds are generated than data shreds, and so
MAX_CODE_SHREDS_PER_SLOT needs to be adjusted accordingly.
The respective value is tied to ERASURE_BATCH_SIZE.
Given the 32:32 erasure recovery schema, current implementation requires
exactly 32 data shreds to generate coding shreds for the batch (except
for the final erasure batch in each slot).
As a result, when serializing ledger entries to data shreds, if the
number of data shreds is not a multiple of 32, the coding shreds for the
last batch cannot be generated until there are more data shreds to
complete the batch to 32 data shreds. This adds latency in generating
and broadcasting coding shreds.
In addition, with Merkle variants for shreds, data shreds cannot be
signed and broadcasted until coding shreds are also generated. As a
result *both* code and data shreds will be delayed before broadcast if
we still require exactly 32 data shreds for each batch.
This commit instead always generates and broadcast coding shreds as soon
as there any number of data shreds available. When serializing entries
to shreds:
* if the number of resulting data shreds is less than 32, then more
coding shreds will be generated so that the resulting erasure batch
has the same recovery probabilities as a 32:32 batch.
* if the number of data shreds is more than 32, then the data shreds are
split uniformly into erasure batches with _at least_ 32 data shreds in
each batch. Each erasure batch will have the same number of code and
data shreds.
For example:
* If there are 19 data shreds, 27 coding shreds are generated. The
resulting 19(data):27(code) erasure batch has the same recovery
probabilities as a 32:32 batch.
* If there are 107 data shreds, they are split into 3 batches of 36:36,
36:36 and 35:35 data:code shreds each.
A consequence of this change is that code and data shreds indices will
no longer align as there will be more coding shreds than data shreds
(not only in the last batch in each slot but also in the intermediate
ones);