If a slot is marked as optimistically confirmed, it is probable but not
guaranteed that its' ancestors will also be marked as optimistically
confirmed in the Blockstore. Given the importance of examining
optimistically confirmed slots around cluster restarts, manually walk
an AncestorIterator to avoid the chance of a slot improperly being
ignored in cluster restart scenarios.
The optional args allow reuse by ledger-tool repair roots command Also,
hold cleanup lock for duration of Blockstore::scan_and_fix_roots().
This prevents a scenario where scan_and_fix_roots() could identify a
slot as needing to be marked root, that slot getting cleaned by
LedgerCleanupService, and then scan_and_fix_roots() marking the slot as
root on the now purged slot.
* Notify replay of pruned duplicate confirmed slots
* Ingest replay signal and run ancestor hashes for pruned
* Forward PDC to ancestor hashes and ingest pruned dumps from ancestor hashes service
* Add local-cluster test
Additionally, change the function definition to use Slot instead of u64.
Slot is defined as an alias of u64 so these are functionally equivalent,
but using Slot over u64 is more expressive of the intent.
Slots refer to time windows where a block could be produced whereas
height refers to how many blocks are actually in a fork. This function
is operating on a list of slots, so the use of "height" is incorrect and
misleading.
In cluster restart scenarios, it is desirable to know the latest
optimistic slot(s), which the subcommand latest-optimistic-slots will
return. However, it is also useful to know whether slots contain only
votes or if they contain votes and user transactions.
This PR adds an extra column of output to show whether an optimistically
confirmed slot is vote only (contains zero non-vote transactions).
Additionally, a flag has been added to enable filtering output to
exclude vote only slots.
Fix SlotMeta is_connected tracking
The tracking of connected status was previously based upon an assumption
that would be practically false for all validators. The connected status
of slots played into whether Blockstore would signal ReplayStage that it
had new shreds ready to be replayed. Prior to the change, we would never
signal and ReplayStage would always wait the entire duration of a 100ms
timeout before restarting its' main processing loop.
This commit introduces a change where we mark snapshot slots as
connected. A validator may not have a path all the way back to genesis
itself; however, snapshots are taken at known roots so we extend the
connected status to these slots. Once a node has been bootstrapped once
to have is connected, the logic persists in Blockstore such that all
children on the main fork also get their connected status updated
properly.
For duplicate block detection, for each (slot, shred-index, shred-type)
we need to allow 2 different shreds to be retransmitted.
The commit implements this using two bloom-filter dedupers:
* Shreds are deduplicated using the 1st deduper.
* If a shred is not a duplicate, then we check if:
(slot, shred-index, shred-type, k)
is not a duplicate for either k = 0 or k = 1 using the 2nd deduper,
and if so then the shred is retransmitted.
This allows to achieve larger capactiy compared to current LRU-cache.
This deadlock could only occur on nodes that call
Blockstore::get_rooted_block(). Regular validators don't call this
function, RPC nodes and nodes that have BigTableUploadService enabled
do.
Blockstore::get_rooted_block() grabs a read lock on lowest_cleanup_slot
right at the start to check if the block has been cleaned up, and to
ensure it doesn't get cleaned up during execution. As part of the
callstack of get_rooted_block(), Blockstore::get_completed_ranges() will
get called, which also grabs a read lock on lowest_cleanup_slot.
If LedgerCleanupService attempts to grab a write lock between the read
lock calls, we could hit a deadlock if priority is given to the write
lock request in this scenario.
This change removes the call to get the read lock in
get_completed_ranges(). The lock is only held for the scope of this
function, which is a single rocksdb read and thus not needed. This does
mean that a different error will be returned if the requested slot was
below lowest_cleanup_slot. Previously, a BlockstoreError::SlotCleanedUp
would have been thrown; the rocksdb error will be bubbled up now. Note
that callers of get_rooted_block() will still get the SlotCleanedUp
error when appropriate because get_rooted_block() grabs the lock. If the
slot is unavailable, it will return immediately. If the slot is
available, get_rooted_block() holding the lock means the slot will
remain available.
* Add address lookup tables to minimized snapshot
* Add comment for future posterity
* Add reference to the issue
* Adjust comment a bit
Co-authored-by: Andrew Fitzgerald <apfitzge@gmail.com>
---------
Co-authored-by: Andrew Fitzgerald <apfitzge@gmail.com>
* Add fully-reproducible online tracer for banking
* Don't use eprintln!()...
* Update programs/sbf/Cargo.lock...
* Remove meaningless assert_eq
* Group test-only code under aptly named mod
* Remove needless overflow handling in receive_until
* Delay stat aggregation as it's possible now
* Use Cow to avoid needless heap allocs
* Properly consume metrics action as soon as hold
* Trace UnprocessedTransactionStorage::len() instead
* Loosen joining api over type safety for replaystage
* Introce hash event to override these when simulating
* Use serde_with/serde_as instead of hacky workaround
* Update another Cargo.lock...
* Add detailed comment for Packet::buffer serialize
* Rename sender_overhead_minimized_receiver_loop()
* Use type interference for TraceError
* Another minor rename
* Retire now useless ForEach to simplify code
* Use type alias as much as possible
* Properly translate and propagate tracing errors
* Clarify --enable-banking-trace with better naming
* Consider unclean (signal-based) node restarts..
* Tweak logging and cli
* Remove Bank events as it's not needed anymore
* Make tpu own banking tracer thread
* Reduce diff a bit..
* Use latest serde_with
* Finally use the published rolling-file crate
* Make test code change more consistent
* Revive dead and non-terminating test code path...
* Dispose batches early now that possible
* Split off thread handle very early at ::new()
* Tweak message for TooSmallDirByteLimitl
* Remove too much of indirection
* Remove needless pub from ::channel()
* Clarify test comments
* Avoid needless event creation if tracer is disabled
* Write tests around file rotation and spill-over
* Remove unneeded PathBuf::clone()s...
* Introduce inner struct instead of tuple...
* Remove unused enum BankStatus...
* Avoid .unwrap() for the case of disabled tracer...
* Increase turbine propagation const
Value is used as a delay threshold for issuing shred repairs and analysis is showing we are overly aggressive in requesting repairs. Shreds show up via turbine before the repair completes the vast majority of the time
* Use Duration type for MAX_TURBINE_PROPAGATION
Store non-vote transaction counts that are now recorded by the banks
into the `blockstore`.
`SamplePerformanceService` now populates `PerfSampleV2` with counts from
the banks.
We currently use the is_connected field to be able to signal to
ReplayStage that a slot has replayable updates. It was discovered that
this functionality is effectively broken, and that is_connected is never
true. In order to convey this information to ReplayStage more
effectively, we need extra state information so this PR changes the
existing bool to bitflags with two bits.
From a compatibility standpoint, the is_connected bool was already
occupying one byte in the serialized SlotMeta in blockstore. Thus, the
change from a bool to bitflags still "fits" in that one byte allotment.
In consideration of a case where a client may wish to downgrade software
and use the same ledger, deserializing the bitflags into a bool could
fail if the new bit is set. As such, this PR introduces the second bit
field, but does not set it anywhere. Once clusters have mass adopted a
software version with this PR, a subsequent change to actually set and
use the new field can be introduced.
The num_repair field is only blockstore insertion metric being updated
outside of Blockstore::insert() call chain; move the update to insert()
with the rest of the fields in BlockstoreInsertionMetrics struct.
The manual Blockstore compaction that was being initiated from
LedgerCleanupService has been disabled for quite some time in favor of
several optimizations.
Co-authored-by: Ryo Onodera <ryoqun@gmail.com>
#### Summary of Changes
Removes the constant default for ShredStorageType::RocksFifo
as the shred storage size is either user-specified or derived
from --limit-ledger-size in #27459.
#### Problem
The current implementation of get_slots_since() invokes multiple rocksdb::get().
As a result, each get() operation may end up requiring one disk read. This leads
to poor performance of get_slots_since described in #24878.
#### Summary of Changes
This PR makes get_slots_since() use the batched version of multi_get() instead,
which allows multiple get operations to be processed in batch so that they can
be answered with fewer disk reads.
#### Problem
Blockstore operations such as get_slots_since() issues multiple rocksdb::get()
at once which is not optimal for performance.
#### Summary of Changes
This PR adds LedgerColumn::multi_get() based on rocksdb::batched_multi_get(),
the optimized version of multi_get() where get requests are processed in batch
to minimize read I/O.
Add ledger-tool command print-file-metadata
#### Summary of Changes
This PR adds a ledger tool subcommand print-file-metadata.
```
USAGE:
solana-ledger-tool print-file-metadata [FLAGS] [OPTIONS] [SST_FILE_NAME]
Prints the metadata of the specified ledger-store file.
If no file name is unspecified, then it will print the metadata of all ledger files
```