#### Problem
Snapshot names are overloaded, and there are multiple terms that mean the same thing. This is confusing. Here's a list of ones in the codebase that I've found:
```
- snapshot_dir
- snapshots_dir
- snapshot_path
- snapshot_output_dir
- snapshot_package_output_path
- snapshot_archives_dir
```
#### Summary of Changes
For all the ones that are about the directory where snapshot archives are stored, ensure they are `snapshot_archives_dir`. For the ones about the (bank) snapshots directory, set to `bank_snapshots_dir`.
Co-authored-by: Michael Vines <mvines@gmail.com>
While reviewing PR #18565, as issue was brought up to refactor some code
around verifying the bank after rebuilding from snapshots. A new
top-level function has been added to get the latest snapshot archives
and load the bank then verify. Additionally, new tests have been
written and existing tests have been updated to use this new function.
Fixes#18973
While resolving the issue, it became clear there was some additional
low-hanging fruit this change enabled. Specifically, the functions
`bank_to_xxx_snapshot_archive()` now return their respective
`SnapshotArchiveInfo`. And on the flip side,
`bank_from_snapshot_archives()` now takes `SnapshotArchiveInfo`s instead
of separate paths and archive formats. This bundling simplifies bank
rebuilding.
This commit also renames `snapshot_interval_slots` to
`full_snapshot_archive_interval_slots`, updates the comments on the
fields, and make appropriate updates where SnapshotConfig is used.
This commit adds high-level functions for creating and loading-from
incremental snapshots, plus all low-level functions required to perform
those tasks. This commit **does not** add taking incremental snapshots
as part of a running validator, nor starting up a node with an
incremental snapshot; just laying ground work.
Additionally, `snapshot_utils` and `serde_snapshot` have been
refactored to use a common code paths for the different snapshots.
Also of note, some renaming has happened:
1. Snapshots are now either `full_` or `incremental_` throughout the
codebase. If not specified, the code applies to both.
2. Bank snapshots now are called "bank snapshots"
(before they were called "slot snapshots", "bank snapshots", or
just "snapshots"). The one exception is within `Bank`, where they
are still just "snapshots", because they are already "bank
snapshots".
3. Snapshot archives now have `_archive` in the code. This
should clear up an ambiguity between bank snapshots and snapshot
archives.
* update ledger tool to restore cost model from blockstore when compute-slot-cost
* Move initialize_cost_table into cost_model, so the function can be tested and shared between validator and ledger-tool
* refactor and simplify a test
* Add `ProgramCosts` Column Family to blockstore, implement LedgerColumn; add `delete_cf` to Rocks
* Add ProgramCosts to compaction excluding list alone side with TransactionStatusIndex in one place: `excludes_from_compaction()`
* Write cost table to blockstore after `replay_stage` replayed active banks; add stats to measure persist time
* Deletes program from `ProgramCosts` in blockstore when they are removed from cost_table in memory
* Only try to persist to blockstore when cost_table is changed.
* Restore cost table during validator startup
* Offload `cost_model` related operations from replay main thread to dedicated service thread, add channel to send execute_timings between these threads;
* Move `cost_update_service` to its own module; replay_stage is now decoupled from cost_model.
When starting a validator, the node initially joins gossip with
shred_verison = 0, until it adopts the entrypoint's shred-version:
https://github.com/solana-labs/solana/blob/9b182f408/validator/src/main.rs#L417
Depending on the load on the entrypoint, this adopting entrypoint
shred-version through gossip sometimes becomes very slow, and causes
several problems in gossip because we have to partially support
shred_version == 0 which is a source of leaking crds values from one
cluster to another. e.g. see
https://github.com/solana-labs/solana/pull/17899
and the other linked issues there.
In order to remove shred_version == 0 from gossip, this commit adds
shred-version to ip-echo-server response. Once the entrypoints are
updated, on validator start-up, if --expected_shred_version is not
specified we will obtain shred-version from the entrypoint using
ip-echo-server.
1. Added both options for measuring space usage using total accounts usage and for individual store shrink ratio using an enum. Validator CLI options: --accounts-shrink-optimize-total-space and --accounts-shrink-ratio
2. Added code for selecting candidates based on total usage in a separate function select_candidates_by_total_usage
3. Added unit tests for the new functions added
4. The default implementations is kept at 0.8 shrink ratio with --accounts-shrink-optimize-total-space set to true
Fixes#17544
* replay stage feeds back realtime per-program execution cost to cost model;
* program cost execution table is initialized into empty table, no longer populated with hardcoded numbers;
* changed cost unit to microsecond, using value collected from mainnet;
* add ExecuteCostTable with fixed capacity for security concern, when its limit is reached, programs with old age AND less occurrence will be pushed out to make room for new programs.
* Create solana-poh crate
* Move BigTableUploadService to solana-ledger
* Add solana-rpc to workspace
* Move dependencies to solana-rpc
* Move remaining rpc modules to solana-rpc
* Single use statement solana-poh
* Single use statement solana-rpc
* Update rocksdb to v0.16.0
* Promote the infrequent and important log to info!
* Force background compaction by ttl without manual compaction
* Fix test
* Support no compaction mode in test_ledger_cleanup_compaction
* Fix comment
* Make compaction_interval customizable
* Avoid major compaction with periodic filtering...
* Adress lazy_static, special cfs and range check
* Clean up a bit and add comment
* Add comment
* More comments...
* Config code cleanup
* Add comment
* Use .conflicts_with()
* Nullify unneeded delete_range ops for special CFs
* Some clean ups
* Clarify the locking intention
* Ensure special CFs' consistency with PurgeType::CompactionFilter
* Fix comment
* Fix bad copy paste
* Fix various types...
* Don't use tuples
* Add a unit test for compaction_filter
* Fix typo...
* Remove flag and just use new behavior always
* Fix wrong condition negation...
* Doc. about no set_last_purged_slot in purge_slots
* Write a test and fix off-by-one bug....
* Apply suggestions from code review
Co-authored-by: Tyera Eulberg <teulberg@gmail.com>
* Follow up to github review suggestions
* Fix line-wrapping
* Fix conflict
Co-authored-by: Tyera Eulberg <teulberg@gmail.com>
* Add BlockHeight CF to blockstore
* Rename CacheBlockTimeService to be more general
* Cache block-height using service
* Fixup previous proto mishandling
* Add block_height to block structs
* Add block-height to solana block
* Fallback to BankForks if block time or block height are not yet written to Blockstore
* Add docs
* Review comments
* Move gossip modules to solana-gossip
* Update Protocol abi digest due to move
* Move gossip benches and hook up CI
* Remove unneeded Result entries
* Single use statements
* Add blockstore-root-scan for api nodes on boot
* Ensure cluster-confirmed root and parents are set as root in blockstore in load_frozen_forks()
* Plumb rpc-scan-and-fix-roots validator flag
* purge_old_snapshot_archives is changed to take an extra argument 'maximum_snapshots_to_retain' to control the max number of latest snapshot archives to retain. Note the oldest snapshot is always retained as before and is not subjected to this new options.
* The validator and ledger-tool executables are modified with a CLI argument --maximum-snapshots-to-retain. And the options are propagated down the call chains. Their corresponding shell scripts were changed accordingly.
* SnapshotConfig is modified to have an extra field for the maximum_snapshots_to_retain
* Unit tests are developed to cover purge_old_snapshot_archives
* Require that blockstore block-time only be recognized slot, instead of root
* Move cache_block_time to after Bank freeze
* Single use statement
* Pass transaction_status_sender by reference
* Remove unnecessary slot-existence check before caching block time altogether
* Move block-time existence check into Blockstore::cache_block_time, Blockstore no longer needed in blockstore_processor helper
rocksdb compaction can cause long stalls, so
make it more configurable to try and reduce those stalls
and also to coordinate between multiple nodes to not induce
stall at the same time.
* Add validator flag to opt in to cpi and logs storage
* Default TestValidator to opt-in; allow using in multinode-demo
* No clone
Co-authored-by: Carl Lin <carl@solana.com>
* Deprecate commitment variants
* Add new CommitmentConfig builders
* Add helpers to avoid allowing deprecated variants
* Remove deprecated transaction-status code
* Include new commitment variants in runtime commitment; allow deprecated as long as old variants persist
* Remove deprecated banks code
* Remove deprecated variants in core; allow deprecated in rpc/rpc-subscriptions for now
* Heavier hand with rpc/rpc-subscription commitment
* Remove deprecated variants from local-cluster
* Remove deprecated variants from various tools
* Remove deprecated variants from validator
* Update docs
* Remove deprecated client code
* Add new variants to cli; remove deprecated variants as possible
* Don't send new commitment variants to old clusters
* Retain deprecated method in test_validator_saves_tower
* Fix clippy matches! suggestion for BPF solana-sdk legacy compile test
* Refactor node version check to handle commitment variants and transaction encoding
* Hide deprecated variants from cli help
* Add cli App comments
* Adds a CLI option to the validator to enable just-in-time compilation of BPF.
* Refactoring to use bpf_loader_program instead of feature_set to pass JIT flag from the validator CLI to the executor.
Gossip and other places repeatedly de-serialize vote-state stored in
vote accounts. Ideally the first de-serialization should cache the
result.
This commit adds new VoteAccount type which lazily de-serializes
VoteState from Account data and caches the result internally.
Serialize and Deserialize traits are manually implemented to match
existing code. So, despite changes to frozen_abi, this commit should be
backward compatible.
The --rpc-pubsub-enable-vote-subscription flag may be used to enable it.
The current vote subscription is problematic because it emits a
notification for *every* vote, so hundreds a second in a real cluster.
Critically it's also missing information about *who* is voting,
rendering all those notifications practically useless.
Until these two issues can be resolved, the vote subscription is not
much more than a potential DoS vector.
* Discard pre hard fork persisted tower if hard-forking
* Relax config.require_tower
* Add cluster test
* nits
* Remove unnecessary check
Co-authored-by: Ryo Onodera <ryoqun@gmail.com>
Co-authored-by: Carl Lin <carl@solana.com>
* Add service to track the most recent optimistically confirmed bank
* Plumb service into ClusterInfoVoteListener and ReplayStage
* Clean up test
* Use OptimisticallyConfirmedBank in RPC
* Remove superfluous notifications from RpcSubscriptions
* Use crossbeam to avoid mpsc recv_timeout panic
* Review comments
* Remove superfluous last_checked_slots, but pass in OptimisticallyConfirmedBank for complete correctness
* Add blockstore column to store performance sampling data
* introduce timer and write performance metrics to blockstore
* introduce getRecentPerformanceSamples rpc
* only run on rpc nodes enabled with transaction history
* add unit tests for get_recent_performance_samples
* remove RpcResponse from rpc call
* refactor to use Instant::now and elapsed for timer
* switch to root bank and ensure not negative subraction
* Add PerfSamples to purge/compaction
* refactor to use Instant::now and elapsed for timer
* switch to root bank and ensure not negative subraction
* remove duplicate constants
Co-authored-by: Tyera Eulberg <tyera@solana.com>
* Save/restore Tower
* Avoid unwrap()
* Rebase cleanups
* Forcibly pass test
* Correct reconcilation of votes after validator resume
* d b g
* Add more tests
* fsync and fix test
* Add test
* Fix fmt
* Debug
* Fix tests...
* save
* Clarify error message and code cleaning around it
* Move most of code out of tower save hot codepath
* Proper comment for the lack of fsync on tower
* Clean up
* Clean up
* Simpler type alias
* Manage tower-restored ancestor slots without banks
* Add comment
* Extract long code blocks...
* Add comment
* Simplify returned tuple...
* Tweak too aggresive log
* Fix typo...
* Add test
* Update comment
* Improve test to require non-empty stray restored slots
* Measure tower save and dump all tower contents
* Log adjust and add threshold related assertions
* cleanup adjust
* Properly lower stray restored slots priority...
* Rust fmt
* Fix test....
* Clarify comments a bit and add TowerError::TooNew
* Further clean-up arround TowerError
* Truly create ancestors by excluding last vote slot
* Add comment for stray_restored_slots
* Add comment for stray_restored_slots
* Use BTreeSet
* Consider root_slot into post-replay adjustment
* Tweak logging
* Add test for stray_restored_ancestors
* Reorder some code
* Better names for unit tests
* Add frozen_abi to SavedTower
* Fold long lines
* Tweak stray ancestors and too old slot history
* Re-adjust error conditon of too old slot history
* Test normal ancestors is checked before stray ones
* Fix conflict, update tests, adjust behavior a bit
* Fix test
* Address review comments
* Last touch!
* Immediately after creating cleaning pr
* Revert stray slots
* Revert comment...
* Report error as metrics
* Revert not to panic! and ignore unfixable test...
* Normalize lockouts.root_slot more strictly
* Add comments for panic! and more assertions
* Proper initialize root without vote account
* Clarify code and comments based on review feedback
* Fix rebase
* Further simplify based on assured tower root
* Reorder code for more readability
Co-authored-by: Michael Vines <mvines@gmail.com>
* Add blockstore column to cache block times
* Add method to cache block time
* Add service to cache block time
* Update rpc getBlockTime to use new method, and refactor blockstore slightly
* Return block_time with confirmed block, if available
* Add measure and warning to cache-block-time
* Remove support for Budget
Also:
* Make "pay" command a deprecated alias for the "transfer" command
* chore: remove budget from web3.js
* Drop Budget depedency from core
Validators no longer ship with builtin Budget
* Plumb replay vote channel for notifying vote listener of replay votes
* Keep gossip only notification for debugging gossip in the future
Co-authored-by: Carl <carl@solana.com>
* Plumb votes into repair service
* Remove refactoring
* Fix tests
* Switch to using RepairWeight for generating repairs
* Revert "Weight repair slots based on vote stake (#10741)"
This reverts commit cabd0a09c3.
* Update logging
Co-authored-by: Carl <carl@solana.com>
* Remove Blockstore member variable from BlockCommitmentCache
* Hoist is_confirmed_rooted() to its only caller
BlockCommitmentCache no longer depends on Blockstore
* Move BlockCommitmentCache to solana_runtime