This is the 2nd installment for the AccountsDb replication.
Summary of Changes
The basic google protocol buffer protocol for replicating updated slots and accounts. tonic/tokio is used for transporting the messages.
The basic framework of the client and server for replicating slots and accounts -- the persisting of accounts in the replica-side will be done at the next PR -- right now -- the accounts are streamed to the replica-node and dumped. Replication for information about Bank is also not done in this PR -- to be addressed in the next PR to limit the change size.
Functionality used by both the client and server side are encapsulated in the replica-lib crate.
There is no impact to the existing validator by default.
Tests:
Observe the confirmed slots replicated to the replica-node.
Observe the accounts for the confirmed slot are received at the replica-node side.
#### Problem
Snapshot names are overloaded, and there are multiple terms that mean the same thing. This is confusing. Here's a list of ones in the codebase that I've found:
```
- snapshot_dir
- snapshots_dir
- snapshot_path
- snapshot_output_dir
- snapshot_package_output_path
- snapshot_archives_dir
```
#### Summary of Changes
For all the ones that are about the directory where snapshot archives are stored, ensure they are `snapshot_archives_dir`. For the ones about the (bank) snapshots directory, set to `bank_snapshots_dir`.
Co-authored-by: Michael Vines <mvines@gmail.com>
While reviewing PR #18565, as issue was brought up to refactor some code
around verifying the bank after rebuilding from snapshots. A new
top-level function has been added to get the latest snapshot archives
and load the bank then verify. Additionally, new tests have been
written and existing tests have been updated to use this new function.
Fixes#18973
While resolving the issue, it became clear there was some additional
low-hanging fruit this change enabled. Specifically, the functions
`bank_to_xxx_snapshot_archive()` now return their respective
`SnapshotArchiveInfo`. And on the flip side,
`bank_from_snapshot_archives()` now takes `SnapshotArchiveInfo`s instead
of separate paths and archive formats. This bundling simplifies bank
rebuilding.
This commit also renames `snapshot_interval_slots` to
`full_snapshot_archive_interval_slots`, updates the comments on the
fields, and make appropriate updates where SnapshotConfig is used.
This commit adds high-level functions for creating and loading-from
incremental snapshots, plus all low-level functions required to perform
those tasks. This commit **does not** add taking incremental snapshots
as part of a running validator, nor starting up a node with an
incremental snapshot; just laying ground work.
Additionally, `snapshot_utils` and `serde_snapshot` have been
refactored to use a common code paths for the different snapshots.
Also of note, some renaming has happened:
1. Snapshots are now either `full_` or `incremental_` throughout the
codebase. If not specified, the code applies to both.
2. Bank snapshots now are called "bank snapshots"
(before they were called "slot snapshots", "bank snapshots", or
just "snapshots"). The one exception is within `Bank`, where they
are still just "snapshots", because they are already "bank
snapshots".
3. Snapshot archives now have `_archive` in the code. This
should clear up an ambiguity between bank snapshots and snapshot
archives.
1. Added both options for measuring space usage using total accounts usage and for individual store shrink ratio using an enum. Validator CLI options: --accounts-shrink-optimize-total-space and --accounts-shrink-ratio
2. Added code for selecting candidates based on total usage in a separate function select_candidates_by_total_usage
3. Added unit tests for the new functions added
4. The default implementations is kept at 0.8 shrink ratio with --accounts-shrink-optimize-total-space set to true
Fixes#17544
* Create solana-poh crate
* Move BigTableUploadService to solana-ledger
* Add solana-rpc to workspace
* Move dependencies to solana-rpc
* Move remaining rpc modules to solana-rpc
* Single use statement solana-poh
* Single use statement solana-rpc
Convert to use type alias for the callback and cascade the changes to callers. Thanks @jeffwashington for the help making it possible.
Changed the closure for the progress update in the validator main to FnMut and modify the abort count in the closure which is more reliable.
* Move gossip modules to solana-gossip
* Update Protocol abi digest due to move
* Move gossip benches and hook up CI
* Remove unneeded Result entries
* Single use statements
1. Allow the validator bootstrap code to specify the minimal snapshot download speed. If the snapshot download speed is detected below that, a different RPC can be retried. The default is 10MB/sec.
2. To prevent spinning on a number of sub-optimal choices and not making progress, the abort/retry logic is implemented with the following safe guards:
2.1 at maximum we do this retry for 5 times -- this number is configurable with default 5.
2.2 if the download in one notification round (5 second) is more than 2%, do not do retry -- it is not as bad anyway.
2.3 if the remaining estimate time is less than 1 minutes, do not abort retry as it will be done quickly anyway.
2.4 We do this abort/retry logic only at the first notification to avoid wasting download efforts -- the reasoning is being opportunistic and too greedy may not achieve overall shorter download time.
3. The download_snapshot and download_file is modified with the option allowing caller to notified of download progress via a callback. This allows the business logic of retrying to the place it belongs.
* Upgrade Rust to 1.52.0
update nightly_version to newly pushed docker image
fix clippy lint errors
1.52 comes with grcov 0.8.0, include this version to script
* upgrade to Rust 1.52.1
* disabling Serum from downstream projects until it is upgraded to Rust 1.52.1
* purge_old_snapshot_archives is changed to take an extra argument 'maximum_snapshots_to_retain' to control the max number of latest snapshot archives to retain. Note the oldest snapshot is always retained as before and is not subjected to this new options.
* The validator and ledger-tool executables are modified with a CLI argument --maximum-snapshots-to-retain. And the options are propagated down the call chains. Their corresponding shell scripts were changed accordingly.
* SnapshotConfig is modified to have an extra field for the maximum_snapshots_to_retain
* Unit tests are developed to cover purge_old_snapshot_archives
* Deprecate commitment variants
* Add new CommitmentConfig builders
* Add helpers to avoid allowing deprecated variants
* Remove deprecated transaction-status code
* Include new commitment variants in runtime commitment; allow deprecated as long as old variants persist
* Remove deprecated banks code
* Remove deprecated variants in core; allow deprecated in rpc/rpc-subscriptions for now
* Heavier hand with rpc/rpc-subscription commitment
* Remove deprecated variants from local-cluster
* Remove deprecated variants from various tools
* Remove deprecated variants from validator
* Update docs
* Remove deprecated client code
* Add new variants to cli; remove deprecated variants as possible
* Don't send new commitment variants to old clusters
* Retain deprecated method in test_validator_saves_tower
* Fix clippy matches! suggestion for BPF solana-sdk legacy compile test
* Refactor node version check to handle commitment variants and transaction encoding
* Hide deprecated variants from cli help
* Add cli App comments
* Discard pre hard fork persisted tower if hard-forking
* Relax config.require_tower
* Add cluster test
* nits
* Remove unnecessary check
Co-authored-by: Ryo Onodera <ryoqun@gmail.com>
Co-authored-by: Carl Lin <carl@solana.com>
* Fix tower/blockstore unsync due to external causes
* Add and clean up long comments
* Clean up test
* Comment about warped_slot_history
* Run test_future_tower with master-only/master-slave
* Update comments about false leader condition
* Follow up to persistent tower
* Ignore for now...
* Hard-code validator identities for easy reasoning
* Add a test for opt. conf violation without tower
* Fix compile with rust < 1.47
* Remove unused method
* More move of assert tweak to the asser pr
* Add comments
* Clean up
* Clean the test addressing various review comments
* Clean up a bit
* Save/restore Tower
* Avoid unwrap()
* Rebase cleanups
* Forcibly pass test
* Correct reconcilation of votes after validator resume
* d b g
* Add more tests
* fsync and fix test
* Add test
* Fix fmt
* Debug
* Fix tests...
* save
* Clarify error message and code cleaning around it
* Move most of code out of tower save hot codepath
* Proper comment for the lack of fsync on tower
* Clean up
* Clean up
* Simpler type alias
* Manage tower-restored ancestor slots without banks
* Add comment
* Extract long code blocks...
* Add comment
* Simplify returned tuple...
* Tweak too aggresive log
* Fix typo...
* Add test
* Update comment
* Improve test to require non-empty stray restored slots
* Measure tower save and dump all tower contents
* Log adjust and add threshold related assertions
* cleanup adjust
* Properly lower stray restored slots priority...
* Rust fmt
* Fix test....
* Clarify comments a bit and add TowerError::TooNew
* Further clean-up arround TowerError
* Truly create ancestors by excluding last vote slot
* Add comment for stray_restored_slots
* Add comment for stray_restored_slots
* Use BTreeSet
* Consider root_slot into post-replay adjustment
* Tweak logging
* Add test for stray_restored_ancestors
* Reorder some code
* Better names for unit tests
* Add frozen_abi to SavedTower
* Fold long lines
* Tweak stray ancestors and too old slot history
* Re-adjust error conditon of too old slot history
* Test normal ancestors is checked before stray ones
* Fix conflict, update tests, adjust behavior a bit
* Fix test
* Address review comments
* Last touch!
* Immediately after creating cleaning pr
* Revert stray slots
* Revert comment...
* Report error as metrics
* Revert not to panic! and ignore unfixable test...
* Normalize lockouts.root_slot more strictly
* Add comments for panic! and more assertions
* Proper initialize root without vote account
* Clarify code and comments based on review feedback
* Fix rebase
* Further simplify based on assured tower root
* Reorder code for more readability
Co-authored-by: Michael Vines <mvines@gmail.com>
* Use untagged RpcSignatureResult enum to avoid breaking downstream consumers of current signature subscriptions
* Clean up client duplication
* Clippy
* Implement Debug for MessageProcessor
* Switch from delta-based gating to whole-set gating
* Remove dbg!
* Fix clippy
* Clippy
* Add test
* add loader to stable operating mode at proper epoch
* refresh_programs_and_inflation after ancestor setup
* Callback via snapshot; avoid account re-add; Debug
* Fix test
* Fix test and fix the past history
* Make callback management stricter and cleaner
* Fix test
* Test overwrite and frozen for native programs
* Test epoch callback with genesis-programs
* Add assertions for parent bank
* Add tests and some minor cleaning
* Remove unsteady assertion...
* Fix test...
* Fix DOS
* Skip ensuring account by dual (whole/delta) gating
* Fix frozen abi implementation...
* Move compute budget constatnt init back into bank
Co-authored-by: Ryo Onodera <ryoqun@gmail.com>
* Make Message::new_with_payer the default constructor
* Remove Transaction::new_[un]signed_instructions
These guess the fee-payer instead of stating it explicitly
* Revert "Untar is called for shred archives that do not exist. (#9565)"
This reverts commit 729cb5eec6.
* Revert "Dont insert shred payload into rocksdb (#9366)"
This reverts commit 5ed39de8c5.
* Accounts hash consistency halting
* Add option to inject account hash faults for testing.
Enable option in local cluster test to see that node halts.
* Consensus fix, don't consider threshold check if
lockouts are not increased
* Change partition tests to wait for epoch with > lockout slots
* Use atomic bool to signal partition
* Refactor local cluster to support killing a partition
* Rework run_network_partition
* Introduce fixed leader schedule
* Plumb fixed schedule into test
* Pass blocktree into execute_batch, if persist_transaction_status
* Add validator arg to enable persistent transaction status store
* Pass blocktree into banking_stage, if persist_transaction_status
* Add validator params to bash scripts
* Expose actual transaction statuses outside Bank; add tests
* Fix benches
* Offload transaction status writes to a separate thread
* Enable persistent transaction status along with rpc service
* nudge
* Review comments
* Remove the name "blob" from archivers
* Remove the name "blob" from broadcast
* Remove the name "blob" from Cluset Info
* Remove the name "blob" from Repair
* Remove the name "blob" from a bunch more places
* Remove the name "blob" from tests and book