In kin-sim, we found that bounded channel causes halt for account
background services. As the number of accounts grows, the time for
pruning and cleaning increases, which would leads to longer intervals
between the pruning of deaded bank slots. With 1.7B accounts, we will
exceed the 10K bounded channel threshold that causes halt of account
back ground services. Without pruning, the node will eventually run out
of memory.
Tenets:
1. Limit thread names to 15 characters
2. Prefix all Solana-controlled threads with "sol"
3. Use Camel case. It's more character dense than Snake or Kebab case
Prior to this change, long running commands like `solana-ledger-tool
verify` would OOM due to AccountsDb cleanup not happening.
Co-authored-by: Michael Vines <mvines@gmail.com>
* allow initial hash calc to occur in bg
* validator_initialized -> startup_verification_complete
* add infos for leader and vote
* rework snapshot for startup verification
* change to assert
* nonblocking send when when droping banks
* detect and report drop signal queue full/disconnect events
* comments
* use counter for reporting bank_drop_queue events
* reduce log
* use datapoint to report stats
* logging instead of reporting bank drop signal full
* fix a corner case for reporting
* fix build
AccountsBackgroundService now knows about incremental snapshots. It is
now also in charge of deciding if an AccountsPackage is destined to be a
SnapshotPackage or not (or just used by AccountsHashVerifier).
!!! New behavior changes !!!
Taking snapshots (both bank and archive) **MUST** succeed.
This is required because of how the last full snapshot slot is
calculated, which is used by AccountsBackgroundService when calling
`clean_accounts()`.
File system calls are now unwrapped and will result in a crash. As Trent told me:
>Well I think if a snapshot fails due to some IO error, it's very likely that the operator is going to have to intervene before it works. We should exit error in this case, otherwise the validator might happily spin for several more hours, never successfully writing a complete snapshot, before something else brings it down. This would leave the validator's last local snapshot many more slots behind than it would be had we exited outright and potentially force the operator to abandon ledger continuity in favor of a quick catchup
Other errors will set the `exit` flag to `true`, and the node will gracefully shutdown.
Fixes#19167Fixes#19168
#### Problem
Snapshot names are overloaded, and there are multiple terms that mean the same thing. This is confusing. Here's a list of ones in the codebase that I've found:
```
- snapshot_dir
- snapshots_dir
- snapshot_path
- snapshot_output_dir
- snapshot_package_output_path
- snapshot_archives_dir
```
#### Summary of Changes
For all the ones that are about the directory where snapshot archives are stored, ensure they are `snapshot_archives_dir`. For the ones about the (bank) snapshots directory, set to `bank_snapshots_dir`.
Co-authored-by: Michael Vines <mvines@gmail.com>
* Handle cleaning zero-lamport accounts
Handle cleaning zero-lamport accounts in slots higher than the last full
snapshot slot. This is part of the Incremental Snapshot work.
Fixes#18825
This commit adds high-level functions for creating and loading-from
incremental snapshots, plus all low-level functions required to perform
those tasks. This commit **does not** add taking incremental snapshots
as part of a running validator, nor starting up a node with an
incremental snapshot; just laying ground work.
Additionally, `snapshot_utils` and `serde_snapshot` have been
refactored to use a common code paths for the different snapshots.
Also of note, some renaming has happened:
1. Snapshots are now either `full_` or `incremental_` throughout the
codebase. If not specified, the code applies to both.
2. Bank snapshots now are called "bank snapshots"
(before they were called "slot snapshots", "bank snapshots", or
just "snapshots"). The one exception is within `Bank`, where they
are still just "snapshots", because they are already "bank
snapshots".
3. Snapshot archives now have `_archive` in the code. This
should clear up an ambiguity between bank snapshots and snapshot
archives.
* Account for possibility of cache flush in load()
* More cleaning
* More cleaning
* Remove unused method and some comment cleaning
* Fix typo
* Make the detected impossible purge race panic()!
* Finally revert to original .expect()
* Fix typos...
* Add assertion for max_root for easier reasoning
* Reframe races with LoadHint as possible opt.
* Fix test
* Make race bug tests run longer for less flaky
* Delay the clone-in-lock slow path even for RPC
* Make get_account panic-free & add its onchain ver.
* Fix rebase conflicts...
* Clean up
* Clean up comment
* Revert fn name change
* Fix flaky test...
* fmt...
Co-authored-by: Ryo Onodera <ryoqun@gmail.com>