Currently each outgoing repair request will attempt to establish a
connection if one does not already exist. This is very wasteful and
consumes many tokio tasks if the remote node is down or unresponsive.
The commit decouples routing packets from establishing connections by
adding a buffering channel for each remote address. Outgoing packets are
always sent down this channel to be processed once the connection is
established. If connecting attempt fails, all packets already pushed to
the channel are dropped at once, reducing the number of attempts to make
a connection if the remote node is down or unresponsive.
The current getHealth mechanism checks a local accounts hash slot vs.
those of other nodes as specified by --known-validator. This is a
very coarse comparison given that the default for this value is 100
slots. More so, any nodes using a value larger than the default
(ie --incremental-snapshot-interval 500) will likely see getHealth
return status behind at some point.
Change the underlying mechanism of how health is computed. Instead of
using the accounts hash slots published in gossip, use the latest
optimistically confirmed slot from the cluster. Even when a node is
behind, it is able to observe cluster optimistically confirmed by slots
by viewing votes published in gossip.
Thus, the latest cluster optimistically confirmed slot can be compared
against the latest optimistically confirmed bank from replay to
determine health. This new comparison is much more granular, and not
needing to depend on individual known validators is also a plus.
* Enable frozen_abi on banking trace file
* Fix ci with really correct bugfix...
* Remove tracker_callers
* Fix typo...
* Fix AbiExample for Arc/Rc's Weaks
* Added comment for AbiExample impl of SystemTime
* Simplify and document EvenAsOpaque with new usage
* Minor clean-ups
* Simplify SystemTime::example() with UNIX_EPOCH...
* Add comment for AbiExample subtleties
* Add wen_restart module:
- Implement reading LastVotedForkSlots from blockstore.
- Add proto file to record the intermediate results.
- Also link wen_restart into validator.
- Move recreation of tower outside replay_stage so we can get last_vote.
* Update lock file.
* Fix linter errors.
* Fix depencies order.
* Update wen_restart explanation and small fixes.
* Generate tower outside tvu.
* Update validator/src/cli.rs
Co-authored-by: Tyera <teulberg@gmail.com>
* Update wen-restart/protos/wen_restart.proto
Co-authored-by: Tyera <teulberg@gmail.com>
* Update wen-restart/build.rs
Co-authored-by: Tyera <teulberg@gmail.com>
* Update wen-restart/src/wen_restart.rs
Co-authored-by: Tyera <teulberg@gmail.com>
* Rename proto directory.
* Rename InitRecord to MyLastVotedForkSlots, add imports.
* Update wen-restart/Cargo.toml
Co-authored-by: Tyera <teulberg@gmail.com>
* Update wen-restart/src/wen_restart.rs
Co-authored-by: Tyera <teulberg@gmail.com>
* Move prost-build dependency to project toml.
* No need to continue if the distance between slot and last_vote is
already larger than MAX_SLOTS_ON_VOTED_FORKS.
* Use 16k slots instead of 81k slots, a few more wording changes.
* Use AncestorIterator which does the same thing.
* Update Cargo.lock
* Update Cargo.lock
---------
Co-authored-by: Tyera <teulberg@gmail.com>
* Separate simple-vote transaction cost from non-vote transaction cost
* remove is_simple_vote flag from transaction UsageCostDetails
* update test and comment
* set static usage cost for SimpleVote transaction
* Move vote related code to its own crate
* Update imports in code and tests
* update programs/sbf/Cargo.lock
* fix check errors
* update abi_digest
* rebase fixes
* fixes after rebase
* Adds a module `address_lookup_table` to the SDK.
* Adds a module `address_lookup_table::instruction` to the SDK.
* Adds a module `address_lookup_table::error` to the SDK.
* Adds a module `address_lookup_table::state` to the SDK.
* Moves AddressLookupTable into SDK as well.
* Moves AddressLookupTableAccount into address_lookup_table.
* Adds deprecation messages.
* Disentangles dependencies across cargo files.
The commit implements server-side of repair using QUIC protocol.
UDP repair requests are adapted as RemoteRequest and sent down the same
channel as remote requests arriving over QUIC, and the rest of the
server code is update to process over RemoteRequest type.
Working towards using QUIC protocol for repair, the commit adds a QUIC
endpoint for repair service.
Outgoing local requests are sent as
struct LocalRequest {
remote_address: SocketAddr,
bytes: Vec<u8>,
num_expected_responses: usize,
response_sender: Sender<(SocketAddr, Vec<u8>)>,
}
to the client-side of the endpoint. The client opens a bidirectional
stream with the LocalRequest.remote_address and once received the
response, sends it down the LocalRequest.response_sender channel.
Incoming requests from remote nodes are received from bidirectional
streams and sent as
struct RemoteRequest {
remote_pubkey: Option<Pubkey>,
remote_address: SocketAddr,
bytes: Vec<u8>,
response_sender: Option<OneShotSender<Vec<Vec<u8>>>>,
}
to the repair-service. The response is received from the receiver end of
RemoteRequest.response_sender channel and send back to the remote node
using the send side of the bidirectional stream.
removes outdated matches crate from the dependencies
std::matches has been stable since rust 1.42.0.
Other use-cases are covered by assert_matches crate.
This function used to contain feature gate activation checks that
required access to a bank. Those checks have been cleaned up, so we no
longer need access to a full Bank. Rather, we can momentarily get a Bank
from BankForks, calculate the necessary results and then drop the Bank
along with the BankForks read lock.
* allow pedantic invalid cast lint
* allow lint with false-positive triggered by `test-case` crate
* nightly `fmt` correction
* adapt to rust layout changes
* remove dubious test
* Use transmute instead of pointer cast and de/ref when check_aligned is false.
* Renames clippy::integer_arithmetic to clippy::arithmetic_side_effects.
* bump rust nightly to 2023-08-25
* Upgrades Rust to 1.72.0
---------
Co-authored-by: Trent Nelson <trent@solana.com>
- BankForks is not an optional argument, so remove dated comment
- Given that BankForks is always present, no need for special values to
initialize variables before the loop
- Root slot can be retrieved from root bank, no need to call
BankForks::root() which will load the underlying atomic a second time
- Use BankForks::highest_slot() instead of .slot() on .working_bank() to
avoid the extra clone that .working_bank() performs
- Move several operations outside of BankForks read lock scope to
minimize lock time
* remove unnecessary hashes around raw string literals
* remove unncessary literal `unwrap()`s
* remove panicking `unwrap()`
* remove unnecessary `unwrap()`
* use `[]` instead of `vec![]` where applicable
* remove (more) unnecessary explicit `into_iter()` calls
* remove redundant pattern matching
* don't cast to same type and constness
* do not `cfg(any(...` a single item
* remove needless pass by `&mut`
* prefer `or_default()` to `or_insert_with(T::default())`
* `filter_map()` better written as `filter()`
* incorrect `PartialOrd` impl on `Ord` type
* replace "slow zero-filled `Vec` initializations"
* remove redundant local bindings
* add required lifetime to associated constant
* sdk: Add concurrent support for rand 0.7 and 0.8
* Update rand, rand_chacha, and getrandom versions
* Run command to replace `gen_range`
Run `git grep -l gen_range | xargs sed -i'' -e 's/gen_range(\(\S*\), /gen_range(\1../'
* sdk: Fix users of older `gen_range`
* Replace `hash::new_rand` with `hash::new_with_thread_rng`
Run:
```
git grep -l hash::new_rand | xargs sed -i'' -e 's/hash::new_rand([^)]*/hash::new_with_thread_rng(/'
```
* perf: Use `Keypair::new()` instead of `generate`
* Use older rand version in zk-token-sdk
* program-runtime: Inline random key generation
* bloom: Fix clippy warnings in tests
* streamer: Scope rng usage correctly
* perf: Fix clippy warning
* accounts-db: Map to char to generate a random string
* Remove `from_secret_key_bytes`, it's just `keypair_from_seed`
* ledger: Generate keypairs by hand
* ed25519-tests: Use new rand
* runtime: Use new rand in all tests
* gossip: Clean up clippy and inline keypair generators
* core: Inline keypair generation for tests
* Push sbf lockfile change
* sdk: Sort dependencies correctly
* Remove `hash::new_with_thread_rng`, use `Hash::new_unique()`
* Use Keypair::new where chacha isn't used
* sdk: Fix build by marking rand 0.7 optional
* Hardcode secret key length, add static assertion
* Unify `getrandom` crate usage to fix linking errors
* bloom: Fix tests that require a random hash
* Remove some dependencies, try to unify others
* Remove unnecessary uses of rand and rand_core
* Update lockfiles
* Add back some dependencies to reduce rebuilds
* Increase max rebuilds from 14 to 15
* frozen-abi: Remove `getrandom`
* Bump rebuilds to 17
* Remove getrandom from zk-token-proof
In most cases, either a &Bank or an Arc<Bank> is more proper.
- &Bank is used if the function only needs a momentary reference
- Arc<Bank> is used if the function needs its' own copy
This PR leaves several instances of &Arc<Bank> around; these instances
are situations where a clone may only happen conditionally.
When a consensus divergance occurs, the current workflow involves a
handful of manual steps to hone in on the offending slot and
transaction. This process isn't overly difficult to execute; however, it
is tedious and currently involves creating and parsing logs.
This change introduces functionality to output a debug file that
contains the components go into the bank hash. The file can be generated
in two ways:
- Via solana-validator when the node realizes it has diverged
- Via solana-ledger-tool verify by passing a flag
When a divergance occurs now, the steps to debug would be:
- Grab the file from the node that diverged
- Generate a file for the same slot with ledger-tool with a known good
version
- Diff the files, they are pretty-printed json
Some of the cleanup tasks include ...
- Make subfunctions return a Result and allow error handling above
- Add some clarifying comments
- Give backup directory name a more meaningful name
- Add some additional logs (with timing info) for long running parts
The existing signature unpacked elements from a Shred and took an owned
Vec<u8>, forcing a .clone() from the caller. The Shred can be passed in
directly to simplify argument list and avoid the clone.
* separates out turbine QUIC from TPU implementation
Turbine being tied to QUIC implementation for TPU hinders development
and makes it hard to optimize QUIC specifically for turbine.
The commit separates out turbine QUIC from TPU implementation.
* Update core/src/validator.rs
Co-authored-by: Jon Cinque <me@jonc.dev>
* Update turbine/src/retransmit_stage.rs
Co-authored-by: Jon Cinque <me@jonc.dev>
---------
Co-authored-by: Jon Cinque <me@jonc.dev>
* Move CostModel and CostTracker to its own crate
* compile new crate and update imports
* update sbf Cargo.lock
* fix AbiExample
* fix cargo sort
* Fix AbiExample
The optional args allow reuse by ledger-tool repair roots command Also,
hold cleanup lock for duration of Blockstore::scan_and_fix_roots().
This prevents a scenario where scan_and_fix_roots() could identify a
slot as needing to be marked root, that slot getting cleaned by
LedgerCleanupService, and then scan_and_fix_roots() marking the slot as
root on the now purged slot.
* When there are too many pubkeys in one slot, kick the one with lowest
stake out.
* Cache last_root to reduce read locks we need.
* Use slots_in_epoch to limit number of slots in the map.
* Fix lint errors.
* Only cache stake and slots per epoch once per epoch.
* Revert "Only cache stake and slots per epoch once per epoch."
This reverts commit 8658aad0083456794b4c4403adaf9c74d1a71d09.
* Vote at the tip of current fork if last vote is outside SlotHash
of the tip and last vote expired.
* Add unittest when last vote is outside slothash, we should vote at the tip
of the current fork.
* Revert "Use slots_in_epoch to limit number of slots in the map."
This reverts commit 93574f57a48d2a70fbbc0f62fa8810d3b6bee0af.
* Revert "Cache last_root to reduce read locks we need."
This reverts commit bb114ec2b62cb9c0207328b19c415f6116be0f1c.
* Revert "When there are too many pubkeys in one slot, kick the one with lowest"
This reverts commit 711e29a6a025fd4f11fbc97dcbbe90e4832be04c.
* Move new vote generation when last vote is outside slothash into the
main path, this actually makes more sense since we don't select where
to vote in two different places, and all the vote generation logic
is seamlessly inherited.
* - Move vote refresh to be behind select vote and do not refresh vote if a new
vote is selected.
- Check whether last vote is inside slothash inside select_vote_and_reset_forks
- rename slot_within_slothash to is_in_slothashes_history
- remove one unittest for now, more tests will be added in a separate CL
* Remove new test, it will be in another file.
* Add is_in_slot_hashes_history test in the new file.
* Add unittest for the case when last vote is outside slot hashes.
* Small improvements and more unittests.
* Fix bad merge.
* Update docs/src/terminology.md
Co-authored-by: mvines <mvines@gmail.com>
* Put SwitchForkDecision::FailedSwitchThreshold logic into separate function.
* Make linter happy.
---------
Co-authored-by: mvines <mvines@gmail.com>
Slot::MAX was used to specify that a type of snapshots should not be
created; define a constant to be that value and reference the constant
to have a single point of edit.
* Restrict access to Bank's HardForks
Callers could previously obtain a a lock to read/write HardForks from
any Bank. This would allow any caller to modify, and creates the
opportunity for inconsistent handling of what is considered a valid hard
fork (ie too old).
This PR adds a function to Bank so consistent sanity checks can be
applied; the caller will already have a Bank as that is where they would
have obtained the HardForks from in the first place. Additionally,
change the getter to return a copy of HardForks (simple Vec).
* Allow hard fork at bank slot if bank is not yet frozen
The core/src/ directory is already pretty crowded, and moving these
items into the subdirectory more clearly identifies that they are tied
to banking_stage.
* Add TpuEntryNotifier to send EntryNotifications from Tpu
* Optionally run TpuEntryNotifier to send out EntrySummarys alongside BroadcastStage messages
* Track entry index in TpuEntryNotifier
* Allow for leader slots that switch forks
* Exit if broadcast send fails
`Arc` is already a reference internally, so it does not seem to be
beneficial to pass a reference to it. Just adds an extra layer of
indirection.
Functions that need to be able to increment `Arc` reference count need
to take `Arc<AtomicBool>`, but those that just want to read the
`AtomicBool` value can accept `&AtomicBool`, making them a bit more
generic.
This change focuses specifically on `Arc<AtomicBool>`. There are other
uses of `&Arc<T>` in the code base that could be converted in a similar
manner. But it would make the change even larger.
The callstack updated in this PR passed an &Arc<...> down only to have
the bottom level clone the reference. Thus, we are giving shared
ownership so the reference is a bit redundant and arguably obscures the
intention to clone further down the callstack.
* Notify replay of pruned duplicate confirmed slots
* Ingest replay signal and run ancestor hashes for pruned
* Forward PDC to ancestor hashes and ingest pruned dumps from ancestor hashes service
* Add local-cluster test
* pass include_slot_in_hash through hash calcs to allow rehashing
* tests use each include_slot_in_hash value
* move include_slot_in_hash
* typo
* reorder struct init
* spelling is hard
* Move entry_notifier_interface
* Add EntryNotifierService
* Use descriptive struct in sender/receiver
* Optionally initialize EntryNotifierService in validator
* Plumb EntryNotfierSender into Tvu, blockstore_processor
* Plumb EntryNotfierSender into Tpu
* Only return one option when constructing EntryNotifierService
Counters incur additional overhead in sending points to the MetricsAgent
over a crossbeam channel. Additionally, some of these counters would be
submitted by non-voting nodes which is just extra overhead and noise.
This change condenses several updates of a counter into a field of the
existing BankingStageStats metrics struct.
replay_stage-voted_empty_bank has been converted into a datapoint that
now includes slot number. replay_stage-replay_transactions has been
removed altogether as we can get similar information on a per-slot basis
from replay-slot-stats metric.