* Add stats and counter around cost model ops, mainly:
- calculate transaction cost
- check transaction can fit in a block
- update block cost tracker after transactions are added to block
- replay_stage to update/insert execution cost to table
* Change mutex on cost_tracker to RwLock
* removed cloning cost_tracker for local use, as the metrics show clone is very expensive.
* acquire and hold locks for block of TXs, instead of acquire and release per transaction;
* remove redundant would_fit check from cost_tracker update execution path
* refactor cost checking with less frequent lock acquiring
* avoid many Transaction_cost heap allocation when calculate cost, which
is in the hot path - executed per transaction.
* create hashmap with new_capacity to reduce runtime heap realloc.
* code review changes: categorize stats, replace explicit drop calls, concisely initiate to default
* address potential deadlock by acquiring locks one at time
When starting a validator, the node initially joins gossip with
shred_verison = 0, until it adopts the entrypoint's shred-version:
https://github.com/solana-labs/solana/blob/9b182f408/validator/src/main.rs#L417
Depending on the load on the entrypoint, this adopting entrypoint
shred-version through gossip sometimes becomes very slow, and causes
several problems in gossip because we have to partially support
shred_version == 0 which is a source of leaking crds values from one
cluster to another. e.g. see
https://github.com/solana-labs/solana/pull/17899
and the other linked issues there.
In order to remove shred_version == 0 from gossip, this commit adds
shred-version to ip-echo-server response. Once the entrypoints are
updated, on validator start-up, if --expected_shred_version is not
specified we will obtain shred-version from the entrypoint using
ip-echo-server.
1. Added both options for measuring space usage using total accounts usage and for individual store shrink ratio using an enum. Validator CLI options: --accounts-shrink-optimize-total-space and --accounts-shrink-ratio
2. Added code for selecting candidates based on total usage in a separate function select_candidates_by_total_usage
3. Added unit tests for the new functions added
4. The default implementations is kept at 0.8 shrink ratio with --accounts-shrink-optimize-total-space set to true
Fixes#17544
* replay stage feeds back realtime per-program execution cost to cost model;
* program cost execution table is initialized into empty table, no longer populated with hardcoded numbers;
* changed cost unit to microsecond, using value collected from mainnet;
* add ExecuteCostTable with fixed capacity for security concern, when its limit is reached, programs with old age AND less occurrence will be pushed out to make room for new programs.
* Create solana-poh crate
* Move BigTableUploadService to solana-ledger
* Add solana-rpc to workspace
* Move dependencies to solana-rpc
* Move remaining rpc modules to solana-rpc
* Single use statement solana-poh
* Single use statement solana-rpc
* Accounts dumping logic
* Add test for interaction between cache flush and remove_unrooted_slot()
* Update comments
* Rename
* renaming
* Add more comments
* Renaming
* Fixup test and bad check
* * Add following to banking_stage:
1. CostModel as immutable ref shared between threads, to provide estimated cost for transactions.
2. CostTracker which is shared between threads, tracks transaction costs for each block.
* replace hard coded program ID with id() calls
* Add Account Access Cost as part of TransactionCost. Account Access cost are weighted differently between read and write, signed and non-signed.
* Establish instruction_execution_cost_table, add function to update or insert instruction cost, unit tested. It is read-only for now; it allows Replay to insert realtime instruction execution costs to the table.
* add test for cost_tracker atomically try_add operation, serves as safety guard for future changes
* check cost against local copy of cost_tracker, return transactions that would exceed limit as unprocessed transaction to be buffered; only apply bank processed transactions cost to tracker;
* bencher to new banking_stage with max cost limit to allow cost model being hit consistently during bench iterations
* Update rocksdb to v0.16.0
* Promote the infrequent and important log to info!
* Force background compaction by ttl without manual compaction
* Fix test
* Support no compaction mode in test_ledger_cleanup_compaction
* Fix comment
* Make compaction_interval customizable
* Avoid major compaction with periodic filtering...
* Adress lazy_static, special cfs and range check
* Clean up a bit and add comment
* Add comment
* More comments...
* Config code cleanup
* Add comment
* Use .conflicts_with()
* Nullify unneeded delete_range ops for special CFs
* Some clean ups
* Clarify the locking intention
* Ensure special CFs' consistency with PurgeType::CompactionFilter
* Fix comment
* Fix bad copy paste
* Fix various types...
* Don't use tuples
* Add a unit test for compaction_filter
* Fix typo...
* Remove flag and just use new behavior always
* Fix wrong condition negation...
* Doc. about no set_last_purged_slot in purge_slots
* Write a test and fix off-by-one bug....
* Apply suggestions from code review
Co-authored-by: Tyera Eulberg <teulberg@gmail.com>
* Follow up to github review suggestions
* Fix line-wrapping
* Fix conflict
Co-authored-by: Tyera Eulberg <teulberg@gmail.com>
* Add BlockHeight CF to blockstore
* Rename CacheBlockTimeService to be more general
* Cache block-height using service
* Fixup previous proto mishandling
* Add block_height to block structs
* Add block-height to solana block
* Fallback to BankForks if block time or block height are not yet written to Blockstore
* Add docs
* Review comments
* Move gossip modules to solana-gossip
* Update Protocol abi digest due to move
* Move gossip benches and hook up CI
* Remove unneeded Result entries
* Single use statements
* Add blockstore-root-scan for api nodes on boot
* Ensure cluster-confirmed root and parents are set as root in blockstore in load_frozen_forks()
* Plumb rpc-scan-and-fix-roots validator flag
For all code paths (gossip push, pull, purge, etc) that remove or
override a crds value, it is necessary to record hash of values purged
from crds table, in order to exclude them from subsequent pull-requests;
otherwise the next pull request will likely return outdated values,
wasting bandwidth:
https://github.com/solana-labs/solana/blob/ed51cde37/core/src/crds_gossip_pull.rs#L486-L491
Currently this is done all over the place in multiple modules, and this
has caused bugs in the past where purged values were not recorded.
This commit encapsulated this bookkeeping into crds module, so that any
code path which removes or overrides a crds value, also records the hash
of purged value in-place.
In order to remove port-based forwarding logic in turbine, we need to
first track how often the turbine retransmit/broadcast trees mismatch
across nodes.
One consistency condition is that if the node is on the critical path
(i.e. the first node in each neighborhood), then we expect that the
packet arrives at tvu socket as opposed to tvu-forwards.
This commit adds a metric to track how often above condition is not met.
If stakes are unknown, then timeouts will be short, resulting in values
being purged from the crds table, and consequently higher pull-response
load when they are obtained again from gossip. In particular, this slows
down validator start where almost all values obtained from entrypoint
are immediately discarded.
On the receiving end, the outdated values are discarded, and they will
only waste bandwidth:
https://github.com/solana-labs/solana/blob/3f0480d06/core/src/crds_gossip_pull.rs#L385-L400
This is also exacerbating validator start, since the entrypoint is
returning old values in pull responses, and the validator immediately
discards those; resulting in huge delay until the validator obtains
contact-info of the entrypoint and is able to adopt shred-version and
fully start.
When a validator starts, it has an (almost) empty crds table and it only
sends one pull-request to the entrypoint. The bloom filter in the
pull-request targets 10% false rate given the number of items. So, if
the `num_items` is very wrong, it makes a very small bloom filter with a
very high false rate:
https://github.com/solana-labs/solana/blob/2ae57c172/runtime/src/bloom.rs#L70-L80https://github.com/solana-labs/solana/blob/2ae57c172/core/src/crds_gossip_pull.rs#L48
As a result, it is very unlikely that the validator obtains entrypoint's
contact-info in response. This exacerbates how long the validator will
loop on:
> Waiting to adopt entrypoint shred version
https://github.com/solana-labs/solana/blob/ed51cde37/validator/src/main.rs#L390-L412
This commit increases the min number of bloom items when making gossip
pull requests. Effectively this will break the entrypoint crds table
into 64 shards, one pull-request for each, a larger bloom filter for
each shard, and increases the chances that the response will include
entrypoint's contact-info, which is needed for adopting shred version
and validator start.
The current implementations use only the id and disregard other fields,
in particular wallclock. This can lead to bugs where an outdated
contact-info shadows or overrides a current one because they compare
equal.
* Upgrade Rust to 1.52.0
update nightly_version to newly pushed docker image
fix clippy lint errors
1.52 comes with grcov 0.8.0, include this version to script
* upgrade to Rust 1.52.1
* disabling Serum from downstream projects until it is upgraded to Rust 1.52.1
crds table retains up to 32 node-instance values per each pubkey. This
is so because if there are multiple running instances of the same node,
then we want gossip to propagate node-instance values associated with
both instances, therefore the corresponding label/key includes the
randomly generated token in addition to the pubkey:
https://github.com/solana-labs/solana/blob/9c42a89a4/core/src/crds_value.rs#L448https://github.com/solana-labs/solana/pull/14037
As a result, the number of such values per pubkey are effectively
unbounded, requiring custom mitigations implemented in:
https://github.com/solana-labs/solana/pull/14467
but still taking redundant extra memory and bandwidth.
This commit instead retains only one node-instance per pubkey by
extending crds values override logic. If a crds value is of type
node-instance, it will always override an existing one with the same key
if it has more recent starting timestamp (not wallclock). As a result,
gossip will always propagate the node-instance with more recent
timestamp. Since the check_duplicate logic will stop the node with older
timestamp, this change should preserve existing functionality.
* purge_old_snapshot_archives is changed to take an extra argument 'maximum_snapshots_to_retain' to control the max number of latest snapshot archives to retain. Note the oldest snapshot is always retained as before and is not subjected to this new options.
* The validator and ledger-tool executables are modified with a CLI argument --maximum-snapshots-to-retain. And the options are propagated down the call chains. Their corresponding shell scripts were changed accordingly.
* SnapshotConfig is modified to have an extra field for the maximum_snapshots_to_retain
* Unit tests are developed to cover purge_old_snapshot_archives
* Require that blockstore block-time only be recognized slot, instead of root
* Move cache_block_time to after Bank freeze
* Single use statement
* Pass transaction_status_sender by reference
* Remove unnecessary slot-existence check before caching block time altogether
* Move block-time existence check into Blockstore::cache_block_time, Blockstore no longer needed in blockstore_processor helper
CodingShredHeader.position is equal to
ShredCommonHeader.index - ShredCommonHeader.fec_set_index
and is so redundant. The extra position field can add bugs if not
consistent with index and fec_set_index.
Having an ordinal index on crds values based on insert order allows to
efficiently filter values using a cursor. In particular
CrdsGossipPush::push_messages hash-map can be replaced with a cursor,
saving on the bookkeepings, purging, etc
VersionedCrdsValue.insert_timestamp is used for fetching crds values
inserted since last query:
https://github.com/solana-labs/solana/blob/ec37a843a/core/src/cluster_info.rs#L1197-L1215https://github.com/solana-labs/solana/blob/ec37a843a/core/src/cluster_info.rs#L1274-L1298
So it is crucial that insert_timestamp does not go backward in time when
new values are inserted into the table. However std::time::SystemTime is
not monotonic, or due to workload, lock contention, thread scheduling,
etc, ... new values may be inserted with a stalled timestamp way in the
past. Additionally, reading system time for the above purpose is
inefficient/unnecessary.
This commit adds an ordinal index to crds values indicating their insert
order. Additionally, it implements a new Cursor type for fetching values
inserted since last query.
IP addresses need to be validated before sending packets to them.
This commit, sends a ping packet to nodes before any pull requests.
Pull requests are then only sent to the nodes which have responded with
the correct hash of their respective ping packet.
It is crucial that VersionedCrdsValue::insert_timestamp does not go
backward in time:
https://github.com/solana-labs/solana/blob/ec37a843a/core/src/crds.rs#L67-L79
Otherwise methods such as get_votes and get_epoch_slots_since will
break, which will break their downstream flow, including vote-listener
and optimistic confirmation:
https://github.com/solana-labs/solana/blob/ec37a843a/core/src/cluster_info.rs#L1197-L1215https://github.com/solana-labs/solana/blob/ec37a843a/core/src/cluster_info.rs#L1274-L1298
For that, Crds::new_versioned is intended to be called "atomically" with
Crds::insert_verioned (as the comment already says so):
https://github.com/solana-labs/solana/blob/ec37a843a/core/src/crds.rs#L126-L129
However, currently this is violated in the code. For example,
filter_pull_responses creates VersionedCrdsValues (with the current
timestamp), then acquires an exclusive lock on gossip, then
process_pull_responses writes those values to the crds table:
https://github.com/solana-labs/solana/blob/ec37a843a/core/src/cluster_info.rs#L2375-L2392
Depending on the workload and lock contention, the insert_timestamps may
well be in the past when these values finally are inserted into gossip.
To avoid such scenarios, this commit:
* removes Crds::new_versioned and Crd::insert_versioned.
* makes VersionedCrdsValue constructor private, only invoked in
Crds::insert, so that insert_timestamp is populated right before
insert.
This will improve insert_timestamp monotonicity as long as Crds::insert
is not called with a stalled timestamp. Following commits may further
improve this by calling timestamp() inside Crds::insert, and/or
switching to std::time::Instant which guarantees monotonicity.
Strip the zero-padding off of data shreds before insertion into blockstore
Co-authored-by: Stephen Akridge <sakridge@gmail.com>
Co-authored-by: Nathan Hawkins <utsl@utsl.org>
Local timestamps are updated for records associated with a pubkey if the
origin is still active:
https://github.com/solana-labs/solana/blob/c8ed14c64/core/src/crds.rs#L301-L311
However this is done inconsistently on some gossip paths (pull requests
and pull responses) but not all (e.g. push messages). Additionally
update_record_timestamp is inefficient since there can be ~800 values
associated with each pubkey.
This commit updates records timestamps only on contact-infos; and,
instead utilizes origin's timestamp when purging old values.