For all code paths (gossip push, pull, purge, etc) that remove or
override a crds value, it is necessary to record hash of values purged
from crds table, in order to exclude them from subsequent pull-requests;
otherwise the next pull request will likely return outdated values,
wasting bandwidth:
https://github.com/solana-labs/solana/blob/ed51cde37/core/src/crds_gossip_pull.rs#L486-L491
Currently this is done all over the place in multiple modules, and this
has caused bugs in the past where purged values were not recorded.
This commit encapsulated this bookkeeping into crds module, so that any
code path which removes or overrides a crds value, also records the hash
of purged value in-place.
If stakes are unknown, then timeouts will be short, resulting in values
being purged from the crds table, and consequently higher pull-response
load when they are obtained again from gossip. In particular, this slows
down validator start where almost all values obtained from entrypoint
are immediately discarded.
On the receiving end, the outdated values are discarded, and they will
only waste bandwidth:
https://github.com/solana-labs/solana/blob/3f0480d06/core/src/crds_gossip_pull.rs#L385-L400
This is also exacerbating validator start, since the entrypoint is
returning old values in pull responses, and the validator immediately
discards those; resulting in huge delay until the validator obtains
contact-info of the entrypoint and is able to adopt shred-version and
fully start.
When a validator starts, it has an (almost) empty crds table and it only
sends one pull-request to the entrypoint. The bloom filter in the
pull-request targets 10% false rate given the number of items. So, if
the `num_items` is very wrong, it makes a very small bloom filter with a
very high false rate:
https://github.com/solana-labs/solana/blob/2ae57c172/runtime/src/bloom.rs#L70-L80https://github.com/solana-labs/solana/blob/2ae57c172/core/src/crds_gossip_pull.rs#L48
As a result, it is very unlikely that the validator obtains entrypoint's
contact-info in response. This exacerbates how long the validator will
loop on:
> Waiting to adopt entrypoint shred version
https://github.com/solana-labs/solana/blob/ed51cde37/validator/src/main.rs#L390-L412
This commit increases the min number of bloom items when making gossip
pull requests. Effectively this will break the entrypoint crds table
into 64 shards, one pull-request for each, a larger bloom filter for
each shard, and increases the chances that the response will include
entrypoint's contact-info, which is needed for adopting shred version
and validator start.
* Upgrade Rust to 1.52.0
update nightly_version to newly pushed docker image
fix clippy lint errors
1.52 comes with grcov 0.8.0, include this version to script
* upgrade to Rust 1.52.1
* disabling Serum from downstream projects until it is upgraded to Rust 1.52.1
Having an ordinal index on crds values based on insert order allows to
efficiently filter values using a cursor. In particular
CrdsGossipPush::push_messages hash-map can be replaced with a cursor,
saving on the bookkeepings, purging, etc
VersionedCrdsValue.insert_timestamp is used for fetching crds values
inserted since last query:
https://github.com/solana-labs/solana/blob/ec37a843a/core/src/cluster_info.rs#L1197-L1215https://github.com/solana-labs/solana/blob/ec37a843a/core/src/cluster_info.rs#L1274-L1298
So it is crucial that insert_timestamp does not go backward in time when
new values are inserted into the table. However std::time::SystemTime is
not monotonic, or due to workload, lock contention, thread scheduling,
etc, ... new values may be inserted with a stalled timestamp way in the
past. Additionally, reading system time for the above purpose is
inefficient/unnecessary.
This commit adds an ordinal index to crds values indicating their insert
order. Additionally, it implements a new Cursor type for fetching values
inserted since last query.
IP addresses need to be validated before sending packets to them.
This commit, sends a ping packet to nodes before any pull requests.
Pull requests are then only sent to the nodes which have responded with
the correct hash of their respective ping packet.
It is crucial that VersionedCrdsValue::insert_timestamp does not go
backward in time:
https://github.com/solana-labs/solana/blob/ec37a843a/core/src/crds.rs#L67-L79
Otherwise methods such as get_votes and get_epoch_slots_since will
break, which will break their downstream flow, including vote-listener
and optimistic confirmation:
https://github.com/solana-labs/solana/blob/ec37a843a/core/src/cluster_info.rs#L1197-L1215https://github.com/solana-labs/solana/blob/ec37a843a/core/src/cluster_info.rs#L1274-L1298
For that, Crds::new_versioned is intended to be called "atomically" with
Crds::insert_verioned (as the comment already says so):
https://github.com/solana-labs/solana/blob/ec37a843a/core/src/crds.rs#L126-L129
However, currently this is violated in the code. For example,
filter_pull_responses creates VersionedCrdsValues (with the current
timestamp), then acquires an exclusive lock on gossip, then
process_pull_responses writes those values to the crds table:
https://github.com/solana-labs/solana/blob/ec37a843a/core/src/cluster_info.rs#L2375-L2392
Depending on the workload and lock contention, the insert_timestamps may
well be in the past when these values finally are inserted into gossip.
To avoid such scenarios, this commit:
* removes Crds::new_versioned and Crd::insert_versioned.
* makes VersionedCrdsValue constructor private, only invoked in
Crds::insert, so that insert_timestamp is populated right before
insert.
This will improve insert_timestamp monotonicity as long as Crds::insert
is not called with a stalled timestamp. Following commits may further
improve this by calling timestamp() inside Crds::insert, and/or
switching to std::time::Instant which guarantees monotonicity.
Crds values buffered when responding to pull-requests can be very large taking a lot of memory.
Added a limit for number of buffered crds values based on outbound data budget.
In several places in gossip code, the entire crds table is scanned only
to filter out nodes' contact infos. Currently on mainnet, crds table is
of size ~70k, while there are only ~470 nodes. So the full table scan is
inefficient. Instead we may maintain an index of only nodes' contact
infos.
https://hackerone.com/reports/991106
> It’s possible to use UDP gossip protocol to amplify DDoS attacks. An attacker
> can spoof IP address in UDP packet when sending PullRequest to the node.
> There's no any validation if provided source IP address is not spoofed and
> the node can send much larger PullResponse to victim's IP. As I checked,
> PullRequest is about 290 bytes, while PullResponse is about 10 kB. It means
> that amplification is about 34x. This way an attacker can easily perform DDoS
> attack both on Solana node and third-party server.
>
> To prevent it, need for example to implement ping-pong mechanism similar as
> in Ethereum: Before accepting requests from remote client needs to validate
> his IP. Local node sends Ping packet to the remote node and it needs to reply
> with Pong packet that contains hash of matching Ping packet. Content of Ping
> packet is unpredictable. If hash from Pong packet matches, local node can
> remember IP where Ping packet was sent as correct and allow further
> communication.
>
> More info:
> https://github.com/ethereum/devp2p/blob/master/discv4.md#endpoint-proof
> https://github.com/ethereum/devp2p/blob/master/discv4.md#wire-protocol
The commit adds a PingCache, which maintains records of remote nodes
which have returned a valid response to a ping message, and on-the-fly
ping messages pending a pong response from the remote node.
When handling pull-requests, those from addresses which have not passed
the ping-pong check are filtered out, and additionally ping packets are
added for addresses which need to be (re)verified.
Based on run-time profiles, the majority time of new_pull_requests is
spent building bloom filters, in hashing and bit-vec ops.
This commit builds crds filters in parallel using rayon constructs. The
added benchmark shows ~5x speedup (4-core machine, 8 threads).
filter_crds_values checks every crds filter against every hash value:
https://github.com/solana-labs/solana/blob/ee646aa7/core/src/crds_gossip_pull.rs#L432
which can be inefficient if the filter's bit-mask only matches small
portion of the entire crds table.
This commit shards crds values into separate tables based on shard_bits
first bits of their hash prefix. Given a (mask, mask_bits) filter,
filtering crds can be done by inspecting only relevant shards.
If CrdsFilter.mask_bits <= shard_bits, then precisely only the crds
values which match (mask, mask_bits) bit pattern are traversed.
If CrdsFilter.mask_bits > shard_bits, then approximately only
1/2^shard_bits of crds values are inspected.
Benchmarking on a gce cluster of 20 nodes, I see ~10% improvement in
generate_pull_responses metric, but with larger clusters, crds table and
2^mask_bits are both larger, so the impact should be more significant.
* Filter push/pulls from spies
* Don't pull from peers with shred version == 0, don't push to people with shred_version == 0
Co-authored-by: Carl <carl@solana.com>
* Gossip benchmark
* Rayon tweaking
* push pulls
* fanout to max nodes
* fixup! fanout to max nodes
* fixup! fixup! fanout to max nodes
* update
* multi vote test
* fixup prune
* fast propagation
* fixups
* compute up to 95%
* test for specific tx
* stats
* stats
* fixed tests
* rename
* track a lagging view of which nodes have the local node in their active set in the local received_cache
* test fixups
* dups are old now
* dont prune your own origin
* send vote to tpu
* tests
* fixed tests
* fixed test
* update
* ignore scale
* lint
* fixup
* fixup
* fixup
* cleanup
Co-authored-by: Stephen Akridge <sakridge@gmail.com>
* Batch process pull responses
* Generate pull requests at 1/2 rate
* Do filtering work of process_pull_response in read lock
Only take write lock to insert if needed.
* filter messages that are likely to be pushed from the response
* tests
* tests
* wait to start filtering responses, and push stats to influx
* wait to start filtering responses, and push stats to influx
* reduce the timers to match the publish self timeout
* fmt
* fmt