Commit Graph

281 Commits

Author SHA1 Message Date
carllin c2e8814dce
Add limit and shrink policy for recycler (#15320) 2021-02-24 00:15:58 -08:00
Michael Vines 5df36aec7d Pacify clippy 2021-02-19 20:08:41 -08:00
behzad nouri aa3aac766f
adds metrics for inbound/outbound gossip packets counts (#15407) 2021-02-19 22:49:35 +00:00
behzad nouri 076c20f1ca
checks that prune-messages have the same inner/outer pubkey (#15352) 2021-02-16 21:06:18 +00:00
behzad nouri 0ad063f4e9
adds flag to disable duplicate instance check (#15006) 2021-02-03 16:26:17 +00:00
dependabot[bot] 1df93fa2be
chore: bump serde from 1.0.112 to 1.0.118 (#14828)
* chore: bump serde from 1.0.112 to 1.0.122

Bumps [serde](https://github.com/serde-rs/serde) from 1.0.112 to 1.0.122.
- [Release notes](https://github.com/serde-rs/serde/releases)
- [Commits](https://github.com/serde-rs/serde/compare/v1.0.112...v1.0.122)

Signed-off-by: dependabot[bot] <support@github.com>

* [auto-commit] Update all Cargo lock files

* Update frozen_abi digest following serde update

* Revert "chore: bump serde from 1.0.112 to 1.0.122"

This reverts commit a3ef4442a4c985144ae2bd7ceaf8899a7ab8d7c0.

* Revert "[auto-commit] Update all Cargo lock files"

This reverts commit c41c3b005fb1ccade55155302c52cd5736c4b55f.

* chore: bump serde from 1.0.112 to 1.0.118

Bumps [serde](https://github.com/serde-rs/serde) from 1.0.112 to 1.0.118.
- [Release notes](https://github.com/serde-rs/serde/releases)
- [Commits](https://github.com/serde-rs/serde/compare/v1.0.112...v1.0.118)

Signed-off-by: dependabot[bot] <support@github.com>

* [auto-commit] Update all Cargo lock files

* Remove serum-dex pinning

* blind commit!

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: dependabot-buildkite <dependabot-buildkite@noreply.solana.com>
Co-authored-by: Ryo Onodera <ryoqun@gmail.com>
2021-02-02 23:28:16 +09:00
behzad nouri e1021d9f83
removes redundant epoch stakes cache in retransmit (#14781)
Following d6d76219b, staked nodes computed from vote accounts are
already cached in runtime::Stakes, so the caching in retransmit_stage is
redundant.
2021-01-24 21:15:09 +00:00
behzad nouri 491b059755
broadcasts duplicate shreds through gossip (#14699) 2021-01-24 15:47:43 +00:00
behzad nouri 8e581601d6
patches crds vote-index assignment bug (#14438)
If tower is full, old votes are evicted from the front of the deque:
https://github.com/solana-labs/solana/blob/2074e407c/programs/vote/src/vote_state/mod.rs#L367-L373
whereas recent votes if expire are evicted from the back:
https://github.com/solana-labs/solana/blob/2074e407c/programs/vote/src/vote_state/mod.rs#L529-L537

As a result, from a single tower_index scalar, we cannot infer which crds-vote
should be overwritten:
https://github.com/solana-labs/solana/blob/2074e407c/core/src/crds_value.rs#L576

In addition there is an off by one bug in the existing code. tower_index is
bounded by MAX_LOCKOUT_HISTORY - 1:
https://github.com/solana-labs/solana/blob/2074e407c/core/src/consensus.rs#L382
So, it is at most 30, whereas MAX_VOTES is 32:
https://github.com/solana-labs/solana/blob/2074e407c/core/src/crds_value.rs#L29
Which means that this branch is never taken:
https://github.com/solana-labs/solana/blob/2074e407c/core/src/crds_value.rs#L590-L593
so crds table alwasys keeps 29 **oldest** votes by wallclock, and then
only overrides the 30st one each time. (i.e a tally of only two most
recent votes).
2021-01-21 13:08:07 +00:00
behzad nouri b5fd0ed859
rewrites turbine retransmit peers computation (#14584) 2021-01-19 04:18:47 +00:00
Michael Vines 9ddd6f08e8 Persist gossip contact info 2020-12-27 20:46:54 -08:00
behzad nouri 2fd38d9912
indexes votes in crds table (#14272) 2020-12-27 13:31:05 +00:00
behzad nouri 49019c6613
obtains staked-nodes from the root-bank (#14257)
... as opposed to the working bank
2020-12-27 13:28:05 +00:00
Michael Vines ace360ade2 Multiple entrypoint support 2020-12-22 18:35:31 -08:00
Michael Vines 3373082ffa Update entrypoint contact info even when shred version adoption is not requested 2020-12-22 18:35:31 -08:00
behzad nouri a14cfd660a
removes &Arc<Self> receivers (#14234) 2020-12-22 23:51:53 +00:00
behzad nouri 691031fefd
limits number of crds values returned when responding to pull requests (#13739)
Crds values buffered when responding to pull-requests can be very large taking a lot of memory.
Added a limit for number of buffered crds values based on outbound data budget.
2020-12-18 18:45:12 +00:00
behzad nouri 6a3797e164
adds crds-value for broadcasting duplicate shreds through gossip (#14133)
In gossip, the header overhead we get from:
https://github.com/solana-labs/solana/blob/de9ac43eb/core/src/cluster_info.rs#L434-L435
https://github.com/solana-labs/solana/blob/de9ac43eb/core/src/crds_value.rs#L31-L36
https://github.com/solana-labs/solana/blob/de9ac43eb/core/src/crds_value.rs#L73
already exceeds SIZE_OF_NONCE in shreds. We also need aditional
meta-data (wallclock, source pubkey, ...). Which means that given the
SHRED_PAYLOAD_SIZE, we cannot fit all these in PACKET_DATA_SIZE:
https://github.com/solana-labs/solana/blob/de9ac43eb/ledger/src/shred.rs#L80

On top of that, we need 2 shred payloads as the proof of duplicate. So
each DuplicateShred crds value includes only a chunk of the payload,
along with the meta-data to reconstruct the full payload from the chunks
on the receiving end.
2020-12-18 14:32:43 +00:00
behzad nouri d6d76219b6
caches staked nodes computed from vote-accounts (#13929) 2020-12-17 21:22:50 +00:00
Michael Vines 7143aaa89b Clippy 2020-12-14 08:03:29 -08:00
behzad nouri 409fe3bca1
adds the instance token to crds-labels for node-instance crds-values (#14037)
If a node "a" receives instance-info from node "b1" it will override any
instance-info associated with "b1" pubkey in its crds table. This makes
it less likely that when "b1" receives crds values from "a" (either
through pull or push), it sees other instances of itself (because node
"a" discarded them when it received "b1" instance info).

In order for the crds table to contain all instance-info associated with
the same pubkey at the same time, we need to add the instance tokens to
the keys in the crds table (i.e. the CrdsValueLabel).
2020-12-10 17:01:55 +00:00
behzad nouri 1d267eae6b std::process::exit to kill all threads 2020-12-09 10:24:23 -08:00
behzad nouri 895d7d6a65 removes RwLock on ClusterInfo.instance 2020-12-09 10:24:23 -08:00
behzad nouri 542198180a pushes node-instance along with version early in gossip 2020-12-09 10:24:23 -08:00
behzad nouri 8cd5eb9863 checks for duplicate validator instances using gossip 2020-12-09 10:24:23 -08:00
behzad nouri 6706f2b3bb
removes recursive read-locks on gossip (#13973)
ClusterInfo::tvu_peers acquires a read-lock on gossip:
https://github.com/solana-labs/solana/blob/f0e934145/core/src/cluster_info.rs#L1171-L1185
and so, ClusterInfo::repair_peers is recursively locking gossip for
read twice:
https://github.com/solana-labs/solana/blob/f0e934145/core/src/cluster_info.rs#L1202-L1223
But std::sync::RwLock is not re-entrant (recursive).
2020-12-06 15:14:49 +00:00
behzad nouri c3048b451d
samples repair peers using WeightedIndex (#13919)
To output one random sample, weighted_best generates n random numbers:
https://github.com/solana-labs/solana/blob/f751a5d4e/core/src/weighted_shuffle.rs#L38-L63
WeightedIndex does so with only one random number:
https://github.com/rust-random/rand/blob/eb02f0e46/src/distributions/weighted_index.rs#L223-L240
Additionally, if the index is already constructed, it only does a total
of O(log(n)) amount of work; which can be achieved if RepairCache,
caches the weighted index:
https://github.com/solana-labs/solana/blob/f751a5d4e/core/src/serve_repair.rs#L83

Also, the repair-peers code can be reorganized to have fewer redundant
unlock-then-lock code.
2020-12-03 14:26:07 +00:00
Tyera Eulberg 10c81a2448
Remove rpc_banks from validator (#13882)
* Remove rpc_banks from validator

* Bump abi-digest
2020-12-02 03:25:09 +00:00
behzad nouri 26bf2b7e45
processes pull-request callers only once per unique caller (#13750)
process_pull_requests acquires a write lock on crds table to update
records timestamp for each of the pull-request callers:
https://github.com/solana-labs/solana/blob/3087c9049/core/src/crds_gossip_pull.rs#L287-L300
However, pull-requests overlap a lot in callers and this function ends
up doing a lot of redundant duplicate work.

This commit obtains unique callers before acquiring an exclusive lock on
crds table.
2020-11-22 17:51:14 +00:00
sakridge c1eb350c47
Allow contact debug interval to be adjusted (#13737) 2020-11-20 14:47:37 -08:00
behzad nouri b58f69297f
makes crds fields private (#13703)
Crds fields should maintain several invariants between themselves, so
exposing them as public fields can be bug prone. In addition these
invariants are asserted on every write:
https://github.com/solana-labs/solana/blob/9668dd85d/core/src/crds.rs#L138-L154
https://github.com/solana-labs/solana/blob/9668dd85d/core/src/crds.rs#L239-L262
which adds extra instructions and is not optimal. Should these fields be
private the asserts will be redundant.
2020-11-19 20:57:40 +00:00
behzad nouri 1ffab5de77
breaks prunes data into chunks to fit into packets (#13613)
Validator logs show that prune messages are dropped because they exceed
packet data size:
https://github.com/solana-labs/solana/blob/f25c969ad/perf/src/packet.rs#L90-L92
This can exacerbate gossip traffic by redundantly increasing push
messages across network. The workaround is to break prunes into smaller
chunks and send over in multiple messages.
2020-11-19 16:38:01 +00:00
behzad nouri 5e8490ab9d
packs more crds-values in a single gossip packet (#13500)
split_gossip_messages:
https://github.com/solana-labs/solana/blob/a97c04b40/core/src/cluster_info.rs#L1536-L1574
splits crds-values into chunks to fit into a gossip packet. However it is
using a global upper-bound for the header-size across all protocols:
https://github.com/solana-labs/solana/blob/a97c04b40/core/src/cluster_info.rs#L90-L93
This can be wasteful as the specific gossip protocol can have smaller
header than this upper-bound (e.g. Protocol::PushMessage is 170 bytes
smaller). Adding more crds-values in one gossip packet can avoid the
overheads of separate packets and reduce total number of bytes sent over
the wire.

This commit updates the splitting function to take a max-chunk-size
argument. At call-site, this value is set to the size of the protocol
which the values are sent over.
2020-11-15 18:23:59 +00:00
behzad nouri cbea9ebc34
indexes nodes' contact infos in crds table (#13553)
In several places in gossip code, the entire crds table is scanned only
to filter out nodes' contact infos. Currently on mainnet, crds table is
of size ~70k, while there are only ~470 nodes. So the full table scan is
inefficient. Instead we may maintain an index of only nodes' contact
infos.
2020-11-15 16:38:04 +00:00
behzad nouri 73ac104df2
propagates errors out of Packet::from_data (#13445)
Packet::from_data is ignoring serialization errors:
https://github.com/solana-labs/solana/blob/d08c3232e/sdk/src/packet.rs#L42-L48
This is likely never useful as the packet will be sent over the wire
taking bandwidth but at the receiving end will either fail to
deserialize or it will be invalid.
This commit will propagate the errors out of the function to the
call-site, allowing the call-site to handle the error.
2020-11-08 15:10:03 +00:00
behzad nouri 7f4debdad5
drops older gossip packets when load shedding (#13364)
Gossip drops incoming packets when overloaded:
https://github.com/solana-labs/solana/blob/f6a73098a/core/src/cluster_info.rs#L2462-L2475
However newer packets are dropped in favor of the older ones.
This is probably not ideal as newer packets are more likely to contain
more recent data, so dropping them will keep the validator state
lagging.
2020-11-05 17:14:28 +00:00
behzad nouri 8f0796436a
shares the lock on gossip when processing prune messages (#13339)
Processing prune messages acquires an exclusive lock on gossip:
https://github.com/solana-labs/solana/blob/55b0428ff/core/src/cluster_info.rs#L1824-L1825
This can be reduced to a shared lock if active-sets are changed to use
atomic bloom filters:
https://github.com/solana-labs/solana/blob/55b0428ff/core/src/crds_gossip_push.rs#L50
2020-11-05 15:42:00 +00:00
behzad nouri 118ce47b97
measures processing time of each kind of gossip packets (#13366) 2020-11-05 15:34:34 +00:00
behzad nouri 10fa4f45ab
uses thread-pool when handling push messages (#13338)
From runtime profiles, the majority time of solana-listen thread:
https://github.com/solana-labs/solana/blob/55b0428ff/core/src/cluster_info.rs#L2720
is spent handling push messages. The code here:
https://github.com/solana-labs/solana/blob/55b0428ff/core/src/cluster_info.rs#L2272-L2364
may utilize the idle gossip thread-pool.
2020-11-04 19:15:58 +00:00
Michael Vines df8dab9d2b Native/builtin programs now receive an InvokeContext 2020-10-29 21:45:24 -07:00
behzad nouri 3738611f5c
adds more parallel processing to gossip packets handling (#12988) 2020-10-29 15:17:19 +00:00
behzad nouri ae91270961
implements ping-pong packets between nodes (#12794)
https://hackerone.com/reports/991106

> It’s possible to use UDP gossip protocol to amplify DDoS attacks. An attacker
> can spoof IP address in UDP packet when sending PullRequest to the node.
> There's no any validation if provided source IP address is not spoofed and
> the node can send much larger PullResponse to victim's IP. As I checked,
> PullRequest is about 290 bytes, while PullResponse is about 10 kB. It means
> that amplification is about 34x. This way an attacker can easily perform DDoS
> attack both on Solana node and third-party server.
>
> To prevent it, need for example to implement ping-pong mechanism similar as
> in Ethereum: Before accepting requests from remote client needs to validate
> his IP. Local node sends Ping packet to the remote node and it needs to reply
> with Pong packet that contains hash of matching Ping packet. Content of Ping
> packet is unpredictable. If hash from Pong packet matches, local node can
> remember IP where Ping packet was sent as correct and allow further
> communication.
>
> More info:
> https://github.com/ethereum/devp2p/blob/master/discv4.md#endpoint-proof
> https://github.com/ethereum/devp2p/blob/master/discv4.md#wire-protocol

The commit adds a PingCache, which maintains records of remote nodes
which have returned a valid response to a ping message, and on-the-fly
ping messages pending a pong response from the remote node.

When handling pull-requests, those from addresses which have not passed
the ping-pong check are filtered out, and additionally ping packets are
added for addresses which need to be (re)verified.
2020-10-28 17:03:02 +00:00
behzad nouri 4bfda3e766
marks pull request creation time only once per peer (#13113)
mark_pull_request_creation time requires an exclusive lock on gossip:
https://github.com/solana-labs/solana/blob/16944e218/core/src/cluster_info.rs#L1547-L1548
Current code is redundantly marking each peer once for each request.
There are at most only 2 unique peers, whereas there are hundreds of
requests per each. So the lock is acquired hundreds of time longer than
necessary.
2020-10-26 17:11:31 +00:00
Michael Vines a4956844bd Update frozen_abi hashes
The movement of files in sdk/ caused ABI hashes to change
2020-10-24 08:37:55 -07:00
behzad nouri 37c8842bcb
scans crds table in parallel for finding old labels (#13073)
From runtime profiles, the majority time of ClusterInfo::handle_purge
https://github.com/solana-labs/solana/blob/0776fa05c/core/src/cluster_info.rs#L1605-L1626
is spent scanning crds table finding old labels:
https://github.com/solana-labs/solana/blob/0776fa05c/core/src/crds.rs#L175-L197

This can be done in parallel given that gossip thread-pool:
https://github.com/solana-labs/solana/blob/0776fa05c/core/src/cluster_info.rs#L1637-L1641
is idle when handle_purge is invoked:
https://github.com/solana-labs/solana/blob/0776fa05c/core/src/cluster_info.rs#L1681
2020-10-23 14:17:37 +00:00
Justin Starry c95f6c4b83
Remove spammy invalid rpc log (#13100) 2020-10-23 07:05:29 +00:00
Justin Starry 8b0242a5d8
Allow nodes to advertise a different rpc address over gossip (#13053)
* Allow nodes to advertise a different rpc address over gossip

* Feedback
2020-10-22 03:31:48 +00:00
Michael Vines 959880db60 Remove unused pubkey::Pubkey imports 2020-10-21 19:08:13 -07:00
Michael Vines 7bc073defe Run `codemod --extensions rs Pubkey::new_rand solana_sdk::pubkey::new_rand` 2020-10-21 19:08:13 -07:00
behzad nouri 75d62ca095
improves threads' utilization in processing gossip packets (#12962)
ClusterInfo::process_packets handles incoming packets in a thread_pool:
https://github.com/solana-labs/solana/blob/87311cce7/core/src/cluster_info.rs#L2118-L2134

However, profiling runtime shows that threads are not well utilized and
a lot of the processing is done sequentially.

This commit redistributes the work done in parallel. Testing on a gce
cluster shows 20%+ improvement in processing gossip packets with much
smaller variations.
2020-10-19 19:03:38 +00:00