2021-10-21 04:47:04 -07:00
|
|
|
|
use std::{cmp::min, sync::Arc};
|
2019-10-21 15:24:17 -07:00
|
|
|
|
|
2019-10-22 12:48:50 -07:00
|
|
|
|
use futures::stream::{FuturesUnordered, StreamExt};
|
2021-10-21 04:47:04 -07:00
|
|
|
|
use tokio::time::{sleep_until, timeout, Instant};
|
2019-10-21 15:24:17 -07:00
|
|
|
|
use tower::{Service, ServiceExt};
|
|
|
|
|
|
2021-05-31 06:49:59 -07:00
|
|
|
|
use zebra_chain::serialization::DateTime32;
|
|
|
|
|
|
Fix a deadlock between the crawler and dialer, and other hangs (#1950)
* Stop ignoring inbound message errors and handshake timeouts
To avoid hangs, Zebra needs to maintain the following invariants in the
handshake and heartbeat code:
- each handshake should run in a separate spawned task
(not yet implemented)
- every message, error, timeout, and shutdown must update the peer address state
- every await that depends on the network must have a timeout
Once the Connection is created, it should handle timeouts.
But we need to handle timeouts during handshake setup.
* Avoid hangs by adding a timeout to the candidate set update
Also increase the fanout from 1 to 2, to increase address diversity.
But only return permanent errors from `CandidateSet::update`, because
the crawler task exits if `update` returns an error.
Also log Peers response errors in the CandidateSet.
* Use the select macro in the crawler to reduce hangs
The `select` function is biased towards its first argument, risking
starvation.
As a side-benefit, this change also makes the code a lot easier to read
and maintain.
* Split CrawlerAction::Demand into separate actions
This refactor makes the code a bit easier to read, at the cost of
sometimes blocking the crawler on `candidates.next()`.
That's ok, because `next` only has a short (< 100 ms) delay. And we're
just about to spawn a separate task for each handshake.
* Spawn a separate task for each handshake
This change avoids deadlocks by letting each handshake make progress
independently.
* Move the dial task into a separate function
This refactor improves readability.
* Fix buggy future::select function usage
And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
|
|
|
|
use crate::{constants, types::MetaAddr, AddressBook, BoxError, Request, Response};
|
2019-10-21 15:24:17 -07:00
|
|
|
|
|
2021-05-21 14:52:36 -07:00
|
|
|
|
#[cfg(test)]
|
|
|
|
|
mod tests;
|
|
|
|
|
|
Security: Limit reconnection rate to individual peers (#2275)
* Security: Limit reconnection rate to individual peers
Reconnection Rate
Limit the reconnection rate to each individual peer by applying the
liveness cutoff to the attempt, responded, and failure time fields.
If any field is recent, the peer is skipped.
The new liveness cutoff skips any peers that have recently been attempted
or failed. (Previously, the liveness check was only applied if the peer
was in the `Responded` state, which could lead to repeated retries of
`Failed` peers, particularly in small address books.)
Reconnection Order
Zebra prefers more useful peer states, then the earliest attempted,
failed, and responded times, then the most recent gossiped last seen
times.
Before this change, Zebra took the most recent time in all the peer time
fields, and used that time for liveness and ordering. This led to
confusion between trusted and untrusted data, and success and failure
times.
Unlike the previous order, the new order:
- tries all peers in each state, before re-trying any peer in that state,
and
- only checks the the gossiped untrusted last seen time
if all other times are equal.
* Preserve the later time if changes arrive out of order
* Update CandidateSet::next documentation
* Update CandidateSet state diagram
* Fix variant names in comments
* Explain why timestamps can be left out of MetaAddrChanges
* Add a simple test for the individual peer retry limit
* Only generate valid Arbitrary PeerServices values
* Add an individual peer retry limit AddressBook and CandidateSet test
* Stop deleting recently live addresses from the address book
If we delete recently live addresses from the address book, we can get a
new entry for them, and reconnect too rapidly.
* Rename functions to match similar tokio API
* Fix docs for service sorting
* Clarify a comment
* Cleanup a variable and comments
* Remove blank lines in the CandidateSet state diagram
* Add a multi-peer proptest that checks outbound attempt fairness
* Fix a comment typo
Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
* Simplify time maths in MetaAddr
* Create a Duration32 type to simplify calculations and comparisons
* Rename variables for clarity
* Split a string constant into multiple lines
* Make constants match rustdoc order
Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
2021-06-18 05:30:44 -07:00
|
|
|
|
/// The [`CandidateSet`] manages outbound peer connection attempts.
|
|
|
|
|
/// Successful connections become peers in the [`PeerSet`].
|
2019-10-21 21:25:49 -07:00
|
|
|
|
///
|
Security: Limit reconnection rate to individual peers (#2275)
* Security: Limit reconnection rate to individual peers
Reconnection Rate
Limit the reconnection rate to each individual peer by applying the
liveness cutoff to the attempt, responded, and failure time fields.
If any field is recent, the peer is skipped.
The new liveness cutoff skips any peers that have recently been attempted
or failed. (Previously, the liveness check was only applied if the peer
was in the `Responded` state, which could lead to repeated retries of
`Failed` peers, particularly in small address books.)
Reconnection Order
Zebra prefers more useful peer states, then the earliest attempted,
failed, and responded times, then the most recent gossiped last seen
times.
Before this change, Zebra took the most recent time in all the peer time
fields, and used that time for liveness and ordering. This led to
confusion between trusted and untrusted data, and success and failure
times.
Unlike the previous order, the new order:
- tries all peers in each state, before re-trying any peer in that state,
and
- only checks the the gossiped untrusted last seen time
if all other times are equal.
* Preserve the later time if changes arrive out of order
* Update CandidateSet::next documentation
* Update CandidateSet state diagram
* Fix variant names in comments
* Explain why timestamps can be left out of MetaAddrChanges
* Add a simple test for the individual peer retry limit
* Only generate valid Arbitrary PeerServices values
* Add an individual peer retry limit AddressBook and CandidateSet test
* Stop deleting recently live addresses from the address book
If we delete recently live addresses from the address book, we can get a
new entry for them, and reconnect too rapidly.
* Rename functions to match similar tokio API
* Fix docs for service sorting
* Clarify a comment
* Cleanup a variable and comments
* Remove blank lines in the CandidateSet state diagram
* Add a multi-peer proptest that checks outbound attempt fairness
* Fix a comment typo
Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
* Simplify time maths in MetaAddr
* Create a Duration32 type to simplify calculations and comparisons
* Rename variables for clarity
* Split a string constant into multiple lines
* Make constants match rustdoc order
Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
2021-06-18 05:30:44 -07:00
|
|
|
|
/// The candidate set divides the set of all possible outbound peers into
|
|
|
|
|
/// disjoint subsets, using the [`PeerAddrState`]:
|
2019-10-21 21:25:49 -07:00
|
|
|
|
///
|
Security: Limit reconnection rate to individual peers (#2275)
* Security: Limit reconnection rate to individual peers
Reconnection Rate
Limit the reconnection rate to each individual peer by applying the
liveness cutoff to the attempt, responded, and failure time fields.
If any field is recent, the peer is skipped.
The new liveness cutoff skips any peers that have recently been attempted
or failed. (Previously, the liveness check was only applied if the peer
was in the `Responded` state, which could lead to repeated retries of
`Failed` peers, particularly in small address books.)
Reconnection Order
Zebra prefers more useful peer states, then the earliest attempted,
failed, and responded times, then the most recent gossiped last seen
times.
Before this change, Zebra took the most recent time in all the peer time
fields, and used that time for liveness and ordering. This led to
confusion between trusted and untrusted data, and success and failure
times.
Unlike the previous order, the new order:
- tries all peers in each state, before re-trying any peer in that state,
and
- only checks the the gossiped untrusted last seen time
if all other times are equal.
* Preserve the later time if changes arrive out of order
* Update CandidateSet::next documentation
* Update CandidateSet state diagram
* Fix variant names in comments
* Explain why timestamps can be left out of MetaAddrChanges
* Add a simple test for the individual peer retry limit
* Only generate valid Arbitrary PeerServices values
* Add an individual peer retry limit AddressBook and CandidateSet test
* Stop deleting recently live addresses from the address book
If we delete recently live addresses from the address book, we can get a
new entry for them, and reconnect too rapidly.
* Rename functions to match similar tokio API
* Fix docs for service sorting
* Clarify a comment
* Cleanup a variable and comments
* Remove blank lines in the CandidateSet state diagram
* Add a multi-peer proptest that checks outbound attempt fairness
* Fix a comment typo
Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
* Simplify time maths in MetaAddr
* Create a Duration32 type to simplify calculations and comparisons
* Rename variables for clarity
* Split a string constant into multiple lines
* Make constants match rustdoc order
Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
2021-06-18 05:30:44 -07:00
|
|
|
|
/// 1. [`Responded`] peers, which we have had an outbound connection to.
|
|
|
|
|
/// 2. [`NeverAttemptedGossiped`] peers, which we learned about from other peers
|
|
|
|
|
/// but have never connected to.
|
|
|
|
|
/// 3. [`NeverAttemptedAlternate`] peers, canonical addresses which we learned
|
|
|
|
|
/// from the [`Version`] messages of inbound and outbound connections,
|
|
|
|
|
/// but have never connected to.
|
|
|
|
|
/// 4. [`Failed`] peers, which failed a connection attempt, or had an error
|
|
|
|
|
/// during an outbound connection.
|
|
|
|
|
/// 5. [`AttemptPending`] peers, which we've recently queued for a connection.
|
|
|
|
|
///
|
|
|
|
|
/// Never attempted peers are always available for connection.
|
|
|
|
|
///
|
|
|
|
|
/// If a peer's attempted, responded, or failure time is recent
|
|
|
|
|
/// (within the liveness limit), we avoid reconnecting to it.
|
|
|
|
|
/// Otherwise, we assume that it has disconnected or hung,
|
|
|
|
|
/// and attempt reconnection.
|
2019-10-21 21:25:49 -07:00
|
|
|
|
///
|
|
|
|
|
/// ```ascii,no_run
|
2021-02-17 17:18:32 -08:00
|
|
|
|
/// ┌──────────────────┐
|
Security: Limit reconnection rate to individual peers (#2275)
* Security: Limit reconnection rate to individual peers
Reconnection Rate
Limit the reconnection rate to each individual peer by applying the
liveness cutoff to the attempt, responded, and failure time fields.
If any field is recent, the peer is skipped.
The new liveness cutoff skips any peers that have recently been attempted
or failed. (Previously, the liveness check was only applied if the peer
was in the `Responded` state, which could lead to repeated retries of
`Failed` peers, particularly in small address books.)
Reconnection Order
Zebra prefers more useful peer states, then the earliest attempted,
failed, and responded times, then the most recent gossiped last seen
times.
Before this change, Zebra took the most recent time in all the peer time
fields, and used that time for liveness and ordering. This led to
confusion between trusted and untrusted data, and success and failure
times.
Unlike the previous order, the new order:
- tries all peers in each state, before re-trying any peer in that state,
and
- only checks the the gossiped untrusted last seen time
if all other times are equal.
* Preserve the later time if changes arrive out of order
* Update CandidateSet::next documentation
* Update CandidateSet state diagram
* Fix variant names in comments
* Explain why timestamps can be left out of MetaAddrChanges
* Add a simple test for the individual peer retry limit
* Only generate valid Arbitrary PeerServices values
* Add an individual peer retry limit AddressBook and CandidateSet test
* Stop deleting recently live addresses from the address book
If we delete recently live addresses from the address book, we can get a
new entry for them, and reconnect too rapidly.
* Rename functions to match similar tokio API
* Fix docs for service sorting
* Clarify a comment
* Cleanup a variable and comments
* Remove blank lines in the CandidateSet state diagram
* Add a multi-peer proptest that checks outbound attempt fairness
* Fix a comment typo
Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
* Simplify time maths in MetaAddr
* Create a Duration32 type to simplify calculations and comparisons
* Rename variables for clarity
* Split a string constant into multiple lines
* Make constants match rustdoc order
Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
2021-06-18 05:30:44 -07:00
|
|
|
|
/// │ Config / DNS │
|
|
|
|
|
/// ┌───────────│ Seed │───────────┐
|
|
|
|
|
/// │ │ Addresses │ │
|
|
|
|
|
/// │ └──────────────────┘ │
|
|
|
|
|
/// │ │ untrusted_last_seen │
|
|
|
|
|
/// │ │ is unknown │
|
|
|
|
|
/// ▼ │ ▼
|
|
|
|
|
/// ┌──────────────────┐ │ ┌──────────────────┐
|
|
|
|
|
/// │ Handshake │ │ │ Peer Set │
|
|
|
|
|
/// │ Canonical │──────────┼──────────│ Gossiped │
|
|
|
|
|
/// │ Addresses │ │ │ Addresses │
|
|
|
|
|
/// └──────────────────┘ │ └──────────────────┘
|
|
|
|
|
/// untrusted_last_seen │ provides
|
|
|
|
|
/// set to now │ untrusted_last_seen
|
2019-10-21 21:25:49 -07:00
|
|
|
|
/// ▼
|
Security: Limit reconnection rate to individual peers (#2275)
* Security: Limit reconnection rate to individual peers
Reconnection Rate
Limit the reconnection rate to each individual peer by applying the
liveness cutoff to the attempt, responded, and failure time fields.
If any field is recent, the peer is skipped.
The new liveness cutoff skips any peers that have recently been attempted
or failed. (Previously, the liveness check was only applied if the peer
was in the `Responded` state, which could lead to repeated retries of
`Failed` peers, particularly in small address books.)
Reconnection Order
Zebra prefers more useful peer states, then the earliest attempted,
failed, and responded times, then the most recent gossiped last seen
times.
Before this change, Zebra took the most recent time in all the peer time
fields, and used that time for liveness and ordering. This led to
confusion between trusted and untrusted data, and success and failure
times.
Unlike the previous order, the new order:
- tries all peers in each state, before re-trying any peer in that state,
and
- only checks the the gossiped untrusted last seen time
if all other times are equal.
* Preserve the later time if changes arrive out of order
* Update CandidateSet::next documentation
* Update CandidateSet state diagram
* Fix variant names in comments
* Explain why timestamps can be left out of MetaAddrChanges
* Add a simple test for the individual peer retry limit
* Only generate valid Arbitrary PeerServices values
* Add an individual peer retry limit AddressBook and CandidateSet test
* Stop deleting recently live addresses from the address book
If we delete recently live addresses from the address book, we can get a
new entry for them, and reconnect too rapidly.
* Rename functions to match similar tokio API
* Fix docs for service sorting
* Clarify a comment
* Cleanup a variable and comments
* Remove blank lines in the CandidateSet state diagram
* Add a multi-peer proptest that checks outbound attempt fairness
* Fix a comment typo
Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
* Simplify time maths in MetaAddr
* Create a Duration32 type to simplify calculations and comparisons
* Rename variables for clarity
* Split a string constant into multiple lines
* Make constants match rustdoc order
Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
2021-06-18 05:30:44 -07:00
|
|
|
|
/// Λ if attempted, responded, or failed:
|
|
|
|
|
/// ╱ ╲ ignore gossiped info
|
|
|
|
|
/// ▕ ▏ otherwise, if never attempted:
|
|
|
|
|
/// ╲ ╱ skip updates to existing fields
|
|
|
|
|
/// V
|
|
|
|
|
/// ┌───────────────────────────────┼───────────────────────────────┐
|
|
|
|
|
/// │ AddressBook │ │
|
|
|
|
|
/// │ disjoint `PeerAddrState`s ▼ │
|
|
|
|
|
/// │ ┌─────────────┐ ┌─────────────────────────┐ ┌─────────────┐ │
|
|
|
|
|
/// │ │ `Responded` │ │`NeverAttemptedGossiped` │ │ `Failed` │ │
|
|
|
|
|
/// ┌┼▶│ Peers │ │`NeverAttemptedAlternate`│ │ Peers │◀┼┐
|
|
|
|
|
/// ││ │ │ │ Peers │ │ │ ││
|
|
|
|
|
/// ││ └─────────────┘ └─────────────────────────┘ └─────────────┘ ││
|
|
|
|
|
/// ││ │ │ │ ││
|
|
|
|
|
/// ││ #1 oldest_first #2 newest_first #3 oldest_first ││
|
|
|
|
|
/// ││ ├──────────────────────┴──────────────────────┘ ││
|
|
|
|
|
/// ││ ▼ ││
|
|
|
|
|
/// ││ Λ ││
|
|
|
|
|
/// ││ ╱ ╲ filter by ││
|
2021-06-28 22:12:27 -07:00
|
|
|
|
/// ││ ▕ ▏ is_ready_for_connection_attempt ││
|
Security: Limit reconnection rate to individual peers (#2275)
* Security: Limit reconnection rate to individual peers
Reconnection Rate
Limit the reconnection rate to each individual peer by applying the
liveness cutoff to the attempt, responded, and failure time fields.
If any field is recent, the peer is skipped.
The new liveness cutoff skips any peers that have recently been attempted
or failed. (Previously, the liveness check was only applied if the peer
was in the `Responded` state, which could lead to repeated retries of
`Failed` peers, particularly in small address books.)
Reconnection Order
Zebra prefers more useful peer states, then the earliest attempted,
failed, and responded times, then the most recent gossiped last seen
times.
Before this change, Zebra took the most recent time in all the peer time
fields, and used that time for liveness and ordering. This led to
confusion between trusted and untrusted data, and success and failure
times.
Unlike the previous order, the new order:
- tries all peers in each state, before re-trying any peer in that state,
and
- only checks the the gossiped untrusted last seen time
if all other times are equal.
* Preserve the later time if changes arrive out of order
* Update CandidateSet::next documentation
* Update CandidateSet state diagram
* Fix variant names in comments
* Explain why timestamps can be left out of MetaAddrChanges
* Add a simple test for the individual peer retry limit
* Only generate valid Arbitrary PeerServices values
* Add an individual peer retry limit AddressBook and CandidateSet test
* Stop deleting recently live addresses from the address book
If we delete recently live addresses from the address book, we can get a
new entry for them, and reconnect too rapidly.
* Rename functions to match similar tokio API
* Fix docs for service sorting
* Clarify a comment
* Cleanup a variable and comments
* Remove blank lines in the CandidateSet state diagram
* Add a multi-peer proptest that checks outbound attempt fairness
* Fix a comment typo
Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
* Simplify time maths in MetaAddr
* Create a Duration32 type to simplify calculations and comparisons
* Rename variables for clarity
* Split a string constant into multiple lines
* Make constants match rustdoc order
Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
2021-06-18 05:30:44 -07:00
|
|
|
|
/// ││ ╲ ╱ to remove recent `Responded`, ││
|
|
|
|
|
/// ││ V `AttemptPending`, and `Failed` peers ││
|
|
|
|
|
/// ││ │ ││
|
|
|
|
|
/// ││ │ try outbound connection, ││
|
|
|
|
|
/// ││ ▼ update last_attempt to now() ││
|
|
|
|
|
/// ││┌────────────────┐ ││
|
|
|
|
|
/// │││`AttemptPending`│ ││
|
|
|
|
|
/// │││ Peers │ ││
|
|
|
|
|
/// ││└────────────────┘ ││
|
|
|
|
|
/// │└────────┼──────────────────────────────────────────────────────┘│
|
|
|
|
|
/// │ ▼ │
|
|
|
|
|
/// │ Λ │
|
|
|
|
|
/// │ ╱ ╲ │
|
|
|
|
|
/// │ ▕ ▏─────────────────────────────────────────────────────┘
|
|
|
|
|
/// │ ╲ ╱ connection failed, update last_failure to now()
|
|
|
|
|
/// │ V
|
|
|
|
|
/// │ │
|
|
|
|
|
/// │ │ connection succeeded
|
|
|
|
|
/// │ ▼
|
|
|
|
|
/// │ ┌────────────┐
|
|
|
|
|
/// │ │ send │
|
|
|
|
|
/// │ │peer::Client│
|
|
|
|
|
/// │ │to Discover │
|
|
|
|
|
/// │ └────────────┘
|
|
|
|
|
/// │ │
|
|
|
|
|
/// │ ▼
|
|
|
|
|
/// │┌───────────────────────────────────────┐
|
|
|
|
|
/// ││ every time we receive a peer message: │
|
|
|
|
|
/// └│ * update state to `Responded` │
|
|
|
|
|
/// │ * update last_response to now() │
|
2021-02-17 17:18:32 -08:00
|
|
|
|
/// └───────────────────────────────────────┘
|
2019-10-21 21:25:49 -07:00
|
|
|
|
/// ```
|
2021-02-17 17:18:32 -08:00
|
|
|
|
// TODO:
|
Security: Limit reconnection rate to individual peers (#2275)
* Security: Limit reconnection rate to individual peers
Reconnection Rate
Limit the reconnection rate to each individual peer by applying the
liveness cutoff to the attempt, responded, and failure time fields.
If any field is recent, the peer is skipped.
The new liveness cutoff skips any peers that have recently been attempted
or failed. (Previously, the liveness check was only applied if the peer
was in the `Responded` state, which could lead to repeated retries of
`Failed` peers, particularly in small address books.)
Reconnection Order
Zebra prefers more useful peer states, then the earliest attempted,
failed, and responded times, then the most recent gossiped last seen
times.
Before this change, Zebra took the most recent time in all the peer time
fields, and used that time for liveness and ordering. This led to
confusion between trusted and untrusted data, and success and failure
times.
Unlike the previous order, the new order:
- tries all peers in each state, before re-trying any peer in that state,
and
- only checks the the gossiped untrusted last seen time
if all other times are equal.
* Preserve the later time if changes arrive out of order
* Update CandidateSet::next documentation
* Update CandidateSet state diagram
* Fix variant names in comments
* Explain why timestamps can be left out of MetaAddrChanges
* Add a simple test for the individual peer retry limit
* Only generate valid Arbitrary PeerServices values
* Add an individual peer retry limit AddressBook and CandidateSet test
* Stop deleting recently live addresses from the address book
If we delete recently live addresses from the address book, we can get a
new entry for them, and reconnect too rapidly.
* Rename functions to match similar tokio API
* Fix docs for service sorting
* Clarify a comment
* Cleanup a variable and comments
* Remove blank lines in the CandidateSet state diagram
* Add a multi-peer proptest that checks outbound attempt fairness
* Fix a comment typo
Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
* Simplify time maths in MetaAddr
* Create a Duration32 type to simplify calculations and comparisons
* Rename variables for clarity
* Split a string constant into multiple lines
* Make constants match rustdoc order
Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
2021-06-18 05:30:44 -07:00
|
|
|
|
// * show all possible transitions between Attempt/Responded/Failed,
|
|
|
|
|
// except Failed -> Responded is invalid, must go through Attempt
|
|
|
|
|
// * for now, seed peers go straight to handshaking and responded,
|
|
|
|
|
// but we'll fix that once we add the Seed state
|
|
|
|
|
// When we add the Seed state:
|
|
|
|
|
// * show that seed peers that transition to other never attempted
|
|
|
|
|
// states are already in the address book
|
|
|
|
|
pub(crate) struct CandidateSet<S> {
|
2021-04-18 23:04:24 -07:00
|
|
|
|
pub(super) address_book: Arc<std::sync::Mutex<AddressBook>>,
|
2019-10-21 15:24:17 -07:00
|
|
|
|
pub(super) peer_service: S,
|
2021-10-21 04:47:04 -07:00
|
|
|
|
min_next_handshake: Instant,
|
2021-06-08 16:42:45 -07:00
|
|
|
|
min_next_crawl: Instant,
|
2019-10-21 15:24:17 -07:00
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
impl<S> CandidateSet<S>
|
|
|
|
|
where
|
2020-09-18 11:20:55 -07:00
|
|
|
|
S: Service<Request, Response = Response, Error = BoxError>,
|
2019-10-21 15:24:17 -07:00
|
|
|
|
S::Future: Send + 'static,
|
|
|
|
|
{
|
2021-04-18 23:04:24 -07:00
|
|
|
|
/// Uses `address_book` and `peer_service` to manage a [`CandidateSet`] of peers.
|
|
|
|
|
pub fn new(
|
|
|
|
|
address_book: Arc<std::sync::Mutex<AddressBook>>,
|
|
|
|
|
peer_service: S,
|
|
|
|
|
) -> CandidateSet<S> {
|
2019-10-21 21:25:49 -07:00
|
|
|
|
CandidateSet {
|
2021-04-18 23:04:24 -07:00
|
|
|
|
address_book,
|
2019-10-21 21:25:49 -07:00
|
|
|
|
peer_service,
|
2021-10-21 04:47:04 -07:00
|
|
|
|
min_next_handshake: Instant::now(),
|
2021-06-08 16:42:45 -07:00
|
|
|
|
min_next_crawl: Instant::now(),
|
2019-10-21 21:25:49 -07:00
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2021-05-13 19:15:39 -07:00
|
|
|
|
/// Update the peer set from the network, using the default fanout limit.
|
|
|
|
|
///
|
2021-05-20 18:40:14 -07:00
|
|
|
|
/// See [`update_initial`][Self::update_initial] for details.
|
2021-05-13 19:15:39 -07:00
|
|
|
|
pub async fn update(&mut self) -> Result<(), BoxError> {
|
2021-05-20 01:15:46 -07:00
|
|
|
|
self.update_timeout(None).await
|
2021-05-13 19:15:39 -07:00
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/// Update the peer set from the network, limiting the fanout to
|
|
|
|
|
/// `fanout_limit`.
|
2021-02-17 17:18:32 -08:00
|
|
|
|
///
|
2021-05-20 01:15:46 -07:00
|
|
|
|
/// - Ask a few live [`Responded`] peers to send us more peers.
|
2021-02-17 17:18:32 -08:00
|
|
|
|
/// - Process all completed peer responses, adding new peers in the
|
2021-05-20 01:15:46 -07:00
|
|
|
|
/// [`NeverAttemptedGossiped`] state.
|
2021-02-17 17:18:32 -08:00
|
|
|
|
///
|
|
|
|
|
/// ## Correctness
|
|
|
|
|
///
|
2021-05-13 19:15:39 -07:00
|
|
|
|
/// Pass the initial peer set size as `fanout_limit` during initialization,
|
|
|
|
|
/// so that Zebra does not send duplicate requests to the same peer.
|
|
|
|
|
///
|
Fix a deadlock between the crawler and dialer, and other hangs (#1950)
* Stop ignoring inbound message errors and handshake timeouts
To avoid hangs, Zebra needs to maintain the following invariants in the
handshake and heartbeat code:
- each handshake should run in a separate spawned task
(not yet implemented)
- every message, error, timeout, and shutdown must update the peer address state
- every await that depends on the network must have a timeout
Once the Connection is created, it should handle timeouts.
But we need to handle timeouts during handshake setup.
* Avoid hangs by adding a timeout to the candidate set update
Also increase the fanout from 1 to 2, to increase address diversity.
But only return permanent errors from `CandidateSet::update`, because
the crawler task exits if `update` returns an error.
Also log Peers response errors in the CandidateSet.
* Use the select macro in the crawler to reduce hangs
The `select` function is biased towards its first argument, risking
starvation.
As a side-benefit, this change also makes the code a lot easier to read
and maintain.
* Split CrawlerAction::Demand into separate actions
This refactor makes the code a bit easier to read, at the cost of
sometimes blocking the crawler on `candidates.next()`.
That's ok, because `next` only has a short (< 100 ms) delay. And we're
just about to spawn a separate task for each handshake.
* Spawn a separate task for each handshake
This change avoids deadlocks by letting each handshake make progress
independently.
* Move the dial task into a separate function
This refactor improves readability.
* Fix buggy future::select function usage
And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
|
|
|
|
/// The crawler exits when update returns an error, so it must only return
|
|
|
|
|
/// errors on permanent failures.
|
|
|
|
|
///
|
2021-02-17 17:18:32 -08:00
|
|
|
|
/// The handshaker sets up the peer message receiver so it also sends a
|
2021-05-20 01:15:46 -07:00
|
|
|
|
/// [`Responded`] peer address update.
|
2021-02-17 17:18:32 -08:00
|
|
|
|
///
|
2021-05-20 18:40:14 -07:00
|
|
|
|
/// [`report_failed`][Self::report_failed] puts peers into the [`Failed`] state.
|
2021-02-17 17:18:32 -08:00
|
|
|
|
///
|
2021-05-20 18:40:14 -07:00
|
|
|
|
/// [`next`][Self::next] puts peers into the [`AttemptPending`] state.
|
|
|
|
|
///
|
2021-06-08 16:42:45 -07:00
|
|
|
|
/// ## Security
|
|
|
|
|
///
|
|
|
|
|
/// This call is rate-limited to prevent sending a burst of repeated requests for new peer
|
|
|
|
|
/// addresses. Each call will only update the [`CandidateSet`] if more time
|
|
|
|
|
/// than [`MIN_PEER_GET_ADDR_INTERVAL`][constants::MIN_PEER_GET_ADDR_INTERVAL] has passed since
|
|
|
|
|
/// the last call. Otherwise, the update is skipped.
|
|
|
|
|
///
|
2021-05-20 18:40:14 -07:00
|
|
|
|
/// [`Responded`]: crate::PeerAddrState::Responded
|
|
|
|
|
/// [`NeverAttemptedGossiped`]: crate::PeerAddrState::NeverAttemptedGossiped
|
|
|
|
|
/// [`Failed`]: crate::PeerAddrState::Failed
|
|
|
|
|
/// [`AttemptPending`]: crate::PeerAddrState::AttemptPending
|
2021-05-13 19:15:39 -07:00
|
|
|
|
pub async fn update_initial(&mut self, fanout_limit: usize) -> Result<(), BoxError> {
|
2021-05-20 01:15:46 -07:00
|
|
|
|
self.update_timeout(Some(fanout_limit)).await
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/// Update the peer set from the network, limiting the fanout to
|
|
|
|
|
/// `fanout_limit`, and imposing a timeout on the entire fanout.
|
|
|
|
|
///
|
2021-05-20 18:40:14 -07:00
|
|
|
|
/// See [`update_initial`][Self::update_initial] for details.
|
2021-05-20 01:15:46 -07:00
|
|
|
|
async fn update_timeout(&mut self, fanout_limit: Option<usize>) -> Result<(), BoxError> {
|
2021-06-08 16:42:45 -07:00
|
|
|
|
// SECURITY
|
2021-05-20 01:15:46 -07:00
|
|
|
|
//
|
2021-06-08 16:42:45 -07:00
|
|
|
|
// Rate limit sending `GetAddr` messages to peers.
|
|
|
|
|
if self.min_next_crawl <= Instant::now() {
|
|
|
|
|
// CORRECTNESS
|
|
|
|
|
//
|
|
|
|
|
// Use a timeout to avoid deadlocks when there are no connected
|
|
|
|
|
// peers, and:
|
|
|
|
|
// - we're waiting on a handshake to complete so there are peers, or
|
|
|
|
|
// - another task that handles or adds peers is waiting on this task
|
|
|
|
|
// to complete.
|
2021-10-19 08:29:03 -07:00
|
|
|
|
if let Ok(fanout_result) = timeout(
|
2021-11-30 13:04:32 -08:00
|
|
|
|
constants::PEER_GET_ADDR_TIMEOUT,
|
2021-10-19 08:29:03 -07:00
|
|
|
|
self.update_fanout(fanout_limit),
|
|
|
|
|
)
|
|
|
|
|
.await
|
2021-06-08 16:42:45 -07:00
|
|
|
|
{
|
|
|
|
|
fanout_result?;
|
|
|
|
|
} else {
|
|
|
|
|
// update must only return an error for permanent failures
|
2021-11-30 13:04:32 -08:00
|
|
|
|
info!("timeout waiting for peer service readiness or peer responses");
|
2021-06-08 16:42:45 -07:00
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
self.min_next_crawl = Instant::now() + constants::MIN_PEER_GET_ADDR_INTERVAL;
|
2021-05-20 01:15:46 -07:00
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
Ok(())
|
2021-05-13 19:15:39 -07:00
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/// Update the peer set from the network, limiting the fanout to
|
|
|
|
|
/// `fanout_limit`.
|
|
|
|
|
///
|
2021-05-20 18:40:14 -07:00
|
|
|
|
/// See [`update_initial`][Self::update_initial] for details.
|
2021-05-20 01:15:46 -07:00
|
|
|
|
///
|
|
|
|
|
/// # Correctness
|
|
|
|
|
///
|
2021-05-20 18:40:14 -07:00
|
|
|
|
/// This function does not have a timeout.
|
|
|
|
|
/// Use [`update_timeout`][Self::update_timeout] instead.
|
2021-05-20 01:15:46 -07:00
|
|
|
|
async fn update_fanout(&mut self, fanout_limit: Option<usize>) -> Result<(), BoxError> {
|
2019-10-21 15:24:17 -07:00
|
|
|
|
// Opportunistically crawl the network on every update call to ensure
|
|
|
|
|
// we're actively fetching peers. Continue independently of whether we
|
|
|
|
|
// actually receive any peers, but always ask the network for more.
|
Fix a deadlock between the crawler and dialer, and other hangs (#1950)
* Stop ignoring inbound message errors and handshake timeouts
To avoid hangs, Zebra needs to maintain the following invariants in the
handshake and heartbeat code:
- each handshake should run in a separate spawned task
(not yet implemented)
- every message, error, timeout, and shutdown must update the peer address state
- every await that depends on the network must have a timeout
Once the Connection is created, it should handle timeouts.
But we need to handle timeouts during handshake setup.
* Avoid hangs by adding a timeout to the candidate set update
Also increase the fanout from 1 to 2, to increase address diversity.
But only return permanent errors from `CandidateSet::update`, because
the crawler task exits if `update` returns an error.
Also log Peers response errors in the CandidateSet.
* Use the select macro in the crawler to reduce hangs
The `select` function is biased towards its first argument, risking
starvation.
As a side-benefit, this change also makes the code a lot easier to read
and maintain.
* Split CrawlerAction::Demand into separate actions
This refactor makes the code a bit easier to read, at the cost of
sometimes blocking the crawler on `candidates.next()`.
That's ok, because `next` only has a short (< 100 ms) delay. And we're
just about to spawn a separate task for each handshake.
* Spawn a separate task for each handshake
This change avoids deadlocks by letting each handshake make progress
independently.
* Move the dial task into a separate function
This refactor improves readability.
* Fix buggy future::select function usage
And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
|
|
|
|
//
|
2019-10-21 15:24:17 -07:00
|
|
|
|
// Because requests are load-balanced across existing peers, we can make
|
|
|
|
|
// multiple requests concurrently, which will be randomly assigned to
|
|
|
|
|
// existing peers, but we don't make too many because update may be
|
|
|
|
|
// called while the peer set is already loaded.
|
|
|
|
|
let mut responses = FuturesUnordered::new();
|
2021-05-20 01:15:46 -07:00
|
|
|
|
let fanout_limit = fanout_limit
|
|
|
|
|
.map(|fanout_limit| min(fanout_limit, constants::GET_ADDR_FANOUT))
|
|
|
|
|
.unwrap_or(constants::GET_ADDR_FANOUT);
|
|
|
|
|
debug!(?fanout_limit, "sending GetPeers requests");
|
|
|
|
|
// TODO: launch each fanout in its own task (might require tokio 1.6)
|
|
|
|
|
for _ in 0..fanout_limit {
|
2021-11-02 11:46:57 -07:00
|
|
|
|
let peer_service = self.peer_service.ready().await?;
|
Fix a deadlock between the crawler and dialer, and other hangs (#1950)
* Stop ignoring inbound message errors and handshake timeouts
To avoid hangs, Zebra needs to maintain the following invariants in the
handshake and heartbeat code:
- each handshake should run in a separate spawned task
(not yet implemented)
- every message, error, timeout, and shutdown must update the peer address state
- every await that depends on the network must have a timeout
Once the Connection is created, it should handle timeouts.
But we need to handle timeouts during handshake setup.
* Avoid hangs by adding a timeout to the candidate set update
Also increase the fanout from 1 to 2, to increase address diversity.
But only return permanent errors from `CandidateSet::update`, because
the crawler task exits if `update` returns an error.
Also log Peers response errors in the CandidateSet.
* Use the select macro in the crawler to reduce hangs
The `select` function is biased towards its first argument, risking
starvation.
As a side-benefit, this change also makes the code a lot easier to read
and maintain.
* Split CrawlerAction::Demand into separate actions
This refactor makes the code a bit easier to read, at the cost of
sometimes blocking the crawler on `candidates.next()`.
That's ok, because `next` only has a short (< 100 ms) delay. And we're
just about to spawn a separate task for each handshake.
* Spawn a separate task for each handshake
This change avoids deadlocks by letting each handshake make progress
independently.
* Move the dial task into a separate function
This refactor improves readability.
* Fix buggy future::select function usage
And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
|
|
|
|
responses.push(peer_service.call(Request::Peers));
|
2019-10-21 15:24:17 -07:00
|
|
|
|
}
|
|
|
|
|
while let Some(rsp) = responses.next().await {
|
Fix a deadlock between the crawler and dialer, and other hangs (#1950)
* Stop ignoring inbound message errors and handshake timeouts
To avoid hangs, Zebra needs to maintain the following invariants in the
handshake and heartbeat code:
- each handshake should run in a separate spawned task
(not yet implemented)
- every message, error, timeout, and shutdown must update the peer address state
- every await that depends on the network must have a timeout
Once the Connection is created, it should handle timeouts.
But we need to handle timeouts during handshake setup.
* Avoid hangs by adding a timeout to the candidate set update
Also increase the fanout from 1 to 2, to increase address diversity.
But only return permanent errors from `CandidateSet::update`, because
the crawler task exits if `update` returns an error.
Also log Peers response errors in the CandidateSet.
* Use the select macro in the crawler to reduce hangs
The `select` function is biased towards its first argument, risking
starvation.
As a side-benefit, this change also makes the code a lot easier to read
and maintain.
* Split CrawlerAction::Demand into separate actions
This refactor makes the code a bit easier to read, at the cost of
sometimes blocking the crawler on `candidates.next()`.
That's ok, because `next` only has a short (< 100 ms) delay. And we're
just about to spawn a separate task for each handshake.
* Spawn a separate task for each handshake
This change avoids deadlocks by letting each handshake make progress
independently.
* Move the dial task into a separate function
This refactor improves readability.
* Fix buggy future::select function usage
And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
|
|
|
|
match rsp {
|
2021-05-20 01:15:46 -07:00
|
|
|
|
Ok(Response::Peers(addrs)) => {
|
Fix a deadlock between the crawler and dialer, and other hangs (#1950)
* Stop ignoring inbound message errors and handshake timeouts
To avoid hangs, Zebra needs to maintain the following invariants in the
handshake and heartbeat code:
- each handshake should run in a separate spawned task
(not yet implemented)
- every message, error, timeout, and shutdown must update the peer address state
- every await that depends on the network must have a timeout
Once the Connection is created, it should handle timeouts.
But we need to handle timeouts during handshake setup.
* Avoid hangs by adding a timeout to the candidate set update
Also increase the fanout from 1 to 2, to increase address diversity.
But only return permanent errors from `CandidateSet::update`, because
the crawler task exits if `update` returns an error.
Also log Peers response errors in the CandidateSet.
* Use the select macro in the crawler to reduce hangs
The `select` function is biased towards its first argument, risking
starvation.
As a side-benefit, this change also makes the code a lot easier to read
and maintain.
* Split CrawlerAction::Demand into separate actions
This refactor makes the code a bit easier to read, at the cost of
sometimes blocking the crawler on `candidates.next()`.
That's ok, because `next` only has a short (< 100 ms) delay. And we're
just about to spawn a separate task for each handshake.
* Spawn a separate task for each handshake
This change avoids deadlocks by letting each handshake make progress
independently.
* Move the dial task into a separate function
This refactor improves readability.
* Fix buggy future::select function usage
And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
|
|
|
|
trace!(
|
2021-05-20 01:15:46 -07:00
|
|
|
|
addr_count = ?addrs.len(),
|
|
|
|
|
?addrs,
|
Fix a deadlock between the crawler and dialer, and other hangs (#1950)
* Stop ignoring inbound message errors and handshake timeouts
To avoid hangs, Zebra needs to maintain the following invariants in the
handshake and heartbeat code:
- each handshake should run in a separate spawned task
(not yet implemented)
- every message, error, timeout, and shutdown must update the peer address state
- every await that depends on the network must have a timeout
Once the Connection is created, it should handle timeouts.
But we need to handle timeouts during handshake setup.
* Avoid hangs by adding a timeout to the candidate set update
Also increase the fanout from 1 to 2, to increase address diversity.
But only return permanent errors from `CandidateSet::update`, because
the crawler task exits if `update` returns an error.
Also log Peers response errors in the CandidateSet.
* Use the select macro in the crawler to reduce hangs
The `select` function is biased towards its first argument, risking
starvation.
As a side-benefit, this change also makes the code a lot easier to read
and maintain.
* Split CrawlerAction::Demand into separate actions
This refactor makes the code a bit easier to read, at the cost of
sometimes blocking the crawler on `candidates.next()`.
That's ok, because `next` only has a short (< 100 ms) delay. And we're
just about to spawn a separate task for each handshake.
* Spawn a separate task for each handshake
This change avoids deadlocks by letting each handshake make progress
independently.
* Move the dial task into a separate function
This refactor improves readability.
* Fix buggy future::select function usage
And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
|
|
|
|
"got response to GetPeers"
|
|
|
|
|
);
|
2021-05-31 06:49:59 -07:00
|
|
|
|
let addrs = validate_addrs(addrs, DateTime32::now());
|
2021-05-20 01:15:46 -07:00
|
|
|
|
self.send_addrs(addrs);
|
Fix a deadlock between the crawler and dialer, and other hangs (#1950)
* Stop ignoring inbound message errors and handshake timeouts
To avoid hangs, Zebra needs to maintain the following invariants in the
handshake and heartbeat code:
- each handshake should run in a separate spawned task
(not yet implemented)
- every message, error, timeout, and shutdown must update the peer address state
- every await that depends on the network must have a timeout
Once the Connection is created, it should handle timeouts.
But we need to handle timeouts during handshake setup.
* Avoid hangs by adding a timeout to the candidate set update
Also increase the fanout from 1 to 2, to increase address diversity.
But only return permanent errors from `CandidateSet::update`, because
the crawler task exits if `update` returns an error.
Also log Peers response errors in the CandidateSet.
* Use the select macro in the crawler to reduce hangs
The `select` function is biased towards its first argument, risking
starvation.
As a side-benefit, this change also makes the code a lot easier to read
and maintain.
* Split CrawlerAction::Demand into separate actions
This refactor makes the code a bit easier to read, at the cost of
sometimes blocking the crawler on `candidates.next()`.
That's ok, because `next` only has a short (< 100 ms) delay. And we're
just about to spawn a separate task for each handshake.
* Spawn a separate task for each handshake
This change avoids deadlocks by letting each handshake make progress
independently.
* Move the dial task into a separate function
This refactor improves readability.
* Fix buggy future::select function usage
And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
|
|
|
|
}
|
|
|
|
|
Err(e) => {
|
|
|
|
|
// since we do a fanout, and new updates are triggered by
|
|
|
|
|
// each demand, we can ignore errors in individual responses
|
|
|
|
|
trace!(?e, "got error in GetPeers request");
|
|
|
|
|
}
|
|
|
|
|
Ok(_) => unreachable!("Peers requests always return Peers responses"),
|
2019-10-21 15:24:17 -07:00
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
Ok(())
|
|
|
|
|
}
|
|
|
|
|
|
2021-05-20 01:15:46 -07:00
|
|
|
|
/// Add new `addrs` to the address book.
|
|
|
|
|
fn send_addrs(&self, addrs: impl IntoIterator<Item = MetaAddr>) {
|
2021-11-01 19:45:35 -07:00
|
|
|
|
let addrs = addrs
|
|
|
|
|
.into_iter()
|
|
|
|
|
.map(MetaAddr::new_gossiped_change)
|
|
|
|
|
.map(|maybe_addr| {
|
|
|
|
|
maybe_addr.expect("Received gossiped peers always have services set")
|
|
|
|
|
});
|
2021-06-14 20:31:16 -07:00
|
|
|
|
|
2021-05-20 01:15:46 -07:00
|
|
|
|
// # Correctness
|
|
|
|
|
//
|
|
|
|
|
// Briefly hold the address book threaded mutex, to extend
|
|
|
|
|
// the address list.
|
|
|
|
|
//
|
|
|
|
|
// Extend handles duplicate addresses internally.
|
|
|
|
|
self.address_book.lock().unwrap().extend(addrs);
|
|
|
|
|
}
|
|
|
|
|
|
2021-02-17 17:18:32 -08:00
|
|
|
|
/// Returns the next candidate for a connection attempt, if any are available.
|
|
|
|
|
///
|
Security: Limit reconnection rate to individual peers (#2275)
* Security: Limit reconnection rate to individual peers
Reconnection Rate
Limit the reconnection rate to each individual peer by applying the
liveness cutoff to the attempt, responded, and failure time fields.
If any field is recent, the peer is skipped.
The new liveness cutoff skips any peers that have recently been attempted
or failed. (Previously, the liveness check was only applied if the peer
was in the `Responded` state, which could lead to repeated retries of
`Failed` peers, particularly in small address books.)
Reconnection Order
Zebra prefers more useful peer states, then the earliest attempted,
failed, and responded times, then the most recent gossiped last seen
times.
Before this change, Zebra took the most recent time in all the peer time
fields, and used that time for liveness and ordering. This led to
confusion between trusted and untrusted data, and success and failure
times.
Unlike the previous order, the new order:
- tries all peers in each state, before re-trying any peer in that state,
and
- only checks the the gossiped untrusted last seen time
if all other times are equal.
* Preserve the later time if changes arrive out of order
* Update CandidateSet::next documentation
* Update CandidateSet state diagram
* Fix variant names in comments
* Explain why timestamps can be left out of MetaAddrChanges
* Add a simple test for the individual peer retry limit
* Only generate valid Arbitrary PeerServices values
* Add an individual peer retry limit AddressBook and CandidateSet test
* Stop deleting recently live addresses from the address book
If we delete recently live addresses from the address book, we can get a
new entry for them, and reconnect too rapidly.
* Rename functions to match similar tokio API
* Fix docs for service sorting
* Clarify a comment
* Cleanup a variable and comments
* Remove blank lines in the CandidateSet state diagram
* Add a multi-peer proptest that checks outbound attempt fairness
* Fix a comment typo
Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
* Simplify time maths in MetaAddr
* Create a Duration32 type to simplify calculations and comparisons
* Rename variables for clarity
* Split a string constant into multiple lines
* Make constants match rustdoc order
Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
2021-06-18 05:30:44 -07:00
|
|
|
|
/// Returns peers in reconnection order, based on
|
|
|
|
|
/// [`AddressBook::reconnection_peers`].
|
2021-02-17 17:18:32 -08:00
|
|
|
|
///
|
Security: Limit reconnection rate to individual peers (#2275)
* Security: Limit reconnection rate to individual peers
Reconnection Rate
Limit the reconnection rate to each individual peer by applying the
liveness cutoff to the attempt, responded, and failure time fields.
If any field is recent, the peer is skipped.
The new liveness cutoff skips any peers that have recently been attempted
or failed. (Previously, the liveness check was only applied if the peer
was in the `Responded` state, which could lead to repeated retries of
`Failed` peers, particularly in small address books.)
Reconnection Order
Zebra prefers more useful peer states, then the earliest attempted,
failed, and responded times, then the most recent gossiped last seen
times.
Before this change, Zebra took the most recent time in all the peer time
fields, and used that time for liveness and ordering. This led to
confusion between trusted and untrusted data, and success and failure
times.
Unlike the previous order, the new order:
- tries all peers in each state, before re-trying any peer in that state,
and
- only checks the the gossiped untrusted last seen time
if all other times are equal.
* Preserve the later time if changes arrive out of order
* Update CandidateSet::next documentation
* Update CandidateSet state diagram
* Fix variant names in comments
* Explain why timestamps can be left out of MetaAddrChanges
* Add a simple test for the individual peer retry limit
* Only generate valid Arbitrary PeerServices values
* Add an individual peer retry limit AddressBook and CandidateSet test
* Stop deleting recently live addresses from the address book
If we delete recently live addresses from the address book, we can get a
new entry for them, and reconnect too rapidly.
* Rename functions to match similar tokio API
* Fix docs for service sorting
* Clarify a comment
* Cleanup a variable and comments
* Remove blank lines in the CandidateSet state diagram
* Add a multi-peer proptest that checks outbound attempt fairness
* Fix a comment typo
Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
* Simplify time maths in MetaAddr
* Create a Duration32 type to simplify calculations and comparisons
* Rename variables for clarity
* Split a string constant into multiple lines
* Make constants match rustdoc order
Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
2021-06-18 05:30:44 -07:00
|
|
|
|
/// Skips peers that have recently been active, attempted, or failed.
|
2021-02-17 17:18:32 -08:00
|
|
|
|
///
|
|
|
|
|
/// ## Correctness
|
|
|
|
|
///
|
|
|
|
|
/// `AttemptPending` peers will become `Responded` if they respond, or
|
|
|
|
|
/// become `Failed` if they time out or provide a bad response.
|
|
|
|
|
///
|
|
|
|
|
/// Live `Responded` peers will stay live if they keep responding, or
|
|
|
|
|
/// become a reconnection candidate if they stop responding.
|
2021-03-09 17:36:05 -08:00
|
|
|
|
///
|
|
|
|
|
/// ## Security
|
|
|
|
|
///
|
|
|
|
|
/// Zebra resists distributed denial of service attacks by making sure that
|
|
|
|
|
/// new peer connections are initiated at least
|
2021-06-08 16:42:45 -07:00
|
|
|
|
/// [`MIN_PEER_CONNECTION_INTERVAL`][constants::MIN_PEER_CONNECTION_INTERVAL] apart.
|
2021-03-09 17:36:05 -08:00
|
|
|
|
pub async fn next(&mut self) -> Option<MetaAddr> {
|
2021-04-18 23:04:24 -07:00
|
|
|
|
// # Correctness
|
Fix a deadlock between the crawler and dialer, and other hangs (#1950)
* Stop ignoring inbound message errors and handshake timeouts
To avoid hangs, Zebra needs to maintain the following invariants in the
handshake and heartbeat code:
- each handshake should run in a separate spawned task
(not yet implemented)
- every message, error, timeout, and shutdown must update the peer address state
- every await that depends on the network must have a timeout
Once the Connection is created, it should handle timeouts.
But we need to handle timeouts during handshake setup.
* Avoid hangs by adding a timeout to the candidate set update
Also increase the fanout from 1 to 2, to increase address diversity.
But only return permanent errors from `CandidateSet::update`, because
the crawler task exits if `update` returns an error.
Also log Peers response errors in the CandidateSet.
* Use the select macro in the crawler to reduce hangs
The `select` function is biased towards its first argument, risking
starvation.
As a side-benefit, this change also makes the code a lot easier to read
and maintain.
* Split CrawlerAction::Demand into separate actions
This refactor makes the code a bit easier to read, at the cost of
sometimes blocking the crawler on `candidates.next()`.
That's ok, because `next` only has a short (< 100 ms) delay. And we're
just about to spawn a separate task for each handshake.
* Spawn a separate task for each handshake
This change avoids deadlocks by letting each handshake make progress
independently.
* Move the dial task into a separate function
This refactor improves readability.
* Fix buggy future::select function usage
And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
|
|
|
|
//
|
2021-04-18 23:04:24 -07:00
|
|
|
|
// In this critical section, we hold the address mutex, blocking the
|
|
|
|
|
// current thread, and all async tasks scheduled on that thread.
|
Fix a deadlock between the crawler and dialer, and other hangs (#1950)
* Stop ignoring inbound message errors and handshake timeouts
To avoid hangs, Zebra needs to maintain the following invariants in the
handshake and heartbeat code:
- each handshake should run in a separate spawned task
(not yet implemented)
- every message, error, timeout, and shutdown must update the peer address state
- every await that depends on the network must have a timeout
Once the Connection is created, it should handle timeouts.
But we need to handle timeouts during handshake setup.
* Avoid hangs by adding a timeout to the candidate set update
Also increase the fanout from 1 to 2, to increase address diversity.
But only return permanent errors from `CandidateSet::update`, because
the crawler task exits if `update` returns an error.
Also log Peers response errors in the CandidateSet.
* Use the select macro in the crawler to reduce hangs
The `select` function is biased towards its first argument, risking
starvation.
As a side-benefit, this change also makes the code a lot easier to read
and maintain.
* Split CrawlerAction::Demand into separate actions
This refactor makes the code a bit easier to read, at the cost of
sometimes blocking the crawler on `candidates.next()`.
That's ok, because `next` only has a short (< 100 ms) delay. And we're
just about to spawn a separate task for each handshake.
* Spawn a separate task for each handshake
This change avoids deadlocks by letting each handshake make progress
independently.
* Move the dial task into a separate function
This refactor improves readability.
* Fix buggy future::select function usage
And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
|
|
|
|
//
|
|
|
|
|
// To avoid deadlocks, the critical section:
|
|
|
|
|
// - must not acquire any other locks
|
|
|
|
|
// - must not await any futures
|
|
|
|
|
//
|
|
|
|
|
// To avoid hangs, any computation in the critical section should
|
|
|
|
|
// be kept to a minimum.
|
2021-03-09 17:36:05 -08:00
|
|
|
|
let reconnect = {
|
2021-04-18 23:04:24 -07:00
|
|
|
|
let mut guard = self.address_book.lock().unwrap();
|
|
|
|
|
// It's okay to return without sleeping here, because we're returning
|
|
|
|
|
// `None`. We only need to sleep before yielding an address.
|
|
|
|
|
let reconnect = guard.reconnection_peers().next()?;
|
2021-03-09 17:36:05 -08:00
|
|
|
|
|
2021-06-14 20:31:16 -07:00
|
|
|
|
let reconnect = MetaAddr::new_reconnect(&reconnect.addr);
|
|
|
|
|
guard.update(reconnect)?
|
2021-03-09 17:36:05 -08:00
|
|
|
|
};
|
2021-02-17 17:18:32 -08:00
|
|
|
|
|
2021-06-14 15:29:17 -07:00
|
|
|
|
// SECURITY: rate-limit new outbound peer connections
|
2021-10-21 04:47:04 -07:00
|
|
|
|
sleep_until(self.min_next_handshake).await;
|
|
|
|
|
self.min_next_handshake = Instant::now() + constants::MIN_PEER_CONNECTION_INTERVAL;
|
2021-02-17 17:18:32 -08:00
|
|
|
|
|
|
|
|
|
Some(reconnect)
|
2019-10-21 15:24:17 -07:00
|
|
|
|
}
|
|
|
|
|
|
2021-02-17 17:18:32 -08:00
|
|
|
|
/// Mark `addr` as a failed peer.
|
2021-03-25 03:14:52 -07:00
|
|
|
|
pub fn report_failed(&mut self, addr: &MetaAddr) {
|
2021-06-14 20:31:16 -07:00
|
|
|
|
let addr = MetaAddr::new_errored(&addr.addr, addr.services);
|
2021-04-18 23:04:24 -07:00
|
|
|
|
// # Correctness
|
|
|
|
|
//
|
|
|
|
|
// Briefly hold the address book threaded mutex, to update the state for
|
|
|
|
|
// a single address.
|
|
|
|
|
self.address_book.lock().unwrap().update(addr);
|
2019-10-21 15:24:17 -07:00
|
|
|
|
}
|
|
|
|
|
}
|
2021-05-20 19:21:13 -07:00
|
|
|
|
|
|
|
|
|
/// Check new `addrs` before adding them to the address book.
|
|
|
|
|
///
|
|
|
|
|
/// `last_seen_limit` is the maximum permitted last seen time, typically
|
|
|
|
|
/// [`Utc::now`].
|
|
|
|
|
///
|
|
|
|
|
/// If the data in an address is invalid, this function can:
|
|
|
|
|
/// - modify the address data, or
|
|
|
|
|
/// - delete the address.
|
2021-05-31 06:57:17 -07:00
|
|
|
|
///
|
2021-05-31 07:18:01 -07:00
|
|
|
|
/// # Security
|
|
|
|
|
///
|
|
|
|
|
/// Adjusts untrusted last seen times so they are not in the future. This stops
|
|
|
|
|
/// malicious peers keeping all their addresses at the front of the connection
|
|
|
|
|
/// queue. Honest peers with future clock skew also get adjusted.
|
2021-05-25 16:31:52 -07:00
|
|
|
|
///
|
2021-05-26 15:09:02 -07:00
|
|
|
|
/// Rejects all addresses if any calculated times overflow or underflow.
|
2021-05-20 19:21:13 -07:00
|
|
|
|
fn validate_addrs(
|
|
|
|
|
addrs: impl IntoIterator<Item = MetaAddr>,
|
2021-05-31 06:49:59 -07:00
|
|
|
|
last_seen_limit: DateTime32,
|
2021-05-21 14:41:26 -07:00
|
|
|
|
) -> impl Iterator<Item = MetaAddr> {
|
2021-05-20 19:21:13 -07:00
|
|
|
|
// Note: The address book handles duplicate addresses internally,
|
|
|
|
|
// so we don't need to de-duplicate addresses here.
|
|
|
|
|
|
|
|
|
|
// TODO:
|
|
|
|
|
// We should eventually implement these checks in this function:
|
|
|
|
|
// - Zebra should ignore peers that are older than 3 weeks (part of #1865)
|
|
|
|
|
// - Zebra should count back 3 weeks from the newest peer timestamp sent
|
|
|
|
|
// by the other peer, to compensate for clock skew
|
|
|
|
|
// - Zebra should limit the number of addresses it uses from a single Addrs
|
|
|
|
|
// response (#1869)
|
|
|
|
|
|
2021-05-31 06:57:17 -07:00
|
|
|
|
let mut addrs: Vec<_> = addrs.into_iter().collect();
|
|
|
|
|
|
2021-05-27 14:29:49 -07:00
|
|
|
|
limit_last_seen_times(&mut addrs, last_seen_limit);
|
2021-05-31 06:57:17 -07:00
|
|
|
|
|
2021-05-21 14:41:26 -07:00
|
|
|
|
addrs.into_iter()
|
2021-05-20 19:21:13 -07:00
|
|
|
|
}
|
2021-05-31 06:57:17 -07:00
|
|
|
|
|
|
|
|
|
/// Ensure all reported `last_seen` times are less than or equal to `last_seen_limit`.
|
|
|
|
|
///
|
2021-05-25 16:31:52 -07:00
|
|
|
|
/// This will consider all addresses as invalid if trying to offset their
|
2021-05-31 23:44:45 -07:00
|
|
|
|
/// `last_seen` times to be before the limit causes an underflow.
|
2021-05-31 06:57:17 -07:00
|
|
|
|
fn limit_last_seen_times(addrs: &mut Vec<MetaAddr>, last_seen_limit: DateTime32) {
|
2021-06-21 19:16:59 -07:00
|
|
|
|
let last_seen_times = addrs.iter().map(|meta_addr| {
|
|
|
|
|
meta_addr
|
|
|
|
|
.untrusted_last_seen()
|
|
|
|
|
.expect("unexpected missing last seen: should be provided by deserialization")
|
|
|
|
|
});
|
|
|
|
|
let oldest_seen = last_seen_times.clone().min().unwrap_or(DateTime32::MIN);
|
|
|
|
|
let newest_seen = last_seen_times.max().unwrap_or(DateTime32::MAX);
|
2021-05-31 06:57:17 -07:00
|
|
|
|
|
2021-05-24 17:59:09 -07:00
|
|
|
|
// If any time is in the future, adjust all times, to compensate for clock skew on honest peers
|
2021-06-21 19:16:59 -07:00
|
|
|
|
if newest_seen > last_seen_limit {
|
|
|
|
|
let offset = newest_seen
|
|
|
|
|
.checked_duration_since(last_seen_limit)
|
|
|
|
|
.expect("unexpected underflow: just checked newest_seen is greater");
|
2021-05-31 06:57:17 -07:00
|
|
|
|
|
2021-06-21 19:16:59 -07:00
|
|
|
|
// Check for underflow
|
|
|
|
|
if oldest_seen.checked_sub(offset).is_some() {
|
2021-05-31 23:44:45 -07:00
|
|
|
|
// No underflow is possible, so apply offset to all addresses
|
2021-05-25 16:31:52 -07:00
|
|
|
|
for addr in addrs {
|
2021-06-21 19:16:59 -07:00
|
|
|
|
let last_seen = addr
|
2021-06-14 20:31:16 -07:00
|
|
|
|
.untrusted_last_seen()
|
2021-06-21 19:16:59 -07:00
|
|
|
|
.expect("unexpected missing last seen: should be provided by deserialization");
|
|
|
|
|
let last_seen = last_seen
|
|
|
|
|
.checked_sub(offset)
|
|
|
|
|
.expect("unexpected underflow: just checked oldest_seen");
|
2021-05-31 07:00:34 -07:00
|
|
|
|
|
2021-06-21 19:16:59 -07:00
|
|
|
|
addr.set_untrusted_last_seen(last_seen);
|
2021-05-25 16:31:52 -07:00
|
|
|
|
}
|
|
|
|
|
} else {
|
2021-05-31 23:44:45 -07:00
|
|
|
|
// An underflow will occur, so reject all gossiped peers
|
2021-05-25 16:31:52 -07:00
|
|
|
|
addrs.clear();
|
2021-05-31 07:00:34 -07:00
|
|
|
|
}
|
2021-05-31 06:57:17 -07:00
|
|
|
|
}
|
|
|
|
|
}
|