zebra/zebra-network/src/peer_set/initialize.rs

1122 lines
43 KiB
Rust
Raw Normal View History

//! A peer set whose size is dynamically determined by resource constraints.
//!
//! The [`PeerSet`] implementation is adapted from the one in [tower::Balance][tower-balance].
//!
//! [tower-balance]: https://github.com/tower-rs/tower/tree/master/tower/src/balance
use std::{
collections::{BTreeMap, HashSet},
convert::Infallible,
net::SocketAddr,
pin::Pin,
sync::Arc,
time::Duration,
};
use futures::{
future::{self, FutureExt},
sink::SinkExt,
stream::{FuturesUnordered, StreamExt},
Future, TryFutureExt,
};
use rand::seq::SliceRandom;
use tokio::{
net::{TcpListener, TcpStream},
sync::broadcast,
time::{sleep, Instant},
};
Update to Tokio 1.13.0 (#2994) * Update `tower` to version `0.4.9` Update to latest version to add support for Tokio version 1. * Replace usage of `ServiceExt::ready_and` It was deprecated in favor of `ServiceExt::ready`. * Update Tokio dependency to version `1.13.0` This will break the build because the code isn't ready for the update, but future commits will fix the issues. * Replace import of `tokio::stream::StreamExt` Use `futures::stream::StreamExt` instead, because newer versions of Tokio don't have the `stream` feature. * Use `IntervalStream` in `zebra-network` In newer versions of Tokio `Interval` doesn't implement `Stream`, so the wrapper types from `tokio-stream` have to be used instead. * Use `IntervalStream` in `inventory_registry` In newer versions of Tokio the `Interval` type doesn't implement `Stream`, so `tokio_stream::wrappers::IntervalStream` has to be used instead. * Use `BroadcastStream` in `inventory_registry` In newer versions of Tokio `broadcast::Receiver` doesn't implement `Stream`, so `tokio_stream::wrappers::BroadcastStream` instead. This also requires changing the error type that is used. * Handle `Semaphore::acquire` error in `tower-batch` Newer versions of Tokio can return an error if the semaphore is closed. This shouldn't happen in `tower-batch` because the semaphore is never closed. * Handle `Semaphore::acquire` error in `zebrad` test On newer versions of Tokio `Semaphore::acquire` can return an error if the semaphore is closed. This shouldn't happen in the test because the semaphore is never closed. * Update some `zebra-network` dependencies Use versions compatible with Tokio version 1. * Upgrade Hyper to version 0.14 Use a version that supports Tokio version 1. * Update `metrics` dependency to version 0.17 And also update the `metrics-exporter-prometheus` to version 0.6.1. These updates are to make sure Tokio 1 is supported. * Use `f64` as the histogram data type `u64` isn't supported as the histogram data type in newer versions of `metrics`. * Update the initialization of the metrics component Make it compatible with the new version of `metrics`. * Simplify build version counter Remove all constants and use the new `metrics::incement_counter!` macro. * Change metrics output line to match on The snapshot string isn't included in the newer version of `metrics-exporter-prometheus`. * Update `sentry` to version 0.23.0 Use a version compatible with Tokio version 1. * Remove usage of `TracingIntegration` This seems to not be available from `sentry-tracing` anymore, so it needs to be replaced. * Add sentry layer to tracing initialization This seems like the replacement for `TracingIntegration`. * Remove unnecessary conversion Suggested by a Clippy lint. * Update Cargo lock file Apply all of the updates to dependencies. * Ban duplicate tokio dependencies Also ban git sources for tokio dependencies. * Stop allowing sentry-tracing git repository in `deny.toml` * Allow remaining duplicates after the tokio upgrade * Use C: drive for CI build output on Windows GitHub Actions uses a Windows image with two disk drives, and the default D: drive is smaller than the C: drive. Zebra currently uses a lot of space to build, so it has to use the C: drive to avoid CI build failures because of insufficient space. Co-authored-by: teor <teor@riseup.net>
2021-11-02 11:46:57 -07:00
use tokio_stream::wrappers::IntervalStream;
use tower::{
Disconnect from outdated peers on network upgrade (#3108) * Replace usage of `discover::Change` with a tuple Remove the assumption that a `Remove` variant would never be created with type changes that allow the compiler to guarantee that assumption. * Add a `version` field to the `Client` type Keep track of the peer's reported protocol version. * Create `LoadTrackedClient` type A `peer::Client` type wrapper that implements `Load`. This helps with the creation of a client service that has extra peer information to be accessed without having to send requests. * Use `LoadTrackedClient` in `initialize` Ensure that `PeerSet` receives `LoadTrackedClient`s so that it will be able to query the peer's protocol version later on. * Require `LoadTrackedClient` in `PeerSet` Replace the generic type with a concrete `LoadTrackedClient` so that we can query its version. * Create `MinimumPeerVersion` helper type A type to track the current minimum protocol version for connected peers based on the current block height. * Use `MinimumPeerVersion` in handshakes Keep the code to obtain the current minimum peer protocol version in a central place. * Add a `MinimumPeerVersion` instance to `PeerSet` Prepare it to be able to disconnect from outdated peers based on the current minimum supported peer protocol version. * Disconnect from ready services for outdated peers When the minimum peer protocol version is detected to have changed (because of a network upgrade), remove all ready services of peers that became outdated. * Cancel added unready services of outdated peers Only add an unready service if it's for a peer that has a supported protocol version. Otherwise, add it but drop the cancel handle so that the `UnreadyService` can execute and detect that it was cancelled. * Avoid adding ready services for outdated peers If a service becomes ready but it's for a connection to an outdated peer, drop it. * Improve comment inside `crawl_and_dial` Describe an edge case that is also handled but was not explicit. Co-authored-by: teor <teor@riseup.net> * Test if calculated minimum peer version is correct Given an arbitrary best chain tip height, check that the calculated minimum peer protocol version is the expected value. * Test if minimum version changes with chain tip Apply an arbitrary list of chain tip height updates and check that for each update the minimum peer version is calculated correctly. * Test minimum peer version changed reports Simulate a series of best chain tip height updates, and check for minimum peer version updates at least once between them. Changes should only be reported once. * Create a `MockedClientHandle` helper type Used to create and then track a mock `Client` instance. * Add `MinimumPeerVersion::with_mock_chain_tip` An extension method useful for tests, that contains some shared boilerplate code. * Bias arbitrary `Version`s to be in valid range Give a 50% chance for an arbitrary `Version` to be in the range of previously used values the Zcash network. * Create a `PeerVersions` helper type Helps with the creation of mocked client services with arbitrary protocol versions. * Create a `PeerSetGuard` helper type An auxiliary type to a `PeerSet` instance created for testing. It keeps track of any dummy endpoints of channels created and passed to the `PeerSet` instance. * Create a `PeerSetBuilder` helper type Helps to reduce the code when preparing a `PeerSet` test instance. * Test if outdated peers are rejected by `PeerSet` Simulate a set of discovered peers being sent to the `PeerSet`. Ensure that only up-to-date peers are kept by the `PeerSet` and that outdated peers are dropped. * Create `BlockHeightPairAcrossNetworkUpgrades` type A helper type that allows the creation of arbitrary block height pairs, where one value is before and the other is at or after the activation height of an arbitrary network upgrade. * Test if peers are dropped as they become outdated Simulate a network upgrade, and check that peers that become outdated are dropped by the `PeerSet`. * Remove dbg! macros Co-authored-by: teor <teor@riseup.net>
2021-12-08 18:54:29 -08:00
buffer::Buffer, discover::Change, layer::Layer, util::BoxService, Service, ServiceExt,
};
use tracing_futures::Instrument;
use zebra_chain::{chain_tip::ChainTip, diagnostic::task::WaitForPanics};
use crate::{
address_book_updater::AddressBookUpdater,
constants,
meta_addr::{MetaAddr, MetaAddrChange},
peer::{
self, address_is_valid_for_inbound_listeners, HandshakeRequest, MinimumPeerVersion,
OutboundConnectorRequest, PeerPreference,
},
feat(net): Cache a list of useful peers on disk (#6739) * Rewrite some state cache docs to clarify * Add a zebra_network::Config.cache_dir for peer address caches * Add new config test files and fix config test failure message * Create some zebra-chain and zebra-network convenience functions * Add methods for reading and writing the peer address cache * Add cached disk peers to the initial peers list * Add metrics and logging for loading and storing the peer cache * Replace log of useless redacted peer IP addresses * Limit the peer cache minimum and maximum size, don't write empty caches * Add a cacheable_peers() method to the address book * Add a peer disk cache updater task to the peer set tasks * Document that the peer cache is shared by multiple instances unless configured otherwise * Disable peer cache read/write in disconnected tests * Make initial peer cache updater sleep shorter for tests * Add unit tests for reading and writing the peer cache * Update the task list in the start command docs * Modify the existing persistent acceptance test to check for peer caches * Update the peer cache directory when writing test configs * Add a CacheDir type so the default config can be enabled, but tests can disable it * Update tests to use the CacheDir config type * Rename some CacheDir internals * Add config file test cases for each kind of CacheDir config * Panic if the config contains invalid socket addresses, rather than continuing * Add a network directory to state cache directory contents tests * Add new network.cache_dir config to the config parsing tests
2023-06-06 01:28:14 -07:00
peer_cache_updater::peer_cache_updater,
peer_set::{set::MorePeers, ActiveConnectionCounter, CandidateSet, ConnectionTracker, PeerSet},
AddressBook, BoxError, Config, PeerSocketAddr, Request, Response,
};
#[cfg(test)]
mod tests;
mod recent_by_ip;
/// A successful outbound peer connection attempt or inbound connection handshake.
///
/// The [`Handshake`](peer::Handshake) service returns a [`Result`]. Only successful connections
/// should be sent on the channel. Errors should be logged or ignored.
///
/// We don't allow any errors in this type, because:
/// - The connection limits don't include failed connections
/// - tower::Discover interprets an error as stream termination
type DiscoveredPeer = (PeerSocketAddr, peer::Client);
/// Initialize a peer set, using a network `config`, `inbound_service`,
/// and `latest_chain_tip`.
///
/// The peer set abstracts away peer management to provide a
/// [`tower::Service`] representing "the network" that load-balances requests
/// over available peers. The peer set automatically crawls the network to
/// find more peer addresses and opportunistically connects to new peers.
///
/// Each peer connection's message handling is isolated from other
/// connections, unlike in `zcashd`. The peer connection first attempts to
/// interpret inbound messages as part of a response to a previously-issued
/// request. Otherwise, inbound messages are interpreted as requests and sent
/// to the supplied `inbound_service`.
///
/// Wrapping the `inbound_service` in [`tower::load_shed`] middleware will
/// cause the peer set to shrink when the inbound service is unable to keep up
/// with the volume of inbound requests.
///
/// Use [`NoChainTip`][1] to explicitly provide no chain tip receiver.
///
/// In addition to returning a service for outbound requests, this method
/// returns a shared [`AddressBook`] updated with last-seen timestamps for
/// connected peers. The shared address book should be accessed using a
/// [blocking thread](https://docs.rs/tokio/1.15.0/tokio/task/index.html#blocking-and-yielding),
/// to avoid async task deadlocks.
///
/// # Panics
///
/// If `config.config.peerset_initial_target_size` is zero.
/// (zebra-network expects to be able to connect to at least one peer.)
///
/// [1]: zebra_chain::chain_tip::NoChainTip
pub async fn init<S, C>(
config: Config,
inbound_service: S,
latest_chain_tip: C,
user_agent: String,
) -> (
Buffer<BoxService<Request, Response, BoxError>, Request>,
Arc<std::sync::Mutex<AddressBook>>,
)
where
S: Service<Request, Response = Response, Error = BoxError> + Clone + Send + Sync + 'static,
S::Future: Send + 'static,
C: ChainTip + Clone + Send + Sync + 'static,
{
let (tcp_listener, listen_addr) = open_listener(&config.clone()).await;
let (address_book, address_book_updater, address_metrics, address_book_updater_guard) =
AddressBookUpdater::spawn(&config, listen_addr);
// Create a broadcast channel for peer inventory advertisements.
// If it reaches capacity, this channel drops older inventory advertisements.
//
// When Zebra is at the chain tip with an up-to-date mempool,
// we expect to have at most 1 new transaction per connected peer,
// and 1-2 new blocks across the entire network.
// (The block syncer and mempool crawler handle bulk fetches of blocks and transactions.)
let (inv_sender, inv_receiver) = broadcast::channel(config.peerset_total_connection_limit());
// Construct services that handle inbound handshakes and perform outbound
// handshakes. These use the same handshake service internally to detect
// self-connection attempts. Both are decorated with a tower TimeoutLayer to
// enforce timeouts as specified in the Config.
let (listen_handshaker, outbound_connector) = {
use tower::timeout::TimeoutLayer;
let hs_timeout = TimeoutLayer::new(constants::HANDSHAKE_TIMEOUT);
use crate::protocol::external::types::PeerServices;
let hs = peer::Handshake::builder()
.with_config(config.clone())
.with_inbound_service(inbound_service)
.with_inventory_collector(inv_sender)
.with_address_book_updater(address_book_updater.clone())
.with_advertised_services(PeerServices::NODE_NETWORK)
.with_user_agent(user_agent)
Disconnect from outdated peers on network upgrade (#3108) * Replace usage of `discover::Change` with a tuple Remove the assumption that a `Remove` variant would never be created with type changes that allow the compiler to guarantee that assumption. * Add a `version` field to the `Client` type Keep track of the peer's reported protocol version. * Create `LoadTrackedClient` type A `peer::Client` type wrapper that implements `Load`. This helps with the creation of a client service that has extra peer information to be accessed without having to send requests. * Use `LoadTrackedClient` in `initialize` Ensure that `PeerSet` receives `LoadTrackedClient`s so that it will be able to query the peer's protocol version later on. * Require `LoadTrackedClient` in `PeerSet` Replace the generic type with a concrete `LoadTrackedClient` so that we can query its version. * Create `MinimumPeerVersion` helper type A type to track the current minimum protocol version for connected peers based on the current block height. * Use `MinimumPeerVersion` in handshakes Keep the code to obtain the current minimum peer protocol version in a central place. * Add a `MinimumPeerVersion` instance to `PeerSet` Prepare it to be able to disconnect from outdated peers based on the current minimum supported peer protocol version. * Disconnect from ready services for outdated peers When the minimum peer protocol version is detected to have changed (because of a network upgrade), remove all ready services of peers that became outdated. * Cancel added unready services of outdated peers Only add an unready service if it's for a peer that has a supported protocol version. Otherwise, add it but drop the cancel handle so that the `UnreadyService` can execute and detect that it was cancelled. * Avoid adding ready services for outdated peers If a service becomes ready but it's for a connection to an outdated peer, drop it. * Improve comment inside `crawl_and_dial` Describe an edge case that is also handled but was not explicit. Co-authored-by: teor <teor@riseup.net> * Test if calculated minimum peer version is correct Given an arbitrary best chain tip height, check that the calculated minimum peer protocol version is the expected value. * Test if minimum version changes with chain tip Apply an arbitrary list of chain tip height updates and check that for each update the minimum peer version is calculated correctly. * Test minimum peer version changed reports Simulate a series of best chain tip height updates, and check for minimum peer version updates at least once between them. Changes should only be reported once. * Create a `MockedClientHandle` helper type Used to create and then track a mock `Client` instance. * Add `MinimumPeerVersion::with_mock_chain_tip` An extension method useful for tests, that contains some shared boilerplate code. * Bias arbitrary `Version`s to be in valid range Give a 50% chance for an arbitrary `Version` to be in the range of previously used values the Zcash network. * Create a `PeerVersions` helper type Helps with the creation of mocked client services with arbitrary protocol versions. * Create a `PeerSetGuard` helper type An auxiliary type to a `PeerSet` instance created for testing. It keeps track of any dummy endpoints of channels created and passed to the `PeerSet` instance. * Create a `PeerSetBuilder` helper type Helps to reduce the code when preparing a `PeerSet` test instance. * Test if outdated peers are rejected by `PeerSet` Simulate a set of discovered peers being sent to the `PeerSet`. Ensure that only up-to-date peers are kept by the `PeerSet` and that outdated peers are dropped. * Create `BlockHeightPairAcrossNetworkUpgrades` type A helper type that allows the creation of arbitrary block height pairs, where one value is before and the other is at or after the activation height of an arbitrary network upgrade. * Test if peers are dropped as they become outdated Simulate a network upgrade, and check that peers that become outdated are dropped by the `PeerSet`. * Remove dbg! macros Co-authored-by: teor <teor@riseup.net>
2021-12-08 18:54:29 -08:00
.with_latest_chain_tip(latest_chain_tip.clone())
.want_transactions(true)
.finish()
.expect("configured all required parameters");
(
hs_timeout.layer(hs.clone()),
hs_timeout.layer(peer::Connector::new(hs)),
)
};
// Create an mpsc channel for peer changes,
// based on the maximum number of inbound and outbound peers.
//
// The connection limit does not apply to errors,
// so they need to be handled before sending to this channel.
let (peerset_tx, peerset_rx) =
futures::channel::mpsc::channel::<DiscoveredPeer>(config.peerset_total_connection_limit());
Disconnect from outdated peers on network upgrade (#3108) * Replace usage of `discover::Change` with a tuple Remove the assumption that a `Remove` variant would never be created with type changes that allow the compiler to guarantee that assumption. * Add a `version` field to the `Client` type Keep track of the peer's reported protocol version. * Create `LoadTrackedClient` type A `peer::Client` type wrapper that implements `Load`. This helps with the creation of a client service that has extra peer information to be accessed without having to send requests. * Use `LoadTrackedClient` in `initialize` Ensure that `PeerSet` receives `LoadTrackedClient`s so that it will be able to query the peer's protocol version later on. * Require `LoadTrackedClient` in `PeerSet` Replace the generic type with a concrete `LoadTrackedClient` so that we can query its version. * Create `MinimumPeerVersion` helper type A type to track the current minimum protocol version for connected peers based on the current block height. * Use `MinimumPeerVersion` in handshakes Keep the code to obtain the current minimum peer protocol version in a central place. * Add a `MinimumPeerVersion` instance to `PeerSet` Prepare it to be able to disconnect from outdated peers based on the current minimum supported peer protocol version. * Disconnect from ready services for outdated peers When the minimum peer protocol version is detected to have changed (because of a network upgrade), remove all ready services of peers that became outdated. * Cancel added unready services of outdated peers Only add an unready service if it's for a peer that has a supported protocol version. Otherwise, add it but drop the cancel handle so that the `UnreadyService` can execute and detect that it was cancelled. * Avoid adding ready services for outdated peers If a service becomes ready but it's for a connection to an outdated peer, drop it. * Improve comment inside `crawl_and_dial` Describe an edge case that is also handled but was not explicit. Co-authored-by: teor <teor@riseup.net> * Test if calculated minimum peer version is correct Given an arbitrary best chain tip height, check that the calculated minimum peer protocol version is the expected value. * Test if minimum version changes with chain tip Apply an arbitrary list of chain tip height updates and check that for each update the minimum peer version is calculated correctly. * Test minimum peer version changed reports Simulate a series of best chain tip height updates, and check for minimum peer version updates at least once between them. Changes should only be reported once. * Create a `MockedClientHandle` helper type Used to create and then track a mock `Client` instance. * Add `MinimumPeerVersion::with_mock_chain_tip` An extension method useful for tests, that contains some shared boilerplate code. * Bias arbitrary `Version`s to be in valid range Give a 50% chance for an arbitrary `Version` to be in the range of previously used values the Zcash network. * Create a `PeerVersions` helper type Helps with the creation of mocked client services with arbitrary protocol versions. * Create a `PeerSetGuard` helper type An auxiliary type to a `PeerSet` instance created for testing. It keeps track of any dummy endpoints of channels created and passed to the `PeerSet` instance. * Create a `PeerSetBuilder` helper type Helps to reduce the code when preparing a `PeerSet` test instance. * Test if outdated peers are rejected by `PeerSet` Simulate a set of discovered peers being sent to the `PeerSet`. Ensure that only up-to-date peers are kept by the `PeerSet` and that outdated peers are dropped. * Create `BlockHeightPairAcrossNetworkUpgrades` type A helper type that allows the creation of arbitrary block height pairs, where one value is before and the other is at or after the activation height of an arbitrary network upgrade. * Test if peers are dropped as they become outdated Simulate a network upgrade, and check that peers that become outdated are dropped by the `PeerSet`. * Remove dbg! macros Co-authored-by: teor <teor@riseup.net>
2021-12-08 18:54:29 -08:00
let discovered_peers = peerset_rx.map(|(address, client)| {
Result::<_, Infallible>::Ok(Change::Insert(address, client.into()))
});
Disconnect from outdated peers on network upgrade (#3108) * Replace usage of `discover::Change` with a tuple Remove the assumption that a `Remove` variant would never be created with type changes that allow the compiler to guarantee that assumption. * Add a `version` field to the `Client` type Keep track of the peer's reported protocol version. * Create `LoadTrackedClient` type A `peer::Client` type wrapper that implements `Load`. This helps with the creation of a client service that has extra peer information to be accessed without having to send requests. * Use `LoadTrackedClient` in `initialize` Ensure that `PeerSet` receives `LoadTrackedClient`s so that it will be able to query the peer's protocol version later on. * Require `LoadTrackedClient` in `PeerSet` Replace the generic type with a concrete `LoadTrackedClient` so that we can query its version. * Create `MinimumPeerVersion` helper type A type to track the current minimum protocol version for connected peers based on the current block height. * Use `MinimumPeerVersion` in handshakes Keep the code to obtain the current minimum peer protocol version in a central place. * Add a `MinimumPeerVersion` instance to `PeerSet` Prepare it to be able to disconnect from outdated peers based on the current minimum supported peer protocol version. * Disconnect from ready services for outdated peers When the minimum peer protocol version is detected to have changed (because of a network upgrade), remove all ready services of peers that became outdated. * Cancel added unready services of outdated peers Only add an unready service if it's for a peer that has a supported protocol version. Otherwise, add it but drop the cancel handle so that the `UnreadyService` can execute and detect that it was cancelled. * Avoid adding ready services for outdated peers If a service becomes ready but it's for a connection to an outdated peer, drop it. * Improve comment inside `crawl_and_dial` Describe an edge case that is also handled but was not explicit. Co-authored-by: teor <teor@riseup.net> * Test if calculated minimum peer version is correct Given an arbitrary best chain tip height, check that the calculated minimum peer protocol version is the expected value. * Test if minimum version changes with chain tip Apply an arbitrary list of chain tip height updates and check that for each update the minimum peer version is calculated correctly. * Test minimum peer version changed reports Simulate a series of best chain tip height updates, and check for minimum peer version updates at least once between them. Changes should only be reported once. * Create a `MockedClientHandle` helper type Used to create and then track a mock `Client` instance. * Add `MinimumPeerVersion::with_mock_chain_tip` An extension method useful for tests, that contains some shared boilerplate code. * Bias arbitrary `Version`s to be in valid range Give a 50% chance for an arbitrary `Version` to be in the range of previously used values the Zcash network. * Create a `PeerVersions` helper type Helps with the creation of mocked client services with arbitrary protocol versions. * Create a `PeerSetGuard` helper type An auxiliary type to a `PeerSet` instance created for testing. It keeps track of any dummy endpoints of channels created and passed to the `PeerSet` instance. * Create a `PeerSetBuilder` helper type Helps to reduce the code when preparing a `PeerSet` test instance. * Test if outdated peers are rejected by `PeerSet` Simulate a set of discovered peers being sent to the `PeerSet`. Ensure that only up-to-date peers are kept by the `PeerSet` and that outdated peers are dropped. * Create `BlockHeightPairAcrossNetworkUpgrades` type A helper type that allows the creation of arbitrary block height pairs, where one value is before and the other is at or after the activation height of an arbitrary network upgrade. * Test if peers are dropped as they become outdated Simulate a network upgrade, and check that peers that become outdated are dropped by the `PeerSet`. * Remove dbg! macros Co-authored-by: teor <teor@riseup.net>
2021-12-08 18:54:29 -08:00
// Create an mpsc channel for peerset demand signaling,
// based on the maximum number of outbound peers.
let (mut demand_tx, demand_rx) =
futures::channel::mpsc::channel::<MorePeers>(config.peerset_outbound_connection_limit());
// Create a oneshot to send background task JoinHandles to the peer set
let (handle_tx, handle_rx) = tokio::sync::oneshot::channel();
// Connect the rx end to a PeerSet, wrapping new peers in load instruments.
let peer_set = PeerSet::new(
&config,
Disconnect from outdated peers on network upgrade (#3108) * Replace usage of `discover::Change` with a tuple Remove the assumption that a `Remove` variant would never be created with type changes that allow the compiler to guarantee that assumption. * Add a `version` field to the `Client` type Keep track of the peer's reported protocol version. * Create `LoadTrackedClient` type A `peer::Client` type wrapper that implements `Load`. This helps with the creation of a client service that has extra peer information to be accessed without having to send requests. * Use `LoadTrackedClient` in `initialize` Ensure that `PeerSet` receives `LoadTrackedClient`s so that it will be able to query the peer's protocol version later on. * Require `LoadTrackedClient` in `PeerSet` Replace the generic type with a concrete `LoadTrackedClient` so that we can query its version. * Create `MinimumPeerVersion` helper type A type to track the current minimum protocol version for connected peers based on the current block height. * Use `MinimumPeerVersion` in handshakes Keep the code to obtain the current minimum peer protocol version in a central place. * Add a `MinimumPeerVersion` instance to `PeerSet` Prepare it to be able to disconnect from outdated peers based on the current minimum supported peer protocol version. * Disconnect from ready services for outdated peers When the minimum peer protocol version is detected to have changed (because of a network upgrade), remove all ready services of peers that became outdated. * Cancel added unready services of outdated peers Only add an unready service if it's for a peer that has a supported protocol version. Otherwise, add it but drop the cancel handle so that the `UnreadyService` can execute and detect that it was cancelled. * Avoid adding ready services for outdated peers If a service becomes ready but it's for a connection to an outdated peer, drop it. * Improve comment inside `crawl_and_dial` Describe an edge case that is also handled but was not explicit. Co-authored-by: teor <teor@riseup.net> * Test if calculated minimum peer version is correct Given an arbitrary best chain tip height, check that the calculated minimum peer protocol version is the expected value. * Test if minimum version changes with chain tip Apply an arbitrary list of chain tip height updates and check that for each update the minimum peer version is calculated correctly. * Test minimum peer version changed reports Simulate a series of best chain tip height updates, and check for minimum peer version updates at least once between them. Changes should only be reported once. * Create a `MockedClientHandle` helper type Used to create and then track a mock `Client` instance. * Add `MinimumPeerVersion::with_mock_chain_tip` An extension method useful for tests, that contains some shared boilerplate code. * Bias arbitrary `Version`s to be in valid range Give a 50% chance for an arbitrary `Version` to be in the range of previously used values the Zcash network. * Create a `PeerVersions` helper type Helps with the creation of mocked client services with arbitrary protocol versions. * Create a `PeerSetGuard` helper type An auxiliary type to a `PeerSet` instance created for testing. It keeps track of any dummy endpoints of channels created and passed to the `PeerSet` instance. * Create a `PeerSetBuilder` helper type Helps to reduce the code when preparing a `PeerSet` test instance. * Test if outdated peers are rejected by `PeerSet` Simulate a set of discovered peers being sent to the `PeerSet`. Ensure that only up-to-date peers are kept by the `PeerSet` and that outdated peers are dropped. * Create `BlockHeightPairAcrossNetworkUpgrades` type A helper type that allows the creation of arbitrary block height pairs, where one value is before and the other is at or after the activation height of an arbitrary network upgrade. * Test if peers are dropped as they become outdated Simulate a network upgrade, and check that peers that become outdated are dropped by the `PeerSet`. * Remove dbg! macros Co-authored-by: teor <teor@riseup.net>
2021-12-08 18:54:29 -08:00
discovered_peers,
demand_tx.clone(),
handle_rx,
inv_receiver,
address_metrics,
change(chain): Remove `Copy` trait impl from `Network` (#8354) * removed derive copy from network, address and ledgerstate * changed is_max_block_time_enforced to accept ref * changed NetworkUpgrade::Current to accept ref * changed NetworkUpgrade::Next to accept ref * changed NetworkUpgrade::IsActivationHeight to accept ref * changed NetworkUpgrade::TargetSpacingForHeight to accept ref * changed NetworkUpgrade::TargetSpacings to accept ref * changed NetworkUpgrade::MinimumDifficultySpacing_forHeight to accept ref * changed NetworkUpgrade::IsTestnetMinDifficultyBlock to accept ref * changed NetworkUpgrade::AveragingWindowTimespanForHeight to accept ref * changed NetworkUpgrade::ActivationHeight to accept ref * changed sapling_activation_height to accept ref * fixed lifetime for target_spacings * fixed sapling_activation_height * changed transaction_to_fake_v5 and fake_v5_transactions_for_network to accept ref to network * changed Input::vec_strategy to accept ref to network * changed functions in zebra-chain/src/primitives/zcash_history.rs to accept ref to network * changed functions in zebra-chain/src/history_tree.rs to accept ref to network * changed functions in zebra-chain/src/history_tree.rs to accept ref to network * changed functions in zebra-chain/src/primitives/address.rs to accept ref to network * changed functions in zebra-chain/src/primitives/viewing_key* to accept ref to network * changed functions in zebra-chain/src/transparent/address.rs to accept ref to network * changed functions in zebra-chain/src/primitives/zcash_primitives.rs to accept ref to network * changed functions in zebra-chain/src/primitives/zcash_note_encryption.rs to accept ref to network * changed functions in zebra-chain/src/primitives/history_tree* to accept ref to network * changed functions in zebra-chain/src/block* to accept ref to network * fixed errors in zebra-chain::parameters::network * fixed errors in zebra-chain::parameters::network * fixed errors in zebra-chain * changed NonEmptyHistoryTree and InnerHistoryTree to hold value instead of ref * changed NonEmptyHistoryTree and InnerHistoryTree to hold value instead of ref * fixed errors in zebra-chain/src/block/arbitrary.rs * finished fixing errors in zebra-chain - all crate tests pass * changed functions in zebra-state::service::finalized_state to accept &Network * changed functions in zebra-state::service::non_finalized_state to accept &Network * zebra-state tests run but fail with overflow error * zebra-state tests all pass * converted zebra-network -- all crate tests pass * applied all requested changes from review * converted zebra-consensus -- all crate tests pass * converted zebra-scan -- all crate tests pass * converted zebra-rpc -- all crate tests pass * converted zebra-grpc -- all crate tests pass * converted zebrad -- all crate tests pass * applied all requested changes from review * fixed all clippy errors * fixed build error in zebrad/src/components/mempool/crawler.rs
2024-03-19 13:45:27 -07:00
MinimumPeerVersion::new(latest_chain_tip, &config.network),
None,
);
let peer_set = Buffer::new(BoxService::new(peer_set), constants::PEERSET_BUFFER_SIZE);
// Connect peerset_tx to the 3 peer sources:
//
// 1. Incoming peer connections, via a listener.
let listen_fut = accept_inbound_connections(
config.clone(),
tcp_listener,
constants::MIN_INBOUND_PEER_CONNECTION_INTERVAL,
listen_handshaker,
peerset_tx.clone(),
);
let listen_guard = tokio::spawn(listen_fut.in_current_span());
feat(net): Cache a list of useful peers on disk (#6739) * Rewrite some state cache docs to clarify * Add a zebra_network::Config.cache_dir for peer address caches * Add new config test files and fix config test failure message * Create some zebra-chain and zebra-network convenience functions * Add methods for reading and writing the peer address cache * Add cached disk peers to the initial peers list * Add metrics and logging for loading and storing the peer cache * Replace log of useless redacted peer IP addresses * Limit the peer cache minimum and maximum size, don't write empty caches * Add a cacheable_peers() method to the address book * Add a peer disk cache updater task to the peer set tasks * Document that the peer cache is shared by multiple instances unless configured otherwise * Disable peer cache read/write in disconnected tests * Make initial peer cache updater sleep shorter for tests * Add unit tests for reading and writing the peer cache * Update the task list in the start command docs * Modify the existing persistent acceptance test to check for peer caches * Update the peer cache directory when writing test configs * Add a CacheDir type so the default config can be enabled, but tests can disable it * Update tests to use the CacheDir config type * Rename some CacheDir internals * Add config file test cases for each kind of CacheDir config * Panic if the config contains invalid socket addresses, rather than continuing * Add a network directory to state cache directory contents tests * Add new network.cache_dir config to the config parsing tests
2023-06-06 01:28:14 -07:00
// 2. Initial peers, specified in the config and cached on disk.
let initial_peers_fut = add_initial_peers(
config.clone(),
outbound_connector.clone(),
peerset_tx.clone(),
address_book_updater.clone(),
);
let initial_peers_join = tokio::spawn(initial_peers_fut.in_current_span());
// 3. Outgoing peers we connect to in response to load.
let mut candidates = CandidateSet::new(address_book.clone(), peer_set.clone());
// Wait for the initial seed peer count
let mut active_outbound_connections = initial_peers_join
.wait_for_panics()
.await
.expect("unexpected error connecting to initial peers");
let active_initial_peer_count = active_outbound_connections.update_count();
// We need to await candidates.update() here,
// because zcashd rate-limits `addr`/`addrv2` messages per connection,
// and if we only have one initial peer,
// we need to ensure that its `Response::Addr` is used by the crawler.
//
// TODO: this might not be needed after we added the Connection peer address cache,
// try removing it in a future release?
info!(
?active_initial_peer_count,
"sending initial request for peers"
);
let _ = candidates.update_initial(active_initial_peer_count).await;
// Compute remaining connections to open.
let demand_count = config
.peerset_initial_target_size
.saturating_sub(active_outbound_connections.update_count());
for _ in 0..demand_count {
let _ = demand_tx.try_send(MorePeers);
}
feat(net): Cache a list of useful peers on disk (#6739) * Rewrite some state cache docs to clarify * Add a zebra_network::Config.cache_dir for peer address caches * Add new config test files and fix config test failure message * Create some zebra-chain and zebra-network convenience functions * Add methods for reading and writing the peer address cache * Add cached disk peers to the initial peers list * Add metrics and logging for loading and storing the peer cache * Replace log of useless redacted peer IP addresses * Limit the peer cache minimum and maximum size, don't write empty caches * Add a cacheable_peers() method to the address book * Add a peer disk cache updater task to the peer set tasks * Document that the peer cache is shared by multiple instances unless configured otherwise * Disable peer cache read/write in disconnected tests * Make initial peer cache updater sleep shorter for tests * Add unit tests for reading and writing the peer cache * Update the task list in the start command docs * Modify the existing persistent acceptance test to check for peer caches * Update the peer cache directory when writing test configs * Add a CacheDir type so the default config can be enabled, but tests can disable it * Update tests to use the CacheDir config type * Rename some CacheDir internals * Add config file test cases for each kind of CacheDir config * Panic if the config contains invalid socket addresses, rather than continuing * Add a network directory to state cache directory contents tests * Add new network.cache_dir config to the config parsing tests
2023-06-06 01:28:14 -07:00
// Start the peer crawler
let crawl_fut = crawl_and_dial(
feat(net): Cache a list of useful peers on disk (#6739) * Rewrite some state cache docs to clarify * Add a zebra_network::Config.cache_dir for peer address caches * Add new config test files and fix config test failure message * Create some zebra-chain and zebra-network convenience functions * Add methods for reading and writing the peer address cache * Add cached disk peers to the initial peers list * Add metrics and logging for loading and storing the peer cache * Replace log of useless redacted peer IP addresses * Limit the peer cache minimum and maximum size, don't write empty caches * Add a cacheable_peers() method to the address book * Add a peer disk cache updater task to the peer set tasks * Document that the peer cache is shared by multiple instances unless configured otherwise * Disable peer cache read/write in disconnected tests * Make initial peer cache updater sleep shorter for tests * Add unit tests for reading and writing the peer cache * Update the task list in the start command docs * Modify the existing persistent acceptance test to check for peer caches * Update the peer cache directory when writing test configs * Add a CacheDir type so the default config can be enabled, but tests can disable it * Update tests to use the CacheDir config type * Rename some CacheDir internals * Add config file test cases for each kind of CacheDir config * Panic if the config contains invalid socket addresses, rather than continuing * Add a network directory to state cache directory contents tests * Add new network.cache_dir config to the config parsing tests
2023-06-06 01:28:14 -07:00
config.clone(),
demand_tx,
demand_rx,
candidates,
outbound_connector,
peerset_tx,
active_outbound_connections,
address_book_updater,
);
let crawl_guard = tokio::spawn(crawl_fut.in_current_span());
feat(net): Cache a list of useful peers on disk (#6739) * Rewrite some state cache docs to clarify * Add a zebra_network::Config.cache_dir for peer address caches * Add new config test files and fix config test failure message * Create some zebra-chain and zebra-network convenience functions * Add methods for reading and writing the peer address cache * Add cached disk peers to the initial peers list * Add metrics and logging for loading and storing the peer cache * Replace log of useless redacted peer IP addresses * Limit the peer cache minimum and maximum size, don't write empty caches * Add a cacheable_peers() method to the address book * Add a peer disk cache updater task to the peer set tasks * Document that the peer cache is shared by multiple instances unless configured otherwise * Disable peer cache read/write in disconnected tests * Make initial peer cache updater sleep shorter for tests * Add unit tests for reading and writing the peer cache * Update the task list in the start command docs * Modify the existing persistent acceptance test to check for peer caches * Update the peer cache directory when writing test configs * Add a CacheDir type so the default config can be enabled, but tests can disable it * Update tests to use the CacheDir config type * Rename some CacheDir internals * Add config file test cases for each kind of CacheDir config * Panic if the config contains invalid socket addresses, rather than continuing * Add a network directory to state cache directory contents tests * Add new network.cache_dir config to the config parsing tests
2023-06-06 01:28:14 -07:00
// Start the peer disk cache updater
let peer_cache_updater_fut = peer_cache_updater(config, address_book.clone());
let peer_cache_updater_guard = tokio::spawn(peer_cache_updater_fut.in_current_span());
handle_tx
feat(net): Cache a list of useful peers on disk (#6739) * Rewrite some state cache docs to clarify * Add a zebra_network::Config.cache_dir for peer address caches * Add new config test files and fix config test failure message * Create some zebra-chain and zebra-network convenience functions * Add methods for reading and writing the peer address cache * Add cached disk peers to the initial peers list * Add metrics and logging for loading and storing the peer cache * Replace log of useless redacted peer IP addresses * Limit the peer cache minimum and maximum size, don't write empty caches * Add a cacheable_peers() method to the address book * Add a peer disk cache updater task to the peer set tasks * Document that the peer cache is shared by multiple instances unless configured otherwise * Disable peer cache read/write in disconnected tests * Make initial peer cache updater sleep shorter for tests * Add unit tests for reading and writing the peer cache * Update the task list in the start command docs * Modify the existing persistent acceptance test to check for peer caches * Update the peer cache directory when writing test configs * Add a CacheDir type so the default config can be enabled, but tests can disable it * Update tests to use the CacheDir config type * Rename some CacheDir internals * Add config file test cases for each kind of CacheDir config * Panic if the config contains invalid socket addresses, rather than continuing * Add a network directory to state cache directory contents tests * Add new network.cache_dir config to the config parsing tests
2023-06-06 01:28:14 -07:00
.send(vec![
listen_guard,
crawl_guard,
address_book_updater_guard,
peer_cache_updater_guard,
])
.unwrap();
(peer_set, address_book)
}
feat(net): Cache a list of useful peers on disk (#6739) * Rewrite some state cache docs to clarify * Add a zebra_network::Config.cache_dir for peer address caches * Add new config test files and fix config test failure message * Create some zebra-chain and zebra-network convenience functions * Add methods for reading and writing the peer address cache * Add cached disk peers to the initial peers list * Add metrics and logging for loading and storing the peer cache * Replace log of useless redacted peer IP addresses * Limit the peer cache minimum and maximum size, don't write empty caches * Add a cacheable_peers() method to the address book * Add a peer disk cache updater task to the peer set tasks * Document that the peer cache is shared by multiple instances unless configured otherwise * Disable peer cache read/write in disconnected tests * Make initial peer cache updater sleep shorter for tests * Add unit tests for reading and writing the peer cache * Update the task list in the start command docs * Modify the existing persistent acceptance test to check for peer caches * Update the peer cache directory when writing test configs * Add a CacheDir type so the default config can be enabled, but tests can disable it * Update tests to use the CacheDir config type * Rename some CacheDir internals * Add config file test cases for each kind of CacheDir config * Panic if the config contains invalid socket addresses, rather than continuing * Add a network directory to state cache directory contents tests * Add new network.cache_dir config to the config parsing tests
2023-06-06 01:28:14 -07:00
/// Use the provided `outbound_connector` to connect to the configured DNS seeder and
/// disk cache initial peers, then send the resulting peer connections over `peerset_tx`.
///
/// Also sends every initial peer address to the `address_book_updater`.
#[instrument(skip(config, outbound_connector, peerset_tx, address_book_updater))]
async fn add_initial_peers<S>(
config: Config,
outbound_connector: S,
mut peerset_tx: futures::channel::mpsc::Sender<DiscoveredPeer>,
address_book_updater: tokio::sync::mpsc::Sender<MetaAddrChange>,
) -> Result<ActiveConnectionCounter, BoxError>
where
S: Service<
OutboundConnectorRequest,
Response = (PeerSocketAddr, peer::Client),
Error = BoxError,
> + Clone
+ Send
+ 'static,
S::Future: Send + 'static,
{
let initial_peers = limit_initial_peers(&config, address_book_updater).await;
let mut handshake_success_total: usize = 0;
let mut handshake_error_total: usize = 0;
feat(ui): Add a terminal-based progress bar to Zebra (#6235) * Implement Display and to_string() for NetworkUpgrade * Add a progress-bar feature to zebrad * Add the progress bar writer to the tracing component * Add a block progress bar transmitter * Correctly shut down the progress bar, and shut it down on an interrupt * Make it clearer that the progress task never exits * Add a config for writing logs to a file * Add a progress-bar feature to zebra-network * Add a progress bar for the address book size * Add progress bars for never attempted and failed peers * Add an optional limit and label to connection counters * Add open connection progress bars * Improve CheckpointList API and CheckpointVerifier debugging * Add checkpoint index and checkpoint queue progress bars * Security: Limit the number of non-finalized chains tracked by Zebra * Make some NonFinalizedState methods available with proptest-impl * Add a non-finalized chain count progress bar * Track the last fork height for newly forked chains * Add a should_count_metrics to Chain * Add a display method for PartialCumulativeWork * Add a progress bar for each chain fork * Add a NonFinalizedState::disable_metrics() method and switch to using it * Move metrics out of Chain because we can't update Arc<Chain> * Fix: consistently use best chain order when searching chains * Track Chain progress bars in NonFinalizedState * Display work as bits, not a multiple of the target difficulty * Handle negative fork lengths by reporting "No fork" * Correctly disable unused fork bars * clippy: rewrite using `match _.cmp(_) { ... }` * Initial mempool progress bar implementation * Update Cargo.lock * Add the actual transaction size as a description to the cost bar * Only show mempool progress bars after first activation * Add queued and rejected mempool progress bars * Clarify cost note is actual size * Add tracing.log_file config and progress-bar feature to zebrad docs * Derive Clone for Chain * Upgrade to howudoin 0.1.2 and remove some bug workarounds * Directly call the debug formatter to Display a Network Co-authored-by: Arya <aryasolhi@gmail.com> * Rename the address count metric to num_addresses Co-authored-by: Arya <aryasolhi@gmail.com> * Simplify reverse checkpoint lookup Co-authored-by: Arya <aryasolhi@gmail.com> * Simplify progress bar shutdown code Co-authored-by: Arya <aryasolhi@gmail.com> * Remove unused MIN_TRANSPARENT_TX_MEMPOOL_SIZE * Document that the progress task runs forever * Fix progress log formatting * If progress-bar is on, log to a file by default * Create missing directories for log files * Add file security docs for running Zebra with elevated permissions * Document automatic log file, spell progress-bar correctly --------- Co-authored-by: Arya <aryasolhi@gmail.com>
2023-04-13 01:42:17 -07:00
let mut active_outbound_connections = ActiveConnectionCounter::new_counter_with(
config.peerset_outbound_connection_limit(),
"Outbound Connections",
);
feat(net): Cache a list of useful peers on disk (#6739) * Rewrite some state cache docs to clarify * Add a zebra_network::Config.cache_dir for peer address caches * Add new config test files and fix config test failure message * Create some zebra-chain and zebra-network convenience functions * Add methods for reading and writing the peer address cache * Add cached disk peers to the initial peers list * Add metrics and logging for loading and storing the peer cache * Replace log of useless redacted peer IP addresses * Limit the peer cache minimum and maximum size, don't write empty caches * Add a cacheable_peers() method to the address book * Add a peer disk cache updater task to the peer set tasks * Document that the peer cache is shared by multiple instances unless configured otherwise * Disable peer cache read/write in disconnected tests * Make initial peer cache updater sleep shorter for tests * Add unit tests for reading and writing the peer cache * Update the task list in the start command docs * Modify the existing persistent acceptance test to check for peer caches * Update the peer cache directory when writing test configs * Add a CacheDir type so the default config can be enabled, but tests can disable it * Update tests to use the CacheDir config type * Rename some CacheDir internals * Add config file test cases for each kind of CacheDir config * Panic if the config contains invalid socket addresses, rather than continuing * Add a network directory to state cache directory contents tests * Add new network.cache_dir config to the config parsing tests
2023-06-06 01:28:14 -07:00
// TODO: update when we add Tor peers or other kinds of addresses.
let ipv4_peer_count = initial_peers.iter().filter(|ip| ip.is_ipv4()).count();
let ipv6_peer_count = initial_peers.iter().filter(|ip| ip.is_ipv6()).count();
info!(
feat(net): Cache a list of useful peers on disk (#6739) * Rewrite some state cache docs to clarify * Add a zebra_network::Config.cache_dir for peer address caches * Add new config test files and fix config test failure message * Create some zebra-chain and zebra-network convenience functions * Add methods for reading and writing the peer address cache * Add cached disk peers to the initial peers list * Add metrics and logging for loading and storing the peer cache * Replace log of useless redacted peer IP addresses * Limit the peer cache minimum and maximum size, don't write empty caches * Add a cacheable_peers() method to the address book * Add a peer disk cache updater task to the peer set tasks * Document that the peer cache is shared by multiple instances unless configured otherwise * Disable peer cache read/write in disconnected tests * Make initial peer cache updater sleep shorter for tests * Add unit tests for reading and writing the peer cache * Update the task list in the start command docs * Modify the existing persistent acceptance test to check for peer caches * Update the peer cache directory when writing test configs * Add a CacheDir type so the default config can be enabled, but tests can disable it * Update tests to use the CacheDir config type * Rename some CacheDir internals * Add config file test cases for each kind of CacheDir config * Panic if the config contains invalid socket addresses, rather than continuing * Add a network directory to state cache directory contents tests * Add new network.cache_dir config to the config parsing tests
2023-06-06 01:28:14 -07:00
?ipv4_peer_count,
?ipv6_peer_count,
"connecting to initial peer set"
);
// # Security
//
// Resists distributed denial of service attacks by making sure that
// new peer connections are initiated at least `MIN_OUTBOUND_PEER_CONNECTION_INTERVAL` apart.
//
// # Correctness
//
// Each `FuturesUnordered` can hold one `Buffer` or `Batch` reservation for
// an indefinite period. We can use `FuturesUnordered` without filling
// the underlying network buffers, because we immediately drive this
// single `FuturesUnordered` to completion, and handshakes have a short timeout.
let mut handshakes: FuturesUnordered<_> = initial_peers
.into_iter()
.enumerate()
.map(|(i, addr)| {
let connection_tracker = active_outbound_connections.track_connection();
let req = OutboundConnectorRequest {
addr,
connection_tracker,
};
let outbound_connector = outbound_connector.clone();
// Spawn a new task to make the outbound connection.
tokio::spawn(
async move {
// Only spawn one outbound connector per
// `MIN_OUTBOUND_PEER_CONNECTION_INTERVAL`,
// by sleeping for the interval multiplied by the peer's index in the list.
sleep(
constants::MIN_OUTBOUND_PEER_CONNECTION_INTERVAL.saturating_mul(i as u32),
)
.await;
// As soon as we create the connector future,
// the handshake starts running as a spawned task.
outbound_connector
.oneshot(req)
.map_err(move |e| (addr, e))
.await
}
.in_current_span(),
)
.wait_for_panics()
})
.collect();
while let Some(handshake_result) = handshakes.next().await {
match handshake_result {
Ok(change) => {
handshake_success_total += 1;
debug!(
?handshake_success_total,
?handshake_error_total,
?change,
"an initial peer handshake succeeded"
);
// The connection limit makes sure this send doesn't block
peerset_tx.send(change).await?;
}
Err((addr, ref e)) => {
handshake_error_total += 1;
// this is verbose, but it's better than just hanging with no output when there are errors
let mut expected_error = false;
if let Some(io_error) = e.downcast_ref::<tokio::io::Error>() {
// Some systems only have IPv4, or only have IPv6,
// so these errors are not particularly interesting.
if io_error.kind() == tokio::io::ErrorKind::AddrNotAvailable {
expected_error = true;
}
}
if expected_error {
debug!(
successes = ?handshake_success_total,
errors = ?handshake_error_total,
?addr,
?e,
"an initial peer connection failed"
);
} else {
info!(
successes = ?handshake_success_total,
errors = ?handshake_error_total,
?addr,
%e,
"an initial peer connection failed"
);
}
}
}
// Security: Let other tasks run after each connection is processed.
//
// Avoids remote peers starving other Zebra tasks using initial connection successes or errors.
tokio::task::yield_now().await;
}
let outbound_connections = active_outbound_connections.update_count();
info!(
?handshake_success_total,
?handshake_error_total,
?outbound_connections,
feat(net): Cache a list of useful peers on disk (#6739) * Rewrite some state cache docs to clarify * Add a zebra_network::Config.cache_dir for peer address caches * Add new config test files and fix config test failure message * Create some zebra-chain and zebra-network convenience functions * Add methods for reading and writing the peer address cache * Add cached disk peers to the initial peers list * Add metrics and logging for loading and storing the peer cache * Replace log of useless redacted peer IP addresses * Limit the peer cache minimum and maximum size, don't write empty caches * Add a cacheable_peers() method to the address book * Add a peer disk cache updater task to the peer set tasks * Document that the peer cache is shared by multiple instances unless configured otherwise * Disable peer cache read/write in disconnected tests * Make initial peer cache updater sleep shorter for tests * Add unit tests for reading and writing the peer cache * Update the task list in the start command docs * Modify the existing persistent acceptance test to check for peer caches * Update the peer cache directory when writing test configs * Add a CacheDir type so the default config can be enabled, but tests can disable it * Update tests to use the CacheDir config type * Rename some CacheDir internals * Add config file test cases for each kind of CacheDir config * Panic if the config contains invalid socket addresses, rather than continuing * Add a network directory to state cache directory contents tests * Add new network.cache_dir config to the config parsing tests
2023-06-06 01:28:14 -07:00
"finished connecting to initial seed and disk cache peers"
);
Ok(active_outbound_connections)
}
/// Limit the number of `initial_peers` addresses entries to the configured
/// `peerset_initial_target_size`.
///
/// Returns randomly chosen entries from the provided set of addresses,
/// in a random order.
///
/// Also sends every initial peer to the `address_book_updater`.
async fn limit_initial_peers(
config: &Config,
address_book_updater: tokio::sync::mpsc::Sender<MetaAddrChange>,
) -> HashSet<PeerSocketAddr> {
let all_peers: HashSet<PeerSocketAddr> = config.initial_peers().await;
let mut preferred_peers: BTreeMap<PeerPreference, Vec<PeerSocketAddr>> = BTreeMap::new();
let all_peers_count = all_peers.len();
if all_peers_count > config.peerset_initial_target_size {
info!(
"limiting the initial peers list from {} to {}",
all_peers_count, config.peerset_initial_target_size,
);
}
// Filter out invalid initial peers, and prioritise valid peers for initial connections.
// (This treats initial peers the same way we treat gossiped peers.)
for peer_addr in all_peers {
change(chain): Remove `Copy` trait impl from `Network` (#8354) * removed derive copy from network, address and ledgerstate * changed is_max_block_time_enforced to accept ref * changed NetworkUpgrade::Current to accept ref * changed NetworkUpgrade::Next to accept ref * changed NetworkUpgrade::IsActivationHeight to accept ref * changed NetworkUpgrade::TargetSpacingForHeight to accept ref * changed NetworkUpgrade::TargetSpacings to accept ref * changed NetworkUpgrade::MinimumDifficultySpacing_forHeight to accept ref * changed NetworkUpgrade::IsTestnetMinDifficultyBlock to accept ref * changed NetworkUpgrade::AveragingWindowTimespanForHeight to accept ref * changed NetworkUpgrade::ActivationHeight to accept ref * changed sapling_activation_height to accept ref * fixed lifetime for target_spacings * fixed sapling_activation_height * changed transaction_to_fake_v5 and fake_v5_transactions_for_network to accept ref to network * changed Input::vec_strategy to accept ref to network * changed functions in zebra-chain/src/primitives/zcash_history.rs to accept ref to network * changed functions in zebra-chain/src/history_tree.rs to accept ref to network * changed functions in zebra-chain/src/history_tree.rs to accept ref to network * changed functions in zebra-chain/src/primitives/address.rs to accept ref to network * changed functions in zebra-chain/src/primitives/viewing_key* to accept ref to network * changed functions in zebra-chain/src/transparent/address.rs to accept ref to network * changed functions in zebra-chain/src/primitives/zcash_primitives.rs to accept ref to network * changed functions in zebra-chain/src/primitives/zcash_note_encryption.rs to accept ref to network * changed functions in zebra-chain/src/primitives/history_tree* to accept ref to network * changed functions in zebra-chain/src/block* to accept ref to network * fixed errors in zebra-chain::parameters::network * fixed errors in zebra-chain::parameters::network * fixed errors in zebra-chain * changed NonEmptyHistoryTree and InnerHistoryTree to hold value instead of ref * changed NonEmptyHistoryTree and InnerHistoryTree to hold value instead of ref * fixed errors in zebra-chain/src/block/arbitrary.rs * finished fixing errors in zebra-chain - all crate tests pass * changed functions in zebra-state::service::finalized_state to accept &Network * changed functions in zebra-state::service::non_finalized_state to accept &Network * zebra-state tests run but fail with overflow error * zebra-state tests all pass * converted zebra-network -- all crate tests pass * applied all requested changes from review * converted zebra-consensus -- all crate tests pass * converted zebra-scan -- all crate tests pass * converted zebra-rpc -- all crate tests pass * converted zebra-grpc -- all crate tests pass * converted zebrad -- all crate tests pass * applied all requested changes from review * fixed all clippy errors * fixed build error in zebrad/src/components/mempool/crawler.rs
2024-03-19 13:45:27 -07:00
let preference = PeerPreference::new(peer_addr, config.network.clone());
match preference {
Ok(preference) => preferred_peers
.entry(preference)
.or_default()
.push(peer_addr),
feat(net): Cache a list of useful peers on disk (#6739) * Rewrite some state cache docs to clarify * Add a zebra_network::Config.cache_dir for peer address caches * Add new config test files and fix config test failure message * Create some zebra-chain and zebra-network convenience functions * Add methods for reading and writing the peer address cache * Add cached disk peers to the initial peers list * Add metrics and logging for loading and storing the peer cache * Replace log of useless redacted peer IP addresses * Limit the peer cache minimum and maximum size, don't write empty caches * Add a cacheable_peers() method to the address book * Add a peer disk cache updater task to the peer set tasks * Document that the peer cache is shared by multiple instances unless configured otherwise * Disable peer cache read/write in disconnected tests * Make initial peer cache updater sleep shorter for tests * Add unit tests for reading and writing the peer cache * Update the task list in the start command docs * Modify the existing persistent acceptance test to check for peer caches * Update the peer cache directory when writing test configs * Add a CacheDir type so the default config can be enabled, but tests can disable it * Update tests to use the CacheDir config type * Rename some CacheDir internals * Add config file test cases for each kind of CacheDir config * Panic if the config contains invalid socket addresses, rather than continuing * Add a network directory to state cache directory contents tests * Add new network.cache_dir config to the config parsing tests
2023-06-06 01:28:14 -07:00
Err(error) => info!(
?peer_addr,
?error,
feat(net): Cache a list of useful peers on disk (#6739) * Rewrite some state cache docs to clarify * Add a zebra_network::Config.cache_dir for peer address caches * Add new config test files and fix config test failure message * Create some zebra-chain and zebra-network convenience functions * Add methods for reading and writing the peer address cache * Add cached disk peers to the initial peers list * Add metrics and logging for loading and storing the peer cache * Replace log of useless redacted peer IP addresses * Limit the peer cache minimum and maximum size, don't write empty caches * Add a cacheable_peers() method to the address book * Add a peer disk cache updater task to the peer set tasks * Document that the peer cache is shared by multiple instances unless configured otherwise * Disable peer cache read/write in disconnected tests * Make initial peer cache updater sleep shorter for tests * Add unit tests for reading and writing the peer cache * Update the task list in the start command docs * Modify the existing persistent acceptance test to check for peer caches * Update the peer cache directory when writing test configs * Add a CacheDir type so the default config can be enabled, but tests can disable it * Update tests to use the CacheDir config type * Rename some CacheDir internals * Add config file test cases for each kind of CacheDir config * Panic if the config contains invalid socket addresses, rather than continuing * Add a network directory to state cache directory contents tests * Add new network.cache_dir config to the config parsing tests
2023-06-06 01:28:14 -07:00
"invalid initial peer from DNS seeder, configured IP address, or disk cache",
),
}
}
// Send every initial peer to the address book, in preferred order.
// (This treats initial peers the same way we treat gossiped peers.)
//
// # Security
//
// Initial peers are limited because:
// - the number of initial peers is limited
// - this code only runs once at startup
for peer in preferred_peers.values().flatten() {
let peer_addr = MetaAddr::new_initial_peer(*peer);
// `send` only waits when the channel is full.
// The address book updater runs in its own thread, so we will only wait for a short time.
let _ = address_book_updater.send(peer_addr).await;
}
// Split out the `initial_peers` that will be shuffled and returned,
// choosing preferred peers first.
let mut initial_peers: HashSet<PeerSocketAddr> = HashSet::new();
for better_peers in preferred_peers.values() {
let mut better_peers = better_peers.clone();
let (chosen_peers, _unused_peers) = better_peers.partial_shuffle(
&mut rand::thread_rng(),
config.peerset_initial_target_size - initial_peers.len(),
);
initial_peers.extend(chosen_peers.iter());
if initial_peers.len() >= config.peerset_initial_target_size {
break;
}
}
initial_peers
}
/// Open a peer connection listener on `config.listen_addr`,
/// returning the opened [`TcpListener`], and the address it is bound to.
///
/// If the listener is configured to use an automatically chosen port (port `0`),
/// then the returned address will contain the actual port.
///
/// # Panics
///
/// If opening the listener fails.
#[instrument(skip(config), fields(addr = ?config.listen_addr))]
pub(crate) async fn open_listener(config: &Config) -> (TcpListener, SocketAddr) {
// Warn if we're configured using the wrong network port.
if let Err(wrong_addr) =
change(chain): Remove `Copy` trait impl from `Network` (#8354) * removed derive copy from network, address and ledgerstate * changed is_max_block_time_enforced to accept ref * changed NetworkUpgrade::Current to accept ref * changed NetworkUpgrade::Next to accept ref * changed NetworkUpgrade::IsActivationHeight to accept ref * changed NetworkUpgrade::TargetSpacingForHeight to accept ref * changed NetworkUpgrade::TargetSpacings to accept ref * changed NetworkUpgrade::MinimumDifficultySpacing_forHeight to accept ref * changed NetworkUpgrade::IsTestnetMinDifficultyBlock to accept ref * changed NetworkUpgrade::AveragingWindowTimespanForHeight to accept ref * changed NetworkUpgrade::ActivationHeight to accept ref * changed sapling_activation_height to accept ref * fixed lifetime for target_spacings * fixed sapling_activation_height * changed transaction_to_fake_v5 and fake_v5_transactions_for_network to accept ref to network * changed Input::vec_strategy to accept ref to network * changed functions in zebra-chain/src/primitives/zcash_history.rs to accept ref to network * changed functions in zebra-chain/src/history_tree.rs to accept ref to network * changed functions in zebra-chain/src/history_tree.rs to accept ref to network * changed functions in zebra-chain/src/primitives/address.rs to accept ref to network * changed functions in zebra-chain/src/primitives/viewing_key* to accept ref to network * changed functions in zebra-chain/src/transparent/address.rs to accept ref to network * changed functions in zebra-chain/src/primitives/zcash_primitives.rs to accept ref to network * changed functions in zebra-chain/src/primitives/zcash_note_encryption.rs to accept ref to network * changed functions in zebra-chain/src/primitives/history_tree* to accept ref to network * changed functions in zebra-chain/src/block* to accept ref to network * fixed errors in zebra-chain::parameters::network * fixed errors in zebra-chain::parameters::network * fixed errors in zebra-chain * changed NonEmptyHistoryTree and InnerHistoryTree to hold value instead of ref * changed NonEmptyHistoryTree and InnerHistoryTree to hold value instead of ref * fixed errors in zebra-chain/src/block/arbitrary.rs * finished fixing errors in zebra-chain - all crate tests pass * changed functions in zebra-state::service::finalized_state to accept &Network * changed functions in zebra-state::service::non_finalized_state to accept &Network * zebra-state tests run but fail with overflow error * zebra-state tests all pass * converted zebra-network -- all crate tests pass * applied all requested changes from review * converted zebra-consensus -- all crate tests pass * converted zebra-scan -- all crate tests pass * converted zebra-rpc -- all crate tests pass * converted zebra-grpc -- all crate tests pass * converted zebrad -- all crate tests pass * applied all requested changes from review * fixed all clippy errors * fixed build error in zebrad/src/components/mempool/crawler.rs
2024-03-19 13:45:27 -07:00
address_is_valid_for_inbound_listeners(config.listen_addr, config.network.clone())
{
warn!(
"We are configured with address {} on {:?}, but it could cause network issues. \
The default port for {:?} is {}. Error: {wrong_addr:?}",
config.listen_addr,
config.network,
config.network,
config.network.default_port(),
);
}
info!(
"Trying to open Zcash protocol endpoint at {}...",
config.listen_addr
);
let listener_result = TcpListener::bind(config.listen_addr).await;
let listener = match listener_result {
Ok(l) => l,
Err(e) => panic!(
"Opening Zcash network protocol listener {:?} failed: {e:?}. \
Hint: Check if another zebrad or zcashd process is running. \
Try changing the network listen_addr in the Zebra config.",
config.listen_addr,
),
};
let local_addr = listener
.local_addr()
.expect("unexpected missing local addr for open listener");
info!("Opened Zcash protocol endpoint at {}", local_addr);
(listener, local_addr)
}
/// Listens for peer connections on `addr`, then sets up each connection as a
/// Zcash peer.
///
/// Uses `handshaker` to perform a Zcash network protocol handshake, and sends
/// the [`peer::Client`] result over `peerset_tx`.
///
/// Limits the number of active inbound connections based on `config`,
/// and waits `min_inbound_peer_connection_interval` between connections.
#[instrument(skip(config, listener, handshaker, peerset_tx), fields(listener_addr = ?listener.local_addr()))]
async fn accept_inbound_connections<S>(
config: Config,
listener: TcpListener,
min_inbound_peer_connection_interval: Duration,
handshaker: S,
peerset_tx: futures::channel::mpsc::Sender<DiscoveredPeer>,
) -> Result<(), BoxError>
where
S: Service<peer::HandshakeRequest<TcpStream>, Response = peer::Client, Error = BoxError>
+ Clone,
S::Future: Send + 'static,
{
let mut recent_inbound_connections =
recent_by_ip::RecentByIp::new(None, Some(config.max_connections_per_ip));
feat(ui): Add a terminal-based progress bar to Zebra (#6235) * Implement Display and to_string() for NetworkUpgrade * Add a progress-bar feature to zebrad * Add the progress bar writer to the tracing component * Add a block progress bar transmitter * Correctly shut down the progress bar, and shut it down on an interrupt * Make it clearer that the progress task never exits * Add a config for writing logs to a file * Add a progress-bar feature to zebra-network * Add a progress bar for the address book size * Add progress bars for never attempted and failed peers * Add an optional limit and label to connection counters * Add open connection progress bars * Improve CheckpointList API and CheckpointVerifier debugging * Add checkpoint index and checkpoint queue progress bars * Security: Limit the number of non-finalized chains tracked by Zebra * Make some NonFinalizedState methods available with proptest-impl * Add a non-finalized chain count progress bar * Track the last fork height for newly forked chains * Add a should_count_metrics to Chain * Add a display method for PartialCumulativeWork * Add a progress bar for each chain fork * Add a NonFinalizedState::disable_metrics() method and switch to using it * Move metrics out of Chain because we can't update Arc<Chain> * Fix: consistently use best chain order when searching chains * Track Chain progress bars in NonFinalizedState * Display work as bits, not a multiple of the target difficulty * Handle negative fork lengths by reporting "No fork" * Correctly disable unused fork bars * clippy: rewrite using `match _.cmp(_) { ... }` * Initial mempool progress bar implementation * Update Cargo.lock * Add the actual transaction size as a description to the cost bar * Only show mempool progress bars after first activation * Add queued and rejected mempool progress bars * Clarify cost note is actual size * Add tracing.log_file config and progress-bar feature to zebrad docs * Derive Clone for Chain * Upgrade to howudoin 0.1.2 and remove some bug workarounds * Directly call the debug formatter to Display a Network Co-authored-by: Arya <aryasolhi@gmail.com> * Rename the address count metric to num_addresses Co-authored-by: Arya <aryasolhi@gmail.com> * Simplify reverse checkpoint lookup Co-authored-by: Arya <aryasolhi@gmail.com> * Simplify progress bar shutdown code Co-authored-by: Arya <aryasolhi@gmail.com> * Remove unused MIN_TRANSPARENT_TX_MEMPOOL_SIZE * Document that the progress task runs forever * Fix progress log formatting * If progress-bar is on, log to a file by default * Create missing directories for log files * Add file security docs for running Zebra with elevated permissions * Document automatic log file, spell progress-bar correctly --------- Co-authored-by: Arya <aryasolhi@gmail.com>
2023-04-13 01:42:17 -07:00
let mut active_inbound_connections = ActiveConnectionCounter::new_counter_with(
config.peerset_inbound_connection_limit(),
"Inbound Connections",
);
let mut handshakes: FuturesUnordered<Pin<Box<dyn Future<Output = ()> + Send>>> =
FuturesUnordered::new();
// Keeping an unresolved future in the pool means the stream never terminates.
handshakes.push(future::pending().boxed());
loop {
// Check for panics in finished tasks, before accepting new connections
let inbound_result = tokio::select! {
biased;
next_handshake_res = handshakes.next() => match next_handshake_res {
// The task has already sent the peer change to the peer set.
Some(()) => continue,
None => unreachable!("handshakes never terminates, because it contains a future that never resolves"),
},
// This future must wait until new connections are available: it can't have a timeout.
inbound_result = listener.accept() => inbound_result,
};
if let Ok((tcp_stream, addr)) = inbound_result {
let addr: PeerSocketAddr = addr.into();
if active_inbound_connections.update_count()
>= config.peerset_inbound_connection_limit()
|| recent_inbound_connections.is_past_limit_or_add(addr.ip())
{
// Too many open inbound connections or pending handshakes already.
// Close the connection.
std::mem::drop(tcp_stream);
// Allow invalid connections to be cleared quickly,
// but still put a limit on our CPU and network usage from failed connections.
tokio::time::sleep(constants::MIN_INBOUND_PEER_FAILED_CONNECTION_INTERVAL).await;
continue;
}
// The peer already opened a connection to us.
// So we want to increment the connection count as soon as possible.
let connection_tracker = active_inbound_connections.track_connection();
debug!(
inbound_connections = ?active_inbound_connections.update_count(),
"handshaking on an open inbound peer connection"
);
let handshake_task = accept_inbound_handshake(
addr,
handshaker.clone(),
tcp_stream,
connection_tracker,
peerset_tx.clone(),
)
.await?
.wait_for_panics();
handshakes.push(handshake_task);
// Rate-limit inbound connection handshakes.
// But sleep longer after a successful connection,
// so we can clear out failed connections at a higher rate.
//
// If there is a flood of connections,
// this stops Zebra overloading the network with handshake data.
//
// Zebra can't control how many queued connections are waiting,
// but most OSes also limit the number of queued inbound connections on a listener port.
tokio::time::sleep(min_inbound_peer_connection_interval).await;
} else {
// Allow invalid connections to be cleared quickly,
// but still put a limit on our CPU and network usage from failed connections.
debug!(?inbound_result, "error accepting inbound connection");
tokio::time::sleep(constants::MIN_INBOUND_PEER_FAILED_CONNECTION_INTERVAL).await;
}
// Security: Let other tasks run after each connection is processed.
//
// Avoids remote peers starving other Zebra tasks using inbound connection successes or
// errors.
//
// Preventing a denial of service is important in this code, so we want to sleep *and* make
// the next connection after other tasks have run. (Sleeps are not guaranteed to do that.)
tokio::task::yield_now().await;
}
}
/// Set up a new inbound connection as a Zcash peer.
///
/// Uses `handshaker` to perform a Zcash network protocol handshake, and sends
/// the [`peer::Client`] result over `peerset_tx`.
//
// TODO: when we support inbound proxies, distinguish between proxied listeners and
// direct listeners in the span generated by this instrument macro
#[instrument(skip(handshaker, tcp_stream, connection_tracker, peerset_tx))]
async fn accept_inbound_handshake<S>(
addr: PeerSocketAddr,
mut handshaker: S,
tcp_stream: TcpStream,
connection_tracker: ConnectionTracker,
peerset_tx: futures::channel::mpsc::Sender<DiscoveredPeer>,
) -> Result<tokio::task::JoinHandle<()>, BoxError>
where
S: Service<peer::HandshakeRequest<TcpStream>, Response = peer::Client, Error = BoxError>
+ Clone,
S::Future: Send + 'static,
{
let connected_addr = peer::ConnectedAddr::new_inbound_direct(addr);
debug!("got incoming connection");
// # Correctness
//
// Holding the drop guard returned by Span::enter across .await points will
// result in incorrect traces if it yields.
//
// This await is okay because the handshaker's `poll_ready` method always returns Ready.
handshaker.ready().await?;
// Construct a handshake future but do not drive it yet....
let handshake = handshaker.call(HandshakeRequest {
data_stream: tcp_stream,
connected_addr,
connection_tracker,
});
// ... instead, spawn a new task to handle this connection
let mut peerset_tx = peerset_tx.clone();
let handshake_task = tokio::spawn(
async move {
let handshake_result = handshake.await;
if let Ok(client) = handshake_result {
// The connection limit makes sure this send doesn't block
let _ = peerset_tx.send((addr, client)).await;
} else {
debug!(?handshake_result, "error handshaking with inbound peer");
}
}
.in_current_span(),
);
Ok(handshake_task)
}
Fix a deadlock between the crawler and dialer, and other hangs (#1950) * Stop ignoring inbound message errors and handshake timeouts To avoid hangs, Zebra needs to maintain the following invariants in the handshake and heartbeat code: - each handshake should run in a separate spawned task (not yet implemented) - every message, error, timeout, and shutdown must update the peer address state - every await that depends on the network must have a timeout Once the Connection is created, it should handle timeouts. But we need to handle timeouts during handshake setup. * Avoid hangs by adding a timeout to the candidate set update Also increase the fanout from 1 to 2, to increase address diversity. But only return permanent errors from `CandidateSet::update`, because the crawler task exits if `update` returns an error. Also log Peers response errors in the CandidateSet. * Use the select macro in the crawler to reduce hangs The `select` function is biased towards its first argument, risking starvation. As a side-benefit, this change also makes the code a lot easier to read and maintain. * Split CrawlerAction::Demand into separate actions This refactor makes the code a bit easier to read, at the cost of sometimes blocking the crawler on `candidates.next()`. That's ok, because `next` only has a short (< 100 ms) delay. And we're just about to spawn a separate task for each handshake. * Spawn a separate task for each handshake This change avoids deadlocks by letting each handshake make progress independently. * Move the dial task into a separate function This refactor improves readability. * Fix buggy future::select function usage And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
/// An action that the peer crawler can take.
enum CrawlerAction {
/// Drop the demand signal because there are too many pending handshakes.
DemandDrop,
/// Initiate a handshake to the next candidate peer in response to demand.
///
/// If there are no available candidates, crawl existing peers.
DemandHandshakeOrCrawl,
Fix a deadlock between the crawler and dialer, and other hangs (#1950) * Stop ignoring inbound message errors and handshake timeouts To avoid hangs, Zebra needs to maintain the following invariants in the handshake and heartbeat code: - each handshake should run in a separate spawned task (not yet implemented) - every message, error, timeout, and shutdown must update the peer address state - every await that depends on the network must have a timeout Once the Connection is created, it should handle timeouts. But we need to handle timeouts during handshake setup. * Avoid hangs by adding a timeout to the candidate set update Also increase the fanout from 1 to 2, to increase address diversity. But only return permanent errors from `CandidateSet::update`, because the crawler task exits if `update` returns an error. Also log Peers response errors in the CandidateSet. * Use the select macro in the crawler to reduce hangs The `select` function is biased towards its first argument, risking starvation. As a side-benefit, this change also makes the code a lot easier to read and maintain. * Split CrawlerAction::Demand into separate actions This refactor makes the code a bit easier to read, at the cost of sometimes blocking the crawler on `candidates.next()`. That's ok, because `next` only has a short (< 100 ms) delay. And we're just about to spawn a separate task for each handshake. * Spawn a separate task for each handshake This change avoids deadlocks by letting each handshake make progress independently. * Move the dial task into a separate function This refactor improves readability. * Fix buggy future::select function usage And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
/// Crawl existing peers for more peers in response to a timer `tick`.
TimerCrawl { tick: Instant },
/// Clear a finished handshake.
HandshakeFinished,
/// Clear a finished demand crawl (DemandHandshakeOrCrawl with no peers).
DemandCrawlFinished,
/// Clear a finished TimerCrawl.
TimerCrawlFinished,
Fix a deadlock between the crawler and dialer, and other hangs (#1950) * Stop ignoring inbound message errors and handshake timeouts To avoid hangs, Zebra needs to maintain the following invariants in the handshake and heartbeat code: - each handshake should run in a separate spawned task (not yet implemented) - every message, error, timeout, and shutdown must update the peer address state - every await that depends on the network must have a timeout Once the Connection is created, it should handle timeouts. But we need to handle timeouts during handshake setup. * Avoid hangs by adding a timeout to the candidate set update Also increase the fanout from 1 to 2, to increase address diversity. But only return permanent errors from `CandidateSet::update`, because the crawler task exits if `update` returns an error. Also log Peers response errors in the CandidateSet. * Use the select macro in the crawler to reduce hangs The `select` function is biased towards its first argument, risking starvation. As a side-benefit, this change also makes the code a lot easier to read and maintain. * Split CrawlerAction::Demand into separate actions This refactor makes the code a bit easier to read, at the cost of sometimes blocking the crawler on `candidates.next()`. That's ok, because `next` only has a short (< 100 ms) delay. And we're just about to spawn a separate task for each handshake. * Spawn a separate task for each handshake This change avoids deadlocks by letting each handshake make progress independently. * Move the dial task into a separate function This refactor improves readability. * Fix buggy future::select function usage And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
}
/// Given a channel `demand_rx` that signals a need for new peers, try to find
/// and connect to new peers, and send the resulting `peer::Client`s through the
/// `peerset_tx` channel.
Fix a deadlock between the crawler and dialer, and other hangs (#1950) * Stop ignoring inbound message errors and handshake timeouts To avoid hangs, Zebra needs to maintain the following invariants in the handshake and heartbeat code: - each handshake should run in a separate spawned task (not yet implemented) - every message, error, timeout, and shutdown must update the peer address state - every await that depends on the network must have a timeout Once the Connection is created, it should handle timeouts. But we need to handle timeouts during handshake setup. * Avoid hangs by adding a timeout to the candidate set update Also increase the fanout from 1 to 2, to increase address diversity. But only return permanent errors from `CandidateSet::update`, because the crawler task exits if `update` returns an error. Also log Peers response errors in the CandidateSet. * Use the select macro in the crawler to reduce hangs The `select` function is biased towards its first argument, risking starvation. As a side-benefit, this change also makes the code a lot easier to read and maintain. * Split CrawlerAction::Demand into separate actions This refactor makes the code a bit easier to read, at the cost of sometimes blocking the crawler on `candidates.next()`. That's ok, because `next` only has a short (< 100 ms) delay. And we're just about to spawn a separate task for each handshake. * Spawn a separate task for each handshake This change avoids deadlocks by letting each handshake make progress independently. * Move the dial task into a separate function This refactor improves readability. * Fix buggy future::select function usage And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
///
/// Crawl for new peers every `config.crawl_new_peer_interval`.
/// Also crawl whenever there is demand, but no new peers in `candidates`.
/// After crawling, try to connect to one new peer using `outbound_connector`.
Fix a deadlock between the crawler and dialer, and other hangs (#1950) * Stop ignoring inbound message errors and handshake timeouts To avoid hangs, Zebra needs to maintain the following invariants in the handshake and heartbeat code: - each handshake should run in a separate spawned task (not yet implemented) - every message, error, timeout, and shutdown must update the peer address state - every await that depends on the network must have a timeout Once the Connection is created, it should handle timeouts. But we need to handle timeouts during handshake setup. * Avoid hangs by adding a timeout to the candidate set update Also increase the fanout from 1 to 2, to increase address diversity. But only return permanent errors from `CandidateSet::update`, because the crawler task exits if `update` returns an error. Also log Peers response errors in the CandidateSet. * Use the select macro in the crawler to reduce hangs The `select` function is biased towards its first argument, risking starvation. As a side-benefit, this change also makes the code a lot easier to read and maintain. * Split CrawlerAction::Demand into separate actions This refactor makes the code a bit easier to read, at the cost of sometimes blocking the crawler on `candidates.next()`. That's ok, because `next` only has a short (< 100 ms) delay. And we're just about to spawn a separate task for each handshake. * Spawn a separate task for each handshake This change avoids deadlocks by letting each handshake make progress independently. * Move the dial task into a separate function This refactor improves readability. * Fix buggy future::select function usage And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
///
/// If a handshake fails, restore the unused demand signal by sending it to
/// `demand_tx`.
///
/// The crawler terminates when `candidates.update()` or `peerset_tx` returns a
Fix a deadlock between the crawler and dialer, and other hangs (#1950) * Stop ignoring inbound message errors and handshake timeouts To avoid hangs, Zebra needs to maintain the following invariants in the handshake and heartbeat code: - each handshake should run in a separate spawned task (not yet implemented) - every message, error, timeout, and shutdown must update the peer address state - every await that depends on the network must have a timeout Once the Connection is created, it should handle timeouts. But we need to handle timeouts during handshake setup. * Avoid hangs by adding a timeout to the candidate set update Also increase the fanout from 1 to 2, to increase address diversity. But only return permanent errors from `CandidateSet::update`, because the crawler task exits if `update` returns an error. Also log Peers response errors in the CandidateSet. * Use the select macro in the crawler to reduce hangs The `select` function is biased towards its first argument, risking starvation. As a side-benefit, this change also makes the code a lot easier to read and maintain. * Split CrawlerAction::Demand into separate actions This refactor makes the code a bit easier to read, at the cost of sometimes blocking the crawler on `candidates.next()`. That's ok, because `next` only has a short (< 100 ms) delay. And we're just about to spawn a separate task for each handshake. * Spawn a separate task for each handshake This change avoids deadlocks by letting each handshake make progress independently. * Move the dial task into a separate function This refactor improves readability. * Fix buggy future::select function usage And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
/// permanent internal error. Transient errors and individual peer errors should
/// be handled within the crawler.
///
/// Uses `active_outbound_connections` to limit the number of active outbound connections
/// across both the initial peers and crawler. The limit is based on `config`.
#[allow(clippy::too_many_arguments)]
#[instrument(
skip(
config,
demand_tx,
demand_rx,
candidates,
outbound_connector,
peerset_tx,
active_outbound_connections,
address_book_updater,
),
fields(
new_peer_interval = ?config.crawl_new_peer_interval,
)
)]
async fn crawl_and_dial<C, S>(
config: Config,
demand_tx: futures::channel::mpsc::Sender<MorePeers>,
mut demand_rx: futures::channel::mpsc::Receiver<MorePeers>,
candidates: CandidateSet<S>,
outbound_connector: C,
peerset_tx: futures::channel::mpsc::Sender<DiscoveredPeer>,
mut active_outbound_connections: ActiveConnectionCounter,
address_book_updater: tokio::sync::mpsc::Sender<MetaAddrChange>,
) -> Result<(), BoxError>
where
C: Service<
OutboundConnectorRequest,
Response = (PeerSocketAddr, peer::Client),
Error = BoxError,
> + Clone
Fix a deadlock between the crawler and dialer, and other hangs (#1950) * Stop ignoring inbound message errors and handshake timeouts To avoid hangs, Zebra needs to maintain the following invariants in the handshake and heartbeat code: - each handshake should run in a separate spawned task (not yet implemented) - every message, error, timeout, and shutdown must update the peer address state - every await that depends on the network must have a timeout Once the Connection is created, it should handle timeouts. But we need to handle timeouts during handshake setup. * Avoid hangs by adding a timeout to the candidate set update Also increase the fanout from 1 to 2, to increase address diversity. But only return permanent errors from `CandidateSet::update`, because the crawler task exits if `update` returns an error. Also log Peers response errors in the CandidateSet. * Use the select macro in the crawler to reduce hangs The `select` function is biased towards its first argument, risking starvation. As a side-benefit, this change also makes the code a lot easier to read and maintain. * Split CrawlerAction::Demand into separate actions This refactor makes the code a bit easier to read, at the cost of sometimes blocking the crawler on `candidates.next()`. That's ok, because `next` only has a short (< 100 ms) delay. And we're just about to spawn a separate task for each handshake. * Spawn a separate task for each handshake This change avoids deadlocks by letting each handshake make progress independently. * Move the dial task into a separate function This refactor improves readability. * Fix buggy future::select function usage And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
+ Send
+ 'static,
C::Future: Send + 'static,
S: Service<Request, Response = Response, Error = BoxError> + Send + Sync + 'static,
S::Future: Send + 'static,
{
Fix a deadlock between the crawler and dialer, and other hangs (#1950) * Stop ignoring inbound message errors and handshake timeouts To avoid hangs, Zebra needs to maintain the following invariants in the handshake and heartbeat code: - each handshake should run in a separate spawned task (not yet implemented) - every message, error, timeout, and shutdown must update the peer address state - every await that depends on the network must have a timeout Once the Connection is created, it should handle timeouts. But we need to handle timeouts during handshake setup. * Avoid hangs by adding a timeout to the candidate set update Also increase the fanout from 1 to 2, to increase address diversity. But only return permanent errors from `CandidateSet::update`, because the crawler task exits if `update` returns an error. Also log Peers response errors in the CandidateSet. * Use the select macro in the crawler to reduce hangs The `select` function is biased towards its first argument, risking starvation. As a side-benefit, this change also makes the code a lot easier to read and maintain. * Split CrawlerAction::Demand into separate actions This refactor makes the code a bit easier to read, at the cost of sometimes blocking the crawler on `candidates.next()`. That's ok, because `next` only has a short (< 100 ms) delay. And we're just about to spawn a separate task for each handshake. * Spawn a separate task for each handshake This change avoids deadlocks by letting each handshake make progress independently. * Move the dial task into a separate function This refactor improves readability. * Fix buggy future::select function usage And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
use CrawlerAction::*;
info!(
crawl_new_peer_interval = ?config.crawl_new_peer_interval,
outbound_connections = ?active_outbound_connections.update_count(),
"starting the peer address crawler",
);
// # Concurrency
//
// Allow tasks using the candidate set to be spawned, so they can run concurrently.
// Previously, Zebra has had deadlocks and long hangs caused by running dependent
// candidate set futures in the same async task.
let candidates = Arc::new(futures::lock::Mutex::new(candidates));
// This contains both crawl and handshake tasks.
let mut handshakes: FuturesUnordered<
Pin<Box<dyn Future<Output = Result<CrawlerAction, BoxError>> + Send>>,
> = FuturesUnordered::new();
// <FuturesUnordered as Stream> returns None when empty.
// Keeping an unresolved future in the pool means the stream never terminates.
handshakes.push(future::pending().boxed());
let mut crawl_timer = tokio::time::interval(config.crawl_new_peer_interval);
// If the crawl is delayed, also delay all future crawls.
// (Shorter intervals just add load, without any benefit.)
crawl_timer.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Delay);
let mut crawl_timer = IntervalStream::new(crawl_timer).map(|tick| TimerCrawl { tick });
// # Concurrency
//
// To avoid hangs and starvation, the crawler must spawn a separate task for each crawl
// and handshake, so they can make progress independently (and avoid deadlocking each other).
loop {
build(deps): bump the prod group with 6 updates (#8125) * build(deps): bump the prod group with 6 updates Bumps the prod group with 6 updates: | Package | From | To | | --- | --- | --- | | [futures](https://github.com/rust-lang/futures-rs) | `0.3.29` | `0.3.30` | | [tokio](https://github.com/tokio-rs/tokio) | `1.35.0` | `1.35.1` | | [metrics](https://github.com/metrics-rs/metrics) | `0.21.1` | `0.22.0` | | [metrics-exporter-prometheus](https://github.com/metrics-rs/metrics) | `0.12.2` | `0.13.0` | | [reqwest](https://github.com/seanmonstar/reqwest) | `0.11.22` | `0.11.23` | | [owo-colors](https://github.com/jam1garner/owo-colors) | `3.5.0` | `4.0.0` | Updates `futures` from 0.3.29 to 0.3.30 - [Release notes](https://github.com/rust-lang/futures-rs/releases) - [Changelog](https://github.com/rust-lang/futures-rs/blob/master/CHANGELOG.md) - [Commits](https://github.com/rust-lang/futures-rs/compare/0.3.29...0.3.30) Updates `tokio` from 1.35.0 to 1.35.1 - [Release notes](https://github.com/tokio-rs/tokio/releases) - [Commits](https://github.com/tokio-rs/tokio/compare/tokio-1.35.0...tokio-1.35.1) Updates `metrics` from 0.21.1 to 0.22.0 - [Changelog](https://github.com/metrics-rs/metrics/blob/main/release.toml) - [Commits](https://github.com/metrics-rs/metrics/compare/metrics-v0.21.1...metrics-v0.22.0) Updates `metrics-exporter-prometheus` from 0.12.2 to 0.13.0 - [Changelog](https://github.com/metrics-rs/metrics/blob/main/release.toml) - [Commits](https://github.com/metrics-rs/metrics/compare/metrics-exporter-prometheus-v0.12.2...metrics-exporter-prometheus-v0.13.0) Updates `reqwest` from 0.11.22 to 0.11.23 - [Release notes](https://github.com/seanmonstar/reqwest/releases) - [Changelog](https://github.com/seanmonstar/reqwest/blob/master/CHANGELOG.md) - [Commits](https://github.com/seanmonstar/reqwest/compare/v0.11.22...v0.11.23) Updates `owo-colors` from 3.5.0 to 4.0.0 - [Commits](https://github.com/jam1garner/owo-colors/compare/v3.5.0...v4.0.0) --- updated-dependencies: - dependency-name: futures dependency-type: direct:production update-type: version-update:semver-patch dependency-group: prod - dependency-name: tokio dependency-type: direct:production update-type: version-update:semver-patch dependency-group: prod - dependency-name: metrics dependency-type: direct:production update-type: version-update:semver-minor dependency-group: prod - dependency-name: metrics-exporter-prometheus dependency-type: direct:production update-type: version-update:semver-minor dependency-group: prod - dependency-name: reqwest dependency-type: direct:production update-type: version-update:semver-patch dependency-group: prod - dependency-name: owo-colors dependency-type: direct:production update-type: version-update:semver-major dependency-group: prod ... Signed-off-by: dependabot[bot] <support@github.com> * update all metric macros * fix deprecated function * fix duplicated deps * Fix an incorrect gauge method call * Expand documentation and error messages for best chain length --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Alfredo Garcia <oxarbitrage@gmail.com> Co-authored-by: teor <teor@riseup.net>
2024-01-01 17:26:54 -08:00
metrics::gauge!("crawler.in_flight_handshakes").set(
handshakes
.len()
.checked_sub(1)
.expect("the pool always contains an unresolved future") as f64,
);
Fix a deadlock between the crawler and dialer, and other hangs (#1950) * Stop ignoring inbound message errors and handshake timeouts To avoid hangs, Zebra needs to maintain the following invariants in the handshake and heartbeat code: - each handshake should run in a separate spawned task (not yet implemented) - every message, error, timeout, and shutdown must update the peer address state - every await that depends on the network must have a timeout Once the Connection is created, it should handle timeouts. But we need to handle timeouts during handshake setup. * Avoid hangs by adding a timeout to the candidate set update Also increase the fanout from 1 to 2, to increase address diversity. But only return permanent errors from `CandidateSet::update`, because the crawler task exits if `update` returns an error. Also log Peers response errors in the CandidateSet. * Use the select macro in the crawler to reduce hangs The `select` function is biased towards its first argument, risking starvation. As a side-benefit, this change also makes the code a lot easier to read and maintain. * Split CrawlerAction::Demand into separate actions This refactor makes the code a bit easier to read, at the cost of sometimes blocking the crawler on `candidates.next()`. That's ok, because `next` only has a short (< 100 ms) delay. And we're just about to spawn a separate task for each handshake. * Spawn a separate task for each handshake This change avoids deadlocks by letting each handshake make progress independently. * Move the dial task into a separate function This refactor improves readability. * Fix buggy future::select function usage And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
let crawler_action = tokio::select! {
biased;
// Check for completed handshakes first, because the rest of the app needs them.
// Pending handshakes are limited by the connection limit.
next_handshake_res = handshakes.next() => next_handshake_res.expect(
"handshakes never terminates, because it contains a future that never resolves"
),
// The timer is rate-limited
next_timer = crawl_timer.next() => Ok(next_timer.expect("timers never terminate")),
// Turn any new demand into an action, based on the crawler's current state.
//
// # Concurrency
//
// Demand is potentially unlimited, so it must go last in a biased select!.
next_demand = demand_rx.next() => next_demand.ok_or("demand stream closed, is Zebra shutting down?".into()).map(|MorePeers|{
if active_outbound_connections.update_count() >= config.peerset_outbound_connection_limit() {
// Too many open outbound connections or pending handshakes already
Fix a deadlock between the crawler and dialer, and other hangs (#1950) * Stop ignoring inbound message errors and handshake timeouts To avoid hangs, Zebra needs to maintain the following invariants in the handshake and heartbeat code: - each handshake should run in a separate spawned task (not yet implemented) - every message, error, timeout, and shutdown must update the peer address state - every await that depends on the network must have a timeout Once the Connection is created, it should handle timeouts. But we need to handle timeouts during handshake setup. * Avoid hangs by adding a timeout to the candidate set update Also increase the fanout from 1 to 2, to increase address diversity. But only return permanent errors from `CandidateSet::update`, because the crawler task exits if `update` returns an error. Also log Peers response errors in the CandidateSet. * Use the select macro in the crawler to reduce hangs The `select` function is biased towards its first argument, risking starvation. As a side-benefit, this change also makes the code a lot easier to read and maintain. * Split CrawlerAction::Demand into separate actions This refactor makes the code a bit easier to read, at the cost of sometimes blocking the crawler on `candidates.next()`. That's ok, because `next` only has a short (< 100 ms) delay. And we're just about to spawn a separate task for each handshake. * Spawn a separate task for each handshake This change avoids deadlocks by letting each handshake make progress independently. * Move the dial task into a separate function This refactor improves readability. * Fix buggy future::select function usage And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
DemandDrop
} else {
DemandHandshakeOrCrawl
}
})
Fix a deadlock between the crawler and dialer, and other hangs (#1950) * Stop ignoring inbound message errors and handshake timeouts To avoid hangs, Zebra needs to maintain the following invariants in the handshake and heartbeat code: - each handshake should run in a separate spawned task (not yet implemented) - every message, error, timeout, and shutdown must update the peer address state - every await that depends on the network must have a timeout Once the Connection is created, it should handle timeouts. But we need to handle timeouts during handshake setup. * Avoid hangs by adding a timeout to the candidate set update Also increase the fanout from 1 to 2, to increase address diversity. But only return permanent errors from `CandidateSet::update`, because the crawler task exits if `update` returns an error. Also log Peers response errors in the CandidateSet. * Use the select macro in the crawler to reduce hangs The `select` function is biased towards its first argument, risking starvation. As a side-benefit, this change also makes the code a lot easier to read and maintain. * Split CrawlerAction::Demand into separate actions This refactor makes the code a bit easier to read, at the cost of sometimes blocking the crawler on `candidates.next()`. That's ok, because `next` only has a short (< 100 ms) delay. And we're just about to spawn a separate task for each handshake. * Spawn a separate task for each handshake This change avoids deadlocks by letting each handshake make progress independently. * Move the dial task into a separate function This refactor improves readability. * Fix buggy future::select function usage And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
};
match crawler_action {
// Dummy actions
Ok(DemandDrop) => {
Fix a deadlock between the crawler and dialer, and other hangs (#1950) * Stop ignoring inbound message errors and handshake timeouts To avoid hangs, Zebra needs to maintain the following invariants in the handshake and heartbeat code: - each handshake should run in a separate spawned task (not yet implemented) - every message, error, timeout, and shutdown must update the peer address state - every await that depends on the network must have a timeout Once the Connection is created, it should handle timeouts. But we need to handle timeouts during handshake setup. * Avoid hangs by adding a timeout to the candidate set update Also increase the fanout from 1 to 2, to increase address diversity. But only return permanent errors from `CandidateSet::update`, because the crawler task exits if `update` returns an error. Also log Peers response errors in the CandidateSet. * Use the select macro in the crawler to reduce hangs The `select` function is biased towards its first argument, risking starvation. As a side-benefit, this change also makes the code a lot easier to read and maintain. * Split CrawlerAction::Demand into separate actions This refactor makes the code a bit easier to read, at the cost of sometimes blocking the crawler on `candidates.next()`. That's ok, because `next` only has a short (< 100 ms) delay. And we're just about to spawn a separate task for each handshake. * Spawn a separate task for each handshake This change avoids deadlocks by letting each handshake make progress independently. * Move the dial task into a separate function This refactor improves readability. * Fix buggy future::select function usage And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
// This is set to trace level because when the peerset is
// congested it can generate a lot of demand signal very rapidly.
trace!("too many open connections or in-flight handshakes, dropping demand signal");
Fix a deadlock between the crawler and dialer, and other hangs (#1950) * Stop ignoring inbound message errors and handshake timeouts To avoid hangs, Zebra needs to maintain the following invariants in the handshake and heartbeat code: - each handshake should run in a separate spawned task (not yet implemented) - every message, error, timeout, and shutdown must update the peer address state - every await that depends on the network must have a timeout Once the Connection is created, it should handle timeouts. But we need to handle timeouts during handshake setup. * Avoid hangs by adding a timeout to the candidate set update Also increase the fanout from 1 to 2, to increase address diversity. But only return permanent errors from `CandidateSet::update`, because the crawler task exits if `update` returns an error. Also log Peers response errors in the CandidateSet. * Use the select macro in the crawler to reduce hangs The `select` function is biased towards its first argument, risking starvation. As a side-benefit, this change also makes the code a lot easier to read and maintain. * Split CrawlerAction::Demand into separate actions This refactor makes the code a bit easier to read, at the cost of sometimes blocking the crawler on `candidates.next()`. That's ok, because `next` only has a short (< 100 ms) delay. And we're just about to spawn a separate task for each handshake. * Spawn a separate task for each handshake This change avoids deadlocks by letting each handshake make progress independently. * Move the dial task into a separate function This refactor improves readability. * Fix buggy future::select function usage And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
}
// Spawned tasks
Ok(DemandHandshakeOrCrawl) => {
let candidates = candidates.clone();
let outbound_connector = outbound_connector.clone();
let peerset_tx = peerset_tx.clone();
let address_book_updater = address_book_updater.clone();
let demand_tx = demand_tx.clone();
// Increment the connection count before we spawn the connection.
let outbound_connection_tracker = active_outbound_connections.track_connection();
let outbound_connections = active_outbound_connections.update_count();
debug!(?outbound_connections, "opening an outbound peer connection");
// Spawn each handshake or crawl into an independent task, so handshakes can make
// progress while crawls are running.
//
// # Concurrency
//
// The peer crawler must be able to make progress even if some handshakes are
// rate-limited. So the async mutex and next peer timeout are awaited inside the
// spawned task.
let handshake_or_crawl_handle = tokio::spawn(
async move {
// Try to get the next available peer for a handshake.
//
// candidates.next() has a short timeout, and briefly holds the address
// book lock, so it shouldn't hang.
//
// Hold the lock for as short a time as possible.
let candidate = { candidates.lock().await.next().await };
if let Some(candidate) = candidate {
// we don't need to spawn here, because there's nothing running concurrently
dial(
candidate,
outbound_connector,
outbound_connection_tracker,
outbound_connections,
peerset_tx,
address_book_updater,
demand_tx,
)
.await?;
Ok(HandshakeFinished)
} else {
// There weren't any peers, so try to get more peers.
debug!("demand for peers but no available candidates");
crawl(candidates, demand_tx, false).await?;
Ok(DemandCrawlFinished)
}
}
.in_current_span(),
)
.wait_for_panics();
handshakes.push(handshake_or_crawl_handle);
Fix a deadlock between the crawler and dialer, and other hangs (#1950) * Stop ignoring inbound message errors and handshake timeouts To avoid hangs, Zebra needs to maintain the following invariants in the handshake and heartbeat code: - each handshake should run in a separate spawned task (not yet implemented) - every message, error, timeout, and shutdown must update the peer address state - every await that depends on the network must have a timeout Once the Connection is created, it should handle timeouts. But we need to handle timeouts during handshake setup. * Avoid hangs by adding a timeout to the candidate set update Also increase the fanout from 1 to 2, to increase address diversity. But only return permanent errors from `CandidateSet::update`, because the crawler task exits if `update` returns an error. Also log Peers response errors in the CandidateSet. * Use the select macro in the crawler to reduce hangs The `select` function is biased towards its first argument, risking starvation. As a side-benefit, this change also makes the code a lot easier to read and maintain. * Split CrawlerAction::Demand into separate actions This refactor makes the code a bit easier to read, at the cost of sometimes blocking the crawler on `candidates.next()`. That's ok, because `next` only has a short (< 100 ms) delay. And we're just about to spawn a separate task for each handshake. * Spawn a separate task for each handshake This change avoids deadlocks by letting each handshake make progress independently. * Move the dial task into a separate function This refactor improves readability. * Fix buggy future::select function usage And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
}
Ok(TimerCrawl { tick }) => {
let candidates = candidates.clone();
let demand_tx = demand_tx.clone();
let should_always_dial = active_outbound_connections.update_count() == 0;
let crawl_handle = tokio::spawn(
async move {
debug!(
?tick,
"crawling for more peers in response to the crawl timer"
);
crawl(candidates, demand_tx, should_always_dial).await?;
Ok(TimerCrawlFinished)
}
.in_current_span(),
)
.wait_for_panics();
handshakes.push(crawl_handle);
Fix a deadlock between the crawler and dialer, and other hangs (#1950) * Stop ignoring inbound message errors and handshake timeouts To avoid hangs, Zebra needs to maintain the following invariants in the handshake and heartbeat code: - each handshake should run in a separate spawned task (not yet implemented) - every message, error, timeout, and shutdown must update the peer address state - every await that depends on the network must have a timeout Once the Connection is created, it should handle timeouts. But we need to handle timeouts during handshake setup. * Avoid hangs by adding a timeout to the candidate set update Also increase the fanout from 1 to 2, to increase address diversity. But only return permanent errors from `CandidateSet::update`, because the crawler task exits if `update` returns an error. Also log Peers response errors in the CandidateSet. * Use the select macro in the crawler to reduce hangs The `select` function is biased towards its first argument, risking starvation. As a side-benefit, this change also makes the code a lot easier to read and maintain. * Split CrawlerAction::Demand into separate actions This refactor makes the code a bit easier to read, at the cost of sometimes blocking the crawler on `candidates.next()`. That's ok, because `next` only has a short (< 100 ms) delay. And we're just about to spawn a separate task for each handshake. * Spawn a separate task for each handshake This change avoids deadlocks by letting each handshake make progress independently. * Move the dial task into a separate function This refactor improves readability. * Fix buggy future::select function usage And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
}
// Completed spawned tasks
Ok(HandshakeFinished) => {
// Already logged in dial()
}
Ok(DemandCrawlFinished) => {
// This is set to trace level because when the peerset is
// congested it can generate a lot of demand signal very rapidly.
trace!("demand-based crawl finished");
}
Ok(TimerCrawlFinished) => {
debug!("timer-based crawl finished");
}
// Fatal errors and shutdowns
Err(error) => {
info!(?error, "crawler task exiting due to an error");
return Err(error);
}
}
// Security: Let other tasks run after each crawler action is processed.
//
// Avoids remote peers starving other Zebra tasks using outbound connection errors.
tokio::task::yield_now().await;
}
}
Fix a deadlock between the crawler and dialer, and other hangs (#1950) * Stop ignoring inbound message errors and handshake timeouts To avoid hangs, Zebra needs to maintain the following invariants in the handshake and heartbeat code: - each handshake should run in a separate spawned task (not yet implemented) - every message, error, timeout, and shutdown must update the peer address state - every await that depends on the network must have a timeout Once the Connection is created, it should handle timeouts. But we need to handle timeouts during handshake setup. * Avoid hangs by adding a timeout to the candidate set update Also increase the fanout from 1 to 2, to increase address diversity. But only return permanent errors from `CandidateSet::update`, because the crawler task exits if `update` returns an error. Also log Peers response errors in the CandidateSet. * Use the select macro in the crawler to reduce hangs The `select` function is biased towards its first argument, risking starvation. As a side-benefit, this change also makes the code a lot easier to read and maintain. * Split CrawlerAction::Demand into separate actions This refactor makes the code a bit easier to read, at the cost of sometimes blocking the crawler on `candidates.next()`. That's ok, because `next` only has a short (< 100 ms) delay. And we're just about to spawn a separate task for each handshake. * Spawn a separate task for each handshake This change avoids deadlocks by letting each handshake make progress independently. * Move the dial task into a separate function This refactor improves readability. * Fix buggy future::select function usage And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
/// Try to get more peers using `candidates`, then queue a connection attempt using `demand_tx`.
/// If there were no new peers and `should_always_dial` is false, the connection attempt is skipped.
#[instrument(skip(candidates, demand_tx))]
async fn crawl<S>(
candidates: Arc<futures::lock::Mutex<CandidateSet<S>>>,
mut demand_tx: futures::channel::mpsc::Sender<MorePeers>,
should_always_dial: bool,
) -> Result<(), BoxError>
where
S: Service<Request, Response = Response, Error = BoxError> + Send + Sync + 'static,
S::Future: Send + 'static,
{
// update() has timeouts, and briefly holds the address book
// lock, so it shouldn't hang.
// Try to get new peers, holding the lock for as short a time as possible.
let result = {
let result = candidates.lock().await.update().await;
std::mem::drop(candidates);
result
};
let more_peers = match result {
Ok(more_peers) => more_peers.or_else(|| should_always_dial.then_some(MorePeers)),
Err(e) => {
info!(
?e,
"candidate set returned an error, is Zebra shutting down?"
);
return Err(e);
}
};
// If we got more peers, try to connect to a new peer on our next loop.
//
// # Security
//
// Update attempts are rate-limited by the candidate set,
// and we only try peers if there was actually an update.
//
// So if all peers have had a recent attempt, and there was recent update
// with no peers, the channel will drain. This prevents useless update attempt
// loops.
if let Some(more_peers) = more_peers {
if let Err(send_error) = demand_tx.try_send(more_peers) {
if send_error.is_disconnected() {
// Zebra is shutting down
return Err(send_error.into());
}
}
}
Ok(())
}
/// Try to connect to `candidate` using `outbound_connector`.
/// Uses `outbound_connection_tracker` to track the active connection count.
Fix a deadlock between the crawler and dialer, and other hangs (#1950) * Stop ignoring inbound message errors and handshake timeouts To avoid hangs, Zebra needs to maintain the following invariants in the handshake and heartbeat code: - each handshake should run in a separate spawned task (not yet implemented) - every message, error, timeout, and shutdown must update the peer address state - every await that depends on the network must have a timeout Once the Connection is created, it should handle timeouts. But we need to handle timeouts during handshake setup. * Avoid hangs by adding a timeout to the candidate set update Also increase the fanout from 1 to 2, to increase address diversity. But only return permanent errors from `CandidateSet::update`, because the crawler task exits if `update` returns an error. Also log Peers response errors in the CandidateSet. * Use the select macro in the crawler to reduce hangs The `select` function is biased towards its first argument, risking starvation. As a side-benefit, this change also makes the code a lot easier to read and maintain. * Split CrawlerAction::Demand into separate actions This refactor makes the code a bit easier to read, at the cost of sometimes blocking the crawler on `candidates.next()`. That's ok, because `next` only has a short (< 100 ms) delay. And we're just about to spawn a separate task for each handshake. * Spawn a separate task for each handshake This change avoids deadlocks by letting each handshake make progress independently. * Move the dial task into a separate function This refactor improves readability. * Fix buggy future::select function usage And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
///
/// On success, sends peers to `peerset_tx`.
/// On failure, marks the peer as failed in the address book,
/// then re-adds demand to `demand_tx`.
#[instrument(skip(
outbound_connector,
outbound_connection_tracker,
outbound_connections,
peerset_tx,
address_book_updater,
demand_tx
))]
async fn dial<C>(
candidate: MetaAddr,
mut outbound_connector: C,
outbound_connection_tracker: ConnectionTracker,
outbound_connections: usize,
mut peerset_tx: futures::channel::mpsc::Sender<DiscoveredPeer>,
address_book_updater: tokio::sync::mpsc::Sender<MetaAddrChange>,
mut demand_tx: futures::channel::mpsc::Sender<MorePeers>,
) -> Result<(), BoxError>
Fix a deadlock between the crawler and dialer, and other hangs (#1950) * Stop ignoring inbound message errors and handshake timeouts To avoid hangs, Zebra needs to maintain the following invariants in the handshake and heartbeat code: - each handshake should run in a separate spawned task (not yet implemented) - every message, error, timeout, and shutdown must update the peer address state - every await that depends on the network must have a timeout Once the Connection is created, it should handle timeouts. But we need to handle timeouts during handshake setup. * Avoid hangs by adding a timeout to the candidate set update Also increase the fanout from 1 to 2, to increase address diversity. But only return permanent errors from `CandidateSet::update`, because the crawler task exits if `update` returns an error. Also log Peers response errors in the CandidateSet. * Use the select macro in the crawler to reduce hangs The `select` function is biased towards its first argument, risking starvation. As a side-benefit, this change also makes the code a lot easier to read and maintain. * Split CrawlerAction::Demand into separate actions This refactor makes the code a bit easier to read, at the cost of sometimes blocking the crawler on `candidates.next()`. That's ok, because `next` only has a short (< 100 ms) delay. And we're just about to spawn a separate task for each handshake. * Spawn a separate task for each handshake This change avoids deadlocks by letting each handshake make progress independently. * Move the dial task into a separate function This refactor improves readability. * Fix buggy future::select function usage And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
where
C: Service<
OutboundConnectorRequest,
Response = (PeerSocketAddr, peer::Client),
Error = BoxError,
> + Clone
Fix a deadlock between the crawler and dialer, and other hangs (#1950) * Stop ignoring inbound message errors and handshake timeouts To avoid hangs, Zebra needs to maintain the following invariants in the handshake and heartbeat code: - each handshake should run in a separate spawned task (not yet implemented) - every message, error, timeout, and shutdown must update the peer address state - every await that depends on the network must have a timeout Once the Connection is created, it should handle timeouts. But we need to handle timeouts during handshake setup. * Avoid hangs by adding a timeout to the candidate set update Also increase the fanout from 1 to 2, to increase address diversity. But only return permanent errors from `CandidateSet::update`, because the crawler task exits if `update` returns an error. Also log Peers response errors in the CandidateSet. * Use the select macro in the crawler to reduce hangs The `select` function is biased towards its first argument, risking starvation. As a side-benefit, this change also makes the code a lot easier to read and maintain. * Split CrawlerAction::Demand into separate actions This refactor makes the code a bit easier to read, at the cost of sometimes blocking the crawler on `candidates.next()`. That's ok, because `next` only has a short (< 100 ms) delay. And we're just about to spawn a separate task for each handshake. * Spawn a separate task for each handshake This change avoids deadlocks by letting each handshake make progress independently. * Move the dial task into a separate function This refactor improves readability. * Fix buggy future::select function usage And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
+ Send
+ 'static,
C::Future: Send + 'static,
{
// If Zebra only has a few connections, we log connection failures at info level,
// so users can diagnose and fix the problem. This defines the threshold for info logs.
const MAX_CONNECTIONS_FOR_INFO_LOG: usize = 5;
// # Correctness
Fix a deadlock between the crawler and dialer, and other hangs (#1950) * Stop ignoring inbound message errors and handshake timeouts To avoid hangs, Zebra needs to maintain the following invariants in the handshake and heartbeat code: - each handshake should run in a separate spawned task (not yet implemented) - every message, error, timeout, and shutdown must update the peer address state - every await that depends on the network must have a timeout Once the Connection is created, it should handle timeouts. But we need to handle timeouts during handshake setup. * Avoid hangs by adding a timeout to the candidate set update Also increase the fanout from 1 to 2, to increase address diversity. But only return permanent errors from `CandidateSet::update`, because the crawler task exits if `update` returns an error. Also log Peers response errors in the CandidateSet. * Use the select macro in the crawler to reduce hangs The `select` function is biased towards its first argument, risking starvation. As a side-benefit, this change also makes the code a lot easier to read and maintain. * Split CrawlerAction::Demand into separate actions This refactor makes the code a bit easier to read, at the cost of sometimes blocking the crawler on `candidates.next()`. That's ok, because `next` only has a short (< 100 ms) delay. And we're just about to spawn a separate task for each handshake. * Spawn a separate task for each handshake This change avoids deadlocks by letting each handshake make progress independently. * Move the dial task into a separate function This refactor improves readability. * Fix buggy future::select function usage And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
//
// To avoid hangs, the dialer must only await:
// - functions that return immediately, or
// - functions that have a reasonable timeout
debug!(?candidate.addr, "attempting outbound connection in response to demand");
// the connector is always ready, so this can't hang
let outbound_connector = outbound_connector.ready().await?;
Fix a deadlock between the crawler and dialer, and other hangs (#1950) * Stop ignoring inbound message errors and handshake timeouts To avoid hangs, Zebra needs to maintain the following invariants in the handshake and heartbeat code: - each handshake should run in a separate spawned task (not yet implemented) - every message, error, timeout, and shutdown must update the peer address state - every await that depends on the network must have a timeout Once the Connection is created, it should handle timeouts. But we need to handle timeouts during handshake setup. * Avoid hangs by adding a timeout to the candidate set update Also increase the fanout from 1 to 2, to increase address diversity. But only return permanent errors from `CandidateSet::update`, because the crawler task exits if `update` returns an error. Also log Peers response errors in the CandidateSet. * Use the select macro in the crawler to reduce hangs The `select` function is biased towards its first argument, risking starvation. As a side-benefit, this change also makes the code a lot easier to read and maintain. * Split CrawlerAction::Demand into separate actions This refactor makes the code a bit easier to read, at the cost of sometimes blocking the crawler on `candidates.next()`. That's ok, because `next` only has a short (< 100 ms) delay. And we're just about to spawn a separate task for each handshake. * Spawn a separate task for each handshake This change avoids deadlocks by letting each handshake make progress independently. * Move the dial task into a separate function This refactor improves readability. * Fix buggy future::select function usage And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
let req = OutboundConnectorRequest {
addr: candidate.addr,
connection_tracker: outbound_connection_tracker,
};
Fix a deadlock between the crawler and dialer, and other hangs (#1950) * Stop ignoring inbound message errors and handshake timeouts To avoid hangs, Zebra needs to maintain the following invariants in the handshake and heartbeat code: - each handshake should run in a separate spawned task (not yet implemented) - every message, error, timeout, and shutdown must update the peer address state - every await that depends on the network must have a timeout Once the Connection is created, it should handle timeouts. But we need to handle timeouts during handshake setup. * Avoid hangs by adding a timeout to the candidate set update Also increase the fanout from 1 to 2, to increase address diversity. But only return permanent errors from `CandidateSet::update`, because the crawler task exits if `update` returns an error. Also log Peers response errors in the CandidateSet. * Use the select macro in the crawler to reduce hangs The `select` function is biased towards its first argument, risking starvation. As a side-benefit, this change also makes the code a lot easier to read and maintain. * Split CrawlerAction::Demand into separate actions This refactor makes the code a bit easier to read, at the cost of sometimes blocking the crawler on `candidates.next()`. That's ok, because `next` only has a short (< 100 ms) delay. And we're just about to spawn a separate task for each handshake. * Spawn a separate task for each handshake This change avoids deadlocks by letting each handshake make progress independently. * Move the dial task into a separate function This refactor improves readability. * Fix buggy future::select function usage And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
// the handshake has timeouts, so it shouldn't hang
let handshake_result = outbound_connector.call(req).map(Into::into).await;
match handshake_result {
Ok((address, client)) => {
debug!(?candidate.addr, "successfully dialed new peer");
// The connection limit makes sure this send doesn't block.
peerset_tx.send((address, client)).await?;
}
// The connection was never opened, or it failed the handshake and was dropped.
Err(error) => {
// Silence verbose info logs in production, but keep logs if the number of connections is low.
// Also silence them completely in tests.
if outbound_connections <= MAX_CONNECTIONS_FOR_INFO_LOG && !cfg!(test) {
info!(?error, ?candidate.addr, "failed to make outbound connection to peer");
} else {
debug!(?error, ?candidate.addr, "failed to make outbound connection to peer");
}
report_failed(address_book_updater.clone(), candidate).await;
// The demand signal that was taken out of the queue to attempt to connect to the
// failed candidate never turned into a connection, so add it back.
//
// # Security
//
// Handshake failures are rate-limited by peer attempt timeouts.
if let Err(send_error) = demand_tx.try_send(MorePeers) {
if send_error.is_disconnected() {
// Zebra is shutting down
return Err(send_error.into());
Fix a deadlock between the crawler and dialer, and other hangs (#1950) * Stop ignoring inbound message errors and handshake timeouts To avoid hangs, Zebra needs to maintain the following invariants in the handshake and heartbeat code: - each handshake should run in a separate spawned task (not yet implemented) - every message, error, timeout, and shutdown must update the peer address state - every await that depends on the network must have a timeout Once the Connection is created, it should handle timeouts. But we need to handle timeouts during handshake setup. * Avoid hangs by adding a timeout to the candidate set update Also increase the fanout from 1 to 2, to increase address diversity. But only return permanent errors from `CandidateSet::update`, because the crawler task exits if `update` returns an error. Also log Peers response errors in the CandidateSet. * Use the select macro in the crawler to reduce hangs The `select` function is biased towards its first argument, risking starvation. As a side-benefit, this change also makes the code a lot easier to read and maintain. * Split CrawlerAction::Demand into separate actions This refactor makes the code a bit easier to read, at the cost of sometimes blocking the crawler on `candidates.next()`. That's ok, because `next` only has a short (< 100 ms) delay. And we're just about to spawn a separate task for each handshake. * Spawn a separate task for each handshake This change avoids deadlocks by letting each handshake make progress independently. * Move the dial task into a separate function This refactor improves readability. * Fix buggy future::select function usage And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
}
}
}
}
Ok(())
}
/// Mark `addr` as a failed peer to `address_book_updater`.
#[instrument(skip(address_book_updater))]
async fn report_failed(
address_book_updater: tokio::sync::mpsc::Sender<MetaAddrChange>,
addr: MetaAddr,
) {
// The connection info is the same as what's already in the address book.
let addr = MetaAddr::new_errored(addr.addr, None);
// Ignore send errors on Zebra shutdown.
let _ = address_book_updater.send(addr).await;
Fix a deadlock between the crawler and dialer, and other hangs (#1950) * Stop ignoring inbound message errors and handshake timeouts To avoid hangs, Zebra needs to maintain the following invariants in the handshake and heartbeat code: - each handshake should run in a separate spawned task (not yet implemented) - every message, error, timeout, and shutdown must update the peer address state - every await that depends on the network must have a timeout Once the Connection is created, it should handle timeouts. But we need to handle timeouts during handshake setup. * Avoid hangs by adding a timeout to the candidate set update Also increase the fanout from 1 to 2, to increase address diversity. But only return permanent errors from `CandidateSet::update`, because the crawler task exits if `update` returns an error. Also log Peers response errors in the CandidateSet. * Use the select macro in the crawler to reduce hangs The `select` function is biased towards its first argument, risking starvation. As a side-benefit, this change also makes the code a lot easier to read and maintain. * Split CrawlerAction::Demand into separate actions This refactor makes the code a bit easier to read, at the cost of sometimes blocking the crawler on `candidates.next()`. That's ok, because `next` only has a short (< 100 ms) delay. And we're just about to spawn a separate task for each handshake. * Spawn a separate task for each handshake This change avoids deadlocks by letting each handshake make progress independently. * Move the dial task into a separate function This refactor improves readability. * Fix buggy future::select function usage And document the correctness of the new code.
2021-04-07 06:25:10 -07:00
}