* warn: if there are no peers at all
* info: if there are no ready peers
* trace: the number of ready and unready peers for every request
Log at most one warn or info log per minute, to avoid flooding the
terminal with log lines. Suppress warn and info logs for the first
minute, while the peer set is starting up.
Closes#1183.
The peer set maintains a preselected ready service that it can use to
perform power-of-two-choices (p2c) routing of requests. Ready services
are stored by key (socket address) in an `IndexMap`, and the preselected
service is represented by an `Option<usize>` indexing that map. This
means that whenever the set of ready services changes (e.g., a service
is removed from the peer set, or a service is taken to be used to
process a request), the preselected index is invalidated. The original
P2C-only implementation maintained this invariant but did not document
it.
The change to inventory-based routing introduced a bug by failing to
maintain this invariant and appropriately invalidate the preselected
index. However, this was only noticeable approximately 1/N of the time
on the next request after an inventory-directed request, so the bug
occurred infrequently. Luckily, the use of `.expect` caused the bug to
be an immediate panic, making it possible to identify by inspecting all
uses of the ready service map.
This addresses at least three pain points:
- we were affected by bugs that were already fixed in git, but not in
the released crate;
- we can use service combinators to transform requests and responses;
- we can use the hedge middleware.
The version in git is still marked as 0.3.1 but these changes will be
part of tower 0.4: https://github.com/tower-rs/tower/issues/431
This commit makes several related changes to the network code:
- adds a `TransactionsByHash(HashSet<transaction::Hash>)` request and
`Transactions(Vec<Arc<Transaction>>)` response pair that allows
fetching transactions from a remote peer;
- adds a `PushTransaction(Arc<Transaction>)` request that pushes an
unsolicited transaction to a remote peer;
- adds an `AdvertiseTransactions(HashSet<transaction::Hash>)` request
that advertises transactions by hash to a remote peer;
- adds an `AdvertiseBlock(block::Hash)` request that advertises a block
by hash to a remote peer;
Then, it modifies the connection state machine so that outbound
requests to remote peers are handled properly:
- `TransactionsByHash` generates a `getdata` message and collects the
results, like the existing `BlocksByHash` request.
- `PushTransaction` generates a `tx` message, and returns `Nil` immediately.
- `AdvertiseTransactions` and `AdvertiseBlock` generate an `inv`
message, and return `Nil` immediately.
Next, it modifies the connection state machine so that messages
from remote peers generate requests to the inbound service:
- `getdata` messages generate `BlocksByHash` or `TransactionsByHash`
requests, depending on the content of the message;
- `tx` messages generate `PushTransaction` requests;
- `inv` messages generate `AdvertiseBlock` or `AdvertiseTransactions`
requests.
Finally, it refactors the request routing logic for the peer set to
handle advertisement messages, providing three routing methods:
- `route_p2c`, which uses p2c as normal (default);
- `route_inv`, which uses the inventory registry and falls back to p2c
(used for `BlocksByHash` or `TransactionsByHash`);
- `route_all`, which broadcasts a request to all ready peers (used for
`AdvertiseBlock` and `AdvertiseTransactions`).
Prior to this change, we required that services that are canceled do not
have a cancel handle in the `cancel_handles` list, based on the
assumption that the handle must have been removed in the process of
canceling this service.
This doesn't holding up though, because it is currently possible for us
to have the same peer connect to us multiple times, the second connect
removes the cancel handle of the original connect and inserts it's own
cancel handle in its place. In this scenario, when the first service is
polled for readiness it will see that it has been canceled and go to
clean itself up, but when it asserts that it doesn't have a cancel
handle it will see the cancel handle of the second connect event, which
uses the same key as the first connect, and fail its debug assertion.
This change removes that debug assert on the assumption that it is okay
for a peer to connect multiple times consecutively, and that the correct
behavior in that case is to just cancel the first connection and
continue as normal.
Prior to this change, the service returned by `zebra_network::init` would spawn background tasks that could silently fail, causing unexpected errors in the zebra_network service.
This change modifies the `PeerSet` that backs `zebra_network::init` to store all of the `JoinHandle`s for each background task it depends on. The `PeerSet` then checks this set of futures to see if any of them have exited with an error or a panic, and if they have it returns the error as part of `poll_ready`.
Co-authored-by: Jane Lusby <jane@zfnd.org>
Prior to this change, the seed subcommand would consistently encounter a panic in one of the background tasks, but would continue running after the panic. This is indicative of two bugs.
First, zebrad was not configured to treat panics as non recoverable and instead defaulted to the tokio defaults, which are to catch panics in tasks and return them via the join handle if available, or to print them if the join handle has been discarded. This is likely a poor fit for zebrad as an application, we do not need to maximize uptime or minimize the extent of an outage should one of our tasks / services start encountering panics. Ignoring a panic increases our risk of observing invalid state, causing all sorts of wild and bad bugs. To deal with this we've switched the default panic behavior from `unwind` to `abort`. This makes panics fail immediately and take down the entire application, regardless of where they occur, which is consistent with our treatment of misbehaving connections.
The second bug is the panic itself. This was triggered by a duplicate entry in the initial_peers set. To fix this we've switched the storage for the peers from a `Vec` to a `HashSet`, which has similar properties but guarantees uniqueness of its keys.
- Add a total peers metric to prevent races between measurements of
ready/unready peers (which can cause the sum to be wrong).
- Add an outbound request counter.
* Don't expose submodules of zebra_network::peer.
* PeerSet, PeerDiscover stubs.
Co-authored-by: Deirdre Connolly <deirdre@zfnd.org>
* Initial work on PeerSet.
This is adapted from the MIT-licensed tower-balance implementation.
* Use PeerSet in the connect stub.