We can't rule out the connection state changing between the state checks
and any eventual failures, particularly in the presence of async code.
So we turn this panic into a warning.
zebra-network's Connection expects that `fail_with` is only called once
per connection, but the overload handling code continues to process the
current request after an overload error, potentially leading to further
failures.
Closes#1599
## Motivation
This PR is motivated by the regression identified in https://github.com/ZcashFoundation/zebra/issues/1349. That PR notes that the metrics stopped working for most of the crates other than `zebrad`.
## Solution
This PR resolves the regression by deduplicating the `metrics` crate dependency. During a recent change we upgraded the metrics version in `zebrad` and a couple other of our crates, but we never updated the dependencies in `zebra-state`, `zebra-consensus`, or `zebra-network`. This caused the metrics macros to attempt to retrieve the current metrics exporter through the wrong function. We would install the metrics exporter in `0.13`, but then attempt to look it up through the `0.12` crate, which contains a different instance of the metrics exporter static variable which is unset. Doing this causes the metrics macros to return `None` for the current exporter after which they just silently give up.
## Related Issues
closes https://github.com/ZcashFoundation/zebra/issues/1349
## Follow Up Work
I noticed we have quite a few duplicate dependencies in our tree. We might be able to save some compilation time by auditing those and deduplicating them as much as possible.
- https://github.com/ZcashFoundation/zebra/issues/1582
Co-authored-by: teor <teor@riseup.net>
The `peer::Client` translates `Request`s into `ClientRequest`s, which
it sends to a background task. If the send is `Ok(())`, it will assume
that it is safe to unconditionally poll the `Receiver` tied to the
`Sender` used to create the `ClientRequest`.
We enforce this invariant via the type system, by converting
`ClientRequest`s to `InProgressClientRequest`s when they are received by
the background task. These conversions are implemented by
`ClientRequestReceiver`.
Changes:
* Revert `ClientRequest` so it uses a `oneshot::Sender`
* Add `InProgressClientRequest`, which is the same as `ClientRequest`,
but has a `MustUseOneshotSender`
* `impl From<ClientRequest> for InProgressClientRequest`
* Add a new `ClientRequestReceiver` type that wraps a
`mpsc::Receiver<ClientRequest>`
* `impl Stream<InProgressClientRequest> for ClientRequestReceiver`,
converting the successful result of `inner.poll_next_unpin` into an
`InProgressClientRequest`
* Replace `client_rx: mpsc::Receiver<ClientRequest>` in `Connection`
with the new `ClientRequestReceiver` type
* `impl From<mpsc::Receiver<ClientRequest>> for ClientRequestReceiver`
This fix also changes heartbeat behaviour in the following ways:
* if the queue is full, the connection is closed. Previously, the sender
would wait until the queue had emptied
* if the queue flush fails, Zebra panics, because it can't send an error
on the ClientRequest sender, so the invariant is broken
Add a MustUseOneshotSender, which panics if its inner sender is unused.
Callers must call `send()` on the MustUseOneshotSender, or ensure that
the sender is canceled.
Replaces an unreliable panic in `Client::call()` with a reliable panic
when a must-use sender is dropped.
Previously, tx would be dropped before send if:
- the success case would have used tx to wait for further messages,
- but the response was actually an error.
Instead, send the error on `tx` and call `fail_with()` using the same
error.
To support this change, allow `fail_with()` to take a `PeerError` or a
`SharedPeerError`.
* Rewrite GetData handling to match the zcashd implementation
`zcashd` silently ignores missing blocks, but sends found transactions
followed by a `NotFound` message:
e7b425298f/src/main.cpp (L5497)
This is significantly different to the behaviour expected by the old
Zebra connection state machine, which expected `NotFound` for blocks.
Also change Zebra's GetData responses to peer request so they ignore
missing blocks.
* Stop hanging on incomplete transaction or block responses
Instead, if the peer sends an unexpected block, unexpected transaction,
or NotFound message:
1. end the request, and return a partial response containing any items
that were successfully received
2. if none of the expected blocks or transactions were received, return
an error, and close the connection
Previously we set the crate versions to 3.x, so that the major version was
aligned with the NU version. But we want to be able to make API changes
independently of the NU schedule.
We modeled a Bitcoin `headers` message as being a list of block headers.
However, the actual data structure is slightly different: it's a list of (block
header, transaction count) pairs. This caused zcashd to reject our headers
messages.
To fix this, introduce a new `CountedHeader` struct with a `block::Header` and
transaction count `usize`, then thread it through the inbound service and the
state.
I tested this locally by running Zebra with these changes and inspecting a
trace-level log of the span of a peer connection that requested a nontrivial
headers packet from us, and verified that it did not reject our message.
The cancellation implementation changes made to the connection state machine
mean that if a response oneshot is dropped, the connection will avoid
cancelling the request. So the heartbeat task does have to wait on the response.
Not all reject messages include a data field. This change partially addresses
a problem that could lead to a depleted peer set:
1. We send a response to a `getheaders` message;
2. The remote peer `reject`s our `headers` message for some reason;
3. We fail to parse their `reject` message and close the connection;
4. Repeating this process, we have no more peers.
This commit fixes (3) but does not address (2).
This makes the span data more compact (e.g., `msg_as_req{msg=block}`) and
restores the Debug impl for Message to show all of the data contained in the
message. The full message is added as a single event at trace level in the
span to preserve the previous full-inspectability.
As we approach our alpha release we've decided we want to plan ahead for the user bug reports we will eventually receive. One of the bigger issues we foresee is determining exactly what version of the software users are running, and particularly how easy it may or may not be for users to accidentally discard this information when reporting bugs.
To defend against this, we've decided to include the exact git sha for any given build in the compiled artifact. This information will then be re-exported as a span early in the application startup process, so that all logs and error messages should include the sha as their very first span. We've also added this sha as issue metadata for `color-eyre`'s github issue url auto generation feature, which should make sure that the sha is easily available in bug reports we receive, even in the absence of logs.
Co-authored-by: teor <teor@riseup.net>
* warn: if there are no peers at all
* info: if there are no ready peers
* trace: the number of ready and unready peers for every request
Log at most one warn or info log per minute, to avoid flooding the
terminal with log lines. Suppress warn and info logs for the first
minute, while the peer set is starting up.
* implement inbound `FindBlocks`
* Handle inbound peer FindHeaders requests
* handle request before having any chain tip
* Split `find_chain_hashes` into smaller functions
Add a `max_len` argument to support `FindHeaders` requests.
Rewrite the hash collection code to use heights, so we can handle the
`stop` hash and "no intersection" cases correctly.
* Split state height functions into "any chain" and "best chain"
* Rename the best chain block method to `best_block`
* Move fmt utilities to zebra_chain::fmt
* Summarise Debug for some Message variants
Co-authored-by: teor <teor@riseup.net>
Co-authored-by: Jane Lusby <jlusby42@gmail.com>
Closes#1183.
The peer set maintains a preselected ready service that it can use to
perform power-of-two-choices (p2c) routing of requests. Ready services
are stored by key (socket address) in an `IndexMap`, and the preselected
service is represented by an `Option<usize>` indexing that map. This
means that whenever the set of ready services changes (e.g., a service
is removed from the peer set, or a service is taken to be used to
process a request), the preselected index is invalidated. The original
P2C-only implementation maintained this invariant but did not document
it.
The change to inventory-based routing introduced a bug by failing to
maintain this invariant and appropriately invalidate the preselected
index. However, this was only noticeable approximately 1/N of the time
on the next request after an inventory-directed request, so the bug
occurred infrequently. Luckily, the use of `.expect` caused the bug to
be an immediate panic, making it possible to identify by inspecting all
uses of the ready service map.
This change is mostly mechanical, with the exception of the changes to the
`tower-batch` middleware. This middleware was adapted from `tower::buffer`,
and the `tower::buffer` code was changed to implement its own bounded queue,
because Tokio 0.3 removed the `mpsc::Sender::poll_send` method. See
ddc64e8d4d
for more context on the Tower changes. To match Tower as closely as possible
in order to be able to upstream `tower-batch`, those changes are copied from
`tower::Buffer` to `tower-batch`.
Per https://zips.z.cash/zip-0251, nodes compatible with Canopy
activation on mainnet MUST advertise protocol version 170013 or later.
Once Canopy activates on testnet or mainnet, Canopy nodes SHOULD reject
new connections from pre-Canopy nodes, so this also increases the
minimum version.
This change explicitly documents cancellation contracts for our Tower services,
and tries to correct a bug in the implementation of the CheckpointVerifier,
which duplicates information from the state service but did not ensure that it
would be kept in sync.
These messages might be unsolicited, or they might be a response to a
request we already canceled. So don't fail the whole connection, just
drop the message and move on.
We handle request cancellation in two places: before we transition into
the AwaitingResponse state, and while we are in AwaitingResponse. We
need both places, or else if we started processing a request, we
wouldn't process the cancellation until the timeout elapsed.
The first is a check that the oneshot is not already canceled.
For the second, we wait on a cancellation, either from a timeout or from
the tx channel closing.
This addresses at least three pain points:
- we were affected by bugs that were already fixed in git, but not in
the released crate;
- we can use service combinators to transform requests and responses;
- we can use the hedge middleware.
The version in git is still marked as 0.3.1 but these changes will be
part of tower 0.4: https://github.com/tower-rs/tower/issues/431
The GetPeers requests sent while crawling the network are randomly
load-balanced over available peers. But at the very beginning, they may
be both routed to the same peer, causing network initialization to be
delayed while the second one times out (since zcashd only ever responds
to the first addr message).
Only sending one GetPeers request per candidate set update means we
crawl the network a little more slowly, but avoids hanging on start.
This cleans up the response processing logic a little bit along the way,
but the overall division of responsibility should be better documented
in a future commit.
This lets us distinguish between cases where the message was unsupported
(e.g., BIP11 messages), and cases where the message was uninterpretable
in context (e.g., unsolicited messages).
> Added a test that the handshake's version message matches specified fields, but the test does not compile, because rustc doesn't believe that the Box<dyn std::error::Error + Send + Sync + 'static> is 'static, and therefore isn't a Box<dyn std::error::Error + Send + Sync + 'static>. This manifests as being unable to spawn the connect_isolated task. From digging through Tokio issues I believe that this is an instance of rust-lang/rust#64552 .
Co-authored-by: Jane Lusby <jlusby42@gmail.com>
The peer set provides an automatically managed connection pool, abstracting
away all the details of handling individual peer connections. However, it's
also useful to be able to create completely isolated and
minimally-distinguishable connections to individual peers, in order to be able
to send specific messages over Tor, or to implement some custom network crawler
logic.
* increase the EWMA default and decay
* increase the block download retries
* increase the request and block download timeouts
* increase the sync timeout
This commit makes several related changes to the network code:
- adds a `TransactionsByHash(HashSet<transaction::Hash>)` request and
`Transactions(Vec<Arc<Transaction>>)` response pair that allows
fetching transactions from a remote peer;
- adds a `PushTransaction(Arc<Transaction>)` request that pushes an
unsolicited transaction to a remote peer;
- adds an `AdvertiseTransactions(HashSet<transaction::Hash>)` request
that advertises transactions by hash to a remote peer;
- adds an `AdvertiseBlock(block::Hash)` request that advertises a block
by hash to a remote peer;
Then, it modifies the connection state machine so that outbound
requests to remote peers are handled properly:
- `TransactionsByHash` generates a `getdata` message and collects the
results, like the existing `BlocksByHash` request.
- `PushTransaction` generates a `tx` message, and returns `Nil` immediately.
- `AdvertiseTransactions` and `AdvertiseBlock` generate an `inv`
message, and return `Nil` immediately.
Next, it modifies the connection state machine so that messages
from remote peers generate requests to the inbound service:
- `getdata` messages generate `BlocksByHash` or `TransactionsByHash`
requests, depending on the content of the message;
- `tx` messages generate `PushTransaction` requests;
- `inv` messages generate `AdvertiseBlock` or `AdvertiseTransactions`
requests.
Finally, it refactors the request routing logic for the peer set to
handle advertisement messages, providing three routing methods:
- `route_p2c`, which uses p2c as normal (default);
- `route_inv`, which uses the inventory registry and falls back to p2c
(used for `BlocksByHash` or `TransactionsByHash`);
- `route_all`, which broadcasts a request to all ready peers (used for
`AdvertiseBlock` and `AdvertiseTransactions`).
* network: fix bug in inventory advertisement handling
The RFC https://zebra.zfnd.org/dev/rfcs/0003-inventory-tracking.html described
the use of a `broadcast` channel in place of an `mpsc` channel to get
ring-buffer behavior, keeping a bound on the size of the channel but dropping
old entries when the channel is full.
However, it didn't explicitly describe how this works (the `broadcast` channel
returns a `RecvError::Lagged(u64)` to inform receivers that they lost
messages), so the lag-handling wasn't implemented and I didn't notice in
review.
Instead, the ? operator bubbled the lag error all the way up from
`InventoryRegistry::poll_inventory` through `<PeerSet as Service>::poll_ready`
through various Tower wrappers to users of the peer set. The error propagation
is bad enough, because it caused client errors that shouldn't have happened,
but there's a worse interaction.
The `Service` contract distinguishes between request errors (from
`Service::call`, scoped to the request) and service errors (from
`Service::poll_ready`, scoped to the service). The `Service` contract
specifies that once a service returns an error from `poll_ready`, the service
can be assumed to be failed permanently.
I believe (but haven't tested or carefully worked through the details) that
this caused various tower middleware to report the entire peer set service as
permanently failed due to a transient inventory "error" (more of an indicator),
and I suspect that this is the cause of #1003, where all of the sync
component's requests end up failing because the peer set reported that it
failed permanently. I am able to reproduce #1003 locally before this change
and unable to reproduce it locally after this change, though I have not tested
exhaustively.
* network: add metric for dropped inventory advertisements
Co-authored-by: teor <teor@riseup.net>
Co-authored-by: teor <teor@riseup.net>
The relay flag in the version message is used in conjunction with BIP37 to
receive bloom-filtered transactions. When it is set to false, transactions are
not relayed until a bloom filter is set. Since we don't implement BIP37 (it's
not useful for shielded transactions), this means we'll never receive
transactions.
This is the first in a sequence of changes that change the block:: items
to not include Block as a prefix in their name, in accordance with the
Rust API guidelines.
This extracts the SHA256d code from being split across two modules and puts it
in one module, under serialization.
The code is unchanged except for three deleted tests:
* `sha256d_flush` in `sha256d_writer` (not a meaningful test);
* `transactionhash_debug` (constructs an invalid transaction hash, and the
behavior is tested in the next test);
* `decode_state_debug` (we do not need to test the Debug output of
DecodeState);
* add bytes read and written metrics
* Apply suggestions from code review
Co-authored-by: Jane Lusby <jlusby42@gmail.com>
* store address as string
* Apply suggestions from code review
Co-authored-by: Henry de Valence <hdevalence@hdevalence.ca>
* change addr to label
Co-authored-by: Henry de Valence <hdevalence@hdevalence.ca>
* remove newline
Co-authored-by: Jane Lusby <jlusby42@gmail.com>
Co-authored-by: Henry de Valence <hdevalence@hdevalence.ca>
* checkpoint: reject older of duplicate verification requests.
If we get a duplicate block verification request, we should drop the older one
in favor of the newer one, because the older request is likely to have been
canceled. Previously, this code would accept up to four duplicate verification
requests, then fail all subsequent ones.
* sync: add a timeout layer to block requests.
Note that if this timeout is too short, we'll bring down the peer set in a
retry storm.
* sync: restart syncing on error
Restart the syncing process when an error occurs, rather than ignoring it.
Restarting means we discard all tips and start over with a new block locator,
so we can have another chance to "unstuck" ourselves.
* sync: additional debug info
* sync: handle lookahead limit correctly.
Instead of extracting all the completed task results, the previous code pulled
results out until there were fewer tasks than the lookahead limit, then
stopped. This meant that completed tasks could be left until the limit was
exceeded again. Instead, extract all completed results, and use the number of
pending tasks to decide whether to extend the tip or wait for blocks to finish.
* network: add debug instrumentation to retry policy
* sync: instrument the spawned task
* sync: streamline ObtainTips/ExtendTips logic & tracing
This change does three things:
1. It aligns the implementation of ObtainTips and ExtendTips so that they use
the same deduplication method. This means that when debugging we only have one
deduplication algorithm to focus on.
2. It streamlines the tracing output to not include information already
included in spans. Both obtain_tips and extend_tips have their own spans
attached to the events, so it's not necessary to add Scope: prefixes in
messages.
3. It changes the messages to be focused on reporting the actual
events rather than the interpretation of the events (e.g., "got genesis hash in
response" rather than "peer could not extend tip"). The motivation for this
change is that when debugging, the interpretation of events is already known to
be incorrect, in the sense that the mental model of the code (no bug) does not
match its behavior (has bug), so presenting minimally-interpreted events forces
interpretation relative to the actual code.
* sync: hack to work around zcashd behavior
* sync: localize debug statement in extend_tips
* sync: change algorithm to define tips as pairs of hashes.
This is different enough from the existing description that its comments no
longer apply, so I removed them. A further chunk of work is to change the sync
RFC to document this algorithm.
* sync: reduce block timeout
* state: add resource limits for sled
Closes#888
* sync: add a restart timeout constant
* sync: de-pub constants
* network: move gossiped peer selection logic into address book.
* network: return BoxService from init.
* zebrad: add note on why we truncate thegossiped peer list
Co-authored-by: Jane Lusby <jlusby42@gmail.com>
* Remove unused .rustfmt.toml
Many of these options are never actually loaded by our CI because of a channel
mismatch, where they're not applied on stable but only on nightly (see the logs
from a rustfmt job). This means that we can get different settings when
running `cargo fmt` on the nightly and stable channels, which was causing a CI
failure on this PR. Reverting back to the default rustfmt settings avoids this
problem and keeps us in line with upstream rustfmt. There's no loss to us
since we were using the defaults anyways.
Co-authored-by: Jane Lusby <jlusby42@gmail.com>
Closes#536.
This removes:
- the user-agent (we can add a mechanism to specify extra BIP14 components later, if any users ask us for that feature);
- the EWMA parameters (these were put in the config just to avoid making a choice);
- the peer connection timeout (we can change the default value if anyone ever has a problem with it);
- the peer set request buffer size (setting this too low can make the application deadlock);
The new peer interval is left in.
We can use this network upgrade to implement different consensus rules
and chain context handling for genesis blocks.
Part of the chain state design in #682.
We had a brief discussion on discord and it seemed like we had consensus on the
following versioning policy:
* zebrad: match major version to NU version, so we will start by releasing
zebrad 3.0.0;
* zebra-* libraries: start by matching zebrad's version, then increment major
versions of each library as we need to make breaking changes (potentially
faster than the zebrad version, always respecting semver but making no
guarantees about the longevity of major releases).
This commit sets all of the crate versions to 3.0.0-alpha.0 -- the -alpha.0
marks it as a prerelease not subject to perfect adherence to compatibility
guarantees.
When the connection sees the client_rx channel close it knows it will never get
any more requests, and it should terminate. But instead of terminating, it
errored itself, and the method to error itself tries to pull all the
outstanding client requests from the channel in order to fail them before it
shuts down. This results in reading from a closed channel, causing a panic.
Instead we return cleanly rather than failing (since we know there are no
outstanding requests, as the channel is closed).
This fixes a bug introduced when we added heartbeat support. Recall that we
handle the Bitcoin connection state machine on a per-peer basis. Each
connection has a task created from the `Connection` struct, and a `Client:
tower::Service` "frontend" that passes requests to it via a channel. In the
`Connection` event loop, the connection checks whether the request channel has
been closed, indicating no further requests from the `Client`, in which case it
shuts itself down and cleans up resources. This occurs when all of the senders
have been dropped.
However, this behavior broke when we introduced heartbeat support, because we
spawned an additional task to send heartbeat messages along the request
channel. This meant that instead of having a single sender, dropped by the
`Client`, we have two senders, the `Client` and the "shadow client" task that
generates heartbeat messages. This means that when the `Client` is dropped, we
still have a live sender and the connection is not closed. To fix this, the
`Client` now uses a `oneshot` to shut down its corresponding heartbeat task.
This closes all senders.
Each subsection has to have `serde(default)` to get the behaviour we want
(delete all fields except the ones that have been changed); otherwise, we can
delete only entire sections.
Prior to this change, we required that services that are canceled do not
have a cancel handle in the `cancel_handles` list, based on the
assumption that the handle must have been removed in the process of
canceling this service.
This doesn't holding up though, because it is currently possible for us
to have the same peer connect to us multiple times, the second connect
removes the cancel handle of the original connect and inserts it's own
cancel handle in its place. In this scenario, when the first service is
polled for readiness it will see that it has been canceled and go to
clean itself up, but when it asserts that it doesn't have a cancel
handle it will see the cancel handle of the second connect event, which
uses the same key as the first connect, and fail its debug assertion.
This change removes that debug assert on the assumption that it is okay
for a peer to connect multiple times consecutively, and that the correct
behavior in that case is to just cancel the first connection and
continue as normal.
Prior to this change, the service returned by `zebra_network::init` would spawn background tasks that could silently fail, causing unexpected errors in the zebra_network service.
This change modifies the `PeerSet` that backs `zebra_network::init` to store all of the `JoinHandle`s for each background task it depends on. The `PeerSet` then checks this set of futures to see if any of them have exited with an error or a panic, and if they have it returns the error as part of `poll_ready`.
Co-authored-by: Jane Lusby <jane@zfnd.org>
Prior to this change, the seed subcommand would consistently encounter a panic in one of the background tasks, but would continue running after the panic. This is indicative of two bugs.
First, zebrad was not configured to treat panics as non recoverable and instead defaulted to the tokio defaults, which are to catch panics in tasks and return them via the join handle if available, or to print them if the join handle has been discarded. This is likely a poor fit for zebrad as an application, we do not need to maximize uptime or minimize the extent of an outage should one of our tasks / services start encountering panics. Ignoring a panic increases our risk of observing invalid state, causing all sorts of wild and bad bugs. To deal with this we've switched the default panic behavior from `unwind` to `abort`. This makes panics fail immediately and take down the entire application, regardless of where they occur, which is consistent with our treatment of misbehaving connections.
The second bug is the panic itself. This was triggered by a duplicate entry in the initial_peers set. To fix this we've switched the storage for the peers from a `Vec` to a `HashSet`, which has similar properties but guarantees uniqueness of its keys.