If any of the times gossiped by a peer are in the future, apply the
necessary offset to all the times gossiped by that peer. This ensures
that all gossiped peers from a malicious peer are moved further back in
the queue.
Co-authored-by: teor <teor@riseup.net>
- Make the security impact clearer and in a separate section.
- Instead of listing an assumption as almost a side-note, describe it
clearly inside a `Panics` section.
Co-authored-by: teor <teor@riseup.net>
Due to clock skew, the peers could end up at the front of the
reconnection queue or far at the back. The solution to this is to offset
the reported times by the difference between the most recent reported
sight (in the remote clock) and the current time (in the local clock).
Returning `impl IntoIterator` means that the caller will always be
forced to call `.into_iter()`, and returning `impl Iterator` still
allows them to call `.into_iter()` because it becomes the identity
function.
Zebra assumes that deserialized times are always able to be serialized.
But this assumption is wrong because:
- sanitization can modify times
- gossiped `MetaAddr` validation can modify times
* Refactor: Split CandidateSet::update into separate functions
* Security: Apply a timeout to the entire CandidateSet::update
* Security: Stop using very large fanout limits during initialization
Previously, Zebra used the number of resolved peer addresses.
So it was possible for all peers to fail, and for Zebra to hang on the
first update.
And Zebra could send a fanout for each initial peer, regardless
of whether their connection was successful.
Also:
- wait for at least one successful peer before trying an update
- warn if there are no successful initial peers
When peers ask for peer addresses, add our local listener address to the
set of addresses, sanitize, then truncate. Sanitize shuffles addresses,
so if there are lots of addresses in the address book, our address will
only be sent to some peers.
Add canonical addresses from inbound connections to the address book,
so that Zebra can use them for reconnection attempts.
Use the newly added `NeverAttemptedAlternate` state for these addresses,
so we try gossiped addresses first, then canonical addresses. This avoids
duplicate connections to inbound peers.
If there is a small number of initial peers, and they are slow, the
initial candidate set update can appear to hang. To avoid this issue,
limit the initial candidate set fanout to the number of initial peers.
Once the initial peers have sent us more peer addresses, there is no need
to limit the fanouts for future updates.
Reported by Niklas Long of Equilibrium.
* Security: panic if an internally generated time is out of range
If Zebra has a bug where it generates blocks, transactions, or meta
addresses with bad times, panic. This avoids sending bad data onto the
network.
(Previously, Zebra would truncate some of these times, silently
corrupting the underlying data.)
Make it clear that deserialization of these objects is infalliable.
* Instrument the crawl task
When we created the crawl task, we forgot to instrument it with the
global span. This fix makes sure that the git and network span appears on
crawl logs.
* Instrument the connector
* Improve handshake instrumentation
Make some spans debug, so there are not too many spans.
* Add the address to initial peer connection errors
- stop putting inbound addresses in the address book
- drop address book entries that can't be used for outbound connections
- distinguish between temporary inbound and permanent outbound peer
addresses
- also create variants to handle proxy connections
(but don't use them yet)
- avoid tracking connection state for isolated connections
- document security constraints for the address book and peer set
* Security: stop panicking on out-of-range version timestamps
Instead, return a deserialization error, and close the connection.
This issue was reported by Equilibrium.
* Disable clippy warnings about comparing a newly created struct
In Sapling, we compare canonical JubJub bytes with a supplied byte array.
Since we need to perform calculations to get it into canonical form, we
need to create a newly owned object.
* Clippy: use assert rather than assert_eq on a bool
* Allow use listen address in config without port
* update comments
* remove not used alias
* use Network::default_port
* Move tests and use toml instead json
* change error message
* Make match more readable
Co-authored-by: teor <teor@riseup.net>
* Add functions for serializing and deserializing split arrays
In Transaction::V5, Zcash splits some types into multiple arrays, with a
single prefix count before the first array.
Add utility functions for serializing and deserializing the subsequent
arrays, with a paramater for the original array's length.
* Use zcash_deserialize_bytes_external_count in zebra-network
* Move some preallocate proptests to their own file
And fix the test module structure so it is consistent with the rest of
zebra-chain.
* Add a convenience alias zcash_serialize_external_count
* Explain why u64::MAX items will never be reached
Zebra avoids having a majority of addresses from a single peer by asking
3 peers for new addresses.
Also update a bunch of security comments and related documentation.
* Make proptest dependencies consistent between chain and network
* Implement Arbitrary for InventoryHash and use it in tests
* Impl Arbitrary for MetaAddr and use it in tests
Also test some extreme times in MetaAddr sanitization.
* Stop ignoring inbound message errors and handshake timeouts
To avoid hangs, Zebra needs to maintain the following invariants in the
handshake and heartbeat code:
- each handshake should run in a separate spawned task
(not yet implemented)
- every message, error, timeout, and shutdown must update the peer address state
- every await that depends on the network must have a timeout
Once the Connection is created, it should handle timeouts.
But we need to handle timeouts during handshake setup.
* Avoid hangs by adding a timeout to the candidate set update
Also increase the fanout from 1 to 2, to increase address diversity.
But only return permanent errors from `CandidateSet::update`, because
the crawler task exits if `update` returns an error.
Also log Peers response errors in the CandidateSet.
* Use the select macro in the crawler to reduce hangs
The `select` function is biased towards its first argument, risking
starvation.
As a side-benefit, this change also makes the code a lot easier to read
and maintain.
* Split CrawlerAction::Demand into separate actions
This refactor makes the code a bit easier to read, at the cost of
sometimes blocking the crawler on `candidates.next()`.
That's ok, because `next` only has a short (< 100 ms) delay. And we're
just about to spawn a separate task for each handshake.
* Spawn a separate task for each handshake
This change avoids deadlocks by letting each handshake make progress
independently.
* Move the dial task into a separate function
This refactor improves readability.
* Fix buggy future::select function usage
And document the correctness of the new code.
* Move the preallocate tests into their own files
And move the MetaAddr proptest into its own file.
Also do some minor formatting and cleanups.
Co-authored-by: Deirdre Connolly <durumcrustulum@gmail.com>
* Implement SafePreallocate. Resolves#1880
* Add proptests for SafePreallocate
* Apply suggestions from code review
Comments which did not include replacement code will be addressed in a follow-up commit.
Co-authored-by: teor <teor@riseup.net>
* Rename [Safe-> Trusted]Allocate. Add doc and tests
Add tests to show that the largest allowed vec under TrustedPreallocate
is small enough to fit in a Zcash block/message (depending on type).
Add doc comments to all TrustedPreallocate test cases.
Tighten bounds on max_trusted_alloc for some types.
Note - this commit does NOT include TrustedPreallocate
impls for JoinSplitData, String, and Script.
These impls will be added in a follow up commit
* Implement SafePreallocate. Resolves#1880
* Add proptests for SafePreallocate
* Apply suggestions from code review
Comments which did not include replacement code will be addressed in a follow-up commit.
Co-authored-by: teor <teor@riseup.net>
* Rename [Safe-> Trusted]Allocate. Add doc and tests
Add tests to show that the largest allowed vec under TrustedPreallocate
is small enough to fit in a Zcash block/message (depending on type).
Add doc comments to all TrustedPreallocate test cases.
Tighten bounds on max_trusted_alloc for some types.
Note - this commit does NOT include TrustedPreallocate
impls for JoinSplitData, String, and Script.
These impls will be added in a follow up commit
* Impl TrustedPreallocate for Joinsplit
* Impl ZcashDeserialize for Vec<u8>
* Arbitrary, TrustedPreallocate, Serialize, and tests for Spend<SharedAnchor>
Co-authored-by: teor <teor@riseup.net>
`sanitize` could be misused in two ways:
* accidentally modifying the addresses in the address book itself
* forgetting to sanitize new fields added to `MetaAddr`
This change prevents accidental modification by taking `&self`, and
explicitly creates a new sanitized `MetaAddr` with all fields listed.
Zebra's latest alpha checkpoints on Canopy activation, continues our work on NU5, and fixes a security issue.
Some notable changes include:
## Added
- Log address book metrics when PeerSet or CandidateSet don't have many peers (#1906)
- Document test coverage workflow (#1919)
- Add a final job to CI, so we can easily require all the CI jobs to pass (#1927)
## Changed
- Zebra has moved its mandatory checkpoint from Sapling to Canopy (#1898, #1926)
- This is a breaking change for users that depend on the exact height of the mandatory checkpoint.
## Fixed
- tower-batch: wake waiting workers on close to avoid hangs (#1908)
- Assert that pre-Canopy blocks use checkpointing (#1909)
- Fix CI disk space usage by disabling incremental compilation in coverage builds (#1923)
## Security
- Stop relying on unchecked length fields when preallocating vectors (#1925)
Zebra already uses `Read::take` to enforce message, body, and block
maximum sizes.
So using `Read::take` on untrusted sizes can result in short reads,
without a corresponding `UnexpectedEof` error. (The old code was
correct, but copying it elsewhere would have been risky.)
* Add NU5 variant to NetworkUpgrade
* Add consensus branch ID for NU5
* Add network protocol versions for NU5
* Add NU5 to the protocol::version_consistent test
* Make unimplemented panic messages more specific
* Block target spacing doesn't change in NU5
* add comments for future updates for NU5
Co-authored-by: teor <teor@riseup.net>
* Retry each peer DNS a few times individually
We retry each peer individually, as well as retrying if there are no
peers in the combined list.
DNS failures are correlated, so all peers can fail DNS, leaving Zebra
with a small list of custom-configured IP address peers.
Individual retries avoid this issue.
* Rename parse_peers to resolve_peers
Co-authored-by: Deirdre Connolly <durumcrustulum@gmail.com>
- Add an `ExitClient` transition, used when the internal client channel
is closed or dropped, and there are no more pending requests
- Ignore pending requests after an `ExitClient` transition
- Reject pending requests when the peer has caused an error
(the `Exit` and `ExitRequest` transitions)
- Remove `PeerError::ConnectionDropped`, because it is now handled by
`ExitClient`. (Which is an internal error, not a peer error.)
Log a "Trying..." message before each listener opens, to see if the
delay is inside Zebra, or in the test harness or OS.
Also report the configured and actual ports where possible, for better
diagnostics.
Design:
- Add a `PeerAddrState` to each `MetaAddr`
- Use a single peer set for all peers, regardless of state
- Implement time-based liveness as an `AddressBook` method, rather than
a `PeerAddrState` variant
- Delete `AddressBook.by_state`
Implementation:
- Simplify `AddressBook` changes using `update` and `take` modifier
methods
- Simplify the `AddressBook` iterator implementation, replacing it with
methods that are more obviously correct
- Consistently collect peer set metrics
Documentation:
- Expand and update the peer set documentation
We can optimise later, but for now we want simple code that is more
obviously correct.
* replace to_socket_addrs
* refactor `resolve()` into `resolve_host()`
* use `resolve_host()` to resolve config peers
* add DNS_LOOKUP_TIMEOUT constant
* don't block the main thread in initialize
* add hint for port error
* add issue filter for port panic
* add lock file hint
* add metrics endpoint port conflict hint
* add hint for tracing endpoint port conflict
* add acceptance test for resource conflics
* Split out common conflict test code into a function
* Add state, metrics, and tracing conflict tests
* Add a full set of stderr acceptance test functions
This change makes the stdout and stderr acceptance test interfaces
identical.
* move Zcash listener opening
* add todo about hint for disk full
* add constant for lock file
* match path in state cache
* don't match windows cache path
* Use Display for state path logs
Avoids weird escaping on Windows when using Debug
* Add Windows conflict error messages
* Turn PORT_IN_USE_ERROR into a regex
And add another alternative Windows-specific port error
Co-authored-by: teor <teor@riseup.net>
Co-authored-by: Jane Lusby <jane@zfnd.org>
* Bump versions where appropriate
Tested with cargo install --locked --path etc
* Remove fixed panics from 'Known Issues'
* Change to alpha release series in the README
Co-authored-by: teor <teor@riseup.net>
The clippy unknown lints attribute was deprecated in
nightly in rust-lang/rust#80524. The old lint name now produces a
warning.
Since we're using `allow(unknown_lints)` to suppress warnings, we need to
add the canonical name, so we can continue to build without warnings on
nightly.
But we also need to keep the old name, so we can continue to build
without warnings on stable.
And therefore, we also need to disable the "removed lints" warning,
otherwise we'll get warnings about the old name on nightly.
We'll need to keep this transitional clippy config until rustc 1.51 is
stable.
We can't rule out the connection state changing between the state checks
and any eventual failures, particularly in the presence of async code.
So we turn this panic into a warning.
zebra-network's Connection expects that `fail_with` is only called once
per connection, but the overload handling code continues to process the
current request after an overload error, potentially leading to further
failures.
Closes#1599
## Motivation
This PR is motivated by the regression identified in https://github.com/ZcashFoundation/zebra/issues/1349. That PR notes that the metrics stopped working for most of the crates other than `zebrad`.
## Solution
This PR resolves the regression by deduplicating the `metrics` crate dependency. During a recent change we upgraded the metrics version in `zebrad` and a couple other of our crates, but we never updated the dependencies in `zebra-state`, `zebra-consensus`, or `zebra-network`. This caused the metrics macros to attempt to retrieve the current metrics exporter through the wrong function. We would install the metrics exporter in `0.13`, but then attempt to look it up through the `0.12` crate, which contains a different instance of the metrics exporter static variable which is unset. Doing this causes the metrics macros to return `None` for the current exporter after which they just silently give up.
## Related Issues
closes https://github.com/ZcashFoundation/zebra/issues/1349
## Follow Up Work
I noticed we have quite a few duplicate dependencies in our tree. We might be able to save some compilation time by auditing those and deduplicating them as much as possible.
- https://github.com/ZcashFoundation/zebra/issues/1582
Co-authored-by: teor <teor@riseup.net>
The `peer::Client` translates `Request`s into `ClientRequest`s, which
it sends to a background task. If the send is `Ok(())`, it will assume
that it is safe to unconditionally poll the `Receiver` tied to the
`Sender` used to create the `ClientRequest`.
We enforce this invariant via the type system, by converting
`ClientRequest`s to `InProgressClientRequest`s when they are received by
the background task. These conversions are implemented by
`ClientRequestReceiver`.
Changes:
* Revert `ClientRequest` so it uses a `oneshot::Sender`
* Add `InProgressClientRequest`, which is the same as `ClientRequest`,
but has a `MustUseOneshotSender`
* `impl From<ClientRequest> for InProgressClientRequest`
* Add a new `ClientRequestReceiver` type that wraps a
`mpsc::Receiver<ClientRequest>`
* `impl Stream<InProgressClientRequest> for ClientRequestReceiver`,
converting the successful result of `inner.poll_next_unpin` into an
`InProgressClientRequest`
* Replace `client_rx: mpsc::Receiver<ClientRequest>` in `Connection`
with the new `ClientRequestReceiver` type
* `impl From<mpsc::Receiver<ClientRequest>> for ClientRequestReceiver`
This fix also changes heartbeat behaviour in the following ways:
* if the queue is full, the connection is closed. Previously, the sender
would wait until the queue had emptied
* if the queue flush fails, Zebra panics, because it can't send an error
on the ClientRequest sender, so the invariant is broken
Add a MustUseOneshotSender, which panics if its inner sender is unused.
Callers must call `send()` on the MustUseOneshotSender, or ensure that
the sender is canceled.
Replaces an unreliable panic in `Client::call()` with a reliable panic
when a must-use sender is dropped.
Previously, tx would be dropped before send if:
- the success case would have used tx to wait for further messages,
- but the response was actually an error.
Instead, send the error on `tx` and call `fail_with()` using the same
error.
To support this change, allow `fail_with()` to take a `PeerError` or a
`SharedPeerError`.
* Rewrite GetData handling to match the zcashd implementation
`zcashd` silently ignores missing blocks, but sends found transactions
followed by a `NotFound` message:
e7b425298f/src/main.cpp (L5497)
This is significantly different to the behaviour expected by the old
Zebra connection state machine, which expected `NotFound` for blocks.
Also change Zebra's GetData responses to peer request so they ignore
missing blocks.
* Stop hanging on incomplete transaction or block responses
Instead, if the peer sends an unexpected block, unexpected transaction,
or NotFound message:
1. end the request, and return a partial response containing any items
that were successfully received
2. if none of the expected blocks or transactions were received, return
an error, and close the connection
Previously we set the crate versions to 3.x, so that the major version was
aligned with the NU version. But we want to be able to make API changes
independently of the NU schedule.
We modeled a Bitcoin `headers` message as being a list of block headers.
However, the actual data structure is slightly different: it's a list of (block
header, transaction count) pairs. This caused zcashd to reject our headers
messages.
To fix this, introduce a new `CountedHeader` struct with a `block::Header` and
transaction count `usize`, then thread it through the inbound service and the
state.
I tested this locally by running Zebra with these changes and inspecting a
trace-level log of the span of a peer connection that requested a nontrivial
headers packet from us, and verified that it did not reject our message.
The cancellation implementation changes made to the connection state machine
mean that if a response oneshot is dropped, the connection will avoid
cancelling the request. So the heartbeat task does have to wait on the response.
Not all reject messages include a data field. This change partially addresses
a problem that could lead to a depleted peer set:
1. We send a response to a `getheaders` message;
2. The remote peer `reject`s our `headers` message for some reason;
3. We fail to parse their `reject` message and close the connection;
4. Repeating this process, we have no more peers.
This commit fixes (3) but does not address (2).
This makes the span data more compact (e.g., `msg_as_req{msg=block}`) and
restores the Debug impl for Message to show all of the data contained in the
message. The full message is added as a single event at trace level in the
span to preserve the previous full-inspectability.
As we approach our alpha release we've decided we want to plan ahead for the user bug reports we will eventually receive. One of the bigger issues we foresee is determining exactly what version of the software users are running, and particularly how easy it may or may not be for users to accidentally discard this information when reporting bugs.
To defend against this, we've decided to include the exact git sha for any given build in the compiled artifact. This information will then be re-exported as a span early in the application startup process, so that all logs and error messages should include the sha as their very first span. We've also added this sha as issue metadata for `color-eyre`'s github issue url auto generation feature, which should make sure that the sha is easily available in bug reports we receive, even in the absence of logs.
Co-authored-by: teor <teor@riseup.net>
* warn: if there are no peers at all
* info: if there are no ready peers
* trace: the number of ready and unready peers for every request
Log at most one warn or info log per minute, to avoid flooding the
terminal with log lines. Suppress warn and info logs for the first
minute, while the peer set is starting up.
* implement inbound `FindBlocks`
* Handle inbound peer FindHeaders requests
* handle request before having any chain tip
* Split `find_chain_hashes` into smaller functions
Add a `max_len` argument to support `FindHeaders` requests.
Rewrite the hash collection code to use heights, so we can handle the
`stop` hash and "no intersection" cases correctly.
* Split state height functions into "any chain" and "best chain"
* Rename the best chain block method to `best_block`
* Move fmt utilities to zebra_chain::fmt
* Summarise Debug for some Message variants
Co-authored-by: teor <teor@riseup.net>
Co-authored-by: Jane Lusby <jlusby42@gmail.com>
Closes#1183.
The peer set maintains a preselected ready service that it can use to
perform power-of-two-choices (p2c) routing of requests. Ready services
are stored by key (socket address) in an `IndexMap`, and the preselected
service is represented by an `Option<usize>` indexing that map. This
means that whenever the set of ready services changes (e.g., a service
is removed from the peer set, or a service is taken to be used to
process a request), the preselected index is invalidated. The original
P2C-only implementation maintained this invariant but did not document
it.
The change to inventory-based routing introduced a bug by failing to
maintain this invariant and appropriately invalidate the preselected
index. However, this was only noticeable approximately 1/N of the time
on the next request after an inventory-directed request, so the bug
occurred infrequently. Luckily, the use of `.expect` caused the bug to
be an immediate panic, making it possible to identify by inspecting all
uses of the ready service map.
This change is mostly mechanical, with the exception of the changes to the
`tower-batch` middleware. This middleware was adapted from `tower::buffer`,
and the `tower::buffer` code was changed to implement its own bounded queue,
because Tokio 0.3 removed the `mpsc::Sender::poll_send` method. See
ddc64e8d4d
for more context on the Tower changes. To match Tower as closely as possible
in order to be able to upstream `tower-batch`, those changes are copied from
`tower::Buffer` to `tower-batch`.
Per https://zips.z.cash/zip-0251, nodes compatible with Canopy
activation on mainnet MUST advertise protocol version 170013 or later.
Once Canopy activates on testnet or mainnet, Canopy nodes SHOULD reject
new connections from pre-Canopy nodes, so this also increases the
minimum version.
This change explicitly documents cancellation contracts for our Tower services,
and tries to correct a bug in the implementation of the CheckpointVerifier,
which duplicates information from the state service but did not ensure that it
would be kept in sync.
These messages might be unsolicited, or they might be a response to a
request we already canceled. So don't fail the whole connection, just
drop the message and move on.
We handle request cancellation in two places: before we transition into
the AwaitingResponse state, and while we are in AwaitingResponse. We
need both places, or else if we started processing a request, we
wouldn't process the cancellation until the timeout elapsed.
The first is a check that the oneshot is not already canceled.
For the second, we wait on a cancellation, either from a timeout or from
the tx channel closing.
This addresses at least three pain points:
- we were affected by bugs that were already fixed in git, but not in
the released crate;
- we can use service combinators to transform requests and responses;
- we can use the hedge middleware.
The version in git is still marked as 0.3.1 but these changes will be
part of tower 0.4: https://github.com/tower-rs/tower/issues/431
The GetPeers requests sent while crawling the network are randomly
load-balanced over available peers. But at the very beginning, they may
be both routed to the same peer, causing network initialization to be
delayed while the second one times out (since zcashd only ever responds
to the first addr message).
Only sending one GetPeers request per candidate set update means we
crawl the network a little more slowly, but avoids hanging on start.
This cleans up the response processing logic a little bit along the way,
but the overall division of responsibility should be better documented
in a future commit.
This lets us distinguish between cases where the message was unsupported
(e.g., BIP11 messages), and cases where the message was uninterpretable
in context (e.g., unsolicited messages).