* increase the EWMA default and decay
* increase the block download retries
* increase the request and block download timeouts
* increase the sync timeout
* add metrics and tracking endpoint tests
* test endpoints more
* add change filter test for tracing
* add await to post
* separate metrics and tracing tests
* Apply suggestions from code review
Co-authored-by: teor <teor@riseup.net>
Co-authored-by: teor <teor@riseup.net>
* Ignore sync errors when the block is already verified
If we get an error for a block that is already in our state, we don't
need to restart the sync. It was probably a duplicate download.
Also:
Process any ready tasks before reset, so the logs and metrics are
up to date. (But ignore the errors, because we're about to reset.)
Improve sync logging and metrics during the download and verify task.
* Remove duplicate hashes in logs
Co-authored-by: Jane Lusby <jlusby42@gmail.com>
* Log the sync hash span at warn level
Co-authored-by: Jane Lusby <jlusby42@gmail.com>
* Remove in-memory state service
* make the config compatible with toml again
* checkpoint commit to see how much I still have to revert
* back to the starting point...
* remove unused dependency
* reorganize error handling a bit
* need to make a new color-eyre release now
* reorder again because I have problems
* remove unnecessary helpers
* revert changes to config loading
* add back missing space
* Switch to released color-eyre version
* add back missing newline again...
* improve error message on unix when terminated by signal
* add context to last few asserts in acceptance tests
* instrument some of the helpers
* remove accidental extra space
* try to make this compile on windows
* reorg platform specific code
* hide on_disk module and fix broken link
* add CheckpointList::new_up_to(limit: NetworkUpgrade)
* if checkpoint_sync is false, limit checkpoints to Sapling
* update tests for CheckpointList and chain::init
This is the first in a sequence of changes that change the block:: items
to not include Block as a prefix in their name, in accordance with the
Rust API guidelines.
* add valid generated config test
* change to pathbuf
* use -c to make sure we are using the generated file
* add and use a ZebraTestDir type
* change approach to generate tempdir in top of each test
* pass tempdir to test_cmd and set current dir to it
* add and use a `generated_config_path` variable in tests
* checkpoint: reject older of duplicate verification requests.
If we get a duplicate block verification request, we should drop the older one
in favor of the newer one, because the older request is likely to have been
canceled. Previously, this code would accept up to four duplicate verification
requests, then fail all subsequent ones.
* sync: add a timeout layer to block requests.
Note that if this timeout is too short, we'll bring down the peer set in a
retry storm.
* sync: restart syncing on error
Restart the syncing process when an error occurs, rather than ignoring it.
Restarting means we discard all tips and start over with a new block locator,
so we can have another chance to "unstuck" ourselves.
* sync: additional debug info
* sync: handle lookahead limit correctly.
Instead of extracting all the completed task results, the previous code pulled
results out until there were fewer tasks than the lookahead limit, then
stopped. This meant that completed tasks could be left until the limit was
exceeded again. Instead, extract all completed results, and use the number of
pending tasks to decide whether to extend the tip or wait for blocks to finish.
* network: add debug instrumentation to retry policy
* sync: instrument the spawned task
* sync: streamline ObtainTips/ExtendTips logic & tracing
This change does three things:
1. It aligns the implementation of ObtainTips and ExtendTips so that they use
the same deduplication method. This means that when debugging we only have one
deduplication algorithm to focus on.
2. It streamlines the tracing output to not include information already
included in spans. Both obtain_tips and extend_tips have their own spans
attached to the events, so it's not necessary to add Scope: prefixes in
messages.
3. It changes the messages to be focused on reporting the actual
events rather than the interpretation of the events (e.g., "got genesis hash in
response" rather than "peer could not extend tip"). The motivation for this
change is that when debugging, the interpretation of events is already known to
be incorrect, in the sense that the mental model of the code (no bug) does not
match its behavior (has bug), so presenting minimally-interpreted events forces
interpretation relative to the actual code.
* sync: hack to work around zcashd behavior
* sync: localize debug statement in extend_tips
* sync: change algorithm to define tips as pairs of hashes.
This is different enough from the existing description that its comments no
longer apply, so I removed them. A further chunk of work is to change the sync
RFC to document this algorithm.
* sync: reduce block timeout
* state: add resource limits for sled
Closes#888
* sync: add a restart timeout constant
* sync: de-pub constants
* network: move gossiped peer selection logic into address book.
* network: return BoxService from init.
* zebrad: add note on why we truncate thegossiped peer list
Co-authored-by: Jane Lusby <jlusby42@gmail.com>
* Remove unused .rustfmt.toml
Many of these options are never actually loaded by our CI because of a channel
mismatch, where they're not applied on stable but only on nightly (see the logs
from a rustfmt job). This means that we can get different settings when
running `cargo fmt` on the nightly and stable channels, which was causing a CI
failure on this PR. Reverting back to the default rustfmt settings avoids this
problem and keeps us in line with upstream rustfmt. There's no loss to us
since we were using the defaults anyways.
Co-authored-by: Jane Lusby <jlusby42@gmail.com>
* fix: Handle known ObtainTips correctly
enumerate never returns a value beyond the end of the vector.
* fix: Ignore known tips in ExtendTips
Some peers send us known tips when we try to extend.
* fix: Ignore known hashes when downloading
Despite all our other checks, we still end up downloading some hashes
multiple times.
* fix: Increase the number of retries
The old sync code relied on duplicate block fetches to make progress,
but the last few commits have removed some of those duplicates.
Instead, just retry the fetches that fail.
* fix: Tweak comments
Co-authored-by: Jane Lusby <jlusby42@gmail.com>
* fix: Cleanup the state_contains interface in Sync
* Fix brackets
Oops
Co-authored-by: Jane Lusby <jlusby42@gmail.com>
* make sure no info is printed in non server tests
* check exact full output for validity instead of log msgs
* add end of output character to version regex
* use coercions, use equality operator
Co-authored-by: Jane Lusby <jlusby42@gmail.com>
Co-authored-by: Jane Lusby <jlusby42@gmail.com>
The old sync code relied on duplicate block fetches to make progress,
but the last few commits have removed some of those duplicates.
Instead, just retry the fetches that fail.
We have a counter for pending "download and verify" futures. But these
futures are spawned, so they can complete in any order. They can also
complete before we receive their results.
* Split tracing component code into modules.
* Repatriate Tracing and simplify config handling.
We upstreamed our Tracing component, expecting not to have to exert fine
control over the tracing settings. But this turned out not to be the case, and
now that we want to do other things (flamegraphs, journalctl, opentelemetry,
etc), we end up with really awkward code (as in the current flamegraph
handling).
This also makes use of the changes to `init()` to load the config early to pass
configuration data into the components, which avoids the need for the
refactoring in #775.
Finally, we restore support for the `-v` flag when the filter is unset. Closes#831.
* Disable tracing and metrics endpoints by default.
Closes#660.
* Switch back to upstream Abscissa.
* Integrate flamegraph support into the new Tracing component.
* Pass -v in acceptance tests to get info-level output.
* Clean up acceptance test code.
* Setup tracing-flame for use profiling zebrad
* start work on conditional flamegraph generation
* review time!
* update comments
* Update Cargo.toml
* disable default features for inferno
* reorganize
* missing one trait
* Apply suggestions from code review
* graceful shutdown!
* remove special case handling on ctrlc for cleanup
* rename signal fn to better represent its responsibility
* remove unused global hook for flushing flamegraph
* move tracing logic to the right file
* just copy linkerd's signal handling logic
* update book
* make zebrad app drop on shutdown normally
* Update zebrad/src/components/tokio.rs
Co-authored-by: teor <teor@riseup.net>
* Update zebrad/src/application.rs
Co-authored-by: teor <teor@riseup.net>
* Apply suggestions from code review
Co-authored-by: teor <teor@riseup.net>
* cleanup a little
* ooh yea there's an API for that
* setup env-filter for backup subscriber
* document env filter
* document return codes
* forgot to save
* Update book/src/applications/zebrad.md
Co-authored-by: teor <teor@riseup.net>
Co-authored-by: teor <teor@riseup.net>
* Load tracing filter only from config and simplify logic.
* Configure the state storage in the config, not an environment variable.
This also changes the config so that the path is always set rather than being
optional, because Zebra always needs a place to store its config.
* add zebrad acceptance tests
* add custom command test helpers that work with kill
* add and use info event for start and seed commands
* combine conflicting tests into one test case
Co-authored-by: Jane Lusby <jane@zfnd.org>
We get the injected TokioComponent dependency before the config is
loaded, so we can't use it to open the endpoints.
And we can't define after_config, because we use derive(Component).
So we work around these issues by opening the endpoints manually,
from the application's after_config.
* make zebra-checkpoints
* fix LOOKAHEAD_LIMIT scope
* add a default cli path
* change doc usage text
* add tracing
* move MAX_CHECKPOINT_HEIGHT_GAP to zebra-consensus
* do byte_reverse_hex in a map
This seems like a better design on principle but also appears to give a much
nicer sawtooth pattern of queued blocks in the checkpointer and a much smoother
pattern of block requests.
We had a brief discussion on discord and it seemed like we had consensus on the
following versioning policy:
* zebrad: match major version to NU version, so we will start by releasing
zebrad 3.0.0;
* zebra-* libraries: start by matching zebrad's version, then increment major
versions of each library as we need to make breaking changes (potentially
faster than the zebrad version, always respecting semver but making no
guarantees about the longevity of major releases).
This commit sets all of the crate versions to 3.0.0-alpha.0 -- the -alpha.0
marks it as a prerelease not subject to perfect adherence to compatibility
guarantees.
Closes#697.
per https://github.com/ZcashFoundation/zebra/issues/697#issuecomment-662742971
The response to a getblocks message is an inv message with the hashes of the
following blocks. However, inv messages are also sent unsolicited to gossip new
blocks across the network. Normally, this wouldn't be a problem, because for
every other request we filter only for the messages that are relevant to us.
But because the response to a getblocks message is an inv, the network layer
doesn't (and can't) distinguish between the response inv and the unsolicited
inv.
But there is a mitigation we can do. In our sync algorithm we have two phases:
(1) "ObtainTips" to get a set of tips to chase down, (2) repeatedly call
"ExtendTips" to extend those as far as possible. The unsolicited inv messages
have length 1, but when extending tips we expect to get more than one hash. So
we could reject responses in ExtendTips that have length 1 in order to ignore
these messages. This way we automatically ignore gossip messages during initial
block sync (while we're extending a tip) but we don't ignore length-1 responses
while trying to obtain tips (while querying the network for new tips).
Closes#617.
Closes#698.
The remaining work on the syncer is alluded to in a new comment:
1. Correctly constructing a block locator object
2. Detecting when we've stopped making progress syncing and restarting obtain_tips.
* fix: Resist CheckpointVerifier memory DoS attacks
Allow a maximum of 2 queued blocks at each height, as a tradeoff between
efficient bad block rejection, and memory usage.
Closes#628.
* fix: Make max queued blocks at height equal to fanout
* fix: Just allocate all the capacity upfront
* fix: Use with_capacity(1) and reserve_exact(1)
* Log at warn level for commands that use stdout
* Let zebrad revhex read from stdin
Most unix tools support reading from stdin, so they can be used in
pipelines.
Part of #564.
* Flatten consensus::verify::* to consensus::*
* Move consensus::*::tests into their own files
* Move CheckpointList into its own file
* Move Progress and Target into a types module
QueuedBlock and QueuedBlockList can stay in checkpoint.rs, because
they are tightly coupled to CheckpointVerifier.
* also download tips and filter tips
* dispatch all block downloads together
* tweek to match henry's changes
* switch to more intuitive match
Co-authored-by: Jane Lusby <jane@zfnd.org>
Each subsection has to have `serde(default)` to get the behaviour we want
(delete all fields except the ones that have been changed); otherwise, we can
delete only entire sections.
Prior to this change, the service returned by `zebra_network::init` would spawn background tasks that could silently fail, causing unexpected errors in the zebra_network service.
This change modifies the `PeerSet` that backs `zebra_network::init` to store all of the `JoinHandle`s for each background task it depends on. The `PeerSet` then checks this set of futures to see if any of them have exited with an error or a panic, and if they have it returns the error as part of `poll_ready`.
* rename zebra-storage to zebra-state
* Setup initial skeleton for zebra-state
* add test
* Apply suggestions from code review
Co-authored-by: Henry de Valence <hdevalence@hdevalence.ca>
* move shared test vectors to a common crate
Co-authored-by: Jane Lusby <jane@zfnd.org>
Co-authored-by: Henry de Valence <hdevalence@hdevalence.ca>
Co-authored-by: Jane Lusby <jane@zfnd.org>
Prior to this change, the seed subcommand would consistently encounter a panic in one of the background tasks, but would continue running after the panic. This is indicative of two bugs.
First, zebrad was not configured to treat panics as non recoverable and instead defaulted to the tokio defaults, which are to catch panics in tasks and return them via the join handle if available, or to print them if the join handle has been discarded. This is likely a poor fit for zebrad as an application, we do not need to maximize uptime or minimize the extent of an outage should one of our tasks / services start encountering panics. Ignoring a panic increases our risk of observing invalid state, causing all sorts of wild and bad bugs. To deal with this we've switched the default panic behavior from `unwind` to `abort`. This makes panics fail immediately and take down the entire application, regardless of where they occur, which is consistent with our treatment of misbehaving connections.
The second bug is the panic itself. This was triggered by a duplicate entry in the initial_peers set. To fix this we've switched the storage for the peers from a `Vec` to a `HashSet`, which has similar properties but guarantees uniqueness of its keys.
Bitcoin does this either with `getblocks` (returns up to 500 following block
hashes) or `getheaders` (returns up to 2000 following block headers, not
just hashes). However, Bitcoin headers are much smaller than Zcash
headers, which contain a giant Equihash solution block, and many Zcash
blocks don't have many transactions in them, so the block header is
often similarly sized to the block itself. Because we're
aiming to have a highly parallel network layer, it seems better to use
`getblocks` to implement `FindBlocks` (which is necessarily sequential)
and parallelize the processing of the block downloads.
Attempting to implement requests for block data revealed a problem with
the previous connection logic. Block data is requested by sending a
`getdata` message with hashes of the requested blocks; the peer responds
with a sequence of `block` messages with the blocks themselves.
However, this wasn't possible to handle with the previous connection
logic, which could only convert a single Bitcoin message into a
Response. Instead, we factor out the message handling logic into a
Handler, which can statefully accumulate arbitrary data into a Response
and signal completion. This is still pretty ugly but it does work.
As a side effect, the HeartbeatNonceMismatch error is removed; because
the Handler now tries to process messages until it comes to a Response,
it just ignores mismatched nonces (and will eventually time out).
The previous Mempool and Transaction requests were removed but could be
re-added in a different form later. Also, the `Get` prefixes are
removed from `Request` to tidy the name.
The components are accessed by a lock on application state. When some command
calls block_on to enter an async context, it obtained a write lock on the
entire application state. This meant that if the application state were
accessed later in an async context, a deadlock would occur. Instead the
TokioComponent holds an Option<Runtime> now, so that before calling block_on,
the caller can .take() the runtime and release the lock. Since we only ever
enter an async context once, it's not a problem that the component is then
missing its runtime, as once we are inside of a task we can access the runtime.
With a 'Transactions' response that gets turned into an 'Inv(Vec<InventoryHash::Tx>)' message.
We don't yet handle a response from our peer for a 'mempool', which will have to be
a more generic 'Inv' type because we might receive transaction hashes we don't know about yet.
Pertains to #26
Moved SeedService out of the command closure Command currently spawns
a tokio task to DOS the seed service with `Request::GetPeers` every
second.
Pertains to #54
This splits out the connection handling code into a try_connect closure, which
could be refactored into a Service of its own.
On creation, when we are likely to have very few peers, launch many concurrent
connections to the first few candidates in the initial candidate set, before
continuing to grow the peer set according to demand signals.
The previous implementation failed when timestamps were duplicated between
peers, because there was not a 1-1 relationship between timestamps and peers.
The disconnected_peers() function allows us to prevent duplicate
connections without maintaining shared state between the peerset and the
dial-additional-peers task.
Previously, the TimestampCollector was intended to own the address book
data, so it was intended to be cloneable and hold shared state among all
of its handles. This is now modeled more directly by an
`Arc<Mutex<AddressBook>>`, so the only functionality left in the
`TimestampCollector` is setting up the inital worker, which is better
called `spawn` than `new`.
This also fixes a problem introduced in the previous commit where the
`TimestampCollector` was dropped, causing the worker task to shut down
early.
This allows us to hide the `TimestampCollector` and to expose only the
address book data required by the inbound request service. It also lets
us have a common data structure (the `AddressBook`) for collecting peer
information that can be used to manage information that other peers
report to us.
This was commented out because making the PeerConnector take a TcpStream
meant that the PeerConnector futures couldn't be constructed in the same
way as before, but now that the PeerConnector is Buffer'able, we can
just clone a buffered copy.
* Don't expose submodules of zebra_network::peer.
* PeerSet, PeerDiscover stubs.
Co-authored-by: Deirdre Connolly <deirdre@zfnd.org>
* Initial work on PeerSet.
This is adapted from the MIT-licensed tower-balance implementation.
* Use PeerSet in the connect stub.
* Fix authorship, license information.
I *thought* I had done a sed pass over the Cargo defaults when doing
repository initialization, but I guess I missed it or something.
Anyways, fixed now.
Add a tower-based peer implementation.
Tower provides middleware for request-response oriented protocols, while Bitcoin/Zcash just send messages which could be interpreted either as requests or responses, depending on context. To bridge this mismatch we define our own internal request/response protocol, and implement a per-peer event loop that scans incoming messages and interprets them either as requests from the remote peer to our node, or as responses to requests we made previously. This is performed by the `PeerService` task, and a corresponding `PeerClient: tower::Service` can send it requests. These tasks are themselves created by a `PeerConnector: tower::Service` which dials a remote peer and performs a handshake.
This field is called `services` in Bitcoin and Zcash, but because we use
that word internally for other purposes, calling it `PeerServices`
disambiguates the meaning to "the services advertised by the peer",
rather than, e.g., a `tower::Service`.
Prior to this commit, the tracing endpoint would attempt to bind the
given address or panic; now, if it is unable to bind the given address
it displays an error but continues running the rest of the application.
This means that we can spin up multiple Zebra instances for load
testing.
This provides a significantly cleaner API to consumers, because it
allows using adaptors that convert a TCP stream to a stream of messages,
and potentially allows more efficient message handling.
This avoids some crate selection conflicts, but makes some futures
extension traits fall out of order? This seems to be an issue with
`pin-project` resolved in the git branch of `hyper` (but not yet
released).
An updated tracing-subscriber version changed one of the public types;
because we hardcode the type instead of being generic over S:
Subscriber, this was actually a breaking change. As noted in the
comment adjacent to this line, we would rather be generic over S, but
this requires fixing a bug in abscissa's proc-macros, so in the meantime
we hardcode the type.
* Add a TracingConfig and some components
Co-authored-by: Deirdre Connolly <deirdre@zfnd.org>
* Restructure, use dependency injection, initialize tracing
* Start a placeholder loop in start command
* Add hyper alpha.1, bump tokio to alpha.4
* Hello world endpoint using async/await from hyper 0.13 alpha
Also cleaned up some linter messages.
Co-authored-by: Henry de Valence <hdevalence@hdevalence.ca>
* Update to tracing_subscriber 0.1
* fmt
* add rust-toolchain
* Remove hyper::Version import
* wip: start filter_handler impl
* Add .rustfmt.toml
* rustfmt
* Tidy up .rustfmt.toml
* Add filter reloading handling.
* bump toolchain
* Remove generated hello world acceptance tests.
These test the behaviour of the autogenerated binary and work as examples of
how to test the behaviour of abscissa binaries. Since we don't print "Hello
World" any more, they fail, but we don't yet have replacement behaviour to add
tests for, so they're removed for now.
* Clean up config file handling with Option::and_then.