This timeout stops the sync service hanging when it is missing required
blocks, but the lookahead queue is full of dependent verify tasks, so the
missing blocks never get downloaded.
Check misconfigured ephemeral doesn't create a state dir
Add extra misconfigured `zebrad` ephemeral mode checks:
* doesn't create a state directory
* doesn't create unexpected files or directories in the working directory
Check ephemeral doesn't delete an existing state directory
Refactor all the ephemeral configs and checks into a single test
function.
Also:
* cleanup acceptance tests using utility functions
* make some checks consistent between tests
* make error messages consistent
Co-authored-by: Jane Lusby <jlusby42@gmail.com>
* Add the configured network to error reports
* Log the configured network at error level
* Create the global span immediately after activating tracing
And leak the span guard, so the span is always active.
* Include panic metadata in the report and URL
* Use `Main` and `Test` in the global span
`net=Mainnet` is a bit redundant
When `cargo run` is run in the workspace directory, it can see two
executables:
- `zebrad`
- `zebra_checkpoints`
Adding `default-run = "zebrad"` to `zebrad/Cargo.toml` makes the
workspace run `zebrad` by default. (Even though it's redundant for the
`zebrad` crate itself.)
* Rewrite GetData handling to match the zcashd implementation
`zcashd` silently ignores missing blocks, but sends found transactions
followed by a `NotFound` message:
e7b425298f/src/main.cpp (L5497)
This is significantly different to the behaviour expected by the old
Zebra connection state machine, which expected `NotFound` for blocks.
Also change Zebra's GetData responses to peer request so they ignore
missing blocks.
* Stop hanging on incomplete transaction or block responses
Instead, if the peer sends an unexpected block, unexpected transaction,
or NotFound message:
1. end the request, and return a partial response containing any items
that were successfully received
2. if none of the expected blocks or transactions were received, return
an error, and close the connection
In our README, we tell users to ignore these errors, so we should also
disable the issue URL.
Also include the hash in the error. (We don't want the span active for
all messages, we just want the hash in the error.)
This change avoids errors when tests are cancelled and re-run within a
short period of time, for example, using `cargo watch`.
It introduces a slight risk of port conflicts between the endpoint tests,
and with (ephemeral) ports used by other services. The risk of conflicts
across 2 tests is very low, and tests should be run in an isolated
environment on busy servers.
vergen's implementation of REBUILD_ON_HEAD_CHANGE assumes that the .git
directory is in the crate root, but Zebra uses a workspace.
Temporary fix for rustyhorde/vergen#21.
Because the new version of the prometheus exporter launches its own
single-threaded runtime on a dedicated worker thread, there's no need
for the tokio and hyper versions it uses internally to align with the
versions used in other crates. So we don't need to use our fork with
tokio 0.3, and can just use the published alpha. Advancing to a later
alpha may fix the missing-metrics issues.
As we approach our alpha release we've decided we want to plan ahead for the user bug reports we will eventually receive. One of the bigger issues we foresee is determining exactly what version of the software users are running, and particularly how easy it may or may not be for users to accidentally discard this information when reporting bugs.
To defend against this, we've decided to include the exact git sha for any given build in the compiled artifact. This information will then be re-exported as a span early in the application startup process, so that all logs and error messages should include the sha as their very first span. We've also added this sha as issue metadata for `color-eyre`'s github issue url auto generation feature, which should make sure that the sha is easily available in bug reports we receive, even in the absence of logs.
Co-authored-by: teor <teor@riseup.net>
* implement inbound `FindBlocks`
* Handle inbound peer FindHeaders requests
* handle request before having any chain tip
* Split `find_chain_hashes` into smaller functions
Add a `max_len` argument to support `FindHeaders` requests.
Rewrite the hash collection code to use heights, so we can handle the
`stop` hash and "no intersection" cases correctly.
* Split state height functions into "any chain" and "best chain"
* Rename the best chain block method to `best_block`
* Move fmt utilities to zebra_chain::fmt
* Summarise Debug for some Message variants
Co-authored-by: teor <teor@riseup.net>
Co-authored-by: Jane Lusby <jlusby42@gmail.com>
This provides useful and not too noisy output at INFO level. We do an
info-level message on every block commit instead of trying to do one
message every N blocks, because this is useful both for initial block
sync as well as continuous state updates on new blocks.
The metrics code becomes much simpler because the current version of the
metrics crate builds its own single-threaded runtime on a dedicated worker
thread, so no dependency on the main Zebra Tokio runtime is required.
This change is mostly mechanical, with the exception of the changes to the
`tower-batch` middleware. This middleware was adapted from `tower::buffer`,
and the `tower::buffer` code was changed to implement its own bounded queue,
because Tokio 0.3 removed the `mpsc::Sender::poll_send` method. See
ddc64e8d4d
for more context on the Tower changes. To match Tower as closely as possible
in order to be able to upstream `tower-batch`, those changes are copied from
`tower::Buffer` to `tower-batch`.
## Motivation
Prior to this PR we've been using `sled` as our database for storing persistent chain data on the disk between boots. We picked sled over rocksdb to minimize our c++ dependencies despite it being a less mature codebase. The theory was if it worked well enough we'd prefer to have a pure rust codebase, but if we ever ran into problems we knew we could easily swap it out with rocksdb.
Well, we ran into problems. Sled's memory usage was particularly high, and it seemed to be leaking memory. On top of all that, the performance for writes was pretty poor, causing us to become bottle-necked on sled instead of the network.
## Solution
This PR replaces `sled` with `rocksdb`. We've seen a 10x improvement in memory usage out of the box, no more leaking, and much better write performance. With this change writing chain data to disk is no longer a limiting factor in how quickly we can sync the chain.
The code in this pull request has:
- [x] Documentation Comments
- [x] Unit Tests and Property Tests
## Review
@hdevalence
This helps prevent overloading the network with too many concurrent
block requests. On a fast network, we're likely to still have enough
room to saturate our bandwidth. In the worst case, with 2MB blocks,
downloading 50 blocks concurrently is 100MB of queued downloads. If we
need to download this in 20 seconds to avoid peer connection timeouts,
the implied worst-case minimum speed is 5MB/s. In practice, this
minimum speed will likely be much lower.
This reverts commit 656bd24ba7.
The Hedge middleware keeps a pair of histograms, writing into one in the
current time interval and reading from the previous time interval's
data. This means that the reverted change resulted in doubling all
block downloads until after at least the second measurement interval
(which means that the time measurements are also incorrect, as they're
operating under double the network load...)
* Use the default memory limit in the acceptance tests
PR #1233 changed the default `memory_cache_bytes`, but left the
acceptance tests with their old value.
Sets the default value to the previous lookahead limit. My testing on
mainnet suggested that the newly lower value (changed when the
checkpoint frequency was decreased) is low enough to cause stalls, even
when using hedged requests.
Remove the minimum data points from the syncer hedge configuragtion.
When there are no data points, hedge sends the second request
immediately.
Where there are less than 1/(1-latency_percentile) data points (20),
hedge delays the second request by the highest recent download time.
This change should improve genesis and post-restart sync latency.
* Run large checkpoint sync tests in CI
* Improve test child output match error context
* Add a debug_stop_at_height config
* Use stop at height in acceptance tests
And add some restart acceptance tests, to make sure the stop at
height feature works correctly.
We should error if we notice that we're attempting to download the same
blocks multiple times, because that indicates that peers reported bad
information to us, or we got confused trying to interpret their
responses.