As we approach our alpha release we've decided we want to plan ahead for the user bug reports we will eventually receive. One of the bigger issues we foresee is determining exactly what version of the software users are running, and particularly how easy it may or may not be for users to accidentally discard this information when reporting bugs.
To defend against this, we've decided to include the exact git sha for any given build in the compiled artifact. This information will then be re-exported as a span early in the application startup process, so that all logs and error messages should include the sha as their very first span. We've also added this sha as issue metadata for `color-eyre`'s github issue url auto generation feature, which should make sure that the sha is easily available in bug reports we receive, even in the absence of logs.
Co-authored-by: teor <teor@riseup.net>
* implement inbound `FindBlocks`
* Handle inbound peer FindHeaders requests
* handle request before having any chain tip
* Split `find_chain_hashes` into smaller functions
Add a `max_len` argument to support `FindHeaders` requests.
Rewrite the hash collection code to use heights, so we can handle the
`stop` hash and "no intersection" cases correctly.
* Split state height functions into "any chain" and "best chain"
* Rename the best chain block method to `best_block`
* Move fmt utilities to zebra_chain::fmt
* Summarise Debug for some Message variants
Co-authored-by: teor <teor@riseup.net>
Co-authored-by: Jane Lusby <jlusby42@gmail.com>
This provides useful and not too noisy output at INFO level. We do an
info-level message on every block commit instead of trying to do one
message every N blocks, because this is useful both for initial block
sync as well as continuous state updates on new blocks.
The metrics code becomes much simpler because the current version of the
metrics crate builds its own single-threaded runtime on a dedicated worker
thread, so no dependency on the main Zebra Tokio runtime is required.
This change is mostly mechanical, with the exception of the changes to the
`tower-batch` middleware. This middleware was adapted from `tower::buffer`,
and the `tower::buffer` code was changed to implement its own bounded queue,
because Tokio 0.3 removed the `mpsc::Sender::poll_send` method. See
ddc64e8d4d
for more context on the Tower changes. To match Tower as closely as possible
in order to be able to upstream `tower-batch`, those changes are copied from
`tower::Buffer` to `tower-batch`.
This helps prevent overloading the network with too many concurrent
block requests. On a fast network, we're likely to still have enough
room to saturate our bandwidth. In the worst case, with 2MB blocks,
downloading 50 blocks concurrently is 100MB of queued downloads. If we
need to download this in 20 seconds to avoid peer connection timeouts,
the implied worst-case minimum speed is 5MB/s. In practice, this
minimum speed will likely be much lower.
This reverts commit 656bd24ba7.
The Hedge middleware keeps a pair of histograms, writing into one in the
current time interval and reading from the previous time interval's
data. This means that the reverted change resulted in doubling all
block downloads until after at least the second measurement interval
(which means that the time measurements are also incorrect, as they're
operating under double the network load...)
Sets the default value to the previous lookahead limit. My testing on
mainnet suggested that the newly lower value (changed when the
checkpoint frequency was decreased) is low enough to cause stalls, even
when using hedged requests.
Remove the minimum data points from the syncer hedge configuragtion.
When there are no data points, hedge sends the second request
immediately.
Where there are less than 1/(1-latency_percentile) data points (20),
hedge delays the second request by the highest recent download time.
This change should improve genesis and post-restart sync latency.
We should error if we notice that we're attempting to download the same
blocks multiple times, because that indicates that peers reported bad
information to us, or we got confused trying to interpret their
responses.
The original sync algorithm split the sync process into two phases, one
that obtained prospective chain tips, and another that attempted to
extend those chain tips as far as possible until encountering an error
(at which point the prospective state is discarded and the process
restarts).
Because a previous implementation of this algorithm didn't properly
enforce linkage between segments of the chain while extending tips,
sometimes it would get confused and fail to discard responses that did
not extend a tip. To mitigate this, a check against the state was
added. However, this check can cause stalls while checkpointing,
because when a checkpoint is reached we may suddenly need to commit
thousands of blocks to the state. Because the sync algorithm now has a
a `CheckedTip` structure that ensures that a new segment of hashes
actually extends an existing one, we don't need to check against the
state while extending a tip, because we don't get confused while
interpreting responses.
This change results in significantly smoother progress on mainnet.
There's no reason to return a pre-Buffer'd service (there's no need for
internal access to the state service, as in zebra-network), but wrapping
it internally removes control of the buffer size from the caller.
The timeout behavior in zebra-network is an implementation detail, not a
feature of the public API. So it shouldn't be mentioned in the doc
comments -- if we want timeout behavior, we have to layer it ourselves.
Using the cancel_handles, we can deduplicate requests. This is
important to do, because otherwise when we insert the second cancel
handle, we'd drop the first one, cancelling an existing task for no
reason.
The hedge middleware implements hedged requests, as described in _The
Tail At Scale_. The idea is that we auto-tune our retry logic according
to the actual network conditions, pre-emptively retrying requests that
exceed some latency percentile. This would hopefully solve the problem
where our timeouts are too long on mainnet and too slow on testnet.
Try to use the better cancellation logic to revert to previous sync
algorithm. As designed, the sync algorithm is supposed to proceed by
downloading state prospectively and handle errors by flushing the
pipeline and starting over. This hasn't worked well, because we didn't
previously cancel tasks properly. Now that we can, try to use something
in the spirit of the original sync algorithm.
This makes two changes relative to the existing download code:
1. It uses a oneshot to attempt to cancel the download task after it
has started;
2. It encapsulates the download creation and cancellation logic into a
Downloads struct.
* Reverse displayed endianness of transaction and block hashes
* fix zebra-checkpoints utility for new hash order
* Stop using "zebrad revhex" in zebrad-hash-lookup
* Rebuild checkpoint lists in new hash order
This change also adds additional checkpoints to the end of each list.
* Replace TransactionHash with transaction::Hash
This change should have been made in #905, but we missed Debug impls
and some docs.
Co-authored-by: Ramana Venkata <vramana@users.noreply.github.com>
Co-authored-by: teor <teor@riseup.net>
This reduces the API surface to the minimum required for functionality,
and cleans up module documentation. The stub mempool module is deleted
entirely, since it will need to be redone later anyways.