Remove the minimum data points from the syncer hedge configuragtion.
When there are no data points, hedge sends the second request
immediately.
Where there are less than 1/(1-latency_percentile) data points (20),
hedge delays the second request by the highest recent download time.
This change should improve genesis and post-restart sync latency.
* Run large checkpoint sync tests in CI
* Improve test child output match error context
* Add a debug_stop_at_height config
* Use stop at height in acceptance tests
And add some restart acceptance tests, to make sure the stop at
height feature works correctly.
## Motivation
The zebra-state service needs to be able to handle duplicate blocks.
## Solution
This implements changes already outlined by [The State
RFC](https://zebra.zfnd.org/dev/rfcs/0005-state-updates.html). We check for
successfully committed blocks first, since interacting with the queued blocks
struct at this point just complicates the implimentation. If the block has not
already been committed we then check if the block has already been queued, if
not we handle the block normally (normally here being the bit we already had
implemented).
## Documentation Changes
- [x] Update the state RFC to match the ways this fix departs from the design
- the main thing is that I switched the order of checking for duplicates
- [x] ~~Add newly added functions to the state rfc~~ Decided not to do this because they're minor getters that don't influence the rest of the design and aren't exposed as part of the API
- [x] Document newly added functions inline
## Testing
## Related Issues
- fixes https://github.com/ZcashFoundation/zebra/issues/1182
- tracking issue https://github.com/ZcashFoundation/zebra/issues/1049
Co-authored-by: teor <teor@riseup.net>
We should error if we notice that we're attempting to download the same
blocks multiple times, because that indicates that peers reported bad
information to us, or we got confused trying to interpret their
responses.
We already use an actor model for the state service, so we get an
ordered sequence of state queries by message-passing. Instead of
performing reads in the futures we return, this commit performs them
synchronously. This means that all sled access is done from the same
task, which
(1) might reduce contention
(2) allows us to avoid using sled transactions when writing to the
state.
Co-authored-by: Jane Lusby <jane@zfnd.org>
Co-authored-by: Jane Lusby <jane@zfnd.org>
Co-authored-by: Deirdre Connolly <deirdre@zfnd.org>
The original sync algorithm split the sync process into two phases, one
that obtained prospective chain tips, and another that attempted to
extend those chain tips as far as possible until encountering an error
(at which point the prospective state is discarded and the process
restarts).
Because a previous implementation of this algorithm didn't properly
enforce linkage between segments of the chain while extending tips,
sometimes it would get confused and fail to discard responses that did
not extend a tip. To mitigate this, a check against the state was
added. However, this check can cause stalls while checkpointing,
because when a checkpoint is reached we may suddenly need to commit
thousands of blocks to the state. Because the sync algorithm now has a
a `CheckedTip` structure that ensures that a new segment of hashes
actually extends an existing one, we don't need to check against the
state while extending a tip, because we don't get confused while
interpreting responses.
This change results in significantly smoother progress on mainnet.
There's no reason to return a pre-Buffer'd service (there's no need for
internal access to the state service, as in zebra-network), but wrapping
it internally removes control of the buffer size from the caller.
The previous debug output printed a message that the chain verifier had
recieved a block. But this provides no additional information compared
to printing no message in chain::Verifier and a message in whichever
verifier the block was sent to, since the resulting spans indicate where
the block was dispatched.
This commit also removes the "unexpected high block" detection; this was
an artefact of the original sync algorithm failing to handle block
advertisements, but we don't have that problem any more, so we can
simplify the code by eliminating that logic.
The timeout behavior in zebra-network is an implementation detail, not a
feature of the public API. So it shouldn't be mentioned in the doc
comments -- if we want timeout behavior, we have to layer it ourselves.
Using the cancel_handles, we can deduplicate requests. This is
important to do, because otherwise when we insert the second cancel
handle, we'd drop the first one, cancelling an existing task for no
reason.
The hedge middleware implements hedged requests, as described in _The
Tail At Scale_. The idea is that we auto-tune our retry logic according
to the actual network conditions, pre-emptively retrying requests that
exceed some latency percentile. This would hopefully solve the problem
where our timeouts are too long on mainnet and too slow on testnet.
Try to use the better cancellation logic to revert to previous sync
algorithm. As designed, the sync algorithm is supposed to proceed by
downloading state prospectively and handle errors by flushing the
pipeline and starting over. This hasn't worked well, because we didn't
previously cancel tasks properly. Now that we can, try to use something
in the spirit of the original sync algorithm.
This makes two changes relative to the existing download code:
1. It uses a oneshot to attempt to cancel the download task after it
has started;
2. It encapsulates the download creation and cancellation logic into a
Downloads struct.
These messages might be unsolicited, or they might be a response to a
request we already canceled. So don't fail the whole connection, just
drop the message and move on.
We handle request cancellation in two places: before we transition into
the AwaitingResponse state, and while we are in AwaitingResponse. We
need both places, or else if we started processing a request, we
wouldn't process the cancellation until the timeout elapsed.
The first is a check that the oneshot is not already canceled.
For the second, we wait on a cancellation, either from a timeout or from
the tx channel closing.
Binding grafana to localhost makes it inaccessible from the wider internet,
which is a secure default.
Since we run docker with host networking, docker containers have access to D-Bus and other
security-related services on localhost. So it's risky to also expose them to the wider internet.