Commit Graph

63 Commits

Author SHA1 Message Date
teor d076b999f3
Fix syncer download order and add sync tests (#3168)
* Refactor so that RetryLimit::Future is std::marker::Sync

* Make the syncer future std::marker::Send by spawning tips futures

* Download synced blocks in chain order, not HashSet order

* Improve MockService failure messages

* Add closure-based responses to the MockService API

* Move MockChainTip to zebra-chain

* Add a MockChainTipSender type alias

* Support MockChainTip in ChainSync and its downloader

* Add syncer tests for obtain tips, extend tips, and wrong block hashes

* Add block too high tests for obtain tips and extend tips

* Add syncer tests for duplicate FindBlocks response hashes

* Allow longer request delays for mocked services in syncer tests
2022-01-11 14:11:35 -03:00
Janito Vaqueiro Ferreira Filho b71833292d
Use `MockedClientHandle` in other tests (#3241)
* Move `MockedClientHandle` to `peer` module

It's more closely related to a `Client` than the `PeerSet`, and this
prepares it to be used by other tests.

* Rename `MockedClientHandle` to `ClientTestHarness`

Reduce confusion, and clarify that the client is not mocked.

Co-authored-by: teor <teor@riseup.net>

* Add clarification to `mock_peers` documentation

Explicitly say how the generated data is returned.

* Rename method to `wants_connection_heartbeats`

The `Client` service only represents one direction of a connection, so
`is_connected` is not the exact term.

Co-authored-by: teor <teor@riseup.net>

* Mock `Client` instead of `LoadTrackedClient`

Move where the conversion from mocked `Client` to mocked
`LoadTrackedClient` in order to make the test helper more easily used by
other tests.

* Use `ClientTestHarness` in `initialize` tests

Replace the boilerplate code to create a fake `Client` instance with
usages of the `ClientTestHarness` constructor.

* Allow receiving requests from `Client` instance

Create a helper type to wrap the result, to make it easier to assert on
specific events after trying to receive a request.

* Allow inspecting the current error in the slot

Share the `ErrorSlot` between the `Client` and the handle, so that the
handle can be used to inspect the contents of the `ErrorSlot`.

* Allow placing an error into the `ErrorSlot`

Assuming it is initially empty. If it already has an error, the code
will panic.

* Allow gracefully closing the request receiver

Close the endpoint with the appropriate call to the `close()` method.

* Allow dropping the request receiver endpoint

Forcefully closes the endpoint.

* Rename field to `client_request_receiver`

Also rename the related methods to include
`outbound_client_request_receiver` to make it more precise.

Co-authored-by: teor <teor@riseup.net>

* Allow dropping the heartbeat shutdown receiver

Allows the `Client` to detect that the channel has been closed.

* Rename fn. to `drop_heartbeat_shutdown_receiver`

Make it clear that it affects the heartbeat task.

Co-authored-by: teor <teor@riseup.net>

* Move `NowOrLater` into a new `now-or-later` crate

Make it easily accessible to other crates.

* Add `IsReady` extension trait for `Service`

Simplifies checking if a service is immediately ready to be called.

* Add extension method to check for readiness error

Checks if the `Service` isn't immediately ready because a call to
`ready` immediately returns an error.

* Rename method to `is_failed`

Avoid negated method names.

Co-authored-by: teor <teor@riseup.net>

* Add a `IsReady::is_pending` extension method

Checks if a `Service` is not ready to be called.

* Use `ClientTestHarness` in `Client` test vectors

Reduce repeated code and try to improve readability.

* Create a new `ClientTestHarnessBuilder` type

A builder to create test `Client` instances using mock data which can be
tracked and manipulated through a `ClientTestHarness`.

* Allow configuring the `Client`'s mocked version

Add a `with_version` builder method.

* Use `ClientTestHarnessBuilder` in `PeerVersions`

Use the builder to set the peer version, so that the `version` parameter
can be removed from the constructor later.

* Use a default mock version where possible

Reduce noise when setting up the harness for tests that don't really
care about the remote peer version.

* Remove `Version` parameter from the `build` method

The `with_version` builder method should be used instead.

* Fix some typos and outdated info in the release checklist

* Add extra client tests for zero and multiple readiness checks (#3273)

And document existing tests.

* Replace `NowOrLater` with `futures::poll!` (#3272)

* Replace NowOrLater with the futures::poll! macro in zebrad

* Replace NowOrLater with the futures::poll! macro in zebra-test

* Remove the now-or-later crate

* remove unused imports

* rustfmt

Co-authored-by: teor <teor@riseup.net>
Co-authored-by: Alfredo Garcia <oxarbitrage@gmail.com>
2021-12-22 06:13:26 +10:00
teor 6cbd7dce43
Fix task handling bugs, so peers are more likely to be available (#3191)
* Tweak crawler timings so peers are more likely to be available

* Tweak min peer connection interval so we try all peers

* Let other tasks run between fanouts, so we're more likely to choose different peers

* Let other tasks run between retries, so we're more likely to choose different peers

* Let other tasks run after peer crawler DemandDrop

This makes it more likely that peers will become ready.
2021-12-20 09:02:31 +10:00
teor a4d1a1801c
Security: Drop blocks that are a long way ahead of the tip (#3167)
* Document the chain verifier

* Drop gossiped blocks that are too far ahead of the tip

* Add extra gossiped block metrics

* Allow extra gossiped blocks, now we have a stricter limit

* Fix a comment

* Check the exact number of blocks in a downloaded block response

* Drop synced blocks that are too far ahead of the tip

* Add extra synced block metrics

* Test dropping gossiped blocks that are too far ahead of the tip

* Allow an extra checkpoint's worth of blocks in the verifier queues

* Actually let's try two extra checkpoints

* Scale extra height limit with lookahead limit

* Also drop blocks that are behind the finalized tip

* Downgrade a noisy log

* Use a debug log for already verified gossiped blocks

* Use debug logs for already verified synced blocks
2021-12-17 13:31:51 -03:00
teor a92c431c03
Ignore NotFound errors in the syncer (#3131) 2021-12-02 11:28:20 -03:00
teor c85ea18b43
Fix slow Zebra startup times, to reduce CI failures (#3104)
* Tweak a log message

* Only retry failed DNS once, then use the other DNS responses

* Limit broadcasts to half the peers

* Use a longer minimum interval for GetAddr requests

* Reduce the syncer and mempool crawler fanouts

* Stop resetting the mempool twice when it starts up

This spawns two crawlers, which send two fanouts,
so it can use up a lot of peers.

Co-authored-by: Conrado Gouvea <conrado@zfnd.org>
Co-authored-by: Alfredo Garcia <oxarbitrage@gmail.com>
2021-11-30 21:04:32 +00:00
teor 375a997d2f
Stop downloading unnecessary blocks in Zebra acceptance tests (#3072)
* Implement graceful shutdown for the peer set

* Use the minimum lookahead limit in acceptance tests

* Enable a doctest that compiles with newly public modules
2021-11-19 01:55:38 +00:00
Janito Vaqueiro Ferreira Filho 0960e4fb0b
Update to Tokio 1.13.0 (#2994)
* Update `tower` to version `0.4.9`

Update to latest version to add support for Tokio version 1.

* Replace usage of `ServiceExt::ready_and`

It was deprecated in favor of `ServiceExt::ready`.

* Update Tokio dependency to version `1.13.0`

This will break the build because the code isn't ready for the update,
but future commits will fix the issues.

* Replace import of `tokio::stream::StreamExt`

Use `futures::stream::StreamExt` instead, because newer versions of
Tokio don't have the `stream` feature.

* Use `IntervalStream` in `zebra-network`

In newer versions of Tokio `Interval` doesn't implement `Stream`, so the
wrapper types from `tokio-stream` have to be used instead.

* Use `IntervalStream` in `inventory_registry`

In newer versions of Tokio the `Interval` type doesn't implement
`Stream`, so `tokio_stream::wrappers::IntervalStream` has to be used
instead.

* Use `BroadcastStream` in `inventory_registry`

In newer versions of Tokio `broadcast::Receiver` doesn't implement
`Stream`, so `tokio_stream::wrappers::BroadcastStream` instead. This
also requires changing the error type that is used.

* Handle `Semaphore::acquire` error in `tower-batch`

Newer versions of Tokio can return an error if the semaphore is closed.
This shouldn't happen in `tower-batch` because the semaphore is never
closed.

* Handle `Semaphore::acquire` error in `zebrad` test

On newer versions of Tokio `Semaphore::acquire` can return an error if
the semaphore is closed. This shouldn't happen in the test because the
semaphore is never closed.

* Update some `zebra-network` dependencies

Use versions compatible with Tokio version 1.

* Upgrade Hyper to version 0.14

Use a version that supports Tokio version 1.

* Update `metrics` dependency to version 0.17

And also update the `metrics-exporter-prometheus` to version 0.6.1.
These updates are to make sure Tokio 1 is supported.

* Use `f64` as the histogram data type

`u64` isn't supported as the histogram data type in newer versions of
`metrics`.

* Update the initialization of the metrics component

Make it compatible with the new version of `metrics`.

* Simplify build version counter

Remove all constants and use the new `metrics::incement_counter!` macro.

* Change metrics output line to match on

The snapshot string isn't included in the newer version of
`metrics-exporter-prometheus`.

* Update `sentry` to version 0.23.0

Use a version compatible with Tokio version 1.

* Remove usage of `TracingIntegration`

This seems to not be available from `sentry-tracing` anymore, so it
needs to be replaced.

* Add sentry layer to tracing initialization

This seems like the replacement for `TracingIntegration`.

* Remove unnecessary conversion

Suggested by a Clippy lint.

* Update Cargo lock file

Apply all of the updates to dependencies.

* Ban duplicate tokio dependencies

Also ban git sources for tokio dependencies.

* Stop allowing sentry-tracing git repository in `deny.toml`

* Allow remaining duplicates after the tokio upgrade

* Use C: drive for CI build output on Windows

GitHub Actions uses a Windows image with two disk drives, and the
default D: drive is smaller than the C: drive. Zebra currently uses a
lot of space to build, so it has to use the C: drive to avoid CI build
failures because of insufficient space.

Co-authored-by: teor <teor@riseup.net>
2021-11-02 18:46:57 +00:00
Janito Vaqueiro Ferreira Filho 595d75d5fb
Fix synchronization delay issue (#2921)
* Create a `NowOrLater` helper type

A replacement for `FutureExt::now_or_never` that ensures that the task
is scheduled for waking up later when the inner future is ready.

* Use `NowOrLater` to fix possible delay bug

Previous usage of `now_or_never` meant that the underlying task wasn't
being scheduled to awake when the `Downloads` stream produced a new
item. Using `NowOrLater` instead fixes that issue.
2021-10-21 10:34:12 +10:00
Conrado Gouvea 84f2c07fbc
Ignore AlreadyInChain error in the syncer (#2890)
* Ignore AlreadyInChain error in the syncer

* Split Cancelled errors; add them to should_restart_sync exceptions

* Also filter 'block is already comitted'; try to detect a wrong downcast
2021-10-20 11:07:19 +10:00
Alfredo Garcia 724967d488
Send `AdvertiseTransactionIds` to peers (#2823)
* bradcast transactions to peers after they get inserted into mempool

* remove network argument from mempool init

* remove dbg left

* remove return value in mempool enable call

* rename channel sender and receiver vars

* change unwrap() to expect()

* change the channel to a hashset

* fix build

* fix tests

* rustfmt

* fix tiny space issue inside macro

Co-authored-by: teor <teor@riseup.net>

* check errors/panics in transaction gossip tests

* fix build of newly added tests

* Stop dropping the inbound service and mempool in a test

Keeping the mempool around avoids a transaction broadcast task error,
so we can test that there are no other errors in the task.

* Tweak variable names and add comments

* Avoid unexpected drops by returning a mempool guard in tests

* Use BoxError to simplify service types in tests

* Make all returned service types consistent in tests

We want to be able to change the setup without changing the tests.

Co-authored-by: teor <teor@riseup.net>
2021-10-08 08:59:46 -03:00
teor 04d2cfb3d0
Gossip recently verified block hashes to peers (#2729)
* Implement a task that gossips verified block hashes

* Log an info message for block broadcasts

* Simplify the gossip task

Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>

* Re-use the old tip change if there is no new tip change

Also improve the comments.

* Add an assertion message

* Rename task join handles and futures in start method

* Add a dedicated BlockGossipError type

This type helps distinguish between syncer and state errors.

* Test that committed blocks are gossiped to peers

Also do a minor type cleanup on the existing test code,
replacing `Option<Vec<_>>` with `Vec<_>`.

* Formatting

* Remove excess newlines

Co-authored-by: Alfredo Garcia <oxarbitrage@gmail.com>

* Clear the initial gossiped blocks during test setup

Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
Co-authored-by: Alfredo Garcia <oxarbitrage@gmail.com>
2021-10-07 07:46:37 -03:00
Conrado Gouvea c6878d9b63
Cancel download and verify tasks when the mempool is deactivated (#2764)
* Cancel download and verify tasks when the mempool is deactivated

* Refactor enable/disable logic to use a state enum

* Add helper test functions to enable/disable the mempool

* Add documentation about errors on service calls

* Improvements from review

* Improve documentation

* Fix bug in test

* Apply suggestions from code review

Co-authored-by: teor <teor@riseup.net>

Co-authored-by: teor <teor@riseup.net>
2021-09-29 09:06:40 +10:00
Janito Vaqueiro Ferreira Filho 83a2e30e33
Create a `SyncStatus` helper type (#2685)
* Create a `SyncStatus` helper type

Keeps track if the synchronizer is close to the chain tip or not.

* Refactor `ChainSync` ctor. to return `SyncStatus`

Change the constructor API so that it returns a higher level construct.

* Test if `SyncStatus` waits for the chain tip

Test if waiting for the chain tip to be reached correctly finishes when
the chain tip is reached. This is done by sending recent sync lengths to
the `SyncStatus` instance, and checking that every time a separate
`SyncStatus` instance determines it has reached the tip the original
instance wakes up.

* Add a temporary attribute to allow dead code

The code added isn't used yet, so we'll add a temporary waiver until
another PR is merged to use them.
2021-08-30 10:01:33 +10:00
teor 6f8f4d8987
Provide recent syncer response lengths as a watch channel (#2602)
* Minimal recent sync lengths implementation

Also includes metrics and logging, to make diagnosing bugs easier.

* Add logging to check what happens when Zebra reaches the chain tip

* Add tests for recent sync lengths

- initially empty
- pruned to correct length
- newest entries go first

* Drop a redundant `/` from a Cargo.lock URL

This seems to be a nightly or beta Rust change,
but hopefully stable just accepts it.

* Use metrics histograms to avoid overwriting values

* Add detailed syncer monitoring dashboard

* Increase the recent sync length to 4

This length makes it easier to distinguish between temporary and
sustained errors/syncs.

Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>
2021-08-19 23:16:16 +00:00
Alfredo Garcia 96a1b661f0
Rate limit initial genesis block download retries, Credit: Equilibrium (#2255)
* implement and test a rate limit in `request_genesis()`
* add `request_genesis_is_rate_limited` test to sync
* add ensure_timeouts constraint for GENESIS_TIMEOUT_RETRY
* Suppress expected warning logs in zebrad tests

Co-authored-by: teor <teor@riseup.net>
2021-06-09 23:39:51 +00:00
teor d494af1e90 Document how the syncer resists memory DoS 2021-03-11 06:24:46 -05:00
teor 972103d797 Fix tracing macro syntax 2021-02-17 11:09:22 -05:00
teor 253d1c02b3 Make sync logging a bit less verbose
And tweak some log content
2021-02-17 11:09:22 -05:00
teor 0b76352468
Document a state_contains bug (#1715)
* Document a state_contains bug in the syncer and Inbound
2021-02-10 09:05:14 +10:00
teor dce11358d7
Log when the syncer awaits peer readiness (#1714) 2021-02-10 07:09:27 +10:00
teor 3d9888f736 Rewrite a sync comment 2021-01-29 11:02:26 +10:00
teor 391c53aa60 Move BoxError to zebrad's lib.rs
For consistency with other crates.
2021-01-27 12:14:27 -08:00
teor 9cdf41f5f4
Panic if the lookahead limit is misconfigured (#1589) 2021-01-14 14:06:30 +10:00
teor fb76eb2e6b Add download and verify timeouts to the inbound service 2021-01-13 20:46:25 -05:00
teor 973aec8ccc Refactor sync members into a consistent order
And add comments about correctness and usage.
2021-01-13 20:46:25 -05:00
teor c2893dce51 Warn when the user's configured lookahead limit is ignored 2021-01-13 20:46:25 -05:00
teor 3699bbdae6 Add some additional sync correctness constraints
And adjust the sync restart delay as a consequence.
2021-01-13 20:46:25 -05:00
teor cef0a492d8 Add a timeout to sync service block verification
This timeout stops the sync service hanging when it is missing required
blocks, but the lookahead queue is full of dependent verify tasks, so the
missing blocks never get downloaded.
2021-01-13 20:46:25 -05:00
Deirdre Connolly 44e1051dee Debug 2020-12-09 13:06:18 -05:00
Deirdre Connolly 25f6fd25b3 Test catching panic 2020-12-09 13:06:18 -05:00
Henry de Valence ba3c19142c deps: update hyper, metrics to tokio 0.3
The metrics code becomes much simpler because the current version of the
metrics crate builds its own single-threaded runtime on a dedicated worker
thread, so no dependency on the main Zebra Tokio runtime is required.
2020-11-20 10:08:16 -08:00
Henry de Valence add94c1c45 deps: move to tokio 0.3, tower 0.4
This change is mostly mechanical, with the exception of the changes to the
`tower-batch` middleware.  This middleware was adapted from `tower::buffer`,
and the `tower::buffer` code was changed to implement its own bounded queue,
because Tokio 0.3 removed the `mpsc::Sender::poll_send` method.  See

ddc64e8d4d

for more context on the Tower changes.  To match Tower as closely as possible
in order to be able to upstream `tower-batch`, those changes are copied from
`tower::Buffer` to `tower-batch`.
2020-11-20 10:08:16 -08:00
Henry de Valence 4953f21670 fixup! zebrad: hack to skip alreadyverified errors 2020-11-18 03:09:06 -05:00
Henry de Valence aa7538ab15 zebrad: hack to skip alreadyverified errors 2020-11-17 14:56:27 -08:00
Henry de Valence 6de824bd99 zebrad: remove block verification timeout
Because we set the lookahead limit to be at least twice the size of a checkpoint, we don't have a risk of timeouts.
2020-11-17 14:56:27 -08:00
Henry de Valence e9c847bbd7 zebrad: avoid a borrow in the ChainSync future 2020-11-17 14:56:27 -08:00
Henry de Valence ec411574ee zebrad: improve sync diagnostics 2020-11-17 14:56:27 -08:00
Henry de Valence e0c92167bc Revert "Hedge every syncer block download request"
This reverts commit 656bd24ba7.

The Hedge middleware keeps a pair of histograms, writing into one in the
current time interval and reading from the previous time interval's
data.  This means that the reverted change resulted in doubling all
block downloads until after at least the second measurement interval
(which means that the time measurements are also incorrect, as they're
operating under double the network load...)
2020-11-12 16:45:47 -05:00
Alfredo Garcia 128643d81e
Call `zebra_test::init` where needed. (#1227)
* Add missing `zebra_test::init()` to zebra-chain
* Add missing `zebra_test::init()` to zebra-consensus
* Add missing `zebra_test::init()` to zebra-network
* Add missing `zebra_test::init()` to zebra-state
* Add missing `zebra_test::init()` to zebra-test
* Add missing `zebra_test::init()` to zebrad
2020-11-10 10:29:25 +10:00
Henry de Valence 0ad648fb6a zebrad: make lookahead limit configurable.
Sets the default value to the previous lookahead limit.  My testing on
mainnet suggested that the newly lower value (changed when the
checkpoint frequency was decreased) is low enough to cause stalls, even
when using hedged requests.
2020-11-01 10:47:46 -08:00
teor 92c623eddf Log each genesis download
This change helps us diagnose sync hangs.
2020-10-28 11:31:04 -04:00
teor 656bd24ba7 Hedge every syncer block download request
Remove the minimum data points from the syncer hedge configuragtion.
When there are no data points, hedge sends the second request
immediately.

Where there are less than 1/(1-latency_percentile) data points (20),
hedge delays the second request by the highest recent download time.

This change should improve genesis and post-restart sync latency.
2020-10-28 11:31:04 -04:00
Henry de Valence 4c960c4e6d zebrad: treat duplicate downloads as an error
We should error if we notice that we're attempting to download the same
blocks multiple times, because that indicates that peers reported bad
information to us, or we got confused trying to interpret their
responses.
2020-10-26 12:05:35 -07:00
Henry de Valence 4127d086ea zebrad: clarify hedge layering motivation
Co-authored-by: teor <teor@riseup.net>
2020-10-26 12:05:35 -07:00
Henry de Valence 253bab042e sync: add a concurrency limit for block downloads 2020-10-26 12:05:35 -07:00
Henry de Valence 0a405c737d zebrad: check state in obtaintips, not extendtips.
The original sync algorithm split the sync process into two phases, one
that obtained prospective chain tips, and another that attempted to
extend those chain tips as far as possible until encountering an error
(at which point the prospective state is discarded and the process
restarts).

Because a previous implementation of this algorithm didn't properly
enforce linkage between segments of the chain while extending tips,
sometimes it would get confused and fail to discard responses that did
not extend a tip.  To mitigate this, a check against the state was
added.  However, this check can cause stalls while checkpointing,
because when a checkpoint is reached we may suddenly need to commit
thousands of blocks to the state.  Because the sync algorithm now has a
a `CheckedTip` structure that ensures that a new segment of hashes
actually extends an existing one, we don't need to check against the
state while extending a tip, because we don't get confused while
interpreting responses.

This change results in significantly smoother progress on mainnet.
2020-10-26 12:05:35 -07:00
Henry de Valence ce2ac3336f zebrad: add debug message before state check
This reveals that there may be contention in access to the state, as
this takes a long time.
2020-10-26 12:05:35 -07:00
Henry de Valence 91469faf3c zebrad: eliminate duplicate span in sync 2020-10-26 12:05:35 -07:00
Henry de Valence b5a43f4516 zebrad: remove implementation details from docs
The timeout behavior in zebra-network is an implementation detail, not a
feature of the public API.  So it shouldn't be mentioned in the doc
comments -- if we want timeout behavior, we have to layer it ourselves.
2020-10-26 12:05:35 -07:00