zebra

Commit Graph

Author	SHA1	Message	Date
teor	a92c431c03	Ignore NotFound errors in the syncer (#3131 )	2021-12-02 11:28:20 -03:00
teor	c85ea18b43	Fix slow Zebra startup times, to reduce CI failures (#3104 ) * Tweak a log message * Only retry failed DNS once, then use the other DNS responses * Limit broadcasts to half the peers * Use a longer minimum interval for GetAddr requests * Reduce the syncer and mempool crawler fanouts * Stop resetting the mempool twice when it starts up This spawns two crawlers, which send two fanouts, so it can use up a lot of peers. Co-authored-by: Conrado Gouvea <conrado@zfnd.org> Co-authored-by: Alfredo Garcia <oxarbitrage@gmail.com>	2021-11-30 21:04:32 +00:00
teor	375a997d2f	Stop downloading unnecessary blocks in Zebra acceptance tests (#3072 ) * Implement graceful shutdown for the peer set * Use the minimum lookahead limit in acceptance tests * Enable a doctest that compiles with newly public modules	2021-11-19 01:55:38 +00:00
Janito Vaqueiro Ferreira Filho	0960e4fb0b	Update to Tokio 1.13.0 (#2994 ) * Update `tower` to version `0.4.9` Update to latest version to add support for Tokio version 1. * Replace usage of `ServiceExt::ready_and` It was deprecated in favor of `ServiceExt::ready`. * Update Tokio dependency to version `1.13.0` This will break the build because the code isn't ready for the update, but future commits will fix the issues. * Replace import of `tokio::stream::StreamExt` Use `futures::stream::StreamExt` instead, because newer versions of Tokio don't have the `stream` feature. * Use `IntervalStream` in `zebra-network` In newer versions of Tokio `Interval` doesn't implement `Stream`, so the wrapper types from `tokio-stream` have to be used instead. * Use `IntervalStream` in `inventory_registry` In newer versions of Tokio the `Interval` type doesn't implement `Stream`, so `tokio_stream::wrappers::IntervalStream` has to be used instead. * Use `BroadcastStream` in `inventory_registry` In newer versions of Tokio `broadcast::Receiver` doesn't implement `Stream`, so `tokio_stream::wrappers::BroadcastStream` instead. This also requires changing the error type that is used. * Handle `Semaphore::acquire` error in `tower-batch` Newer versions of Tokio can return an error if the semaphore is closed. This shouldn't happen in `tower-batch` because the semaphore is never closed. * Handle `Semaphore::acquire` error in `zebrad` test On newer versions of Tokio `Semaphore::acquire` can return an error if the semaphore is closed. This shouldn't happen in the test because the semaphore is never closed. * Update some `zebra-network` dependencies Use versions compatible with Tokio version 1. * Upgrade Hyper to version 0.14 Use a version that supports Tokio version 1. * Update `metrics` dependency to version 0.17 And also update the `metrics-exporter-prometheus` to version 0.6.1. These updates are to make sure Tokio 1 is supported. * Use `f64` as the histogram data type `u64` isn't supported as the histogram data type in newer versions of `metrics`. * Update the initialization of the metrics component Make it compatible with the new version of `metrics`. * Simplify build version counter Remove all constants and use the new `metrics::incement_counter!` macro. * Change metrics output line to match on The snapshot string isn't included in the newer version of `metrics-exporter-prometheus`. * Update `sentry` to version 0.23.0 Use a version compatible with Tokio version 1. * Remove usage of `TracingIntegration` This seems to not be available from `sentry-tracing` anymore, so it needs to be replaced. * Add sentry layer to tracing initialization This seems like the replacement for `TracingIntegration`. * Remove unnecessary conversion Suggested by a Clippy lint. * Update Cargo lock file Apply all of the updates to dependencies. * Ban duplicate tokio dependencies Also ban git sources for tokio dependencies. * Stop allowing sentry-tracing git repository in `deny.toml` * Allow remaining duplicates after the tokio upgrade * Use C: drive for CI build output on Windows GitHub Actions uses a Windows image with two disk drives, and the default D: drive is smaller than the C: drive. Zebra currently uses a lot of space to build, so it has to use the C: drive to avoid CI build failures because of insufficient space. Co-authored-by: teor <teor@riseup.net>	2021-11-02 18:46:57 +00:00
Janito Vaqueiro Ferreira Filho	595d75d5fb	Fix synchronization delay issue (#2921 ) * Create a `NowOrLater` helper type A replacement for `FutureExt::now_or_never` that ensures that the task is scheduled for waking up later when the inner future is ready. * Use `NowOrLater` to fix possible delay bug Previous usage of `now_or_never` meant that the underlying task wasn't being scheduled to awake when the `Downloads` stream produced a new item. Using `NowOrLater` instead fixes that issue.	2021-10-21 10:34:12 +10:00
Conrado Gouvea	84f2c07fbc	Ignore AlreadyInChain error in the syncer (#2890 ) * Ignore AlreadyInChain error in the syncer * Split Cancelled errors; add them to should_restart_sync exceptions * Also filter 'block is already comitted'; try to detect a wrong downcast	2021-10-20 11:07:19 +10:00
Alfredo Garcia	724967d488	Send `AdvertiseTransactionIds` to peers (#2823 ) * bradcast transactions to peers after they get inserted into mempool * remove network argument from mempool init * remove dbg left * remove return value in mempool enable call * rename channel sender and receiver vars * change unwrap() to expect() * change the channel to a hashset * fix build * fix tests * rustfmt * fix tiny space issue inside macro Co-authored-by: teor <teor@riseup.net> * check errors/panics in transaction gossip tests * fix build of newly added tests * Stop dropping the inbound service and mempool in a test Keeping the mempool around avoids a transaction broadcast task error, so we can test that there are no other errors in the task. * Tweak variable names and add comments * Avoid unexpected drops by returning a mempool guard in tests * Use BoxError to simplify service types in tests * Make all returned service types consistent in tests We want to be able to change the setup without changing the tests. Co-authored-by: teor <teor@riseup.net>	2021-10-08 08:59:46 -03:00
teor	04d2cfb3d0	Gossip recently verified block hashes to peers (#2729 ) * Implement a task that gossips verified block hashes * Log an info message for block broadcasts * Simplify the gossip task Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com> * Re-use the old tip change if there is no new tip change Also improve the comments. * Add an assertion message * Rename task join handles and futures in start method * Add a dedicated BlockGossipError type This type helps distinguish between syncer and state errors. * Test that committed blocks are gossiped to peers Also do a minor type cleanup on the existing test code, replacing `Option<Vec<_>>` with `Vec<_>`. * Formatting * Remove excess newlines Co-authored-by: Alfredo Garcia <oxarbitrage@gmail.com> * Clear the initial gossiped blocks during test setup Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com> Co-authored-by: Alfredo Garcia <oxarbitrage@gmail.com>	2021-10-07 07:46:37 -03:00
Conrado Gouvea	c6878d9b63	Cancel download and verify tasks when the mempool is deactivated (#2764 ) * Cancel download and verify tasks when the mempool is deactivated * Refactor enable/disable logic to use a state enum * Add helper test functions to enable/disable the mempool * Add documentation about errors on service calls * Improvements from review * Improve documentation * Fix bug in test * Apply suggestions from code review Co-authored-by: teor <teor@riseup.net> Co-authored-by: teor <teor@riseup.net>	2021-09-29 09:06:40 +10:00
Janito Vaqueiro Ferreira Filho	83a2e30e33	Create a `SyncStatus` helper type (#2685 ) * Create a `SyncStatus` helper type Keeps track if the synchronizer is close to the chain tip or not. * Refactor `ChainSync` ctor. to return `SyncStatus` Change the constructor API so that it returns a higher level construct. * Test if `SyncStatus` waits for the chain tip Test if waiting for the chain tip to be reached correctly finishes when the chain tip is reached. This is done by sending recent sync lengths to the `SyncStatus` instance, and checking that every time a separate `SyncStatus` instance determines it has reached the tip the original instance wakes up. * Add a temporary attribute to allow dead code The code added isn't used yet, so we'll add a temporary waiver until another PR is merged to use them.	2021-08-30 10:01:33 +10:00
teor	6f8f4d8987	Provide recent syncer response lengths as a watch channel (#2602 ) * Minimal recent sync lengths implementation Also includes metrics and logging, to make diagnosing bugs easier. * Add logging to check what happens when Zebra reaches the chain tip * Add tests for recent sync lengths - initially empty - pruned to correct length - newest entries go first * Drop a redundant `/` from a Cargo.lock URL This seems to be a nightly or beta Rust change, but hopefully stable just accepts it. * Use metrics histograms to avoid overwriting values * Add detailed syncer monitoring dashboard * Increase the recent sync length to 4 This length makes it easier to distinguish between temporary and sustained errors/syncs. Co-authored-by: Janito Vaqueiro Ferreira Filho <janito.vff@gmail.com>	2021-08-19 23:16:16 +00:00
Alfredo Garcia	96a1b661f0	Rate limit initial genesis block download retries, Credit: Equilibrium (#2255 ) * implement and test a rate limit in `request_genesis()` * add `request_genesis_is_rate_limited` test to sync * add ensure_timeouts constraint for GENESIS_TIMEOUT_RETRY * Suppress expected warning logs in zebrad tests Co-authored-by: teor <teor@riseup.net>	2021-06-09 23:39:51 +00:00
teor	d494af1e90	Document how the syncer resists memory DoS	2021-03-11 06:24:46 -05:00
teor	972103d797	Fix tracing macro syntax	2021-02-17 11:09:22 -05:00
teor	253d1c02b3	Make sync logging a bit less verbose And tweak some log content	2021-02-17 11:09:22 -05:00
teor	0b76352468	Document a state_contains bug (#1715 ) * Document a state_contains bug in the syncer and Inbound	2021-02-10 09:05:14 +10:00
teor	dce11358d7	Log when the syncer awaits peer readiness (#1714 )	2021-02-10 07:09:27 +10:00
teor	3d9888f736	Rewrite a sync comment	2021-01-29 11:02:26 +10:00
teor	391c53aa60	Move BoxError to zebrad's lib.rs For consistency with other crates.	2021-01-27 12:14:27 -08:00
teor	9cdf41f5f4	Panic if the lookahead limit is misconfigured (#1589 )	2021-01-14 14:06:30 +10:00
teor	fb76eb2e6b	Add download and verify timeouts to the inbound service	2021-01-13 20:46:25 -05:00
teor	973aec8ccc	Refactor sync members into a consistent order And add comments about correctness and usage.	2021-01-13 20:46:25 -05:00
teor	c2893dce51	Warn when the user's configured lookahead limit is ignored	2021-01-13 20:46:25 -05:00
teor	3699bbdae6	Add some additional sync correctness constraints And adjust the sync restart delay as a consequence.	2021-01-13 20:46:25 -05:00
teor	cef0a492d8	Add a timeout to sync service block verification This timeout stops the sync service hanging when it is missing required blocks, but the lookahead queue is full of dependent verify tasks, so the missing blocks never get downloaded.	2021-01-13 20:46:25 -05:00
Deirdre Connolly	44e1051dee	Debug	2020-12-09 13:06:18 -05:00
Deirdre Connolly	25f6fd25b3	Test catching panic	2020-12-09 13:06:18 -05:00
Henry de Valence	ba3c19142c	deps: update hyper, metrics to tokio 0.3 The metrics code becomes much simpler because the current version of the metrics crate builds its own single-threaded runtime on a dedicated worker thread, so no dependency on the main Zebra Tokio runtime is required.	2020-11-20 10:08:16 -08:00
Henry de Valence	add94c1c45	deps: move to tokio 0.3, tower 0.4 This change is mostly mechanical, with the exception of the changes to the `tower-batch` middleware. This middleware was adapted from `tower::buffer`, and the `tower::buffer` code was changed to implement its own bounded queue, because Tokio 0.3 removed the `mpsc::Sender::poll_send` method. See `ddc64e8d4d` for more context on the Tower changes. To match Tower as closely as possible in order to be able to upstream `tower-batch`, those changes are copied from `tower::Buffer` to `tower-batch`.	2020-11-20 10:08:16 -08:00
Henry de Valence	4953f21670	fixup! zebrad: hack to skip alreadyverified errors	2020-11-18 03:09:06 -05:00
Henry de Valence	aa7538ab15	zebrad: hack to skip alreadyverified errors	2020-11-17 14:56:27 -08:00
Henry de Valence	6de824bd99	zebrad: remove block verification timeout Because we set the lookahead limit to be at least twice the size of a checkpoint, we don't have a risk of timeouts.	2020-11-17 14:56:27 -08:00
Henry de Valence	e9c847bbd7	zebrad: avoid a borrow in the ChainSync future	2020-11-17 14:56:27 -08:00
Henry de Valence	ec411574ee	zebrad: improve sync diagnostics	2020-11-17 14:56:27 -08:00
Henry de Valence	e0c92167bc	Revert "Hedge every syncer block download request" This reverts commit `656bd24ba7`. The Hedge middleware keeps a pair of histograms, writing into one in the current time interval and reading from the previous time interval's data. This means that the reverted change resulted in doubling all block downloads until after at least the second measurement interval (which means that the time measurements are also incorrect, as they're operating under double the network load...)	2020-11-12 16:45:47 -05:00
Alfredo Garcia	128643d81e	Call `zebra_test::init` where needed. (#1227 ) * Add missing `zebra_test::init()` to zebra-chain * Add missing `zebra_test::init()` to zebra-consensus * Add missing `zebra_test::init()` to zebra-network * Add missing `zebra_test::init()` to zebra-state * Add missing `zebra_test::init()` to zebra-test * Add missing `zebra_test::init()` to zebrad	2020-11-10 10:29:25 +10:00
Henry de Valence	0ad648fb6a	zebrad: make lookahead limit configurable. Sets the default value to the previous lookahead limit. My testing on mainnet suggested that the newly lower value (changed when the checkpoint frequency was decreased) is low enough to cause stalls, even when using hedged requests.	2020-11-01 10:47:46 -08:00
teor	92c623eddf	Log each genesis download This change helps us diagnose sync hangs.	2020-10-28 11:31:04 -04:00
teor	656bd24ba7	Hedge every syncer block download request Remove the minimum data points from the syncer hedge configuragtion. When there are no data points, hedge sends the second request immediately. Where there are less than 1/(1-latency_percentile) data points (20), hedge delays the second request by the highest recent download time. This change should improve genesis and post-restart sync latency.	2020-10-28 11:31:04 -04:00
Henry de Valence	4c960c4e6d	zebrad: treat duplicate downloads as an error We should error if we notice that we're attempting to download the same blocks multiple times, because that indicates that peers reported bad information to us, or we got confused trying to interpret their responses.	2020-10-26 12:05:35 -07:00
Henry de Valence	4127d086ea	zebrad: clarify hedge layering motivation Co-authored-by: teor <teor@riseup.net>	2020-10-26 12:05:35 -07:00
Henry de Valence	253bab042e	sync: add a concurrency limit for block downloads	2020-10-26 12:05:35 -07:00
Henry de Valence	0a405c737d	zebrad: check state in obtaintips, not extendtips. The original sync algorithm split the sync process into two phases, one that obtained prospective chain tips, and another that attempted to extend those chain tips as far as possible until encountering an error (at which point the prospective state is discarded and the process restarts). Because a previous implementation of this algorithm didn't properly enforce linkage between segments of the chain while extending tips, sometimes it would get confused and fail to discard responses that did not extend a tip. To mitigate this, a check against the state was added. However, this check can cause stalls while checkpointing, because when a checkpoint is reached we may suddenly need to commit thousands of blocks to the state. Because the sync algorithm now has a a `CheckedTip` structure that ensures that a new segment of hashes actually extends an existing one, we don't need to check against the state while extending a tip, because we don't get confused while interpreting responses. This change results in significantly smoother progress on mainnet.	2020-10-26 12:05:35 -07:00
Henry de Valence	ce2ac3336f	zebrad: add debug message before state check This reveals that there may be contention in access to the state, as this takes a long time.	2020-10-26 12:05:35 -07:00
Henry de Valence	91469faf3c	zebrad: eliminate duplicate span in sync	2020-10-26 12:05:35 -07:00
Henry de Valence	b5a43f4516	zebrad: remove implementation details from docs The timeout behavior in zebra-network is an implementation detail, not a feature of the public API. So it shouldn't be mentioned in the doc comments -- if we want timeout behavior, we have to layer it ourselves.	2020-10-26 12:05:35 -07:00
Henry de Valence	56fe4f4379	zebrad: unify sync restart logic This lets us keep the main loop simple and just write `continue 'sync;` to keep going.	2020-10-26 12:05:35 -07:00
Henry de Valence	12d25159c6	zebrad: use hedged requests in sync The hedge middleware implements hedged requests, as described in _The Tail At Scale_. The idea is that we auto-tune our retry logic according to the actual network conditions, pre-emptively retrying requests that exceed some latency percentile. This would hopefully solve the problem where our timeouts are too long on mainnet and too slow on testnet.	2020-10-26 12:05:35 -07:00
Henry de Valence	5f229d1475	zebrad: use Downloads in sync Try to use the better cancellation logic to revert to previous sync algorithm. As designed, the sync algorithm is supposed to proceed by downloading state prospectively and handle errors by flushing the pipeline and starting over. This hasn't worked well, because we didn't previously cancel tasks properly. Now that we can, try to use something in the spirit of the original sync algorithm.	2020-10-26 12:05:35 -07:00
Henry de Valence	b90581a3d7	zebrad: create a Downloads Stream for syncing. This makes two changes relative to the existing download code: 1. It uses a oneshot to attempt to cancel the download task after it has started; 2. It encapsulates the download creation and cancellation logic into a Downloads struct.	2020-10-26 12:05:35 -07:00

1 2

59 Commits