Commit Graph

247 Commits

Author SHA1 Message Date
teor 656bd24ba7 Hedge every syncer block download request
Remove the minimum data points from the syncer hedge configuragtion.
When there are no data points, hedge sends the second request
immediately.

Where there are less than 1/(1-latency_percentile) data points (20),
hedge delays the second request by the highest recent download time.

This change should improve genesis and post-restart sync latency.
2020-10-28 11:31:04 -04:00
Henry de Valence 4c960c4e6d zebrad: treat duplicate downloads as an error
We should error if we notice that we're attempting to download the same
blocks multiple times, because that indicates that peers reported bad
information to us, or we got confused trying to interpret their
responses.
2020-10-26 12:05:35 -07:00
Henry de Valence 4127d086ea zebrad: clarify hedge layering motivation
Co-authored-by: teor <teor@riseup.net>
2020-10-26 12:05:35 -07:00
Henry de Valence 253bab042e sync: add a concurrency limit for block downloads 2020-10-26 12:05:35 -07:00
Henry de Valence 0a405c737d zebrad: check state in obtaintips, not extendtips.
The original sync algorithm split the sync process into two phases, one
that obtained prospective chain tips, and another that attempted to
extend those chain tips as far as possible until encountering an error
(at which point the prospective state is discarded and the process
restarts).

Because a previous implementation of this algorithm didn't properly
enforce linkage between segments of the chain while extending tips,
sometimes it would get confused and fail to discard responses that did
not extend a tip.  To mitigate this, a check against the state was
added.  However, this check can cause stalls while checkpointing,
because when a checkpoint is reached we may suddenly need to commit
thousands of blocks to the state.  Because the sync algorithm now has a
a `CheckedTip` structure that ensures that a new segment of hashes
actually extends an existing one, we don't need to check against the
state while extending a tip, because we don't get confused while
interpreting responses.

This change results in significantly smoother progress on mainnet.
2020-10-26 12:05:35 -07:00
Henry de Valence ce2ac3336f zebrad: add debug message before state check
This reveals that there may be contention in access to the state, as
this takes a long time.
2020-10-26 12:05:35 -07:00
Henry de Valence 91469faf3c zebrad: eliminate duplicate span in sync 2020-10-26 12:05:35 -07:00
Henry de Valence b5a43f4516 zebrad: remove implementation details from docs
The timeout behavior in zebra-network is an implementation detail, not a
feature of the public API.  So it shouldn't be mentioned in the doc
comments -- if we want timeout behavior, we have to layer it ourselves.
2020-10-26 12:05:35 -07:00
Henry de Valence 1d7309afe2 zebrad: correctly handle duplicates in DownloadSet
Using the cancel_handles, we can deduplicate requests.  This is
important to do, because otherwise when we insert the second cancel
handle, we'd drop the first one, cancelling an existing task for no
reason.
2020-10-26 12:05:35 -07:00
Henry de Valence 56fe4f4379 zebrad: unify sync restart logic
This lets us keep the main loop simple and just write `continue 'sync;`
to keep going.
2020-10-26 12:05:35 -07:00
Henry de Valence 12d25159c6 zebrad: use hedged requests in sync
The hedge middleware implements hedged requests, as described in _The
Tail At Scale_. The idea is that we auto-tune our retry logic according
to the actual network conditions, pre-emptively retrying requests that
exceed some latency percentile. This would hopefully solve the problem
where our timeouts are too long on mainnet and too slow on testnet.
2020-10-26 12:05:35 -07:00
Henry de Valence 5f229d1475 zebrad: use Downloads in sync
Try to use the better cancellation logic to revert to previous sync
algorithm.  As designed, the sync algorithm is supposed to proceed by
downloading state prospectively and handle errors by flushing the
pipeline and starting over.  This hasn't worked well, because we didn't
previously cancel tasks properly.  Now that we can, try to use something
in the spirit of the original sync algorithm.
2020-10-26 12:05:35 -07:00
Henry de Valence b90581a3d7 zebrad: create a Downloads Stream for syncing.
This makes two changes relative to the existing download code:

1.  It uses a oneshot to attempt to cancel the download task after it
    has started;

2.  It encapsulates the download creation and cancellation logic into a
    Downloads struct.
2020-10-26 12:05:35 -07:00
Henry de Valence b636660d6a zebrad: rename sync::Error alias to BoxError. 2020-10-26 12:05:35 -07:00
Henry de Valence eb43893de0 consensus: minimize API, clean docs
This reduces the API surface to the minimum required for functionality,
and cleans up module documentation.  The stub mempool module is deleted
entirely, since it will need to be redone later anyways.
2020-10-20 11:16:22 -04:00
Alfredo Garcia c0a14ecc8c
move genesis parameters to zebra-chain (#1151) 2020-10-12 14:08:23 -07:00
Jane Lusby 855f9b5bcb
Implement MVP of NonFinalizedState and integrate it with the state service (#1101)
* implement most of the chain functions
* implement fork
* fix outpoint handling in Chain struct
* update expect for work
* split utxo into two sets
* update the Chain definition
* remove allow attribute in zebra-state/lib.rs
* merge ChainSet type into MemoryState
* Add error messages to asserts
* export proptest impls for use in downstream crates
* add testjob for disabled feature in zebra-chain
* try to fix github actions syntax
* add module doc comment
* update RFC for utxos
* add missing header
* working proptest for Chain
* propagate back results over channel
* Start updating RFC to match changes
* implement queued block pruning
* and now it syncs wooo!
* remove empty modules
* setup config for proptests
* re-enable missing_docs lint
* update RFC to match changes in impl
* add documentation
* use more explicit variable names
2020-10-08 13:07:32 +10:00
Henry de Valence fe61090a64 zebrad: make Inbound Poll::Ready before setup.
The Inbound service only needs the network setup for some requests, but
it can service other requests without it.  Making it return
Poll::Pending until the network setup finishes means that initial
network connections may view the Inbound service as overloaded and
attempt to load-shed.
2020-09-21 09:26:39 -07:00
Henry de Valence 9c021025a7 network: fill in remaining request/response pairs 2020-09-20 10:21:18 -07:00
Henry de Valence 4b35fea492 zebrad: document Inbound, ChainSync responsibilities 2020-09-18 18:34:25 -07:00
Henry de Valence 65877cb4b1 zebrad: make Inbound propagate backpressure 2020-09-18 18:34:25 -07:00
Henry de Valence 55f46967b2 zebrad: serve blocks from Inbound service
The original version of this commit ran into

https://github.com/rust-lang/rust/issues/64552

again.  Thanks to @yaahc for suggesting a workaround (using futures combinators
to avoid writing an async block).
2020-09-18 18:34:25 -07:00
Henry de Valence 1d0ebf89c6 zebrad: move seed command into inbound component
Remove the seed command entirely, and make the behavior it provided
(responding to `Request::Peers`) part of the ordinary functioning of the
start command.

The new `Inbound` service should be expanded to handle all request
types.
2020-09-18 18:34:25 -07:00
Jane Lusby ca648ff27c
Enable issue-url feature in color-eyre (#1072)
* Enable issue-url feature in color-eyre

* get version automatically

* and the url!
2020-09-17 15:09:18 -07:00
Henry de Valence 3133214e4f zebrad: use new state API 2020-09-11 13:37:49 -07:00
Henry de Valence 24de90c900 zebrad: tidy sync imports 2020-09-10 09:45:52 -07:00
Henry de Valence 9b6e66c1b9 zebrad: rename Syncer to ChainSync
This name clarifies what is being synced and avoids an agent-noun
construction.
2020-09-10 09:45:52 -07:00
Henry de Valence 0bc79686b8 zebrad: move sync into components module.
Part of #1030.
2020-09-10 09:45:52 -07:00
teor a6d6e65940 fix: fix the flamegraph module comment 2020-09-01 11:40:18 -04:00
Alfredo Garcia 9c387521bd
Print endpoint addresses at startup (#867)
* print tracing and metrics endpoints in startup

* print network address in startup
2020-08-10 12:47:26 -07:00
Henry de Valence a77328ad7c
Refactor tracing components (#834)
* Split tracing component code into modules.

* Repatriate Tracing and simplify config handling.

We upstreamed our Tracing component, expecting not to have to exert fine
control over the tracing settings.  But this turned out not to be the case, and
now that we want to do other things (flamegraphs, journalctl, opentelemetry,
etc), we end up with really awkward code (as in the current flamegraph
handling).

This also makes use of the changes to `init()` to load the config early to pass
configuration data into the components, which avoids the need for the
refactoring in #775.

Finally, we restore support for the `-v` flag when the filter is unset.  Closes #831.

* Disable tracing and metrics endpoints by default.

Closes #660.

* Switch back to upstream Abscissa.

* Integrate flamegraph support into the new Tracing component.

* Pass -v in acceptance tests to get info-level output.

* Clean up acceptance test code.
2020-08-06 10:29:31 -07:00
Jane Lusby 867dd0b475
Setup tracing-flame for use profiling zebrad (#436)
* Setup tracing-flame for use profiling zebrad

* start work on conditional flamegraph generation

* review time!

* update comments

* Update Cargo.toml

* disable default features for inferno

* reorganize

* missing one trait

* Apply suggestions from code review

* graceful shutdown!

* remove special case handling on ctrlc for cleanup

* rename signal fn to better represent its responsibility

* remove unused global hook for flushing flamegraph

* move tracing logic to the right file

* just copy linkerd's signal handling logic

* update book

* make zebrad app drop on shutdown normally

* Update zebrad/src/components/tokio.rs

Co-authored-by: teor <teor@riseup.net>

* Update zebrad/src/application.rs

Co-authored-by: teor <teor@riseup.net>

* Apply suggestions from code review

Co-authored-by: teor <teor@riseup.net>

* cleanup a little

* ooh yea there's an API for that

* setup env-filter for backup subscriber

* document env filter

* document return codes

* forgot to save

* Update book/src/applications/zebrad.md

Co-authored-by: teor <teor@riseup.net>

Co-authored-by: teor <teor@riseup.net>
2020-08-05 16:35:56 -07:00
teor 050c46388f fix: Open the endpoints after the config is loaded
We get the injected TokioComponent dependency before the config is
loaded, so we can't use it to open the endpoints.

And we can't define after_config, because we use derive(Component).

So we work around these issues by opening the endpoints manually,
from the application's after_config.
2020-07-29 16:03:52 +10:00
teor e7437cc551 feature: Get endpoint addresses from config 2020-07-29 16:03:52 +10:00
Deirdre Connolly 05316dee21 Listen on 0.0.0.0, not 127.0.0.1
Turns out when your node faces the internet directly, it has to listen
to those addresses directly.
2020-06-19 03:46:09 -04:00
Henry de Valence f98cda40f9 Remove unused import. 2020-02-21 06:48:25 -05:00
Henry de Valence 75d3d44fb3 Metrics MVP: add two metrics and export them to Prometheus.
Co-authored-by: Deirdre Connolly <deirdre@zfnd.org>
2020-02-14 20:14:05 -05:00
Henry de Valence 4fcb550aa6 Fix a deadlock in TokioComponent.
The components are accessed by a lock on application state.  When some command
calls block_on to enter an async context, it obtained a write lock on the
entire application state.  This meant that if the application state were
accessed later in an async context, a deadlock would occur.  Instead the
TokioComponent holds an Option<Runtime> now, so that before calling block_on,
the caller can .take() the runtime and release the lock.  Since we only ever
enter an async context once, it's not a problem that the component is then
missing its runtime, as once we are inside of a task we can access the runtime.
2020-01-15 12:06:31 -08:00
Henry de Valence ab3db201ee Change TracingEndpoint to forward to the Abscissa Tracing component. 2020-01-15 12:06:31 -08:00
Tony Arcieri 45eb81a204 Upgrade to Abscissa v0.5 2020-01-15 12:06:31 -08:00
Henry de Valence 2965187b91 Upgrade tokio, futures, hyper to released versions. 2019-12-13 17:42:15 -05:00
Henry de Valence 79c36a979c Use try_bind when building tracing endpoint.
Prior to this commit, the tracing endpoint would attempt to bind the
given address or panic; now, if it is unable to bind the given address
it displays an error but continues running the rest of the application.
This means that we can spin up multiple Zebra instances for load
testing.
2019-09-30 21:30:36 -04:00
Henry de Valence c8a3d47b56 Use tracing::instrument and monitor for messages. 2019-09-23 22:17:12 -04:00
Henry de Valence df7801d623 Temporarily change hyper to git version.
This avoids some crate selection conflicts, but makes some futures
extension traits fall out of order?  This seems to be an issue with
`pin-project` resolved in the git branch of `hyper` (but not yet
released).
2019-09-22 17:27:08 -04:00
Henry de Valence a64a051276 Clean tracing_subscriber deprecation warnings. 2019-09-20 16:02:55 -04:00
Henry de Valence e0cd099487 Fix type with updated tracing-subscriber
An updated tracing-subscriber version changed one of the public types;
because we hardcode the type instead of being generic over S:
Subscriber, this was actually a breaking change.  As noted in the
comment adjacent to this line, we would rather be generic over S, but
this requires fixing a bug in abscissa's proc-macros, so in the meantime
we hardcode the type.
2019-09-18 17:32:06 -04:00
Deirdre Connolly 162b37fe8d Tracing endpoint (#3)
* Add a TracingConfig and some components

Co-authored-by: Deirdre Connolly <deirdre@zfnd.org>

* Restructure, use dependency injection, initialize tracing

* Start a placeholder loop in start command

* Add hyper alpha.1, bump tokio to alpha.4

* Hello world endpoint using async/await from hyper 0.13 alpha

Also cleaned up some linter messages.

Co-authored-by: Henry de Valence <hdevalence@hdevalence.ca>

* Update to tracing_subscriber 0.1

* fmt

* add rust-toolchain

* Remove hyper::Version import

* wip: start filter_handler impl

* Add .rustfmt.toml

* rustfmt

* Tidy up .rustfmt.toml

* Add filter reloading handling.

* bump toolchain

* Remove generated hello world acceptance tests.

These test the behaviour of the autogenerated binary and work as examples of
how to test the behaviour of abscissa binaries.  Since we don't print "Hello
World" any more, they fail, but we don't yet have replacement behaviour to add
tests for, so they're removed for now.

* Clean up config file handling with Option::and_then.
2019-09-09 13:05:42 -07:00