Commit Graph

180 Commits

Author SHA1 Message Date
teor b1e1291f45 Log inbound peer requests at debug
Logging at info was a bit too verbose.

Also add a short log message.
2020-09-10 09:46:53 -07:00
Henry de Valence 24de90c900 zebrad: tidy sync imports 2020-09-10 09:45:52 -07:00
Henry de Valence 9b6e66c1b9 zebrad: rename Syncer to ChainSync
This name clarifies what is being synced and avoids an agent-noun
construction.
2020-09-10 09:45:52 -07:00
Henry de Valence 0bc79686b8 zebrad: move sync into components module.
Part of #1030.
2020-09-10 09:45:52 -07:00
teor adafe1d189 Restart sync after the first failed ObtainTips
The ObtainTips retry was redundant. The timeout wasn't much shorter, but
it made the code and sync logic more complicated.
2020-09-09 15:35:09 -07:00
teor 2a68ef5acb Update the peerset buffer size and sync timeout
Also add a bunch of comments and documentation for network-constrained
nodes, and for testnet.
2020-09-08 12:44:33 -07:00
teor b062a682b0 Refactor "waiting for pending blocks" log 2020-09-08 12:44:33 -07:00
teor e6e859dce2 Tweak sync timeouts
* increase the EWMA default and decay
* increase the block download retries
* increase the request and block download timeouts
* increase the sync timeout
2020-09-08 12:44:33 -07:00
teor ce12d4dadc Add timeouts for tip responses and block verify tasks 2020-09-08 12:44:33 -07:00
teor 379ce5c1b8 Retry obtain and extend tips on failure 2020-09-08 12:44:33 -07:00
Alfredo Garcia 454e75e7c0
Rename old references to BlockHeaderHash and BlockHeight (#1002)
* rename some references

* Apply suggestions from code review

Co-authored-by: Deirdre Connolly <durumcrustulum@gmail.com>
Co-authored-by: teor <teor@riseup.net>

Co-authored-by: Deirdre Connolly <durumcrustulum@gmail.com>
Co-authored-by: teor <teor@riseup.net>
2020-09-04 15:40:48 -07:00
teor 48497d4857
Ignore sync errors when the block is already verified (#980)
* Ignore sync errors when the block is already verified

If we get an error for a block that is already in our state, we don't
need to restart the sync. It was probably a duplicate download.

Also:

Process any ready tasks before reset, so the logs and metrics are
up to date. (But ignore the errors, because we're about to reset.)

Improve sync logging and metrics during the download and verify task.

* Remove duplicate hashes in logs
Co-authored-by: Jane Lusby <jlusby42@gmail.com>

* Log the sync hash span at warn level
Co-authored-by: Jane Lusby <jlusby42@gmail.com>
2020-09-04 08:13:00 +10:00
teor 437549d8e9
Always drop the final hash in peer responses (#991)
To workaround a zcashd bug that squashes responses together.
2020-09-04 08:09:34 +10:00
teor c770daa51f
If the first ExtendTips hash is bad, discard it and re-check (#992) 2020-09-04 08:08:19 +10:00
Alfredo Garcia 5485f4429a
Add config path to acceptance tests (#946)
* add and apply config mode to get_child

* remove option to read config from current directory

* remove argument from get_child
2020-09-03 13:13:23 -07:00
Jane Lusby ffdec0cb23
Remove in-memory state service (#974)
* Remove in-memory state service

* make the config compatible with toml again

* checkpoint commit to see how much I still have to revert

* back to the starting point...

* remove unused dependency

* reorganize error handling a bit

* need to make a new color-eyre release now

* reorder again because I have problems

* remove unnecessary helpers

* revert changes to config loading

* add back missing space

* Switch to released color-eyre version

* add back missing newline again...

* improve error message on unix when terminated by signal

* add context to last few asserts in acceptance tests

* instrument some of the helpers

* remove accidental extra space

* try to make this compile on windows

* reorg platform specific code

* hide on_disk module and fix broken link
2020-09-01 12:39:04 -07:00
teor 3fdfcb3179 fix: remove old tips that are behind new tips
This change makes sync less reliant on the exact order of ObtainTips and
ExtendTips responses.
2020-09-01 11:42:48 -04:00
teor a6d6e65940 fix: fix the flamegraph module comment 2020-09-01 11:40:18 -04:00
teor 78201b456d feature: Implement checkpoint_sync for checkpoint verification
* add CheckpointList::new_up_to(limit: NetworkUpgrade)
* if checkpoint_sync is false, limit checkpoints to Sapling
* update tests for CheckpointList and chain::init
2020-08-24 15:34:46 +10:00
teor 06f4a59664 feature: Add a checkpoint_sync config option
(The option doesn't do anything yet.)
2020-08-24 15:34:46 +10:00
teor b8e8d4f548 fix: Remove some deeply-nested instrument spans
Closes #923.
2020-08-20 14:52:39 -04:00
Henry de Valence 103b663c40 chain: rename BlockHeight to block::Height 2020-08-17 11:46:34 -07:00
Henry de Valence 61dea90e2f chain: rename BlockHeaderHash to block::Hash
This is the first in a sequence of changes that change the block:: items
to not include Block as a prefix in their name, in accordance with the
Rust API guidelines.
2020-08-17 11:46:34 -07:00
Henry de Valence 948b067808 chain: move Network, NetworkUpgrade to parameters
Also, avoid using star-imports of the enum variants, which pollutes the
namespace.
2020-08-17 11:46:34 -07:00
Henry de Valence 0d1f56ad2f chain: remove utils module
A catch-all utils module can really easily slip into being a place to stash
miscellaneous functions that don't really belong anywhere in particular.
2020-08-17 11:46:34 -07:00
Henry de Valence a79ce97957
Fix sync algorithm. (#887)
* checkpoint: reject older of duplicate verification requests.

If we get a duplicate block verification request, we should drop the older one
in favor of the newer one, because the older request is likely to have been
canceled.  Previously, this code would accept up to four duplicate verification
requests, then fail all subsequent ones.

* sync: add a timeout layer to block requests.

Note that if this timeout is too short, we'll bring down the peer set in a
retry storm.

* sync: restart syncing on error

Restart the syncing process when an error occurs, rather than ignoring it.
Restarting means we discard all tips and start over with a new block locator,
so we can have another chance to "unstuck" ourselves.

* sync: additional debug info

* sync: handle lookahead limit correctly.

Instead of extracting all the completed task results, the previous code pulled
results out until there were fewer tasks than the lookahead limit, then
stopped.  This meant that completed tasks could be left until the limit was
exceeded again.  Instead, extract all completed results, and use the number of
pending tasks to decide whether to extend the tip or wait for blocks to finish.

* network: add debug instrumentation to retry policy

* sync: instrument the spawned task

* sync: streamline ObtainTips/ExtendTips logic & tracing

This change does three things:

1.  It aligns the implementation of ObtainTips and ExtendTips so that they use
the same deduplication method.  This means that when debugging we only have one
deduplication algorithm to focus on.

2.  It streamlines the tracing output to not include information already
included in spans. Both obtain_tips and extend_tips have their own spans
attached to the events, so it's not necessary to add Scope: prefixes in
messages.

3.  It changes the messages to be focused on reporting the actual
events rather than the interpretation of the events (e.g., "got genesis hash in
response" rather than "peer could not extend tip").  The motivation for this
change is that when debugging, the interpretation of events is already known to
be incorrect, in the sense that the mental model of the code (no bug) does not
match its behavior (has bug), so presenting minimally-interpreted events forces
interpretation relative to the actual code.

* sync: hack to work around zcashd behavior

* sync: localize debug statement in extend_tips

* sync: change algorithm to define tips as pairs of hashes.

This is different enough from the existing description that its comments no
longer apply, so I removed them.  A further chunk of work is to change the sync
RFC to document this algorithm.

* sync: reduce block timeout

* state: add resource limits for sled

Closes #888

* sync: add a restart timeout constant

* sync: de-pub constants
2020-08-12 16:48:01 -07:00
Henry de Valence 299afe13df
zebra-network tweaks. (#877)
* network: move gossiped peer selection logic into address book.

* network: return BoxService from init.

* zebrad: add note on why we truncate thegossiped peer list

Co-authored-by: Jane Lusby <jlusby42@gmail.com>

* Remove unused .rustfmt.toml

Many of these options are never actually loaded by our CI because of a channel
mismatch, where they're not applied on stable but only on nightly (see the logs
from a rustfmt job).  This means that we can get different settings when
running `cargo fmt` on the nightly and stable channels, which was causing a CI
failure on this PR.  Reverting back to the default rustfmt settings avoids this
problem and keeps us in line with upstream rustfmt.  There's no loss to us
since we were using the defaults anyways.

Co-authored-by: Jane Lusby <jlusby42@gmail.com>
2020-08-11 13:07:44 -07:00
teor 2550c44d48
Make sync ignore known hashes (#853)
* fix: Handle known ObtainTips correctly

enumerate never returns a value beyond the end of the vector.

* fix: Ignore known tips in ExtendTips

Some peers send us known tips when we try to extend.

* fix: Ignore known hashes when downloading

Despite all our other checks, we still end up downloading some hashes
multiple times.

* fix: Increase the number of retries

The old sync code relied on duplicate block fetches to make progress,
but the last few commits have removed some of those duplicates.

Instead, just retry the fetches that fail.

* fix: Tweak comments

Co-authored-by: Jane Lusby <jlusby42@gmail.com>

* fix: Cleanup the state_contains interface in Sync

* Fix brackets

Oops

Co-authored-by: Jane Lusby <jlusby42@gmail.com>
2020-08-10 16:17:50 -07:00
Alfredo Garcia 9c387521bd
Print endpoint addresses at startup (#867)
* print tracing and metrics endpoints in startup

* print network address in startup
2020-08-10 12:47:26 -07:00
teor e95358dbe3 fix: Increase the number of retries
The old sync code relied on duplicate block fetches to make progress,
but the last few commits have removed some of those duplicates.

Instead, just retry the fetches that fail.
2020-08-10 18:58:21 +10:00
teor faac50697c feature: Add a verified blocks metrics counter
We have a counter for pending "download and verify" futures. But these
futures are spawned, so they can complete in any order. They can also
complete before we receive their results.
2020-08-10 15:12:08 +10:00
teor 6aeefcee8b fix: Improve sync diagnostics 2020-08-10 15:12:08 +10:00
Henry de Valence 6d1a4b2218
Load config after initializing the Terminal (#848) 2020-08-06 17:22:40 -07:00
Alfredo Garcia c52481c041 fix logs 2020-08-07 09:21:57 +10:00
Jane Lusby 3e9c6f054b
fix log level default for server commands (#840)
* fix log level default for server commands

* remove dbg
2020-08-06 11:23:00 -07:00
Henry de Valence a77328ad7c
Refactor tracing components (#834)
* Split tracing component code into modules.

* Repatriate Tracing and simplify config handling.

We upstreamed our Tracing component, expecting not to have to exert fine
control over the tracing settings.  But this turned out not to be the case, and
now that we want to do other things (flamegraphs, journalctl, opentelemetry,
etc), we end up with really awkward code (as in the current flamegraph
handling).

This also makes use of the changes to `init()` to load the config early to pass
configuration data into the components, which avoids the need for the
refactoring in #775.

Finally, we restore support for the `-v` flag when the filter is unset.  Closes #831.

* Disable tracing and metrics endpoints by default.

Closes #660.

* Switch back to upstream Abscissa.

* Integrate flamegraph support into the new Tracing component.

* Pass -v in acceptance tests to get info-level output.

* Clean up acceptance test code.
2020-08-06 10:29:31 -07:00
Jane Lusby 867dd0b475
Setup tracing-flame for use profiling zebrad (#436)
* Setup tracing-flame for use profiling zebrad

* start work on conditional flamegraph generation

* review time!

* update comments

* Update Cargo.toml

* disable default features for inferno

* reorganize

* missing one trait

* Apply suggestions from code review

* graceful shutdown!

* remove special case handling on ctrlc for cleanup

* rename signal fn to better represent its responsibility

* remove unused global hook for flushing flamegraph

* move tracing logic to the right file

* just copy linkerd's signal handling logic

* update book

* make zebrad app drop on shutdown normally

* Update zebrad/src/components/tokio.rs

Co-authored-by: teor <teor@riseup.net>

* Update zebrad/src/application.rs

Co-authored-by: teor <teor@riseup.net>

* Apply suggestions from code review

Co-authored-by: teor <teor@riseup.net>

* cleanup a little

* ooh yea there's an API for that

* setup env-filter for backup subscriber

* document env filter

* document return codes

* forgot to save

* Update book/src/applications/zebrad.md

Co-authored-by: teor <teor@riseup.net>

Co-authored-by: teor <teor@riseup.net>
2020-08-05 16:35:56 -07:00
Henry de Valence 4a03d76a41
Remove environment variables in favor of documented config options. (#827)
* Load tracing filter only from config and simplify logic.

* Configure the state storage in the config, not an environment variable.

This also changes the config so that the path is always set rather than being
optional, because Zebra always needs a place to store its config.
2020-08-05 11:48:08 -07:00
Henry de Valence 82da4a5326 Remove connect command. 2020-08-04 23:34:45 -07:00
Alfredo Garcia f2d7bb3177
Command execution tests (#690)
* add zebrad acceptance tests
* add custom command test helpers that work with kill
* add and use info event for start and seed commands
* combine conflicting tests into one test case

Co-authored-by: Jane Lusby <jane@zfnd.org>
2020-08-01 16:15:26 +10:00
Alfredo Garcia 617f1d80ef move docs to zebra book 2020-07-29 19:44:21 -07:00
Alfredo Garcia 6297a7cd19 document zebrad enviroment variables 2020-07-29 19:44:21 -07:00
teor 050c46388f fix: Open the endpoints after the config is loaded
We get the injected TokioComponent dependency before the config is
loaded, so we can't use it to open the endpoints.

And we can't define after_config, because we use derive(Component).

So we work around these issues by opening the endpoints manually,
from the application's after_config.
2020-07-29 16:03:52 +10:00
teor e7437cc551 feature: Get endpoint addresses from config 2020-07-29 16:03:52 +10:00
teor 11090dbf91 feature: Separate Mainnet and Testnet state 2020-07-29 01:45:19 -04:00
Alfredo Garcia 5b3c6e4c6c
Port bash checkpoint scripts to zebra-checkpoints single rust binary (#740)
* make zebra-checkpoints
* fix LOOKAHEAD_LIMIT scope
* add a default cli path
* change doc usage text
* add tracing
* move MAX_CHECKPOINT_HEIGHT_GAP to zebra-consensus
* do byte_reverse_hex in a map
2020-07-25 17:53:00 +10:00
Henry de Valence b59cfc49b7 sync: create requests sequentially to respect backpressure.
This seems like a better design on principle but also appears to give a much
nicer sawtooth pattern of queued blocks in the checkpointer and a much smoother
pattern of block requests.
2020-07-24 18:36:00 -04:00
teor 2acfcf3a90
Make the CheckpointVerifier handle partial restarts (#736)
Also put generic bounds on the BlockVerifier struct,
so we get better compilation errors.
2020-07-24 11:47:48 +10:00
teor 77a1fefa1e
Download genesis (#731)
* feature: Add more CheckpointVerifier tracing

* fix: Download the genesis block
2020-07-23 10:56:52 -07:00
Jane Lusby c1a1493159
use dirs crate for default location of state and config (#714)
* use dirs crate for default location of state and config
* panic if a path isn't specified for zebra-state
2020-07-23 21:12:20 +10:00