Update async correctness docs and the async in Zebra RFC (#3243)

* Justify that the ErrorSlot Mutex is deadlock-safe

* Document cancellation safety in the async RFC

* Document task starvation in the async RFC

Co-authored-by: Marek <mail@marek.onl>
This commit is contained in:
teor 2021-12-21 17:10:15 +10:00 committed by GitHub
parent c47ea80529
commit 6814525a7a
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 70 additions and 5 deletions

View File

@ -97,8 +97,10 @@ let res = ready!(this
## Futures-Aware Mutexes
[futures-aware-mutexes]: #futures-aware-mutexes
To avoid hangs or slowdowns, use futures-aware types. For more details, see the
[Futures-Aware Types](#futures-aware-types) section.
To avoid hangs or slowdowns, prefer futures-aware types,
particularly for complex waiting or locking code.
But in some simple cases, [std::sync::Mutex is more efficient](https://docs.rs/tokio/1.15.0/tokio/sync/struct.Mutex.html#which-kind-of-mutex-should-you-use).
For more details, see the [Futures-Aware Types](#futures-aware-types) section.
Zebra's [`Handshake`](https://github.com/ZcashFoundation/zebra/blob/a63c2e8c40fa847a86d00c754fb10a4729ba34e5/zebra-network/src/peer/handshake.rs#L204)
won't block other tasks on its thread, because it uses `futures::lock::Mutex`:
@ -452,6 +454,8 @@ async fn handle_client_request(&mut self, req: InProgressClientRequest) {
[reference-level-explanation]: #reference-level-explanation
The reference section contains in-depth information about concurrency in Zebra:
- [After an await, the rest of the Future might not be run](#cancellation-safe)
- [Task starvation](#starvation)
- [`Poll::Pending` and Wakeups](#poll-pending-and-wakeups)
- [Futures-Aware Types](#futures-aware-types)
- [Acquiring Buffer Slots, Mutexes, or Readiness](#acquiring-buffer-slots-mutexes-readiness)
@ -467,6 +471,37 @@ The reference section contains in-depth information about concurrency in Zebra:
Most Zebra designs or code changes will only touch on one or two of these areas.
## After an await, the rest of the Future might not be run
[cancellation-safe]: #cancellation-safe
> Futures can be "canceled" at any await point. Authors of futures must be aware that after an await, the code might not run.
> Futures might be polled to completion causing the code to work. But then many years later, the code is changed and the future might conditionally not be polled to completion which breaks things.
> The burden falls on the user of the future to poll to completion, and there is no way for the lib author to enforce this - they can only document this invariant.
https://github.com/rust-lang/wg-async-foundations/blob/master/src/vision/submitted_stories/status_quo/alan_builds_a_cache.md#-frequently-asked-questions
In particular, [`FutureExt::now_or_never`](https://docs.rs/futures/0.3.17/futures/future/trait.FutureExt.html#method.now_or_never):
- drops the future, and
- doesn't schedule the task for wakeups.
So even if the future or service passed to `now_or_never` is cloned,
the task won't be awoken when it is ready again.
## Task starvation
[starvation]: #starvation
Tokio [tasks are scheduled cooperatively](https://docs.rs/tokio/1.15.0/tokio/task/index.html#what-are-tasks):
> a task is allowed to run until it yields, indicating to the Tokio runtimes scheduler
> that it cannot currently continue executing.
> When a task yields, the Tokio runtime switches to executing the next task.
If a task doesn't yield during a CPU-intensive operation, or a tight loop,
it can starve other tasks on the same thread. This can cause hangs or timeouts.
There are a few different ways to avoid task starvation:
- run the operation on another thread using [`spawn_blocking`](https://docs.rs/tokio/1.15.0/tokio/task/fn.spawn_blocking.html) or [`block_in_place`](https://docs.rs/tokio/1.15.0/tokio/task/fn.block_in_place.html)
- [manually yield using `yield_now`](https://docs.rs/tokio/1.15.0/tokio/task/fn.yield_now.html)
## `Poll::Pending` and Wakeups
[poll-pending-and-wakeups]: #poll-pending-and-wakeups
@ -483,7 +518,8 @@ Note: `poll` functions often have a qualifier, like `poll_ready` or `poll_next`.
## Futures-Aware Types
[futures-aware-types]: #futures-aware-types
Use futures-aware types, rather than types which will block the current thread.
Prefer futures-aware types in complex locking or waiting code,
rather than types which will block the current thread.
For example:
- Use `futures::lock::Mutex` rather than `std::sync::Mutex`
@ -497,6 +533,17 @@ If you are unable to use futures-aware types:
- document the correctness of each blocking call
- consider re-designing the code to use `tower::Services`, or other futures-aware types
In some simple cases, `std::sync::Mutex` is correct and more efficient, when:
- the value behind the mutex is just data, and
- the locking behaviour is simple.
In these cases:
> wrap the `Arc<Mutex<...>>` in a struct
> that provides non-async methods for performing operations on the data within,
> and only lock the mutex inside these methods
For more details, see [the tokio documentation](https://docs.rs/tokio/1.15.0/tokio/sync/struct.Mutex.html#which-kind-of-mutex-should-you-use).
## Acquiring Buffer Slots, Mutexes, or Readiness
[acquiring-buffer-slots-mutexes-readiness]: #acquiring-buffer-slots-mutexes-readiness
@ -512,6 +559,11 @@ If a buffer, mutex, future or service has complex readiness dependencies,
schedule those dependencies separate tasks using `tokio::spawn`. Otherwise,
it might deadlock due to a dependency loop within a single executor task.
Carefully read the documentation of the channel methods you call, to check if they lock.
For example, [`tokio::sync::watch::Receiver::borrow`](https://docs.rs/tokio/1.15.0/tokio/sync/watch/struct.Receiver.html#method.borrow)
holds a read lock, so the borrowed data should always be cloned.
Use `Arc` for efficient clones if needed.
In all of these cases:
- make critical sections as short as possible, and
- do not depend on other tasks or locks inside the critical section.
@ -591,7 +643,9 @@ priority (for example, cancellation), you might want to use biased selection.
The [`futures::select!`](https://docs.rs/futures/0.3.13/futures/macro.select.html) and
[`tokio::select!`](https://docs.rs/tokio/0.3.6/tokio/macro.select.html) macros select
ready arguments at random.
ready arguments at random by default.
To poll a `select!` in order, pass `biased;` as the first argument to the macro.
Also consider the `FuturesUnordered` stream for unbiased selection of a large number
of futures. However, this macro and stream require mapping all arguments to the same
@ -650,7 +704,7 @@ Since Zebra's CI all runs on x86 (as of June 2021), our tests get `AcqRel` order
when we specify `Relaxed`. But ARM processors like the Apple M1
[implement weaker memory orderings, including genuinely `Relaxed` access](https://stackoverflow.com/questions/59089084/loads-and-stores-reordering-on-arm#59089757). For more details, see the [hardware reordering](https://doc.rust-lang.org/nomicon/atomics.html#hardware-reordering)
section of the Rust nomicon.
But if a Zebra feature requires atomics:
1. use an `AtomicUsize` with the strongest memory ordering ([`SeqCst`](https://doc.rust-lang.org/nomicon/atomics.html#sequentially-consistent))
2. use a weaker memory ordering, with:

View File

@ -89,6 +89,14 @@ impl PeerError {
/// Error slots are shared between sync and async code. In async code, the error
/// mutex should be held for as short a time as possible. This avoids blocking
/// the async task thread on acquiring the mutex.
///
/// > If the value behind the mutex is just data, its usually appropriate to use a blocking mutex
/// > ...
/// > wrap the `Arc<Mutex<...>>` in a struct
/// > that provides non-async methods for performing operations on the data within,
/// > and only lock the mutex inside these methods
///
/// https://docs.rs/tokio/1.15.0/tokio/sync/struct.Mutex.html#which-kind-of-mutex-should-you-use
#[derive(Default, Clone)]
pub struct ErrorSlot(Arc<std::sync::Mutex<Option<SharedPeerError>>>);

View File

@ -468,6 +468,9 @@ where
self.remove(&key);
}
Change::Insert(key, svc) => {
// We add peers as unready, so that we:
// - always do the same checks on every ready peer, and
// - check for any errors that happened right after the handshake
trace!(?key, "got Change::Insert from Discover");
self.remove(&key);
self.push_unready(key, svc);