Update async correctness docs and the async in Zebra RFC (#3243)

* Justify that the ErrorSlot Mutex is deadlock-safe * Document cancellation safety in the async RFC * Document task starvation in the async RFC Co-authored-by: Marek <mail@marek.onl>
2021-12-21 17:10:15 +10:00 · 2021-12-21 17:10:15 +10:00 · 6814525a7a
parent c47ea80529
commit 6814525a7a
3 changed files with 70 additions and 5 deletions
--- a/book/src/dev/rfcs/0011-async-rust-in-zebra.md
+++ b/book/src/dev/rfcs/0011-async-rust-in-zebra.md
@ -97,8 +97,10 @@ let res = ready!(this
 ## Futures-Aware Mutexes
 [futures-aware-mutexes]: #futures-aware-mutexes

-To avoid hangs or slowdowns, use futures-aware types. For more details, see the
-[Futures-Aware Types](#futures-aware-types) section.
+To avoid hangs or slowdowns, prefer futures-aware types,
+particularly for complex waiting or locking code.
+But in some simple cases, [std::sync::Mutex is more efficient](https://docs.rs/tokio/1.15.0/tokio/sync/struct.Mutex.html#which-kind-of-mutex-should-you-use).
+For more details, see the [Futures-Aware Types](#futures-aware-types) section.

 Zebra's [`Handshake`](https://github.com/ZcashFoundation/zebra/blob/a63c2e8c40fa847a86d00c754fb10a4729ba34e5/zebra-network/src/peer/handshake.rs#L204)
 won't block other tasks on its thread, because it uses `futures::lock::Mutex`:
@ -452,6 +454,8 @@ async fn handle_client_request(&mut self, req: InProgressClientRequest) {
 [reference-level-explanation]: #reference-level-explanation

 The reference section contains in-depth information about concurrency in Zebra:
+- [After an await, the rest of the Future might not be run](#cancellation-safe)
+- [Task starvation](#starvation)
 - [`Poll::Pending` and Wakeups](#poll-pending-and-wakeups)
 - [Futures-Aware Types](#futures-aware-types)
 - [Acquiring Buffer Slots, Mutexes, or Readiness](#acquiring-buffer-slots-mutexes-readiness)
@ -467,6 +471,37 @@ The reference section contains in-depth information about concurrency in Zebra:

 Most Zebra designs or code changes will only touch on one or two of these areas.

+## After an await, the rest of the Future might not be run
+[cancellation-safe]: #cancellation-safe
+
+> Futures can be "canceled" at any await point. Authors of futures must be aware that after an await, the code might not run.
+> Futures might be polled to completion causing the code to work. But then many years later, the code is changed and the future might conditionally not be polled to completion which breaks things.
+> The burden falls on the user of the future to poll to completion, and there is no way for the lib author to enforce this - they can only document this invariant.
+
+https://github.com/rust-lang/wg-async-foundations/blob/master/src/vision/submitted_stories/status_quo/alan_builds_a_cache.md#-frequently-asked-questions
+
+In particular, [`FutureExt::now_or_never`](https://docs.rs/futures/0.3.17/futures/future/trait.FutureExt.html#method.now_or_never):
+- drops the future, and
+- doesn't schedule the task for wakeups.
+
+So even if the future or service passed to `now_or_never` is cloned,
+the task won't be awoken when it is ready again.
+
+## Task starvation
+[starvation]: #starvation
+
+Tokio [tasks are scheduled cooperatively](https://docs.rs/tokio/1.15.0/tokio/task/index.html#what-are-tasks):
+> a task is allowed to run until it yields, indicating to the Tokio runtime’s scheduler
+> that it cannot currently continue executing.
+> When a task yields, the Tokio runtime switches to executing the next task.
+
+If a task doesn't yield during a CPU-intensive operation, or a tight loop,
+it can starve other tasks on the same thread. This can cause hangs or timeouts.
+
+There are a few different ways to avoid task starvation:
+- run the operation on another thread using [`spawn_blocking`](https://docs.rs/tokio/1.15.0/tokio/task/fn.spawn_blocking.html) or [`block_in_place`](https://docs.rs/tokio/1.15.0/tokio/task/fn.block_in_place.html)
+- [manually yield using `yield_now`](https://docs.rs/tokio/1.15.0/tokio/task/fn.yield_now.html)
+
 ## `Poll::Pending` and Wakeups
 [poll-pending-and-wakeups]: #poll-pending-and-wakeups

@ -483,7 +518,8 @@ Note: `poll` functions often have a qualifier, like `poll_ready` or `poll_next`.
 ## Futures-Aware Types
 [futures-aware-types]: #futures-aware-types

-Use futures-aware types, rather than types which will block the current thread.
+Prefer futures-aware types in complex locking or waiting code,
+rather than types which will block the current thread.

 For example:
 - Use `futures::lock::Mutex` rather than `std::sync::Mutex`
@ -497,6 +533,17 @@ If you are unable to use futures-aware types:
 - document the correctness of each blocking call
 - consider re-designing the code to use `tower::Services`, or other futures-aware types

+In some simple cases, `std::sync::Mutex` is correct and more efficient, when:
+- the value behind the mutex is just data, and
+- the locking behaviour is simple.
+
+In these cases:
+> wrap the `Arc<Mutex<...>>` in a struct
+> that provides non-async methods for performing operations on the data within,
+> and only lock the mutex inside these methods
+
+For more details, see [the tokio documentation](https://docs.rs/tokio/1.15.0/tokio/sync/struct.Mutex.html#which-kind-of-mutex-should-you-use).
+
 ## Acquiring Buffer Slots, Mutexes, or Readiness
 [acquiring-buffer-slots-mutexes-readiness]: #acquiring-buffer-slots-mutexes-readiness

@ -512,6 +559,11 @@ If a buffer, mutex, future or service has complex readiness dependencies,
 schedule those dependencies separate tasks using `tokio::spawn`. Otherwise,
 it might deadlock due to a dependency loop within a single executor task.

+Carefully read the documentation of the channel methods you call, to check if they lock.
+For example, [`tokio::sync::watch::Receiver::borrow`](https://docs.rs/tokio/1.15.0/tokio/sync/watch/struct.Receiver.html#method.borrow)
+holds a read lock, so the borrowed data should always be cloned.
+Use `Arc` for efficient clones if needed.
+
 In all of these cases:
 - make critical sections as short as possible, and
 - do not depend on other tasks or locks inside the critical section.
@ -591,7 +643,9 @@ priority (for example, cancellation), you might want to use biased selection.

 The [`futures::select!`](https://docs.rs/futures/0.3.13/futures/macro.select.html) and
 [`tokio::select!`](https://docs.rs/tokio/0.3.6/tokio/macro.select.html) macros select
-ready arguments at random.
+ready arguments at random by default.
+
+To poll a `select!` in order, pass `biased;` as the first argument to the macro.

 Also consider the `FuturesUnordered` stream for unbiased selection of a large number
 of futures. However, this macro and stream require mapping all arguments to the same
@ -650,7 +704,7 @@ Since Zebra's CI all runs on x86 (as of June 2021), our tests get `AcqRel` order
 when we specify `Relaxed`. But ARM processors like the Apple M1
 [implement weaker memory orderings, including genuinely `Relaxed` access](https://stackoverflow.com/questions/59089084/loads-and-stores-reordering-on-arm#59089757). For more details, see the [hardware reordering](https://doc.rust-lang.org/nomicon/atomics.html#hardware-reordering)
 section of the Rust nomicon.
- 
+
 But if a Zebra feature requires atomics:
 1. use an `AtomicUsize` with the strongest memory ordering ([`SeqCst`](https://doc.rust-lang.org/nomicon/atomics.html#sequentially-consistent))
 2. use a weaker memory ordering, with:
--- a/zebra-network/src/peer/error.rs
+++ b/zebra-network/src/peer/error.rs
@ -89,6 +89,14 @@ impl PeerError {
 /// Error slots are shared between sync and async code. In async code, the error
 /// mutex should be held for as short a time as possible. This avoids blocking
 /// the async task thread on acquiring the mutex.
+///
+/// > If the value behind the mutex is just data, it’s usually appropriate to use a blocking mutex
+/// > ...
+/// > wrap the `Arc<Mutex<...>>` in a struct
+/// > that provides non-async methods for performing operations on the data within,
+/// > and only lock the mutex inside these methods
+///
+/// https://docs.rs/tokio/1.15.0/tokio/sync/struct.Mutex.html#which-kind-of-mutex-should-you-use
 #[derive(Default, Clone)]
 pub struct ErrorSlot(Arc<std::sync::Mutex<Option<SharedPeerError>>>);

--- a/zebra-network/src/peer_set/set.rs
+++ b/zebra-network/src/peer_set/set.rs
@ -468,6 +468,9 @@ where
                    self.remove(&key);
                }
                Change::Insert(key, svc) => {
+                    // We add peers as unready, so that we:
+                    // - always do the same checks on every ready peer, and
+                    // - check for any errors that happened right after the handshake
                    trace!(?key, "got Change::Insert from Discover");
                    self.remove(&key);
                    self.push_unready(key, svc);