ready-cache: Avoid panic on strange race (#420)

It's been observed that occasionally tower-ready-cache would panic
trying to find an already canceled service in `cancel_pending_txs`
(#415). The source of the race is not entirely clear, but extensive
debugging demonstrated that occasionally a call to `evict` would send on
the `CancelTx` for a service, yet that service would be yielded back
from `pending` in `poll_pending` in a non-`Canceled` state. This
is equivalent to saying that this code may panic:

```rust
async {
  let (tx, rx) = oneshot::channel();
  tx.send(42).unwrap();
  yield_once().await;
  rx.try_recv().unwrap(); // <- may occasionally panic
}
```

I have not been able to demonstrate a self-contained example failing in
this way, but it's the only explanation I have found for the observed
bug. Pinning the entire runtime to one core still produced the bug,
which indicates that it is not a memory ordering issue. Replacing
oneshot with `mpsc::channel(1)` still produced the bug, which indicates
that the bug is not with the implementation of `oneshot`. Logs also
indicate that the `ChannelTx` we send on in `evict()` truly is the same
one associated with the `ChannelRx` polled in `Pending::poll`, so we're
not getting our wires crossed somewhere. It truly is bizarre.

This patch resolves the issue by considering a failure to find a
ready/errored service's `CancelTx` as another signal that a service has
been removed. Specifically, if `poll_pending` finds a service that
returns `Ok` or `Err`, but does _not_ find its `CancelTx`, then it
assumes that it must be because the service _was_ canceled, but did not
observe that cancellation signal.

As an explanation, this isn't entirely satisfactory, since we do not
fully understand the underlying problem. It _may_ be that a canceled
service could remain in the pending state for a very long time if it
does not become ready _and_ does not see the cancellation signal (so it
returns `Poll::Pending` and is not removed). That, in turn, might cause
an issue if the driver of the `ReadyCache` then chooses to re-use a key
they believe they have evicted. However, any such case _must_ first hit
the panic that exists in the code today, so this is still an improvement
over the status quo.

Fixes #415.
This commit is contained in:
Jon Gjengset 2020-02-24 13:03:43 -05:00 committed by GitHub
parent be156e733d
commit 414e3b0809
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 30 additions and 11 deletions

View File

@ -233,13 +233,28 @@ where
Poll::Ready(None) => return Poll::Ready(Ok(())),
Poll::Ready(Some(Ok((key, svc, cancel_rx)))) => {
trace!("endpoint ready");
let cancel_tx = self
.pending_cancel_txs
.swap_remove(&key)
.expect("missing cancelation");
// Keep track of the cancelation so that it need not be
// recreated after the service is used.
self.ready.insert(key, (svc, (cancel_tx, cancel_rx)));
let cancel_tx = self.pending_cancel_txs.swap_remove(&key);
if let Some(cancel_tx) = cancel_tx {
// Keep track of the cancelation so that it need not be
// recreated after the service is used.
self.ready.insert(key, (svc, (cancel_tx, cancel_rx)));
} else {
// This should not technically be possible. We must have decided to cancel
// a Service (by sending on the CancelTx), yet that same service then
// returns Ready. Since polling a Pending _first_ polls the CancelRx, that
// _should_ always see our CancelTx send. Yet empirically, that isn't true:
//
// https://github.com/tower-rs/tower/issues/415
//
// So, we instead detect the endpoint as canceled at this point. That
// should be fine, since the oneshot is only really there to ensure that
// the Pending is polled again anyway.
//
// We assert that this can't happen in debug mode so that hopefully one day
// we can find a test that triggers this reliably.
debug_assert!(cancel_tx.is_some());
debug!("canceled endpoint removed when ready");
}
}
Poll::Ready(Some(Err(PendingError::Canceled(_)))) => {
debug!("endpoint canceled");
@ -247,10 +262,14 @@ where
// cause this cancellation.
}
Poll::Ready(Some(Err(PendingError::Inner(key, e)))) => {
self.pending_cancel_txs
.swap_remove(&key)
.expect("missing cancelation");
return Err(error::Failed(key, e.into())).into();
let cancel_tx = self.pending_cancel_txs.swap_remove(&key);
if let Some(_) = cancel_tx {
return Err(error::Failed(key, e.into())).into();
} else {
// See comment for the same clause under Ready(Some(Ok)).
debug_assert!(cancel_tx.is_some());
debug!("canceled endpoint removed on error");
}
}
}
}