ready-cache: Avoid panic on strange race (#420)
It's been observed that occasionally tower-ready-cache would panic trying to find an already canceled service in `cancel_pending_txs` (#415). The source of the race is not entirely clear, but extensive debugging demonstrated that occasionally a call to `evict` would send on the `CancelTx` for a service, yet that service would be yielded back from `pending` in `poll_pending` in a non-`Canceled` state. This is equivalent to saying that this code may panic: ```rust async { let (tx, rx) = oneshot::channel(); tx.send(42).unwrap(); yield_once().await; rx.try_recv().unwrap(); // <- may occasionally panic } ``` I have not been able to demonstrate a self-contained example failing in this way, but it's the only explanation I have found for the observed bug. Pinning the entire runtime to one core still produced the bug, which indicates that it is not a memory ordering issue. Replacing oneshot with `mpsc::channel(1)` still produced the bug, which indicates that the bug is not with the implementation of `oneshot`. Logs also indicate that the `ChannelTx` we send on in `evict()` truly is the same one associated with the `ChannelRx` polled in `Pending::poll`, so we're not getting our wires crossed somewhere. It truly is bizarre. This patch resolves the issue by considering a failure to find a ready/errored service's `CancelTx` as another signal that a service has been removed. Specifically, if `poll_pending` finds a service that returns `Ok` or `Err`, but does _not_ find its `CancelTx`, then it assumes that it must be because the service _was_ canceled, but did not observe that cancellation signal. As an explanation, this isn't entirely satisfactory, since we do not fully understand the underlying problem. It _may_ be that a canceled service could remain in the pending state for a very long time if it does not become ready _and_ does not see the cancellation signal (so it returns `Poll::Pending` and is not removed). That, in turn, might cause an issue if the driver of the `ReadyCache` then chooses to re-use a key they believe they have evicted. However, any such case _must_ first hit the panic that exists in the code today, so this is still an improvement over the status quo. Fixes #415.
This commit is contained in:
parent
be156e733d
commit
414e3b0809
|
@ -233,13 +233,28 @@ where
|
|||
Poll::Ready(None) => return Poll::Ready(Ok(())),
|
||||
Poll::Ready(Some(Ok((key, svc, cancel_rx)))) => {
|
||||
trace!("endpoint ready");
|
||||
let cancel_tx = self
|
||||
.pending_cancel_txs
|
||||
.swap_remove(&key)
|
||||
.expect("missing cancelation");
|
||||
// Keep track of the cancelation so that it need not be
|
||||
// recreated after the service is used.
|
||||
self.ready.insert(key, (svc, (cancel_tx, cancel_rx)));
|
||||
let cancel_tx = self.pending_cancel_txs.swap_remove(&key);
|
||||
if let Some(cancel_tx) = cancel_tx {
|
||||
// Keep track of the cancelation so that it need not be
|
||||
// recreated after the service is used.
|
||||
self.ready.insert(key, (svc, (cancel_tx, cancel_rx)));
|
||||
} else {
|
||||
// This should not technically be possible. We must have decided to cancel
|
||||
// a Service (by sending on the CancelTx), yet that same service then
|
||||
// returns Ready. Since polling a Pending _first_ polls the CancelRx, that
|
||||
// _should_ always see our CancelTx send. Yet empirically, that isn't true:
|
||||
//
|
||||
// https://github.com/tower-rs/tower/issues/415
|
||||
//
|
||||
// So, we instead detect the endpoint as canceled at this point. That
|
||||
// should be fine, since the oneshot is only really there to ensure that
|
||||
// the Pending is polled again anyway.
|
||||
//
|
||||
// We assert that this can't happen in debug mode so that hopefully one day
|
||||
// we can find a test that triggers this reliably.
|
||||
debug_assert!(cancel_tx.is_some());
|
||||
debug!("canceled endpoint removed when ready");
|
||||
}
|
||||
}
|
||||
Poll::Ready(Some(Err(PendingError::Canceled(_)))) => {
|
||||
debug!("endpoint canceled");
|
||||
|
@ -247,10 +262,14 @@ where
|
|||
// cause this cancellation.
|
||||
}
|
||||
Poll::Ready(Some(Err(PendingError::Inner(key, e)))) => {
|
||||
self.pending_cancel_txs
|
||||
.swap_remove(&key)
|
||||
.expect("missing cancelation");
|
||||
return Err(error::Failed(key, e.into())).into();
|
||||
let cancel_tx = self.pending_cancel_txs.swap_remove(&key);
|
||||
if let Some(_) = cancel_tx {
|
||||
return Err(error::Failed(key, e.into())).into();
|
||||
} else {
|
||||
// See comment for the same clause under Ready(Some(Ok)).
|
||||
debug_assert!(cancel_tx.is_some());
|
||||
debug!("canceled endpoint removed on error");
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
Loading…
Reference in New Issue