From cad38415b210c2c367659ea6fd214fb4eb1ba83a Mon Sep 17 00:00:00 2001 From: Henry de Valence Date: Mon, 7 Sep 2020 21:24:31 -0700 Subject: [PATCH] network: fix bug in inventory advertisement handling (#1022) * network: fix bug in inventory advertisement handling The RFC https://zebra.zfnd.org/dev/rfcs/0003-inventory-tracking.html described the use of a `broadcast` channel in place of an `mpsc` channel to get ring-buffer behavior, keeping a bound on the size of the channel but dropping old entries when the channel is full. However, it didn't explicitly describe how this works (the `broadcast` channel returns a `RecvError::Lagged(u64)` to inform receivers that they lost messages), so the lag-handling wasn't implemented and I didn't notice in review. Instead, the ? operator bubbled the lag error all the way up from `InventoryRegistry::poll_inventory` through `::poll_ready` through various Tower wrappers to users of the peer set. The error propagation is bad enough, because it caused client errors that shouldn't have happened, but there's a worse interaction. The `Service` contract distinguishes between request errors (from `Service::call`, scoped to the request) and service errors (from `Service::poll_ready`, scoped to the service). The `Service` contract specifies that once a service returns an error from `poll_ready`, the service can be assumed to be failed permanently. I believe (but haven't tested or carefully worked through the details) that this caused various tower middleware to report the entire peer set service as permanently failed due to a transient inventory "error" (more of an indicator), and I suspect that this is the cause of #1003, where all of the sync component's requests end up failing because the peer set reported that it failed permanently. I am able to reproduce #1003 locally before this change and unable to reproduce it locally after this change, though I have not tested exhaustively. * network: add metric for dropped inventory advertisements Co-authored-by: teor Co-authored-by: teor --- .../src/peer_set/inventory_registry.rs | 30 +++++++++++++++++-- 1 file changed, 28 insertions(+), 2 deletions(-) diff --git a/zebra-network/src/peer_set/inventory_registry.rs b/zebra-network/src/peer_set/inventory_registry.rs index 4a6f27e4e..e7af48c01 100644 --- a/zebra-network/src/peer_set/inventory_registry.rs +++ b/zebra-network/src/peer_set/inventory_registry.rs @@ -64,8 +64,34 @@ impl InventoryRegistry { self.rotate(); } - while let Poll::Ready(Some((hash, addr))) = Pin::new(&mut self.inv_stream).poll_next(cx)? { - self.register(hash, addr) + // This module uses a broadcast channel instead of an mpsc channel, even + // though there's a single consumer of inventory advertisements, because + // the broadcast channel has ring-buffer behavior: when the channel is + // full, sending a new message displaces the oldest message in the + // channel. + // + // This is the behavior we want for inventory advertisements, because we + // want to have a bounded buffer of unprocessed advertisements, and we + // want to prioritize new inventory (which is likely only at a specific + // peer) over old inventory (which is likely more widely distributed). + // + // The broadcast channel reports dropped messages by returning + // `RecvError::Lagged`. It's crucial that we handle that error here + // rather than propagating it through the peer set's Service::poll_ready + // implementation, where reporting a failure means reporting a permanent + // failure of the peer set. + use broadcast::RecvError; + while let Poll::Ready(Some(channel_result)) = Pin::new(&mut self.inv_stream).poll_next(cx) { + match channel_result { + Ok((hash, addr)) => self.register(hash, addr), + Err(RecvError::Lagged(count)) => { + metrics::counter!("pool.inventory.dropped", 1); + tracing::debug!(count, "dropped lagged inventory advertisements"); + } + // This indicates all senders, including the one in the handshaker, + // have been dropped, which really is a permanent failure. + Err(RecvError::Closed) => return Err(RecvError::Closed.into()), + } } Ok(())