From cad38415b210c2c367659ea6fd214fb4eb1ba83a Mon Sep 17 00:00:00 2001
From: Henry de Valence <hdevalence@hdevalence.ca>
Date: Mon, 7 Sep 2020 21:24:31 -0700
Subject: [PATCH] network: fix bug in inventory advertisement handling (#1022)

* network: fix bug in inventory advertisement handling

The RFC https://zebra.zfnd.org/dev/rfcs/0003-inventory-tracking.html described
the use of a `broadcast` channel in place of an `mpsc` channel to get
ring-buffer behavior, keeping a bound on the size of the channel but dropping
old entries when the channel is full.

However, it didn't explicitly describe how this works (the `broadcast` channel
returns a `RecvError::Lagged(u64)` to inform receivers that they lost
messages), so the lag-handling wasn't implemented and I didn't notice in
review.

Instead, the ? operator bubbled the lag error all the way up from
`InventoryRegistry::poll_inventory` through `<PeerSet as Service>::poll_ready`
through various Tower wrappers to users of the peer set.  The error propagation
is bad enough, because it caused client errors that shouldn't have happened,
but there's a worse interaction.

The `Service` contract distinguishes between request errors (from
`Service::call`, scoped to the request) and service errors (from
`Service::poll_ready`, scoped to the service).  The `Service` contract
specifies that once a service returns an error from `poll_ready`, the service
can be assumed to be failed permanently.

I believe (but haven't tested or carefully worked through the details) that
this caused various tower middleware to report the entire peer set service as
permanently failed due to a transient inventory "error" (more of an indicator),
and I suspect that this is the cause of #1003, where all of the sync
component's requests end up failing because the peer set reported that it
failed permanently.  I am able to reproduce #1003 locally before this change
and unable to reproduce it locally after this change, though I have not tested
exhaustively.

* network: add metric for dropped inventory advertisements

Co-authored-by: teor <teor@riseup.net>

Co-authored-by: teor <teor@riseup.net>
---
 .../src/peer_set/inventory_registry.rs        | 30 +++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)
diff --git a/zebra-network/src/peer_set/inventory_registry.rs b/zebra-network/src/peer_set/inventory_registry.rs
index 4a6f27e4e..e7af48c01 100644
--- a/zebra-network/src/peer_set/inventory_registry.rs
+++ b/zebra-network/src/peer_set/inventory_registry.rs
@@ -64,8 +64,34 @@ impl InventoryRegistry {
             self.rotate();
         }
 
-        while let Poll::Ready(Some((hash, addr))) = Pin::new(&mut self.inv_stream).poll_next(cx)? {
-            self.register(hash, addr)
+        // This module uses a broadcast channel instead of an mpsc channel, even
+        // though there's a single consumer of inventory advertisements, because
+        // the broadcast channel has ring-buffer behavior: when the channel is
+        // full, sending a new message displaces the oldest message in the
+        // channel.
+        //
+        // This is the behavior we want for inventory advertisements, because we
+        // want to have a bounded buffer of unprocessed advertisements, and we
+        // want to prioritize new inventory (which is likely only at a specific
+        // peer) over old inventory (which is likely more widely distributed).
+        //
+        // The broadcast channel reports dropped messages by returning
+        // `RecvError::Lagged`. It's crucial that we handle that error here
+        // rather than propagating it through the peer set's Service::poll_ready
+        // implementation, where reporting a failure means reporting a permanent
+        // failure of the peer set.
+        use broadcast::RecvError;
+        while let Poll::Ready(Some(channel_result)) = Pin::new(&mut self.inv_stream).poll_next(cx) {
+            match channel_result {
+                Ok((hash, addr)) => self.register(hash, addr),
+                Err(RecvError::Lagged(count)) => {
+                    metrics::counter!("pool.inventory.dropped", 1);
+                    tracing::debug!(count, "dropped lagged inventory advertisements");
+                }
+                // This indicates all senders, including the one in the handshaker,
+                // have been dropped, which really is a permanent failure.
+                Err(RecvError::Closed) => return Err(RecvError::Closed.into()),
+            }
         }
 
         Ok(())