Supervisor does not back off tasks that failed in a healthy state.
There are a couple places where we rely on supervisor for
application-level backoff, so we always want back-off. The distinction
is meant to enable runnables to implement their own specific back-off
logic, which we don't, so we can safely ignore it.
Fixes#37
ghstack-source-id: c756381b1b
Pull Request resolved: https://github.com/certusone/wormhole/pull/64
In particular, this fixes a race condition where the Solana devnet would
take longer to deploy than the ETH devnet to deploy and we'd end up
with an outdated guardian set on Solana.
We currently create a Goroutine for every pending resubmission, which
waits and blocks on the channel until solwatch is processing requests
again. This is effectively an unbounded queue. An alternative approach
would be a channel with sufficient capacity plus backoff.
Test Plan: Deployed without solana-devnet, waited for initial guardian
set change VAA to be requeued, then deployed solana-devnet.
The VAA was successfully submitted once the transient error resolved:
```
[...]
21:08:44.712Z ERROR wormhole-guardian-0.supervisor Runnable died {"dn": "root.solwatch", "error": "returned error when NODE_STATE_HEALTHY: failed to receive message from agent: EOF"}
21:08:44.712Z INFO wormhole-guardian-0.supervisor rescheduling supervised node {"dn": "root.solwatch", "backoff": 0.737286432}
21:08:45.451Z INFO wormhole-guardian-0.root.solwatch watching for on-chain events
21:08:50.031Z ERROR wormhole-guardian-0.root.solwatch failed to submit VAA {"error": "rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL", "digest": "79[...]"}
21:08:50.031Z ERROR wormhole-guardian-0.root.solwatch requeuing VAA {"error": "rpc error: code = Canceled desc = stream terminated by RST_STREAM with error code: CANCEL", "digest": "79[...]"}
21:09:02.062Z INFO wormhole-guardian-0.root.solwatch submitted VAA {"tx_sig": "4EKmH[...]", "digest": "79[...]"}
```
ghstack-source-id: 1b1d05a4cb
Pull Request resolved: https://github.com/certusone/wormhole/pull/48