Make ReplayStage panic before dumping repeated-repair-attempt slots (#31333)

When ReplayStage repeatedly fails to compute the correct for a block
after purging and repairing, it panics on the assumption that something
is very wrong and will require human intervention.

If this is the case, there is typically something to be debugged, and
having the slot available locally is valuable. This change does the
retry check that will panic before purging the failure slot.
This commit is contained in:
steviez 2023-04-25 11:50:47 -05:00 committed by GitHub
parent c9ca6e3461
commit 758bc1ca75
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 12 additions and 11 deletions

View File

@ -1305,6 +1305,18 @@ impl ReplayStage {
We froze slot {duplicate_slot} with hash {frozen_hash:?} while the cluster hash is {correct_hash}");
}
let attempt_no = purge_repair_slot_counter
.entry(*duplicate_slot)
.and_modify(|x| *x += 1)
.or_insert(1);
if *attempt_no > MAX_REPAIR_RETRY_LOOP_ATTEMPTS {
panic!("We have tried to repair duplicate slot: {duplicate_slot} more than {MAX_REPAIR_RETRY_LOOP_ATTEMPTS} times \
and are unable to freeze a block with bankhash {correct_hash}, \
instead we have a block with bankhash {frozen_hash:?}. \
This is most likely a bug in the runtime. \
At this point manual intervention is needed to make progress. Exiting");
}
Self::purge_unconfirmed_duplicate_slot(
*duplicate_slot,
ancestors,
@ -1317,17 +1329,6 @@ impl ReplayStage {
dumped.push((*duplicate_slot, *correct_hash));
let attempt_no = purge_repair_slot_counter
.entry(*duplicate_slot)
.and_modify(|x| *x += 1)
.or_insert(1);
if *attempt_no > MAX_REPAIR_RETRY_LOOP_ATTEMPTS {
panic!("We have tried to repair duplicate slot: {duplicate_slot} more than {MAX_REPAIR_RETRY_LOOP_ATTEMPTS} times \
and are unable to freeze a block with bankhash {correct_hash}, \
instead we have a block with bankhash {frozen_hash:?}. \
This is most likely a bug in the runtime. \
At this point manual intervention is needed to make progress. Exiting");
}
warn!(
"Notifying repair service to repair duplicate slot: {}, attempt {}",
*duplicate_slot, *attempt_no,