Make ReplayStage panic before dumping repeated-repair-attempt slots (#31333)
When ReplayStage repeatedly fails to compute the correct for a block after purging and repairing, it panics on the assumption that something is very wrong and will require human intervention. If this is the case, there is typically something to be debugged, and having the slot available locally is valuable. This change does the retry check that will panic before purging the failure slot.
This commit is contained in:
parent
c9ca6e3461
commit
758bc1ca75
|
@ -1305,6 +1305,18 @@ impl ReplayStage {
|
||||||
We froze slot {duplicate_slot} with hash {frozen_hash:?} while the cluster hash is {correct_hash}");
|
We froze slot {duplicate_slot} with hash {frozen_hash:?} while the cluster hash is {correct_hash}");
|
||||||
}
|
}
|
||||||
|
|
||||||
|
let attempt_no = purge_repair_slot_counter
|
||||||
|
.entry(*duplicate_slot)
|
||||||
|
.and_modify(|x| *x += 1)
|
||||||
|
.or_insert(1);
|
||||||
|
if *attempt_no > MAX_REPAIR_RETRY_LOOP_ATTEMPTS {
|
||||||
|
panic!("We have tried to repair duplicate slot: {duplicate_slot} more than {MAX_REPAIR_RETRY_LOOP_ATTEMPTS} times \
|
||||||
|
and are unable to freeze a block with bankhash {correct_hash}, \
|
||||||
|
instead we have a block with bankhash {frozen_hash:?}. \
|
||||||
|
This is most likely a bug in the runtime. \
|
||||||
|
At this point manual intervention is needed to make progress. Exiting");
|
||||||
|
}
|
||||||
|
|
||||||
Self::purge_unconfirmed_duplicate_slot(
|
Self::purge_unconfirmed_duplicate_slot(
|
||||||
*duplicate_slot,
|
*duplicate_slot,
|
||||||
ancestors,
|
ancestors,
|
||||||
|
@ -1317,17 +1329,6 @@ impl ReplayStage {
|
||||||
|
|
||||||
dumped.push((*duplicate_slot, *correct_hash));
|
dumped.push((*duplicate_slot, *correct_hash));
|
||||||
|
|
||||||
let attempt_no = purge_repair_slot_counter
|
|
||||||
.entry(*duplicate_slot)
|
|
||||||
.and_modify(|x| *x += 1)
|
|
||||||
.or_insert(1);
|
|
||||||
if *attempt_no > MAX_REPAIR_RETRY_LOOP_ATTEMPTS {
|
|
||||||
panic!("We have tried to repair duplicate slot: {duplicate_slot} more than {MAX_REPAIR_RETRY_LOOP_ATTEMPTS} times \
|
|
||||||
and are unable to freeze a block with bankhash {correct_hash}, \
|
|
||||||
instead we have a block with bankhash {frozen_hash:?}. \
|
|
||||||
This is most likely a bug in the runtime. \
|
|
||||||
At this point manual intervention is needed to make progress. Exiting");
|
|
||||||
}
|
|
||||||
warn!(
|
warn!(
|
||||||
"Notifying repair service to repair duplicate slot: {}, attempt {}",
|
"Notifying repair service to repair duplicate slot: {}, attempt {}",
|
||||||
*duplicate_slot, *attempt_no,
|
*duplicate_slot, *attempt_no,
|
||||||
|
|
Loading…
Reference in New Issue