fix(state): Avoid panicking on state errors during shutdown (#6828)

* Enable cancel_all_background_work() only on macOS

* Ignore expected "during shutdown" errors, and log other errors

* Disable cancel_all_background_work() but keep the updated docs and error handling

* Add the macOS shutdown crash to the README known issues

---------

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
This commit is contained in:
teor 2023-06-07 09:38:48 +10:00 committed by GitHub
parent 355f1233f5
commit d7b90552f3
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 77 additions and 15 deletions

View File

@ -155,6 +155,8 @@ There are a few bugs in Zebra that we're still working on fixing:
- Block download and verification sometimes times out during Zebra's initial sync [#5709](https://github.com/ZcashFoundation/zebra/issues/5709). The full sync still finishes reasonably quickly.
- Rust 1.70 [causes crashes during shutdown on macOS x86_64 (#6812)](https://github.com/ZcashFoundation/zebra/issues/6812). The state cache should stay valid despite the crash.
- No Windows support [#3801](https://github.com/ZcashFoundation/zebra/issues/3801). We used to test with Windows Server 2019, but not any more; see the issue for details.
- Experimental Tor support is disabled until [Zebra upgrades to the latest `arti-client`](https://github.com/ZcashFoundation/zebra/issues/5492). This happened due to a Rust dependency conflict, which could only be resolved by `arti` upgrading to a version of `x25519-dalek` with the dependency fix.

View File

@ -780,18 +780,72 @@ impl DiskDb {
let path = self.path();
debug!(?path, "flushing database to disk");
self.db
.flush()
.expect("unexpected failure flushing SST data to disk");
self.db
.flush_wal(true)
.expect("unexpected failure flushing WAL data to disk");
// These flushes can fail during forced shutdown or during Drop after a shutdown,
// particularly in tests. If they fail, there's nothing we can do about it anyway.
if let Err(error) = self.db.flush() {
let error = format!("{error:?}");
if error.to_ascii_lowercase().contains("shutdown in progress") {
debug!(
?error,
?path,
"expected shutdown error flushing database SST files to disk"
);
} else {
info!(
?error,
?path,
"unexpected error flushing database SST files to disk during shutdown"
);
}
}
if let Err(error) = self.db.flush_wal(true) {
let error = format!("{error:?}");
if error.to_ascii_lowercase().contains("shutdown in progress") {
debug!(
?error,
?path,
"expected shutdown error flushing database WAL buffer to disk"
);
} else {
info!(
?error,
?path,
"unexpected error flushing database WAL buffer to disk during shutdown"
);
}
}
// # Memory Safety
//
// We'd like to call `cancel_all_background_work()` before Zebra exits,
// but when we call it, we get memory, thread, or C++ errors when the process exits.
// (This seems to be a bug in RocksDB: cancel_all_background_work() should wait until
// all the threads have cleaned up.)
//
// # Change History
//
// We've changed this setting multiple times since 2021, in response to new RocksDB
// and Rust compiler behaviour.
//
// We enabled cancel_all_background_work() due to failures on:
// - Rust 1.57 on Linux
//
// We disabled cancel_all_background_work() due to failures on:
// - Rust 1.64 on Linux
//
// We tried enabling cancel_all_background_work() due to failures on:
// - Rust 1.70 on macOS 12.6.5 on x86_64
// but it didn't stop the aborts happening (PR #6820).
//
// There weren't any failures with cancel_all_background_work() disabled on:
// - Rust 1.69 or earlier
// - Linux with Rust 1.70
// And with cancel_all_background_work() enabled or disabled on:
// - macOS 13.2 on aarch64 (M1), native and emulated x86_64, with Rust 1.70
//
// # Detailed Description
//
// We see these kinds of errors:
// ```
// pthread lock: Invalid argument
@ -803,13 +857,26 @@ impl DiskDb {
// signal: 11, SIGSEGV: invalid memory reference
// ```
//
// # Reference
//
// The RocksDB wiki says:
// > Q: Is it safe to close RocksDB while another thread is issuing read, write or manual compaction requests?
// >
// > A: No. The users of RocksDB need to make sure all functions have finished before they close RocksDB.
// > You can speed up the waiting by calling CancelAllBackgroundWork().
//
// https://github.com/facebook/rocksdb/wiki/RocksDB-FAQ
// <https://github.com/facebook/rocksdb/wiki/RocksDB-FAQ>
//
// > rocksdb::DB instances need to be destroyed before your main function exits.
// > RocksDB instances usually depend on some internal static variables.
// > Users need to make sure rocksdb::DB instances are destroyed before those static variables.
//
// <https://github.com/facebook/rocksdb/wiki/Known-Issues>
//
// # TODO
//
// Try re-enabling this code and fixing the underlying concurrency bug.
//
//info!(?path, "stopping background database tasks");
//self.db.cancel_all_background_work(true);
@ -818,14 +885,7 @@ impl DiskDb {
// But Rust's ownership rules make that difficult,
// so we just flush and delete ephemeral data instead.
//
// The RocksDB wiki says:
// > rocksdb::DB instances need to be destroyed before your main function exits.
// > RocksDB instances usually depend on some internal static variables.
// > Users need to make sure rocksdb::DB instances are destroyed before those static variables.
//
// https://github.com/facebook/rocksdb/wiki/Known-Issues
//
// But this implementation doesn't seem to cause any issues,
// This implementation doesn't seem to cause any issues,
// and the RocksDB Drop implementation handles any cleanup.
self.delete_ephemeral();
}