doc(db): update database design for read-only state service (#3843)

* Add a TODO for a history tree concurrent write issue

* Update database design for read-only state service
This commit is contained in:
teor 2022-03-12 10:37:01 +10:00 committed by GitHub
parent ebecfd078c
commit 419770409a
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 64 additions and 20 deletions

View File

@ -611,7 +611,7 @@ We use the following rocksdb column families:
| `hash_by_tx_loc` | `TransactionLocation` | `transaction::Hash` | Never | | `hash_by_tx_loc` | `TransactionLocation` | `transaction::Hash` | Never |
| `tx_loc_by_hash` | `transaction::Hash` | `TransactionLocation` | Never | | `tx_loc_by_hash` | `transaction::Hash` | `TransactionLocation` | Never |
| *Transparent* | | | | | *Transparent* | | | |
| `utxo_by_out_loc` | `OutputLocation` | `transparent::Output` | Delete | | `utxo_by_out_loc` | `OutputLocation` | `Output \|\| AddressLocation` | Delete |
| `balance_by_transparent_addr` | `transparent::Address` | `Amount \|\| AddressLocation` | Update | | `balance_by_transparent_addr` | `transparent::Address` | `Amount \|\| AddressLocation` | Update |
| `utxo_by_transparent_addr_loc` | `AddressLocation` | `AtLeastOne<OutputLocation>` | Up/Del | | `utxo_by_transparent_addr_loc` | `AddressLocation` | `AtLeastOne<OutputLocation>` | Up/Del |
| `tx_by_transparent_addr_loc` | `AddressLocation` | `AtLeastOne<TransactionLocation>` | Append | | `tx_by_transparent_addr_loc` | `AddressLocation` | `AtLeastOne<TransactionLocation>` | Append |
@ -622,11 +622,11 @@ We use the following rocksdb column families:
| *Sapling* | | | | | *Sapling* | | | |
| `sapling_nullifiers` | `sapling::Nullifier` | `()` | Never | | `sapling_nullifiers` | `sapling::Nullifier` | `()` | Never |
| `sapling_anchors` | `sapling::tree::Root` | `()` | Never | | `sapling_anchors` | `sapling::tree::Root` | `()` | Never |
| `sapling_note_commitment_tree` | `block::Height` | `sapling::tree::NoteCommitmentTree` | Delete | | `sapling_note_commitment_tree` | `block::Height` | `sapling::tree::NoteCommitmentTree` | Never |
| *Orchard* | | | | | *Orchard* | | | |
| `orchard_nullifiers` | `orchard::Nullifier` | `()` | Never | | `orchard_nullifiers` | `orchard::Nullifier` | `()` | Never |
| `orchard_anchors` | `orchard::tree::Root` | `()` | Never | | `orchard_anchors` | `orchard::tree::Root` | `()` | Never |
| `orchard_note_commitment_tree` | `block::Height` | `orchard::tree::NoteCommitmentTree` | Delete | | `orchard_note_commitment_tree` | `block::Height` | `orchard::tree::NoteCommitmentTree` | Never |
| *Chain* | | | | | *Chain* | | | |
| `history_tree` | `block::Height` | `NonEmptyHistoryTree` | Delete | | `history_tree` | `block::Height` | `NonEmptyHistoryTree` | Delete |
| `tip_chain_value_pool` | `()` | `ValueBalance` | Update | | `tip_chain_value_pool` | `()` | `ValueBalance` | Update |
@ -664,14 +664,47 @@ Derived Formats:
Each column family handles updates differently, based on its specific consensus rules: Each column family handles updates differently, based on its specific consensus rules:
- Never: Keys are never deleted, values are never updated. The value for each key is inserted once. - Never: Keys are never deleted, values are never updated. The value for each key is inserted once.
- Delete: Keys can be deleted, but values are never updated. The value for each key is inserted once. - Delete: Keys can be deleted, but values are never updated. The value for each key is inserted once.
- Code called by ReadStateService must ignore deleted keys, or use a read lock.
- TODO: should we prevent re-inserts of keys that have been deleted? - TODO: should we prevent re-inserts of keys that have been deleted?
- Update: Keys are never deleted, but values can be updated. - Update: Keys are never deleted, but values can be updated.
- Code called by ReadStateService must accept old or new values, or use a read lock.
- Append: Keys are never deleted, existing values are never updated, - Append: Keys are never deleted, existing values are never updated,
but sets of values can be extended with more entries. but sets of values can be extended with more entries.
- Up/Del: Keys can be deleted, existing entries can be removed, - Code called by ReadStateService must accept truncated or extended sets, or use a read lock.
sets of values can be extended with more entries. - Up/Del: Keys can be deleted, and values can be added or removed from sets.
- Code called by ReadStateService must ignore deleted keys and values,
accept truncated or extended sets, and accept old or new values.
Or it should use a read lock.
### RocksDB read locks
[rocksdb-read-locks]: #rocksdb-read-locks
Currently, there are no column families that both delete and update keys. The read-only ReadStateService needs to handle concurrent writes and deletes of the finalized
column families it reads. It must also handle overlaps between the cached non-finalized `Chain`,
and the current finalized state database.
The StateService uses RocksDB transactions for each block write.
So ReadStateService queries that only access a single key or value will always see
a consistent view of the database.
If a ReadStateService query only uses column families that have keys and values appended
(`Never` in the Updates table above), it should ignore extra appended values.
Most queries do this by default.
For more complex queries, there are several options:
Reading across multiple column families:
1. Ignore deleted values using custom Rust code
2. Take a database snapshot - https://docs.rs/rocksdb/latest/rocksdb/struct.DBWithThreadMode.html#method.snapshot
Reading a single column family:
3. multi_get - https://docs.rs/rocksdb/latest/rocksdb/struct.DBWithThreadMode.html#method.multi_get_cf
4. iterator - https://docs.rs/rocksdb/latest/rocksdb/struct.DBWithThreadMode.html#method.iterator_cf
RocksDB also has read transactions, but they don't seem to be exposed in the Rust crate.
### Low-Level Implementation Details
[rocksdb-low-level]: #rocksdb-low-level
RocksDB ignores duplicate puts and deletes, preserving the latest values. RocksDB ignores duplicate puts and deletes, preserving the latest values.
If rejecting duplicate puts or deletes is consensus-critical, If rejecting duplicate puts or deletes is consensus-critical,
@ -693,6 +726,7 @@ and merge operators are unreliable (or have undocumented behaviour).
So they should not be used for consensus-critical checks. So they should not be used for consensus-critical checks.
### Notes on rocksdb column families ### Notes on rocksdb column families
[rocksdb-column-families]: #rocksdb-column-families
- The `hash_by_height` and `height_tx_count_by_hash` column families provide a bijection between - The `hash_by_height` and `height_tx_count_by_hash` column families provide a bijection between
block heights and block hashes. (Since the rocksdb state only stores finalized block heights and block hashes. (Since the rocksdb state only stores finalized
@ -748,32 +782,40 @@ So they should not be used for consensus-critical checks.
addresses with large UTXO sets. It also stores the `AddressLocation` for each addresses with large UTXO sets. It also stores the `AddressLocation` for each
address, which allows for efficient lookups. address, which allows for efficient lookups.
- `utxo_by_transparent_addr_loc` stores unspent transparent output locations by address. - `utxo_by_transparent_addr_loc` stores unspent transparent output locations
UTXO locations are appended by each block. If an address lookup discovers a UTXO by address. UTXO locations are appended by each block.
has been spent in `utxo_by_outpoint`, that UTXO location can be deleted from
`utxo_by_transparent_addr_loc`. (We don't do these deletions every time a block is
committed, because that requires an expensive full index search.)
This list includes the `AddressLocation`, if it has not been spent. This list includes the `AddressLocation`, if it has not been spent.
(This duplicate data is small, and helps simplify the code.) (This duplicate data is small, and helps simplify the code.)
- When a block write deletes a UTXO from `utxo_by_outpoint`,
that UTXO location should be deleted from `utxo_by_transparent_addr_loc`.
This is an index optimisation.
- `tx_by_transparent_addr_loc` stores transaction locations by address. - `tx_by_transparent_addr_loc` stores transaction locations by address.
This list includes transactions containing spent UTXOs. This list includes transactions containing spent UTXOs.
It also includes the `TransactionLocation` from the `AddressLocation`. It also includes the `TransactionLocation` from the `AddressLocation`.
(This duplicate data is small, and helps simplify the code.) (This duplicate data is small, and helps simplify the code.)
- Each `*_note_commitment_tree` stores the note commitment tree state - The `sprout_note_commitment_tree` stores the note commitment tree state
at the tip of the finalized state, for the specific pool. There is always at the tip of the finalized state, for the specific pool. There is always
a single entry for those; they are indexed by height just to make testing a single entry. Each tree is stored
and debugging easier (so for each block committed, the old tree is as a "Merkle tree frontier" which is basically a (logarithmic) subset of
deleted and a new one is inserted by its new height). Each tree is stored the Merkle tree nodes as required to insert new items.
For each block committed, the old tree is deleted and a new one is inserted
by its new height.
**TODO:** store the sprout note commitment tree by `()`,
to avoid ReadStateService concurrent write issues.
- The `{sapling, orchard}_note_commitment_tree` stores the note commitment tree
state for every height, for the specific pool. Each tree is stored
as a "Merkle tree frontier" which is basically a (logarithmic) subset of as a "Merkle tree frontier" which is basically a (logarithmic) subset of
the Merkle tree nodes as required to insert new items. the Merkle tree nodes as required to insert new items.
- `history_tree` stores the ZIP-221 history tree state at the tip of the finalized - `history_tree` stores the ZIP-221 history tree state at the tip of the finalized
state. There is always a single entry for it; it is indexed by height just state. There is always a single entry for it. The tree is stored as the set of "peaks"
to make testing and debugging easier. The tree is stored as the set of "peaks"
of the "Merkle mountain range" tree structure, which is what is required to of the "Merkle mountain range" tree structure, which is what is required to
insert new items. insert new items.
**TODO:** store the history tree by `()`, to avoid ReadStateService concurrent write issues.
- Each `*_anchors` stores the anchor (the root of a Merkle tree) of the note commitment - Each `*_anchors` stores the anchor (the root of a Merkle tree) of the note commitment
tree of a certain block. We only use the keys since we just need the set of anchors, tree of a certain block. We only use the keys since we just need the set of anchors,

View File

@ -90,9 +90,11 @@ impl DiskWriteBatch {
self.zs_delete(history_tree_cf, h); self.zs_delete(history_tree_cf, h);
} }
// TODO: just store a single history tree, using `()` as the key, // TODO: if we ever need concurrent read-only access to the history tree,
// and remove the delete (like the chain value pool balances). // store it by `()`, not height.
// This requires a database version update. // Otherwise, the ReadStateService could access a height
// that was just deleted by a concurrent StateService write.
// This requires a database version update.
if let Some(history_tree) = history_tree.as_ref() { if let Some(history_tree) = history_tree.as_ref() {
self.zs_insert(history_tree_cf, height, history_tree); self.zs_insert(history_tree_cf, height, history_tree);
} }