doc(db): update database design for read-only state service (#3843)
* Add a TODO for a history tree concurrent write issue * Update database design for read-only state service
This commit is contained in:
parent
ebecfd078c
commit
419770409a
|
@ -611,7 +611,7 @@ We use the following rocksdb column families:
|
||||||
| `hash_by_tx_loc` | `TransactionLocation` | `transaction::Hash` | Never |
|
| `hash_by_tx_loc` | `TransactionLocation` | `transaction::Hash` | Never |
|
||||||
| `tx_loc_by_hash` | `transaction::Hash` | `TransactionLocation` | Never |
|
| `tx_loc_by_hash` | `transaction::Hash` | `TransactionLocation` | Never |
|
||||||
| *Transparent* | | | |
|
| *Transparent* | | | |
|
||||||
| `utxo_by_out_loc` | `OutputLocation` | `transparent::Output` | Delete |
|
| `utxo_by_out_loc` | `OutputLocation` | `Output \|\| AddressLocation` | Delete |
|
||||||
| `balance_by_transparent_addr` | `transparent::Address` | `Amount \|\| AddressLocation` | Update |
|
| `balance_by_transparent_addr` | `transparent::Address` | `Amount \|\| AddressLocation` | Update |
|
||||||
| `utxo_by_transparent_addr_loc` | `AddressLocation` | `AtLeastOne<OutputLocation>` | Up/Del |
|
| `utxo_by_transparent_addr_loc` | `AddressLocation` | `AtLeastOne<OutputLocation>` | Up/Del |
|
||||||
| `tx_by_transparent_addr_loc` | `AddressLocation` | `AtLeastOne<TransactionLocation>` | Append |
|
| `tx_by_transparent_addr_loc` | `AddressLocation` | `AtLeastOne<TransactionLocation>` | Append |
|
||||||
|
@ -622,11 +622,11 @@ We use the following rocksdb column families:
|
||||||
| *Sapling* | | | |
|
| *Sapling* | | | |
|
||||||
| `sapling_nullifiers` | `sapling::Nullifier` | `()` | Never |
|
| `sapling_nullifiers` | `sapling::Nullifier` | `()` | Never |
|
||||||
| `sapling_anchors` | `sapling::tree::Root` | `()` | Never |
|
| `sapling_anchors` | `sapling::tree::Root` | `()` | Never |
|
||||||
| `sapling_note_commitment_tree` | `block::Height` | `sapling::tree::NoteCommitmentTree` | Delete |
|
| `sapling_note_commitment_tree` | `block::Height` | `sapling::tree::NoteCommitmentTree` | Never |
|
||||||
| *Orchard* | | | |
|
| *Orchard* | | | |
|
||||||
| `orchard_nullifiers` | `orchard::Nullifier` | `()` | Never |
|
| `orchard_nullifiers` | `orchard::Nullifier` | `()` | Never |
|
||||||
| `orchard_anchors` | `orchard::tree::Root` | `()` | Never |
|
| `orchard_anchors` | `orchard::tree::Root` | `()` | Never |
|
||||||
| `orchard_note_commitment_tree` | `block::Height` | `orchard::tree::NoteCommitmentTree` | Delete |
|
| `orchard_note_commitment_tree` | `block::Height` | `orchard::tree::NoteCommitmentTree` | Never |
|
||||||
| *Chain* | | | |
|
| *Chain* | | | |
|
||||||
| `history_tree` | `block::Height` | `NonEmptyHistoryTree` | Delete |
|
| `history_tree` | `block::Height` | `NonEmptyHistoryTree` | Delete |
|
||||||
| `tip_chain_value_pool` | `()` | `ValueBalance` | Update |
|
| `tip_chain_value_pool` | `()` | `ValueBalance` | Update |
|
||||||
|
@ -664,14 +664,47 @@ Derived Formats:
|
||||||
Each column family handles updates differently, based on its specific consensus rules:
|
Each column family handles updates differently, based on its specific consensus rules:
|
||||||
- Never: Keys are never deleted, values are never updated. The value for each key is inserted once.
|
- Never: Keys are never deleted, values are never updated. The value for each key is inserted once.
|
||||||
- Delete: Keys can be deleted, but values are never updated. The value for each key is inserted once.
|
- Delete: Keys can be deleted, but values are never updated. The value for each key is inserted once.
|
||||||
|
- Code called by ReadStateService must ignore deleted keys, or use a read lock.
|
||||||
- TODO: should we prevent re-inserts of keys that have been deleted?
|
- TODO: should we prevent re-inserts of keys that have been deleted?
|
||||||
- Update: Keys are never deleted, but values can be updated.
|
- Update: Keys are never deleted, but values can be updated.
|
||||||
|
- Code called by ReadStateService must accept old or new values, or use a read lock.
|
||||||
- Append: Keys are never deleted, existing values are never updated,
|
- Append: Keys are never deleted, existing values are never updated,
|
||||||
but sets of values can be extended with more entries.
|
but sets of values can be extended with more entries.
|
||||||
- Up/Del: Keys can be deleted, existing entries can be removed,
|
- Code called by ReadStateService must accept truncated or extended sets, or use a read lock.
|
||||||
sets of values can be extended with more entries.
|
- Up/Del: Keys can be deleted, and values can be added or removed from sets.
|
||||||
|
- Code called by ReadStateService must ignore deleted keys and values,
|
||||||
|
accept truncated or extended sets, and accept old or new values.
|
||||||
|
Or it should use a read lock.
|
||||||
|
|
||||||
Currently, there are no column families that both delete and update keys.
|
### RocksDB read locks
|
||||||
|
[rocksdb-read-locks]: #rocksdb-read-locks
|
||||||
|
|
||||||
|
The read-only ReadStateService needs to handle concurrent writes and deletes of the finalized
|
||||||
|
column families it reads. It must also handle overlaps between the cached non-finalized `Chain`,
|
||||||
|
and the current finalized state database.
|
||||||
|
|
||||||
|
The StateService uses RocksDB transactions for each block write.
|
||||||
|
So ReadStateService queries that only access a single key or value will always see
|
||||||
|
a consistent view of the database.
|
||||||
|
|
||||||
|
If a ReadStateService query only uses column families that have keys and values appended
|
||||||
|
(`Never` in the Updates table above), it should ignore extra appended values.
|
||||||
|
Most queries do this by default.
|
||||||
|
|
||||||
|
For more complex queries, there are several options:
|
||||||
|
|
||||||
|
Reading across multiple column families:
|
||||||
|
1. Ignore deleted values using custom Rust code
|
||||||
|
2. Take a database snapshot - https://docs.rs/rocksdb/latest/rocksdb/struct.DBWithThreadMode.html#method.snapshot
|
||||||
|
|
||||||
|
Reading a single column family:
|
||||||
|
3. multi_get - https://docs.rs/rocksdb/latest/rocksdb/struct.DBWithThreadMode.html#method.multi_get_cf
|
||||||
|
4. iterator - https://docs.rs/rocksdb/latest/rocksdb/struct.DBWithThreadMode.html#method.iterator_cf
|
||||||
|
|
||||||
|
RocksDB also has read transactions, but they don't seem to be exposed in the Rust crate.
|
||||||
|
|
||||||
|
### Low-Level Implementation Details
|
||||||
|
[rocksdb-low-level]: #rocksdb-low-level
|
||||||
|
|
||||||
RocksDB ignores duplicate puts and deletes, preserving the latest values.
|
RocksDB ignores duplicate puts and deletes, preserving the latest values.
|
||||||
If rejecting duplicate puts or deletes is consensus-critical,
|
If rejecting duplicate puts or deletes is consensus-critical,
|
||||||
|
@ -693,6 +726,7 @@ and merge operators are unreliable (or have undocumented behaviour).
|
||||||
So they should not be used for consensus-critical checks.
|
So they should not be used for consensus-critical checks.
|
||||||
|
|
||||||
### Notes on rocksdb column families
|
### Notes on rocksdb column families
|
||||||
|
[rocksdb-column-families]: #rocksdb-column-families
|
||||||
|
|
||||||
- The `hash_by_height` and `height_tx_count_by_hash` column families provide a bijection between
|
- The `hash_by_height` and `height_tx_count_by_hash` column families provide a bijection between
|
||||||
block heights and block hashes. (Since the rocksdb state only stores finalized
|
block heights and block hashes. (Since the rocksdb state only stores finalized
|
||||||
|
@ -748,32 +782,40 @@ So they should not be used for consensus-critical checks.
|
||||||
addresses with large UTXO sets. It also stores the `AddressLocation` for each
|
addresses with large UTXO sets. It also stores the `AddressLocation` for each
|
||||||
address, which allows for efficient lookups.
|
address, which allows for efficient lookups.
|
||||||
|
|
||||||
- `utxo_by_transparent_addr_loc` stores unspent transparent output locations by address.
|
- `utxo_by_transparent_addr_loc` stores unspent transparent output locations
|
||||||
UTXO locations are appended by each block. If an address lookup discovers a UTXO
|
by address. UTXO locations are appended by each block.
|
||||||
has been spent in `utxo_by_outpoint`, that UTXO location can be deleted from
|
|
||||||
`utxo_by_transparent_addr_loc`. (We don't do these deletions every time a block is
|
|
||||||
committed, because that requires an expensive full index search.)
|
|
||||||
This list includes the `AddressLocation`, if it has not been spent.
|
This list includes the `AddressLocation`, if it has not been spent.
|
||||||
(This duplicate data is small, and helps simplify the code.)
|
(This duplicate data is small, and helps simplify the code.)
|
||||||
|
|
||||||
|
- When a block write deletes a UTXO from `utxo_by_outpoint`,
|
||||||
|
that UTXO location should be deleted from `utxo_by_transparent_addr_loc`.
|
||||||
|
This is an index optimisation.
|
||||||
|
|
||||||
- `tx_by_transparent_addr_loc` stores transaction locations by address.
|
- `tx_by_transparent_addr_loc` stores transaction locations by address.
|
||||||
This list includes transactions containing spent UTXOs.
|
This list includes transactions containing spent UTXOs.
|
||||||
It also includes the `TransactionLocation` from the `AddressLocation`.
|
It also includes the `TransactionLocation` from the `AddressLocation`.
|
||||||
(This duplicate data is small, and helps simplify the code.)
|
(This duplicate data is small, and helps simplify the code.)
|
||||||
|
|
||||||
- Each `*_note_commitment_tree` stores the note commitment tree state
|
- The `sprout_note_commitment_tree` stores the note commitment tree state
|
||||||
at the tip of the finalized state, for the specific pool. There is always
|
at the tip of the finalized state, for the specific pool. There is always
|
||||||
a single entry for those; they are indexed by height just to make testing
|
a single entry. Each tree is stored
|
||||||
and debugging easier (so for each block committed, the old tree is
|
as a "Merkle tree frontier" which is basically a (logarithmic) subset of
|
||||||
deleted and a new one is inserted by its new height). Each tree is stored
|
the Merkle tree nodes as required to insert new items.
|
||||||
|
For each block committed, the old tree is deleted and a new one is inserted
|
||||||
|
by its new height.
|
||||||
|
**TODO:** store the sprout note commitment tree by `()`,
|
||||||
|
to avoid ReadStateService concurrent write issues.
|
||||||
|
|
||||||
|
- The `{sapling, orchard}_note_commitment_tree` stores the note commitment tree
|
||||||
|
state for every height, for the specific pool. Each tree is stored
|
||||||
as a "Merkle tree frontier" which is basically a (logarithmic) subset of
|
as a "Merkle tree frontier" which is basically a (logarithmic) subset of
|
||||||
the Merkle tree nodes as required to insert new items.
|
the Merkle tree nodes as required to insert new items.
|
||||||
|
|
||||||
- `history_tree` stores the ZIP-221 history tree state at the tip of the finalized
|
- `history_tree` stores the ZIP-221 history tree state at the tip of the finalized
|
||||||
state. There is always a single entry for it; it is indexed by height just
|
state. There is always a single entry for it. The tree is stored as the set of "peaks"
|
||||||
to make testing and debugging easier. The tree is stored as the set of "peaks"
|
|
||||||
of the "Merkle mountain range" tree structure, which is what is required to
|
of the "Merkle mountain range" tree structure, which is what is required to
|
||||||
insert new items.
|
insert new items.
|
||||||
|
**TODO:** store the history tree by `()`, to avoid ReadStateService concurrent write issues.
|
||||||
|
|
||||||
- Each `*_anchors` stores the anchor (the root of a Merkle tree) of the note commitment
|
- Each `*_anchors` stores the anchor (the root of a Merkle tree) of the note commitment
|
||||||
tree of a certain block. We only use the keys since we just need the set of anchors,
|
tree of a certain block. We only use the keys since we just need the set of anchors,
|
||||||
|
|
|
@ -90,9 +90,11 @@ impl DiskWriteBatch {
|
||||||
self.zs_delete(history_tree_cf, h);
|
self.zs_delete(history_tree_cf, h);
|
||||||
}
|
}
|
||||||
|
|
||||||
// TODO: just store a single history tree, using `()` as the key,
|
// TODO: if we ever need concurrent read-only access to the history tree,
|
||||||
// and remove the delete (like the chain value pool balances).
|
// store it by `()`, not height.
|
||||||
// This requires a database version update.
|
// Otherwise, the ReadStateService could access a height
|
||||||
|
// that was just deleted by a concurrent StateService write.
|
||||||
|
// This requires a database version update.
|
||||||
if let Some(history_tree) = history_tree.as_ref() {
|
if let Some(history_tree) = history_tree.as_ref() {
|
||||||
self.zs_insert(history_tree_cf, height, history_tree);
|
self.zs_insert(history_tree_cf, height, history_tree);
|
||||||
}
|
}
|
||||||
|
|
Loading…
Reference in New Issue