diff --git a/book/src/SUMMARY.md b/book/src/SUMMARY.md index 77a8a8e35..e0c322759 100644 --- a/book/src/SUMMARY.md +++ b/book/src/SUMMARY.md @@ -20,7 +20,14 @@ - [Developer Documentation](dev.md) - [Contribution Guide](CONTRIBUTING.md) - [Design Overview](dev/overview.md) + - [Diagrams](dev/diagrams.md) + - [Network Architecture](dev/diagrams/zebra-network.md) + - [Upgrading the State Database](dev/state-db-upgrades.md) - [Zebra versioning and releases](dev/release-process.md) + - [Continuous Integration](dev/continuous-integration.md) + - [Continuous Delivery](dev/continuous-delivery.md) + - [Generating Zebra Checkpoints](dev/zebra-checkpoints.md) + - [Doing Mass Renames](dev/mass-renames.md) - [Zebra RFCs](dev/rfcs.md) - [Pipelinable Block Lookup](dev/rfcs/0001-pipelinable-block-lookup.md) - [Parallel Verification](dev/rfcs/0002-parallel-verification.md) @@ -32,10 +39,4 @@ - [V5 Transaction](dev/rfcs/0010-v5-transaction.md) - [Async Rust in Zebra](dev/rfcs/0011-async-rust-in-zebra.md) - [Value Pools](dev/rfcs/0012-value-pools.md) - - [Diagrams](dev/diagrams.md) - - [Network Architecture](dev/diagrams/zebra-network.md) - - [Continuous Integration](dev/continuous-integration.md) - - [Continuous Delivery](dev/continuous-delivery.md) - - [Generating Zebra Checkpoints](dev/zebra-checkpoints.md) - - [Doing Mass Renames](dev/mass-renames.md) - [API Reference](api.md) diff --git a/book/src/dev/rfcs/0005-state-updates.md b/book/src/dev/rfcs/0005-state-updates.md index e47245ad1..7767975fd 100644 --- a/book/src/dev/rfcs/0005-state-updates.md +++ b/book/src/dev/rfcs/0005-state-updates.md @@ -663,305 +663,7 @@ New `non-finalized` blocks are committed as follows: ## rocksdb data structures [rocksdb]: #rocksdb -rocksdb provides a persistent, thread-safe `BTreeMap<&[u8], &[u8]>`. Each map is -a distinct "tree". Keys are sorted using lex order on byte strings, so -integer values should be stored using big-endian encoding (so that the lex -order on byte strings is the numeric ordering). - -Note that the lex order storage allows creating 1-to-many maps using keys only. -For example, the `tx_loc_by_transparent_addr_loc` allows mapping each address -to all transactions related to it, by simply storing each transaction prefixed -with the address as the key, leaving the value empty. Since rocksdb allows -listing all keys with a given prefix, it will allow listing all transactions -related to a given address. - -We use the following rocksdb column families: - -| Column Family | Keys | Values | Changes | -| ---------------------------------- | ---------------------- | ----------------------------- | ------- | -| *Blocks* | | | | -| `hash_by_height` | `block::Height` | `block::Hash` | Create | -| `height_by_hash` | `block::Hash` | `block::Height` | Create | -| `block_header_by_height` | `block::Height` | `block::Header` | Create | -| *Transactions* | | | | -| `tx_by_loc` | `TransactionLocation` | `Transaction` | Create | -| `hash_by_tx_loc` | `TransactionLocation` | `transaction::Hash` | Create | -| `tx_loc_by_hash` | `transaction::Hash` | `TransactionLocation` | Create | -| *Transparent* | | | | -| `balance_by_transparent_addr` | `transparent::Address` | `Amount \|\| AddressLocation` | Update | -| `tx_loc_by_transparent_addr_loc` | `AddressTransaction` | `()` | Create | -| `utxo_by_out_loc` | `OutputLocation` | `transparent::Output` | Delete | -| `utxo_loc_by_transparent_addr_loc` | `AddressUnspentOutput` | `()` | Delete | -| *Sprout* | | | | -| `sprout_nullifiers` | `sprout::Nullifier` | `()` | Create | -| `sprout_anchors` | `sprout::tree::Root` | `sprout::NoteCommitmentTree` | Create | -| `sprout_note_commitment_tree` | `block::Height` | `sprout::NoteCommitmentTree` | Delete | -| *Sapling* | | | | -| `sapling_nullifiers` | `sapling::Nullifier` | `()` | Create | -| `sapling_anchors` | `sapling::tree::Root` | `()` | Create | -| `sapling_note_commitment_tree` | `block::Height` | `sapling::NoteCommitmentTree` | Create | -| *Orchard* | | | | -| `orchard_nullifiers` | `orchard::Nullifier` | `()` | Create | -| `orchard_anchors` | `orchard::tree::Root` | `()` | Create | -| `orchard_note_commitment_tree` | `block::Height` | `orchard::NoteCommitmentTree` | Create | -| *Chain* | | | | -| `history_tree` | `block::Height` | `NonEmptyHistoryTree` | Delete | -| `tip_chain_value_pool` | `()` | `ValueBalance` | Update | - -Zcash structures are encoded using `ZcashSerialize`/`ZcashDeserialize`. -Other structures are encoded using `IntoDisk`/`FromDisk`. - -Block and Transaction Data: -- `Height`: 24 bits, big-endian, unsigned (allows for ~30 years worth of blocks) -- `TransactionIndex`: 16 bits, big-endian, unsigned (max ~23,000 transactions in the 2 MB block limit) -- `TransactionCount`: same as `TransactionIndex` -- `TransactionLocation`: `Height \|\| TransactionIndex` -- `OutputIndex`: 24 bits, big-endian, unsigned (max ~223,000 transfers in the 2 MB block limit) -- transparent and shielded input indexes, and shielded output indexes: 16 bits, big-endian, unsigned (max ~49,000 transfers in the 2 MB block limit) -- `OutputLocation`: `TransactionLocation \|\| OutputIndex` -- `AddressLocation`: the first `OutputLocation` used by a `transparent::Address`. - Always has the same value for each address, even if the first output is spent. -- `Utxo`: `Output`, derives extra fields from the `OutputLocation` key -- `AddressUnspentOutput`: `AddressLocation \|\| OutputLocation`, - used instead of a `BTreeSet` value, to improve database performance -- `AddressTransaction`: `AddressLocation \|\| TransactionLocation` - used instead of a `BTreeSet` value, to improve database performance - -We use big-endian encoding for keys, to allow database index prefix searches. - -Amounts: -- `Amount`: 64 bits, little-endian, signed -- `ValueBalance`: `[Amount; 4]` - -Derived Formats: -- `*::NoteCommitmentTree`: `bincode` using `serde` -- `NonEmptyHistoryTree`: `bincode` using `serde`, using `zcash_history`'s `serde` implementation - - -The following figure helps visualizing the address index, which is the most complicated part. -Numbers in brackets are array sizes; bold arrows are compositions (i.e. `TransactionLocation` is the -concatenation of `Height` and `TransactionIndex`); dashed arrows are compositions that are also 1-to-many -maps (i.e. `AddressTransaction` is the concatenation of `AddressLocation` and `TransactionLocation`, -but also is used to map each `AddressLocation` to multiple `TransactionLocation`s). - -```mermaid -graph TD; - Address -->|"balance_by_transparent_addr
"| AddressBalance; - AddressBalance ==> Amount; - AddressBalance ==> AddressLocation; - AddressLocation ==> FirstOutputLocation; - AddressLocation -.->|"tx_loc_by_transparent_addr_loc
(AddressTransaction[13])"| TransactionLocation; - TransactionLocation ==> Height; - TransactionLocation ==> TransactionIndex; - OutputLocation -->|utxo_by_out_loc| Output; - OutputLocation ==> TransactionLocation; - OutputLocation ==> OutputIndex; - AddressLocation -.->|"utxo_loc_by_transparent_addr_loc
(AddressUnspentOutput[16])"| OutputLocation; - - AddressBalance["AddressBalance[16]"]; - Amount["Amount[8]"]; - Height["Height[3]"]; - Address["Address[21]"]; - TransactionIndex["TransactionIndex[2]"]; - TransactionLocation["TransactionLocation[5]"]; - OutputIndex["OutputIndex[3]"]; - OutputLocation["OutputLocation[8]"]; - FirstOutputLocation["First OutputLocation[8]"]; - AddressLocation["AddressLocation[8]"]; -``` - -### Implementing consensus rules using rocksdb -[rocksdb-consensus-rules]: #rocksdb-consensus-rules - -Each column family handles updates differently, based on its specific consensus rules: -- Create: - - Each key-value entry is created once. - - Keys are never deleted, values are never updated. -- Delete: - - Each key-value entry is created once. - - Keys can be deleted, but values are never updated. - - Code called by ReadStateService must ignore deleted keys, or use a read lock. - - TODO: should we prevent re-inserts of keys that have been deleted? -- Update: - - Each key-value entry is created once. - - Keys are never deleted, but values can be updated. - - Code called by ReadStateService must handle old or new values, or use a read lock. - -We can't do some kinds of value updates, because they cause RocksDB performance issues: -- Append: - - Keys are never deleted. - - Existing values are never updated. - - Sets of values have additional items appended to the end of the set. - - Code called by ReadStateService must handle shorter or longer sets, or use a read lock. -- Up/Del: - - Keys can be deleted. - - Sets of values have items added or deleted (in any position). - - Code called by ReadStateService must ignore deleted keys and values, - accept shorter or longer sets, and accept old or new values. - Or it should use a read lock. - -Avoid using large sets of values as RocksDB keys or values. - -### RocksDB read locks -[rocksdb-read-locks]: #rocksdb-read-locks - -The read-only ReadStateService needs to handle concurrent writes and deletes of the finalized -column families it reads. It must also handle overlaps between the cached non-finalized `Chain`, -and the current finalized state database. - -The StateService uses RocksDB transactions for each block write. -So ReadStateService queries that only access a single key or value will always see -a consistent view of the database. - -If a ReadStateService query only uses column families that have keys and values appended -(`Never` in the Updates table above), it should ignore extra appended values. -Most queries do this by default. - -For more complex queries, there are several options: - -Reading across multiple column families: -1. Ignore deleted values using custom Rust code -2. Take a database snapshot - https://docs.rs/rocksdb/latest/rocksdb/struct.DBWithThreadMode.html#method.snapshot - -Reading a single column family: -3. multi_get - https://docs.rs/rocksdb/latest/rocksdb/struct.DBWithThreadMode.html#method.multi_get_cf -4. iterator - https://docs.rs/rocksdb/latest/rocksdb/struct.DBWithThreadMode.html#method.iterator_cf - -RocksDB also has read transactions, but they don't seem to be exposed in the Rust crate. - -### Low-Level Implementation Details -[rocksdb-low-level]: #rocksdb-low-level - -RocksDB ignores duplicate puts and deletes, preserving the latest values. -If rejecting duplicate puts or deletes is consensus-critical, -check [`db.get_cf(cf, key)?`](https://docs.rs/rocksdb/0.16.0/rocksdb/struct.DBWithThreadMode.html#method.get_cf) -before putting or deleting any values in a batch. - -Currently, these restrictions should be enforced by code review: -- multiple `zs_insert`s are only allowed on Update column families, and -- [`delete_cf`](https://docs.rs/rocksdb/0.16.0/rocksdb/struct.WriteBatch.html#method.delete_cf) - is only allowed on Delete column families. - -In future, we could enforce these restrictions by: -- creating traits for Never, Delete, and Update -- doing different checks in `zs_insert` depending on the trait -- wrapping `delete_cf` in a trait, and only implementing that trait for types that use Delete column families. - -As of June 2021, the Rust `rocksdb` crate [ignores the delete callback](https://docs.rs/rocksdb/0.16.0/src/rocksdb/merge_operator.rs.html#83-94), -and merge operators are unreliable (or have undocumented behaviour). -So they should not be used for consensus-critical checks. - -### Notes on rocksdb column families -[rocksdb-column-families]: #rocksdb-column-families - -- The `hash_by_height` and `height_by_hash` column families provide a bijection between - block heights and block hashes. (Since the rocksdb state only stores finalized - state, they are actually a bijection). - -- Similarly, the `tx_loc_by_hash` and `hash_by_tx_loc` column families provide a bijection between - transaction locations and transaction hashes. - -- The `block_header_by_height` column family provides a bijection between block - heights and block header data. There is no corresponding `height_by_block` column - family: instead, hash the block header, and use the hash from `height_by_hash`. - (Since the rocksdb state only stores finalized state, they are actually a bijection). - Similarly, there are no column families that go from transaction data - to transaction locations: hash the transaction and use `tx_loc_by_hash`. - -- Block headers and transactions are stored separately in the database, - so that individual transactions can be accessed efficiently. - Blocks can be re-created on request using the following process: - - Look up `height` in `height_by_hash` - - Get the block header for `height` from `block_header_by_height` - - Iterate from `TransactionIndex` 0, - to get each transaction with `height` from `tx_by_loc`, - stopping when there are no more transactions in the block - -- Block headers are stored by height, not by hash. This has the downside that looking - up a block by hash requires an extra level of indirection. The upside is - that blocks with adjacent heights are adjacent in the database, and many - common access patterns, such as helping a client sync the chain or doing - analysis, access blocks in (potentially sparse) height order. In addition, - the fact that we commit blocks in order means we're writing only to the end - of the rocksdb column family, which may help save space. - -- Similarly, transaction data is stored in chain order in `tx_by_loc` and `utxo_by_out_loc`, - and chain order within each vector in `utxo_loc_by_transparent_addr_loc` and - `tx_loc_by_transparent_addr_loc`. - -- `TransactionLocation`s are stored as a `(height, index)` pair referencing the - height of the transaction's parent block and the transaction's index in that - block. This would more traditionally be a `(hash, index)` pair, but because - we store blocks by height, storing the height saves one level of indirection. - Transaction hashes can be looked up using `hash_by_tx_loc`. - -- Similarly, UTXOs are stored in `utxo_by_out_loc` by `OutputLocation`, - rather than `OutPoint`. `OutPoint`s can be looked up using `tx_loc_by_hash`, - and reconstructed using `hash_by_tx_loc`. - -- The `Utxo` type can be constructed from the `OutputLocation` and `Output` data, - `height: OutputLocation.height`, and - `is_coinbase: OutputLocation.transaction_index == 0` - (coinbase transactions are always the first transaction in a block). - -- `balance_by_transparent_addr` is the sum of all `utxo_loc_by_transparent_addr_loc`s - that are still in `utxo_by_out_loc`. It is cached to improve performance for - addresses with large UTXO sets. It also stores the `AddressLocation` for each - address, which allows for efficient lookups. - -- `utxo_loc_by_transparent_addr_loc` stores unspent transparent output locations - by address. The address location and UTXO location are stored as a RocksDB key, - so they are in chain order, and get good database performance. - This column family includes also includes the original address location UTXO, - if it has not been spent. - -- When a block write deletes a UTXO from `utxo_by_out_loc`, - that UTXO location should be deleted from `utxo_loc_by_transparent_addr_loc`. - The deleted UTXO can be removed efficiently, because the UTXO location is part of the key. - This is an index optimisation, which does not affect query results. - -- `tx_loc_by_transparent_addr_loc` stores transaction locations by address. - This list includes transactions containing spent UTXOs. - The address location and transaction location are stored as a RocksDB key, - so they are in chain order, and get good database performance. - This column family also includes the `TransactionLocation` - of the transaction for the `AddressLocation`. - -- The `sprout_note_commitment_tree` stores the note commitment tree state - at the tip of the finalized state, for the specific pool. There is always - a single entry. Each tree is stored - as a "Merkle tree frontier" which is basically a (logarithmic) subset of - the Merkle tree nodes as required to insert new items. - For each block committed, the old tree is deleted and a new one is inserted - by its new height. - **TODO:** store the sprout note commitment tree by `()`, - to avoid ReadStateService concurrent write issues. - -- The `{sapling, orchard}_note_commitment_tree` stores the note commitment tree - state for every height, for the specific pool. Each tree is stored - as a "Merkle tree frontier" which is basically a (logarithmic) subset of - the Merkle tree nodes as required to insert new items. - -- `history_tree` stores the ZIP-221 history tree state at the tip of the finalized - state. There is always a single entry for it. The tree is stored as the set of "peaks" - of the "Merkle mountain range" tree structure, which is what is required to - insert new items. - **TODO:** store the history tree by `()`, to avoid ReadStateService concurrent write issues. - -- Each `*_anchors` stores the anchor (the root of a Merkle tree) of the note commitment - tree of a certain block. We only use the keys since we just need the set of anchors, - regardless of where they come from. The exception is `sprout_anchors` which also maps - the anchor to the matching note commitment tree. This is required to support interstitial - treestates, which are unique to Sprout. - **TODO:** store the `Root` hash in `sprout_note_commitment_tree`, and use it to look up the - note commitment tree. This de-duplicates tree state data. But we currently only store one sprout tree by height. - -- The value pools are only stored for the finalized tip. - -- We do not store the cumulative work for the finalized chain, - because the finalized work is equal for all non-finalized chains. - So the additional non-finalized work can be used to calculate the relative chain order, - and choose the best chain. +The current database format is documented in [Upgrading the State Database](../state-db-upgrades.md). ## Committing finalized blocks diff --git a/book/src/dev/state-db-upgrades.md b/book/src/dev/state-db-upgrades.md new file mode 100644 index 000000000..2174aba24 --- /dev/null +++ b/book/src/dev/state-db-upgrades.md @@ -0,0 +1,356 @@ +# Zebra Cached State Database Implementation + +## Upgrading the State Database + +For most state upgrades, we want to modify the database format of the existing database. If we +change the major database version, every user needs to re-download and re-verify all the blocks, +which can take days. + +### In-Place Upgrade Goals + +- avoid a full download and rebuild of the state +- the previous state format must be able to be loaded by the new state + - this is checked the first time CI runs on a PR with a new state version. + After the first CI run, the cached state is marked as upgraded, so the upgrade doesn't run + again. If CI fails on the first run, any cached states with that version should be deleted. +- previous zebra versions should be able to load the new format + - this is checked by other PRs running using the upgraded cached state, but only if a Rust PR + runs after the new PR's CI finishes, but before it merges +- best-effort loading of older supported states by newer Zebra versions +- best-effort compatibility between newer states and older supported Zebra versions + +### Design Constraints +[design]: #design + +Upgrades run concurrently with state verification and RPC requests. + +This means that: +- the state must be able to read the old and new formats + - it can't panic if the data is missing + - it can't give incorrect results, because that can affect verification or wallets + - it can return an error + - it can only return an `Option` if the caller handles it correctly +- multiple upgrades must produce a valid state format + - if Zebra is restarted, the format upgrade will run multiple times + - if an older Zebra version opens the state, data can be written in an older format +- the format must be valid before and after each database transaction or API call, because an upgrade can be cancelled at any time + - multi-column family changes should made in database transactions + - if you are building new column family, disable state queries, then enable them once it's done + - if each database API call produces a valid format, transactions aren't needed + +If there is an upgrade failure, it can panic and tell the user to delete their cached state and re-launch Zebra. + +### Implementation Steps + +- [ ] update the [database format](https://github.com/ZcashFoundation/zebra/blob/main/book/src/dev/state-db-upgrades.md#current) in the Zebra docs +- [ ] increment the state minor version +- [ ] write the new format in the block write task +- [ ] update older formats in the format upgrade task +- [ ] test that the new format works when creating a new state, and updating an older state + +See the [upgrade design docs](https://github.com/ZcashFoundation/zebra/blob/main/book/src/dev/state-db-upgrades.md#design) for more details. + +These steps can be copied into tickets. + +## Current State Database Format +[current]: #current + +rocksdb provides a persistent, thread-safe `BTreeMap<&[u8], &[u8]>`. Each map is +a distinct "tree". Keys are sorted using lexographic order (`[u8].sorted()`) on byte strings, so +integer values should be stored using big-endian encoding (so that the lex +order on byte strings is the numeric ordering). + +Note that the lex order storage allows creating 1-to-many maps using keys only. +For example, the `tx_loc_by_transparent_addr_loc` allows mapping each address +to all transactions related to it, by simply storing each transaction prefixed +with the address as the key, leaving the value empty. Since rocksdb allows +listing all keys with a given prefix, it will allow listing all transactions +related to a given address. + +We use the following rocksdb column families: + +| Column Family | Keys | Values | Changes | +| ---------------------------------- | ---------------------- | ----------------------------- | ------- | +| *Blocks* | | | | +| `hash_by_height` | `block::Height` | `block::Hash` | Create | +| `height_by_hash` | `block::Hash` | `block::Height` | Create | +| `block_header_by_height` | `block::Height` | `block::Header` | Create | +| *Transactions* | | | | +| `tx_by_loc` | `TransactionLocation` | `Transaction` | Create | +| `hash_by_tx_loc` | `TransactionLocation` | `transaction::Hash` | Create | +| `tx_loc_by_hash` | `transaction::Hash` | `TransactionLocation` | Create | +| *Transparent* | | | | +| `balance_by_transparent_addr` | `transparent::Address` | `Amount \|\| AddressLocation` | Update | +| `tx_loc_by_transparent_addr_loc` | `AddressTransaction` | `()` | Create | +| `utxo_by_out_loc` | `OutputLocation` | `transparent::Output` | Delete | +| `utxo_loc_by_transparent_addr_loc` | `AddressUnspentOutput` | `()` | Delete | +| *Sprout* | | | | +| `sprout_nullifiers` | `sprout::Nullifier` | `()` | Create | +| `sprout_anchors` | `sprout::tree::Root` | `sprout::NoteCommitmentTree` | Create | +| `sprout_note_commitment_tree` | `block::Height` | `sprout::NoteCommitmentTree` | Delete | +| *Sapling* | | | | +| `sapling_nullifiers` | `sapling::Nullifier` | `()` | Create | +| `sapling_anchors` | `sapling::tree::Root` | `()` | Create | +| `sapling_note_commitment_tree` | `block::Height` | `sapling::NoteCommitmentTree` | Create | +| *Orchard* | | | | +| `orchard_nullifiers` | `orchard::Nullifier` | `()` | Create | +| `orchard_anchors` | `orchard::tree::Root` | `()` | Create | +| `orchard_note_commitment_tree` | `block::Height` | `orchard::NoteCommitmentTree` | Create | +| *Chain* | | | | +| `history_tree` | `block::Height` | `NonEmptyHistoryTree` | Delete | +| `tip_chain_value_pool` | `()` | `ValueBalance` | Update | + +Zcash structures are encoded using `ZcashSerialize`/`ZcashDeserialize`. +Other structures are encoded using `IntoDisk`/`FromDisk`. + +Block and Transaction Data: +- `Height`: 24 bits, big-endian, unsigned (allows for ~30 years worth of blocks) +- `TransactionIndex`: 16 bits, big-endian, unsigned (max ~23,000 transactions in the 2 MB block limit) +- `TransactionCount`: same as `TransactionIndex` +- `TransactionLocation`: `Height \|\| TransactionIndex` +- `OutputIndex`: 24 bits, big-endian, unsigned (max ~223,000 transfers in the 2 MB block limit) +- transparent and shielded input indexes, and shielded output indexes: 16 bits, big-endian, unsigned (max ~49,000 transfers in the 2 MB block limit) +- `OutputLocation`: `TransactionLocation \|\| OutputIndex` +- `AddressLocation`: the first `OutputLocation` used by a `transparent::Address`. + Always has the same value for each address, even if the first output is spent. +- `Utxo`: `Output`, derives extra fields from the `OutputLocation` key +- `AddressUnspentOutput`: `AddressLocation \|\| OutputLocation`, + used instead of a `BTreeSet` value, to improve database performance +- `AddressTransaction`: `AddressLocation \|\| TransactionLocation` + used instead of a `BTreeSet` value, to improve database performance + +We use big-endian encoding for keys, to allow database index prefix searches. + +Amounts: +- `Amount`: 64 bits, little-endian, signed +- `ValueBalance`: `[Amount; 4]` + +Derived Formats: +- `*::NoteCommitmentTree`: `bincode` using `serde` +- `NonEmptyHistoryTree`: `bincode` using `serde`, using `zcash_history`'s `serde` implementation + + +The following figure helps visualizing the address index, which is the most complicated part. +Numbers in brackets are array sizes; bold arrows are compositions (i.e. `TransactionLocation` is the +concatenation of `Height` and `TransactionIndex`); dashed arrows are compositions that are also 1-to-many +maps (i.e. `AddressTransaction` is the concatenation of `AddressLocation` and `TransactionLocation`, +but also is used to map each `AddressLocation` to multiple `TransactionLocation`s). + +```mermaid +graph TD; + Address -->|"balance_by_transparent_addr
"| AddressBalance; + AddressBalance ==> Amount; + AddressBalance ==> AddressLocation; + AddressLocation ==> FirstOutputLocation; + AddressLocation -.->|"tx_loc_by_transparent_addr_loc
(AddressTransaction[13])"| TransactionLocation; + TransactionLocation ==> Height; + TransactionLocation ==> TransactionIndex; + OutputLocation -->|utxo_by_out_loc| Output; + OutputLocation ==> TransactionLocation; + OutputLocation ==> OutputIndex; + AddressLocation -.->|"utxo_loc_by_transparent_addr_loc
(AddressUnspentOutput[16])"| OutputLocation; + + AddressBalance["AddressBalance[16]"]; + Amount["Amount[8]"]; + Height["Height[3]"]; + Address["Address[21]"]; + TransactionIndex["TransactionIndex[2]"]; + TransactionLocation["TransactionLocation[5]"]; + OutputIndex["OutputIndex[3]"]; + OutputLocation["OutputLocation[8]"]; + FirstOutputLocation["First OutputLocation[8]"]; + AddressLocation["AddressLocation[8]"]; +``` + +### Implementing consensus rules using rocksdb +[rocksdb-consensus-rules]: #rocksdb-consensus-rules + +Each column family handles updates differently, based on its specific consensus rules: +- Create: + - Each key-value entry is created once. + - Keys are never deleted, values are never updated. +- Delete: + - Each key-value entry is created once. + - Keys can be deleted, but values are never updated. + - Code called by ReadStateService must ignore deleted keys, or use a read lock. + - TODO: should we prevent re-inserts of keys that have been deleted? +- Update: + - Each key-value entry is created once. + - Keys are never deleted, but values can be updated. + - Code called by ReadStateService must handle old or new values, or use a read lock. + +We can't do some kinds of value updates, because they cause RocksDB performance issues: +- Append: + - Keys are never deleted. + - Existing values are never updated. + - Sets of values have additional items appended to the end of the set. + - Code called by ReadStateService must handle shorter or longer sets, or use a read lock. +- Up/Del: + - Keys can be deleted. + - Sets of values have items added or deleted (in any position). + - Code called by ReadStateService must ignore deleted keys and values, + accept shorter or longer sets, and accept old or new values. + Or it should use a read lock. + +Avoid using large sets of values as RocksDB keys or values. + +### RocksDB read locks +[rocksdb-read-locks]: #rocksdb-read-locks + +The read-only ReadStateService needs to handle concurrent writes and deletes of the finalized +column families it reads. It must also handle overlaps between the cached non-finalized `Chain`, +and the current finalized state database. + +The StateService uses RocksDB transactions for each block write. +So ReadStateService queries that only access a single key or value will always see +a consistent view of the database. + +If a ReadStateService query only uses column families that have keys and values appended +(`Never` in the Updates table above), it should ignore extra appended values. +Most queries do this by default. + +For more complex queries, there are several options: + +Reading across multiple column families: +1. Ignore deleted values using custom Rust code +2. Take a database snapshot - https://docs.rs/rocksdb/latest/rocksdb/struct.DBWithThreadMode.html#method.snapshot + +Reading a single column family: +3. multi_get - https://docs.rs/rocksdb/latest/rocksdb/struct.DBWithThreadMode.html#method.multi_get_cf +4. iterator - https://docs.rs/rocksdb/latest/rocksdb/struct.DBWithThreadMode.html#method.iterator_cf + +RocksDB also has read transactions, but they don't seem to be exposed in the Rust crate. + +### Low-Level Implementation Details +[rocksdb-low-level]: #rocksdb-low-level + +RocksDB ignores duplicate puts and deletes, preserving the latest values. +If rejecting duplicate puts or deletes is consensus-critical, +check [`db.get_cf(cf, key)?`](https://docs.rs/rocksdb/0.16.0/rocksdb/struct.DBWithThreadMode.html#method.get_cf) +before putting or deleting any values in a batch. + +Currently, these restrictions should be enforced by code review: +- multiple `zs_insert`s are only allowed on Update column families, and +- [`delete_cf`](https://docs.rs/rocksdb/0.16.0/rocksdb/struct.WriteBatch.html#method.delete_cf) + is only allowed on Delete column families. + +In future, we could enforce these restrictions by: +- creating traits for Never, Delete, and Update +- doing different checks in `zs_insert` depending on the trait +- wrapping `delete_cf` in a trait, and only implementing that trait for types that use Delete column families. + +As of June 2021, the Rust `rocksdb` crate [ignores the delete callback](https://docs.rs/rocksdb/0.16.0/src/rocksdb/merge_operator.rs.html#83-94), +and merge operators are unreliable (or have undocumented behaviour). +So they should not be used for consensus-critical checks. + +### Notes on rocksdb column families +[rocksdb-column-families]: #rocksdb-column-families + +- The `hash_by_height` and `height_by_hash` column families provide a bijection between + block heights and block hashes. (Since the rocksdb state only stores finalized + state, they are actually a bijection). + +- Similarly, the `tx_loc_by_hash` and `hash_by_tx_loc` column families provide a bijection between + transaction locations and transaction hashes. + +- The `block_header_by_height` column family provides a bijection between block + heights and block header data. There is no corresponding `height_by_block` column + family: instead, hash the block header, and use the hash from `height_by_hash`. + (Since the rocksdb state only stores finalized state, they are actually a bijection). + Similarly, there are no column families that go from transaction data + to transaction locations: hash the transaction and use `tx_loc_by_hash`. + +- Block headers and transactions are stored separately in the database, + so that individual transactions can be accessed efficiently. + Blocks can be re-created on request using the following process: + - Look up `height` in `height_by_hash` + - Get the block header for `height` from `block_header_by_height` + - Iterate from `TransactionIndex` 0, + to get each transaction with `height` from `tx_by_loc`, + stopping when there are no more transactions in the block + +- Block headers are stored by height, not by hash. This has the downside that looking + up a block by hash requires an extra level of indirection. The upside is + that blocks with adjacent heights are adjacent in the database, and many + common access patterns, such as helping a client sync the chain or doing + analysis, access blocks in (potentially sparse) height order. In addition, + the fact that we commit blocks in order means we're writing only to the end + of the rocksdb column family, which may help save space. + +- Similarly, transaction data is stored in chain order in `tx_by_loc` and `utxo_by_out_loc`, + and chain order within each vector in `utxo_loc_by_transparent_addr_loc` and + `tx_loc_by_transparent_addr_loc`. + +- `TransactionLocation`s are stored as a `(height, index)` pair referencing the + height of the transaction's parent block and the transaction's index in that + block. This would more traditionally be a `(hash, index)` pair, but because + we store blocks by height, storing the height saves one level of indirection. + Transaction hashes can be looked up using `hash_by_tx_loc`. + +- Similarly, UTXOs are stored in `utxo_by_out_loc` by `OutputLocation`, + rather than `OutPoint`. `OutPoint`s can be looked up using `tx_loc_by_hash`, + and reconstructed using `hash_by_tx_loc`. + +- The `Utxo` type can be constructed from the `OutputLocation` and `Output` data, + `height: OutputLocation.height`, and + `is_coinbase: OutputLocation.transaction_index == 0` + (coinbase transactions are always the first transaction in a block). + +- `balance_by_transparent_addr` is the sum of all `utxo_loc_by_transparent_addr_loc`s + that are still in `utxo_by_out_loc`. It is cached to improve performance for + addresses with large UTXO sets. It also stores the `AddressLocation` for each + address, which allows for efficient lookups. + +- `utxo_loc_by_transparent_addr_loc` stores unspent transparent output locations + by address. The address location and UTXO location are stored as a RocksDB key, + so they are in chain order, and get good database performance. + This column family includes also includes the original address location UTXO, + if it has not been spent. + +- When a block write deletes a UTXO from `utxo_by_out_loc`, + that UTXO location should be deleted from `utxo_loc_by_transparent_addr_loc`. + The deleted UTXO can be removed efficiently, because the UTXO location is part of the key. + This is an index optimisation, which does not affect query results. + +- `tx_loc_by_transparent_addr_loc` stores transaction locations by address. + This list includes transactions containing spent UTXOs. + The address location and transaction location are stored as a RocksDB key, + so they are in chain order, and get good database performance. + This column family also includes the `TransactionLocation` + of the transaction for the `AddressLocation`. + +- The `sprout_note_commitment_tree` stores the note commitment tree state + at the tip of the finalized state, for the specific pool. There is always + a single entry. Each tree is stored + as a "Merkle tree frontier" which is basically a (logarithmic) subset of + the Merkle tree nodes as required to insert new items. + For each block committed, the old tree is deleted and a new one is inserted + by its new height. + **TODO:** store the sprout note commitment tree by `()`, + to avoid ReadStateService concurrent write issues. + +- The `{sapling, orchard}_note_commitment_tree` stores the note commitment tree + state for every height, for the specific pool. Each tree is stored + as a "Merkle tree frontier" which is basically a (logarithmic) subset of + the Merkle tree nodes as required to insert new items. + +- `history_tree` stores the ZIP-221 history tree state at the tip of the finalized + state. There is always a single entry for it. The tree is stored as the set of "peaks" + of the "Merkle mountain range" tree structure, which is what is required to + insert new items. + **TODO:** store the history tree by `()`, to avoid ReadStateService concurrent write issues. + +- Each `*_anchors` stores the anchor (the root of a Merkle tree) of the note commitment + tree of a certain block. We only use the keys since we just need the set of anchors, + regardless of where they come from. The exception is `sprout_anchors` which also maps + the anchor to the matching note commitment tree. This is required to support interstitial + treestates, which are unique to Sprout. + **TODO:** store the `Root` hash in `sprout_note_commitment_tree`, and use it to look up the + note commitment tree. This de-duplicates tree state data. But we currently only store one sprout tree by height. + +- The value pools are only stored for the finalized tip. + +- We do not store the cumulative work for the finalized chain, + because the finalized work is equal for all non-finalized chains. + So the additional non-finalized work can be used to calculate the relative chain order, + and choose the best chain. diff --git a/zebra-state/src/service/finalized_state/disk_format/upgrade.rs b/zebra-state/src/service/finalized_state/disk_format/upgrade.rs index 15c1e0037..9f61855c1 100644 --- a/zebra-state/src/service/finalized_state/disk_format/upgrade.rs +++ b/zebra-state/src/service/finalized_state/disk_format/upgrade.rs @@ -218,6 +218,9 @@ impl DbFormatChange { /// /// If `cancel_receiver` gets a message, or its sender is dropped, /// the format change stops running early. + /// + /// See the format upgrade design docs for more details: + /// // // New format upgrades must be added to the *end* of this method. fn apply_format_upgrade( @@ -259,8 +262,6 @@ impl DbFormatChange { }; // Example format change. - // - // TODO: link to format upgrade instructions doc here // Check if we need to do this upgrade. let database_format_add_format_change_task =