change(doc): Document how to upgrade the database format (#7261)
* Move the state format into a new doc * Add upgrade instructions * Link to the format upgrade docs from the upgrade code * Fix typo Co-authored-by: Marek <mail@marek.onl> --------- Co-authored-by: Marek <mail@marek.onl>
This commit is contained in:
parent
9ebd56092b
commit
512dd9bc5d
|
@ -20,7 +20,14 @@
|
|||
- [Developer Documentation](dev.md)
|
||||
- [Contribution Guide](CONTRIBUTING.md)
|
||||
- [Design Overview](dev/overview.md)
|
||||
- [Diagrams](dev/diagrams.md)
|
||||
- [Network Architecture](dev/diagrams/zebra-network.md)
|
||||
- [Upgrading the State Database](dev/state-db-upgrades.md)
|
||||
- [Zebra versioning and releases](dev/release-process.md)
|
||||
- [Continuous Integration](dev/continuous-integration.md)
|
||||
- [Continuous Delivery](dev/continuous-delivery.md)
|
||||
- [Generating Zebra Checkpoints](dev/zebra-checkpoints.md)
|
||||
- [Doing Mass Renames](dev/mass-renames.md)
|
||||
- [Zebra RFCs](dev/rfcs.md)
|
||||
- [Pipelinable Block Lookup](dev/rfcs/0001-pipelinable-block-lookup.md)
|
||||
- [Parallel Verification](dev/rfcs/0002-parallel-verification.md)
|
||||
|
@ -32,10 +39,4 @@
|
|||
- [V5 Transaction](dev/rfcs/0010-v5-transaction.md)
|
||||
- [Async Rust in Zebra](dev/rfcs/0011-async-rust-in-zebra.md)
|
||||
- [Value Pools](dev/rfcs/0012-value-pools.md)
|
||||
- [Diagrams](dev/diagrams.md)
|
||||
- [Network Architecture](dev/diagrams/zebra-network.md)
|
||||
- [Continuous Integration](dev/continuous-integration.md)
|
||||
- [Continuous Delivery](dev/continuous-delivery.md)
|
||||
- [Generating Zebra Checkpoints](dev/zebra-checkpoints.md)
|
||||
- [Doing Mass Renames](dev/mass-renames.md)
|
||||
- [API Reference](api.md)
|
||||
|
|
|
@ -663,305 +663,7 @@ New `non-finalized` blocks are committed as follows:
|
|||
## rocksdb data structures
|
||||
[rocksdb]: #rocksdb
|
||||
|
||||
rocksdb provides a persistent, thread-safe `BTreeMap<&[u8], &[u8]>`. Each map is
|
||||
a distinct "tree". Keys are sorted using lex order on byte strings, so
|
||||
integer values should be stored using big-endian encoding (so that the lex
|
||||
order on byte strings is the numeric ordering).
|
||||
|
||||
Note that the lex order storage allows creating 1-to-many maps using keys only.
|
||||
For example, the `tx_loc_by_transparent_addr_loc` allows mapping each address
|
||||
to all transactions related to it, by simply storing each transaction prefixed
|
||||
with the address as the key, leaving the value empty. Since rocksdb allows
|
||||
listing all keys with a given prefix, it will allow listing all transactions
|
||||
related to a given address.
|
||||
|
||||
We use the following rocksdb column families:
|
||||
|
||||
| Column Family | Keys | Values | Changes |
|
||||
| ---------------------------------- | ---------------------- | ----------------------------- | ------- |
|
||||
| *Blocks* | | | |
|
||||
| `hash_by_height` | `block::Height` | `block::Hash` | Create |
|
||||
| `height_by_hash` | `block::Hash` | `block::Height` | Create |
|
||||
| `block_header_by_height` | `block::Height` | `block::Header` | Create |
|
||||
| *Transactions* | | | |
|
||||
| `tx_by_loc` | `TransactionLocation` | `Transaction` | Create |
|
||||
| `hash_by_tx_loc` | `TransactionLocation` | `transaction::Hash` | Create |
|
||||
| `tx_loc_by_hash` | `transaction::Hash` | `TransactionLocation` | Create |
|
||||
| *Transparent* | | | |
|
||||
| `balance_by_transparent_addr` | `transparent::Address` | `Amount \|\| AddressLocation` | Update |
|
||||
| `tx_loc_by_transparent_addr_loc` | `AddressTransaction` | `()` | Create |
|
||||
| `utxo_by_out_loc` | `OutputLocation` | `transparent::Output` | Delete |
|
||||
| `utxo_loc_by_transparent_addr_loc` | `AddressUnspentOutput` | `()` | Delete |
|
||||
| *Sprout* | | | |
|
||||
| `sprout_nullifiers` | `sprout::Nullifier` | `()` | Create |
|
||||
| `sprout_anchors` | `sprout::tree::Root` | `sprout::NoteCommitmentTree` | Create |
|
||||
| `sprout_note_commitment_tree` | `block::Height` | `sprout::NoteCommitmentTree` | Delete |
|
||||
| *Sapling* | | | |
|
||||
| `sapling_nullifiers` | `sapling::Nullifier` | `()` | Create |
|
||||
| `sapling_anchors` | `sapling::tree::Root` | `()` | Create |
|
||||
| `sapling_note_commitment_tree` | `block::Height` | `sapling::NoteCommitmentTree` | Create |
|
||||
| *Orchard* | | | |
|
||||
| `orchard_nullifiers` | `orchard::Nullifier` | `()` | Create |
|
||||
| `orchard_anchors` | `orchard::tree::Root` | `()` | Create |
|
||||
| `orchard_note_commitment_tree` | `block::Height` | `orchard::NoteCommitmentTree` | Create |
|
||||
| *Chain* | | | |
|
||||
| `history_tree` | `block::Height` | `NonEmptyHistoryTree` | Delete |
|
||||
| `tip_chain_value_pool` | `()` | `ValueBalance` | Update |
|
||||
|
||||
Zcash structures are encoded using `ZcashSerialize`/`ZcashDeserialize`.
|
||||
Other structures are encoded using `IntoDisk`/`FromDisk`.
|
||||
|
||||
Block and Transaction Data:
|
||||
- `Height`: 24 bits, big-endian, unsigned (allows for ~30 years worth of blocks)
|
||||
- `TransactionIndex`: 16 bits, big-endian, unsigned (max ~23,000 transactions in the 2 MB block limit)
|
||||
- `TransactionCount`: same as `TransactionIndex`
|
||||
- `TransactionLocation`: `Height \|\| TransactionIndex`
|
||||
- `OutputIndex`: 24 bits, big-endian, unsigned (max ~223,000 transfers in the 2 MB block limit)
|
||||
- transparent and shielded input indexes, and shielded output indexes: 16 bits, big-endian, unsigned (max ~49,000 transfers in the 2 MB block limit)
|
||||
- `OutputLocation`: `TransactionLocation \|\| OutputIndex`
|
||||
- `AddressLocation`: the first `OutputLocation` used by a `transparent::Address`.
|
||||
Always has the same value for each address, even if the first output is spent.
|
||||
- `Utxo`: `Output`, derives extra fields from the `OutputLocation` key
|
||||
- `AddressUnspentOutput`: `AddressLocation \|\| OutputLocation`,
|
||||
used instead of a `BTreeSet<OutputLocation>` value, to improve database performance
|
||||
- `AddressTransaction`: `AddressLocation \|\| TransactionLocation`
|
||||
used instead of a `BTreeSet<TransactionLocation>` value, to improve database performance
|
||||
|
||||
We use big-endian encoding for keys, to allow database index prefix searches.
|
||||
|
||||
Amounts:
|
||||
- `Amount`: 64 bits, little-endian, signed
|
||||
- `ValueBalance`: `[Amount; 4]`
|
||||
|
||||
Derived Formats:
|
||||
- `*::NoteCommitmentTree`: `bincode` using `serde`
|
||||
- `NonEmptyHistoryTree`: `bincode` using `serde`, using `zcash_history`'s `serde` implementation
|
||||
|
||||
|
||||
The following figure helps visualizing the address index, which is the most complicated part.
|
||||
Numbers in brackets are array sizes; bold arrows are compositions (i.e. `TransactionLocation` is the
|
||||
concatenation of `Height` and `TransactionIndex`); dashed arrows are compositions that are also 1-to-many
|
||||
maps (i.e. `AddressTransaction` is the concatenation of `AddressLocation` and `TransactionLocation`,
|
||||
but also is used to map each `AddressLocation` to multiple `TransactionLocation`s).
|
||||
|
||||
```mermaid
|
||||
graph TD;
|
||||
Address -->|"balance_by_transparent_addr<br/>"| AddressBalance;
|
||||
AddressBalance ==> Amount;
|
||||
AddressBalance ==> AddressLocation;
|
||||
AddressLocation ==> FirstOutputLocation;
|
||||
AddressLocation -.->|"tx_loc_by_transparent_addr_loc<br/>(AddressTransaction[13])"| TransactionLocation;
|
||||
TransactionLocation ==> Height;
|
||||
TransactionLocation ==> TransactionIndex;
|
||||
OutputLocation -->|utxo_by_out_loc| Output;
|
||||
OutputLocation ==> TransactionLocation;
|
||||
OutputLocation ==> OutputIndex;
|
||||
AddressLocation -.->|"utxo_loc_by_transparent_addr_loc<br/>(AddressUnspentOutput[16])"| OutputLocation;
|
||||
|
||||
AddressBalance["AddressBalance[16]"];
|
||||
Amount["Amount[8]"];
|
||||
Height["Height[3]"];
|
||||
Address["Address[21]"];
|
||||
TransactionIndex["TransactionIndex[2]"];
|
||||
TransactionLocation["TransactionLocation[5]"];
|
||||
OutputIndex["OutputIndex[3]"];
|
||||
OutputLocation["OutputLocation[8]"];
|
||||
FirstOutputLocation["First OutputLocation[8]"];
|
||||
AddressLocation["AddressLocation[8]"];
|
||||
```
|
||||
|
||||
### Implementing consensus rules using rocksdb
|
||||
[rocksdb-consensus-rules]: #rocksdb-consensus-rules
|
||||
|
||||
Each column family handles updates differently, based on its specific consensus rules:
|
||||
- Create:
|
||||
- Each key-value entry is created once.
|
||||
- Keys are never deleted, values are never updated.
|
||||
- Delete:
|
||||
- Each key-value entry is created once.
|
||||
- Keys can be deleted, but values are never updated.
|
||||
- Code called by ReadStateService must ignore deleted keys, or use a read lock.
|
||||
- TODO: should we prevent re-inserts of keys that have been deleted?
|
||||
- Update:
|
||||
- Each key-value entry is created once.
|
||||
- Keys are never deleted, but values can be updated.
|
||||
- Code called by ReadStateService must handle old or new values, or use a read lock.
|
||||
|
||||
We can't do some kinds of value updates, because they cause RocksDB performance issues:
|
||||
- Append:
|
||||
- Keys are never deleted.
|
||||
- Existing values are never updated.
|
||||
- Sets of values have additional items appended to the end of the set.
|
||||
- Code called by ReadStateService must handle shorter or longer sets, or use a read lock.
|
||||
- Up/Del:
|
||||
- Keys can be deleted.
|
||||
- Sets of values have items added or deleted (in any position).
|
||||
- Code called by ReadStateService must ignore deleted keys and values,
|
||||
accept shorter or longer sets, and accept old or new values.
|
||||
Or it should use a read lock.
|
||||
|
||||
Avoid using large sets of values as RocksDB keys or values.
|
||||
|
||||
### RocksDB read locks
|
||||
[rocksdb-read-locks]: #rocksdb-read-locks
|
||||
|
||||
The read-only ReadStateService needs to handle concurrent writes and deletes of the finalized
|
||||
column families it reads. It must also handle overlaps between the cached non-finalized `Chain`,
|
||||
and the current finalized state database.
|
||||
|
||||
The StateService uses RocksDB transactions for each block write.
|
||||
So ReadStateService queries that only access a single key or value will always see
|
||||
a consistent view of the database.
|
||||
|
||||
If a ReadStateService query only uses column families that have keys and values appended
|
||||
(`Never` in the Updates table above), it should ignore extra appended values.
|
||||
Most queries do this by default.
|
||||
|
||||
For more complex queries, there are several options:
|
||||
|
||||
Reading across multiple column families:
|
||||
1. Ignore deleted values using custom Rust code
|
||||
2. Take a database snapshot - https://docs.rs/rocksdb/latest/rocksdb/struct.DBWithThreadMode.html#method.snapshot
|
||||
|
||||
Reading a single column family:
|
||||
3. multi_get - https://docs.rs/rocksdb/latest/rocksdb/struct.DBWithThreadMode.html#method.multi_get_cf
|
||||
4. iterator - https://docs.rs/rocksdb/latest/rocksdb/struct.DBWithThreadMode.html#method.iterator_cf
|
||||
|
||||
RocksDB also has read transactions, but they don't seem to be exposed in the Rust crate.
|
||||
|
||||
### Low-Level Implementation Details
|
||||
[rocksdb-low-level]: #rocksdb-low-level
|
||||
|
||||
RocksDB ignores duplicate puts and deletes, preserving the latest values.
|
||||
If rejecting duplicate puts or deletes is consensus-critical,
|
||||
check [`db.get_cf(cf, key)?`](https://docs.rs/rocksdb/0.16.0/rocksdb/struct.DBWithThreadMode.html#method.get_cf)
|
||||
before putting or deleting any values in a batch.
|
||||
|
||||
Currently, these restrictions should be enforced by code review:
|
||||
- multiple `zs_insert`s are only allowed on Update column families, and
|
||||
- [`delete_cf`](https://docs.rs/rocksdb/0.16.0/rocksdb/struct.WriteBatch.html#method.delete_cf)
|
||||
is only allowed on Delete column families.
|
||||
|
||||
In future, we could enforce these restrictions by:
|
||||
- creating traits for Never, Delete, and Update
|
||||
- doing different checks in `zs_insert` depending on the trait
|
||||
- wrapping `delete_cf` in a trait, and only implementing that trait for types that use Delete column families.
|
||||
|
||||
As of June 2021, the Rust `rocksdb` crate [ignores the delete callback](https://docs.rs/rocksdb/0.16.0/src/rocksdb/merge_operator.rs.html#83-94),
|
||||
and merge operators are unreliable (or have undocumented behaviour).
|
||||
So they should not be used for consensus-critical checks.
|
||||
|
||||
### Notes on rocksdb column families
|
||||
[rocksdb-column-families]: #rocksdb-column-families
|
||||
|
||||
- The `hash_by_height` and `height_by_hash` column families provide a bijection between
|
||||
block heights and block hashes. (Since the rocksdb state only stores finalized
|
||||
state, they are actually a bijection).
|
||||
|
||||
- Similarly, the `tx_loc_by_hash` and `hash_by_tx_loc` column families provide a bijection between
|
||||
transaction locations and transaction hashes.
|
||||
|
||||
- The `block_header_by_height` column family provides a bijection between block
|
||||
heights and block header data. There is no corresponding `height_by_block` column
|
||||
family: instead, hash the block header, and use the hash from `height_by_hash`.
|
||||
(Since the rocksdb state only stores finalized state, they are actually a bijection).
|
||||
Similarly, there are no column families that go from transaction data
|
||||
to transaction locations: hash the transaction and use `tx_loc_by_hash`.
|
||||
|
||||
- Block headers and transactions are stored separately in the database,
|
||||
so that individual transactions can be accessed efficiently.
|
||||
Blocks can be re-created on request using the following process:
|
||||
- Look up `height` in `height_by_hash`
|
||||
- Get the block header for `height` from `block_header_by_height`
|
||||
- Iterate from `TransactionIndex` 0,
|
||||
to get each transaction with `height` from `tx_by_loc`,
|
||||
stopping when there are no more transactions in the block
|
||||
|
||||
- Block headers are stored by height, not by hash. This has the downside that looking
|
||||
up a block by hash requires an extra level of indirection. The upside is
|
||||
that blocks with adjacent heights are adjacent in the database, and many
|
||||
common access patterns, such as helping a client sync the chain or doing
|
||||
analysis, access blocks in (potentially sparse) height order. In addition,
|
||||
the fact that we commit blocks in order means we're writing only to the end
|
||||
of the rocksdb column family, which may help save space.
|
||||
|
||||
- Similarly, transaction data is stored in chain order in `tx_by_loc` and `utxo_by_out_loc`,
|
||||
and chain order within each vector in `utxo_loc_by_transparent_addr_loc` and
|
||||
`tx_loc_by_transparent_addr_loc`.
|
||||
|
||||
- `TransactionLocation`s are stored as a `(height, index)` pair referencing the
|
||||
height of the transaction's parent block and the transaction's index in that
|
||||
block. This would more traditionally be a `(hash, index)` pair, but because
|
||||
we store blocks by height, storing the height saves one level of indirection.
|
||||
Transaction hashes can be looked up using `hash_by_tx_loc`.
|
||||
|
||||
- Similarly, UTXOs are stored in `utxo_by_out_loc` by `OutputLocation`,
|
||||
rather than `OutPoint`. `OutPoint`s can be looked up using `tx_loc_by_hash`,
|
||||
and reconstructed using `hash_by_tx_loc`.
|
||||
|
||||
- The `Utxo` type can be constructed from the `OutputLocation` and `Output` data,
|
||||
`height: OutputLocation.height`, and
|
||||
`is_coinbase: OutputLocation.transaction_index == 0`
|
||||
(coinbase transactions are always the first transaction in a block).
|
||||
|
||||
- `balance_by_transparent_addr` is the sum of all `utxo_loc_by_transparent_addr_loc`s
|
||||
that are still in `utxo_by_out_loc`. It is cached to improve performance for
|
||||
addresses with large UTXO sets. It also stores the `AddressLocation` for each
|
||||
address, which allows for efficient lookups.
|
||||
|
||||
- `utxo_loc_by_transparent_addr_loc` stores unspent transparent output locations
|
||||
by address. The address location and UTXO location are stored as a RocksDB key,
|
||||
so they are in chain order, and get good database performance.
|
||||
This column family includes also includes the original address location UTXO,
|
||||
if it has not been spent.
|
||||
|
||||
- When a block write deletes a UTXO from `utxo_by_out_loc`,
|
||||
that UTXO location should be deleted from `utxo_loc_by_transparent_addr_loc`.
|
||||
The deleted UTXO can be removed efficiently, because the UTXO location is part of the key.
|
||||
This is an index optimisation, which does not affect query results.
|
||||
|
||||
- `tx_loc_by_transparent_addr_loc` stores transaction locations by address.
|
||||
This list includes transactions containing spent UTXOs.
|
||||
The address location and transaction location are stored as a RocksDB key,
|
||||
so they are in chain order, and get good database performance.
|
||||
This column family also includes the `TransactionLocation`
|
||||
of the transaction for the `AddressLocation`.
|
||||
|
||||
- The `sprout_note_commitment_tree` stores the note commitment tree state
|
||||
at the tip of the finalized state, for the specific pool. There is always
|
||||
a single entry. Each tree is stored
|
||||
as a "Merkle tree frontier" which is basically a (logarithmic) subset of
|
||||
the Merkle tree nodes as required to insert new items.
|
||||
For each block committed, the old tree is deleted and a new one is inserted
|
||||
by its new height.
|
||||
**TODO:** store the sprout note commitment tree by `()`,
|
||||
to avoid ReadStateService concurrent write issues.
|
||||
|
||||
- The `{sapling, orchard}_note_commitment_tree` stores the note commitment tree
|
||||
state for every height, for the specific pool. Each tree is stored
|
||||
as a "Merkle tree frontier" which is basically a (logarithmic) subset of
|
||||
the Merkle tree nodes as required to insert new items.
|
||||
|
||||
- `history_tree` stores the ZIP-221 history tree state at the tip of the finalized
|
||||
state. There is always a single entry for it. The tree is stored as the set of "peaks"
|
||||
of the "Merkle mountain range" tree structure, which is what is required to
|
||||
insert new items.
|
||||
**TODO:** store the history tree by `()`, to avoid ReadStateService concurrent write issues.
|
||||
|
||||
- Each `*_anchors` stores the anchor (the root of a Merkle tree) of the note commitment
|
||||
tree of a certain block. We only use the keys since we just need the set of anchors,
|
||||
regardless of where they come from. The exception is `sprout_anchors` which also maps
|
||||
the anchor to the matching note commitment tree. This is required to support interstitial
|
||||
treestates, which are unique to Sprout.
|
||||
**TODO:** store the `Root` hash in `sprout_note_commitment_tree`, and use it to look up the
|
||||
note commitment tree. This de-duplicates tree state data. But we currently only store one sprout tree by height.
|
||||
|
||||
- The value pools are only stored for the finalized tip.
|
||||
|
||||
- We do not store the cumulative work for the finalized chain,
|
||||
because the finalized work is equal for all non-finalized chains.
|
||||
So the additional non-finalized work can be used to calculate the relative chain order,
|
||||
and choose the best chain.
|
||||
The current database format is documented in [Upgrading the State Database](../state-db-upgrades.md).
|
||||
|
||||
## Committing finalized blocks
|
||||
|
||||
|
|
|
@ -0,0 +1,356 @@
|
|||
# Zebra Cached State Database Implementation
|
||||
|
||||
## Upgrading the State Database
|
||||
|
||||
For most state upgrades, we want to modify the database format of the existing database. If we
|
||||
change the major database version, every user needs to re-download and re-verify all the blocks,
|
||||
which can take days.
|
||||
|
||||
### In-Place Upgrade Goals
|
||||
|
||||
- avoid a full download and rebuild of the state
|
||||
- the previous state format must be able to be loaded by the new state
|
||||
- this is checked the first time CI runs on a PR with a new state version.
|
||||
After the first CI run, the cached state is marked as upgraded, so the upgrade doesn't run
|
||||
again. If CI fails on the first run, any cached states with that version should be deleted.
|
||||
- previous zebra versions should be able to load the new format
|
||||
- this is checked by other PRs running using the upgraded cached state, but only if a Rust PR
|
||||
runs after the new PR's CI finishes, but before it merges
|
||||
- best-effort loading of older supported states by newer Zebra versions
|
||||
- best-effort compatibility between newer states and older supported Zebra versions
|
||||
|
||||
### Design Constraints
|
||||
[design]: #design
|
||||
|
||||
Upgrades run concurrently with state verification and RPC requests.
|
||||
|
||||
This means that:
|
||||
- the state must be able to read the old and new formats
|
||||
- it can't panic if the data is missing
|
||||
- it can't give incorrect results, because that can affect verification or wallets
|
||||
- it can return an error
|
||||
- it can only return an `Option` if the caller handles it correctly
|
||||
- multiple upgrades must produce a valid state format
|
||||
- if Zebra is restarted, the format upgrade will run multiple times
|
||||
- if an older Zebra version opens the state, data can be written in an older format
|
||||
- the format must be valid before and after each database transaction or API call, because an upgrade can be cancelled at any time
|
||||
- multi-column family changes should made in database transactions
|
||||
- if you are building new column family, disable state queries, then enable them once it's done
|
||||
- if each database API call produces a valid format, transactions aren't needed
|
||||
|
||||
If there is an upgrade failure, it can panic and tell the user to delete their cached state and re-launch Zebra.
|
||||
|
||||
### Implementation Steps
|
||||
|
||||
- [ ] update the [database format](https://github.com/ZcashFoundation/zebra/blob/main/book/src/dev/state-db-upgrades.md#current) in the Zebra docs
|
||||
- [ ] increment the state minor version
|
||||
- [ ] write the new format in the block write task
|
||||
- [ ] update older formats in the format upgrade task
|
||||
- [ ] test that the new format works when creating a new state, and updating an older state
|
||||
|
||||
See the [upgrade design docs](https://github.com/ZcashFoundation/zebra/blob/main/book/src/dev/state-db-upgrades.md#design) for more details.
|
||||
|
||||
These steps can be copied into tickets.
|
||||
|
||||
## Current State Database Format
|
||||
[current]: #current
|
||||
|
||||
rocksdb provides a persistent, thread-safe `BTreeMap<&[u8], &[u8]>`. Each map is
|
||||
a distinct "tree". Keys are sorted using lexographic order (`[u8].sorted()`) on byte strings, so
|
||||
integer values should be stored using big-endian encoding (so that the lex
|
||||
order on byte strings is the numeric ordering).
|
||||
|
||||
Note that the lex order storage allows creating 1-to-many maps using keys only.
|
||||
For example, the `tx_loc_by_transparent_addr_loc` allows mapping each address
|
||||
to all transactions related to it, by simply storing each transaction prefixed
|
||||
with the address as the key, leaving the value empty. Since rocksdb allows
|
||||
listing all keys with a given prefix, it will allow listing all transactions
|
||||
related to a given address.
|
||||
|
||||
We use the following rocksdb column families:
|
||||
|
||||
| Column Family | Keys | Values | Changes |
|
||||
| ---------------------------------- | ---------------------- | ----------------------------- | ------- |
|
||||
| *Blocks* | | | |
|
||||
| `hash_by_height` | `block::Height` | `block::Hash` | Create |
|
||||
| `height_by_hash` | `block::Hash` | `block::Height` | Create |
|
||||
| `block_header_by_height` | `block::Height` | `block::Header` | Create |
|
||||
| *Transactions* | | | |
|
||||
| `tx_by_loc` | `TransactionLocation` | `Transaction` | Create |
|
||||
| `hash_by_tx_loc` | `TransactionLocation` | `transaction::Hash` | Create |
|
||||
| `tx_loc_by_hash` | `transaction::Hash` | `TransactionLocation` | Create |
|
||||
| *Transparent* | | | |
|
||||
| `balance_by_transparent_addr` | `transparent::Address` | `Amount \|\| AddressLocation` | Update |
|
||||
| `tx_loc_by_transparent_addr_loc` | `AddressTransaction` | `()` | Create |
|
||||
| `utxo_by_out_loc` | `OutputLocation` | `transparent::Output` | Delete |
|
||||
| `utxo_loc_by_transparent_addr_loc` | `AddressUnspentOutput` | `()` | Delete |
|
||||
| *Sprout* | | | |
|
||||
| `sprout_nullifiers` | `sprout::Nullifier` | `()` | Create |
|
||||
| `sprout_anchors` | `sprout::tree::Root` | `sprout::NoteCommitmentTree` | Create |
|
||||
| `sprout_note_commitment_tree` | `block::Height` | `sprout::NoteCommitmentTree` | Delete |
|
||||
| *Sapling* | | | |
|
||||
| `sapling_nullifiers` | `sapling::Nullifier` | `()` | Create |
|
||||
| `sapling_anchors` | `sapling::tree::Root` | `()` | Create |
|
||||
| `sapling_note_commitment_tree` | `block::Height` | `sapling::NoteCommitmentTree` | Create |
|
||||
| *Orchard* | | | |
|
||||
| `orchard_nullifiers` | `orchard::Nullifier` | `()` | Create |
|
||||
| `orchard_anchors` | `orchard::tree::Root` | `()` | Create |
|
||||
| `orchard_note_commitment_tree` | `block::Height` | `orchard::NoteCommitmentTree` | Create |
|
||||
| *Chain* | | | |
|
||||
| `history_tree` | `block::Height` | `NonEmptyHistoryTree` | Delete |
|
||||
| `tip_chain_value_pool` | `()` | `ValueBalance` | Update |
|
||||
|
||||
Zcash structures are encoded using `ZcashSerialize`/`ZcashDeserialize`.
|
||||
Other structures are encoded using `IntoDisk`/`FromDisk`.
|
||||
|
||||
Block and Transaction Data:
|
||||
- `Height`: 24 bits, big-endian, unsigned (allows for ~30 years worth of blocks)
|
||||
- `TransactionIndex`: 16 bits, big-endian, unsigned (max ~23,000 transactions in the 2 MB block limit)
|
||||
- `TransactionCount`: same as `TransactionIndex`
|
||||
- `TransactionLocation`: `Height \|\| TransactionIndex`
|
||||
- `OutputIndex`: 24 bits, big-endian, unsigned (max ~223,000 transfers in the 2 MB block limit)
|
||||
- transparent and shielded input indexes, and shielded output indexes: 16 bits, big-endian, unsigned (max ~49,000 transfers in the 2 MB block limit)
|
||||
- `OutputLocation`: `TransactionLocation \|\| OutputIndex`
|
||||
- `AddressLocation`: the first `OutputLocation` used by a `transparent::Address`.
|
||||
Always has the same value for each address, even if the first output is spent.
|
||||
- `Utxo`: `Output`, derives extra fields from the `OutputLocation` key
|
||||
- `AddressUnspentOutput`: `AddressLocation \|\| OutputLocation`,
|
||||
used instead of a `BTreeSet<OutputLocation>` value, to improve database performance
|
||||
- `AddressTransaction`: `AddressLocation \|\| TransactionLocation`
|
||||
used instead of a `BTreeSet<TransactionLocation>` value, to improve database performance
|
||||
|
||||
We use big-endian encoding for keys, to allow database index prefix searches.
|
||||
|
||||
Amounts:
|
||||
- `Amount`: 64 bits, little-endian, signed
|
||||
- `ValueBalance`: `[Amount; 4]`
|
||||
|
||||
Derived Formats:
|
||||
- `*::NoteCommitmentTree`: `bincode` using `serde`
|
||||
- `NonEmptyHistoryTree`: `bincode` using `serde`, using `zcash_history`'s `serde` implementation
|
||||
|
||||
|
||||
The following figure helps visualizing the address index, which is the most complicated part.
|
||||
Numbers in brackets are array sizes; bold arrows are compositions (i.e. `TransactionLocation` is the
|
||||
concatenation of `Height` and `TransactionIndex`); dashed arrows are compositions that are also 1-to-many
|
||||
maps (i.e. `AddressTransaction` is the concatenation of `AddressLocation` and `TransactionLocation`,
|
||||
but also is used to map each `AddressLocation` to multiple `TransactionLocation`s).
|
||||
|
||||
```mermaid
|
||||
graph TD;
|
||||
Address -->|"balance_by_transparent_addr<br/>"| AddressBalance;
|
||||
AddressBalance ==> Amount;
|
||||
AddressBalance ==> AddressLocation;
|
||||
AddressLocation ==> FirstOutputLocation;
|
||||
AddressLocation -.->|"tx_loc_by_transparent_addr_loc<br/>(AddressTransaction[13])"| TransactionLocation;
|
||||
TransactionLocation ==> Height;
|
||||
TransactionLocation ==> TransactionIndex;
|
||||
OutputLocation -->|utxo_by_out_loc| Output;
|
||||
OutputLocation ==> TransactionLocation;
|
||||
OutputLocation ==> OutputIndex;
|
||||
AddressLocation -.->|"utxo_loc_by_transparent_addr_loc<br/>(AddressUnspentOutput[16])"| OutputLocation;
|
||||
|
||||
AddressBalance["AddressBalance[16]"];
|
||||
Amount["Amount[8]"];
|
||||
Height["Height[3]"];
|
||||
Address["Address[21]"];
|
||||
TransactionIndex["TransactionIndex[2]"];
|
||||
TransactionLocation["TransactionLocation[5]"];
|
||||
OutputIndex["OutputIndex[3]"];
|
||||
OutputLocation["OutputLocation[8]"];
|
||||
FirstOutputLocation["First OutputLocation[8]"];
|
||||
AddressLocation["AddressLocation[8]"];
|
||||
```
|
||||
|
||||
### Implementing consensus rules using rocksdb
|
||||
[rocksdb-consensus-rules]: #rocksdb-consensus-rules
|
||||
|
||||
Each column family handles updates differently, based on its specific consensus rules:
|
||||
- Create:
|
||||
- Each key-value entry is created once.
|
||||
- Keys are never deleted, values are never updated.
|
||||
- Delete:
|
||||
- Each key-value entry is created once.
|
||||
- Keys can be deleted, but values are never updated.
|
||||
- Code called by ReadStateService must ignore deleted keys, or use a read lock.
|
||||
- TODO: should we prevent re-inserts of keys that have been deleted?
|
||||
- Update:
|
||||
- Each key-value entry is created once.
|
||||
- Keys are never deleted, but values can be updated.
|
||||
- Code called by ReadStateService must handle old or new values, or use a read lock.
|
||||
|
||||
We can't do some kinds of value updates, because they cause RocksDB performance issues:
|
||||
- Append:
|
||||
- Keys are never deleted.
|
||||
- Existing values are never updated.
|
||||
- Sets of values have additional items appended to the end of the set.
|
||||
- Code called by ReadStateService must handle shorter or longer sets, or use a read lock.
|
||||
- Up/Del:
|
||||
- Keys can be deleted.
|
||||
- Sets of values have items added or deleted (in any position).
|
||||
- Code called by ReadStateService must ignore deleted keys and values,
|
||||
accept shorter or longer sets, and accept old or new values.
|
||||
Or it should use a read lock.
|
||||
|
||||
Avoid using large sets of values as RocksDB keys or values.
|
||||
|
||||
### RocksDB read locks
|
||||
[rocksdb-read-locks]: #rocksdb-read-locks
|
||||
|
||||
The read-only ReadStateService needs to handle concurrent writes and deletes of the finalized
|
||||
column families it reads. It must also handle overlaps between the cached non-finalized `Chain`,
|
||||
and the current finalized state database.
|
||||
|
||||
The StateService uses RocksDB transactions for each block write.
|
||||
So ReadStateService queries that only access a single key or value will always see
|
||||
a consistent view of the database.
|
||||
|
||||
If a ReadStateService query only uses column families that have keys and values appended
|
||||
(`Never` in the Updates table above), it should ignore extra appended values.
|
||||
Most queries do this by default.
|
||||
|
||||
For more complex queries, there are several options:
|
||||
|
||||
Reading across multiple column families:
|
||||
1. Ignore deleted values using custom Rust code
|
||||
2. Take a database snapshot - https://docs.rs/rocksdb/latest/rocksdb/struct.DBWithThreadMode.html#method.snapshot
|
||||
|
||||
Reading a single column family:
|
||||
3. multi_get - https://docs.rs/rocksdb/latest/rocksdb/struct.DBWithThreadMode.html#method.multi_get_cf
|
||||
4. iterator - https://docs.rs/rocksdb/latest/rocksdb/struct.DBWithThreadMode.html#method.iterator_cf
|
||||
|
||||
RocksDB also has read transactions, but they don't seem to be exposed in the Rust crate.
|
||||
|
||||
### Low-Level Implementation Details
|
||||
[rocksdb-low-level]: #rocksdb-low-level
|
||||
|
||||
RocksDB ignores duplicate puts and deletes, preserving the latest values.
|
||||
If rejecting duplicate puts or deletes is consensus-critical,
|
||||
check [`db.get_cf(cf, key)?`](https://docs.rs/rocksdb/0.16.0/rocksdb/struct.DBWithThreadMode.html#method.get_cf)
|
||||
before putting or deleting any values in a batch.
|
||||
|
||||
Currently, these restrictions should be enforced by code review:
|
||||
- multiple `zs_insert`s are only allowed on Update column families, and
|
||||
- [`delete_cf`](https://docs.rs/rocksdb/0.16.0/rocksdb/struct.WriteBatch.html#method.delete_cf)
|
||||
is only allowed on Delete column families.
|
||||
|
||||
In future, we could enforce these restrictions by:
|
||||
- creating traits for Never, Delete, and Update
|
||||
- doing different checks in `zs_insert` depending on the trait
|
||||
- wrapping `delete_cf` in a trait, and only implementing that trait for types that use Delete column families.
|
||||
|
||||
As of June 2021, the Rust `rocksdb` crate [ignores the delete callback](https://docs.rs/rocksdb/0.16.0/src/rocksdb/merge_operator.rs.html#83-94),
|
||||
and merge operators are unreliable (or have undocumented behaviour).
|
||||
So they should not be used for consensus-critical checks.
|
||||
|
||||
### Notes on rocksdb column families
|
||||
[rocksdb-column-families]: #rocksdb-column-families
|
||||
|
||||
- The `hash_by_height` and `height_by_hash` column families provide a bijection between
|
||||
block heights and block hashes. (Since the rocksdb state only stores finalized
|
||||
state, they are actually a bijection).
|
||||
|
||||
- Similarly, the `tx_loc_by_hash` and `hash_by_tx_loc` column families provide a bijection between
|
||||
transaction locations and transaction hashes.
|
||||
|
||||
- The `block_header_by_height` column family provides a bijection between block
|
||||
heights and block header data. There is no corresponding `height_by_block` column
|
||||
family: instead, hash the block header, and use the hash from `height_by_hash`.
|
||||
(Since the rocksdb state only stores finalized state, they are actually a bijection).
|
||||
Similarly, there are no column families that go from transaction data
|
||||
to transaction locations: hash the transaction and use `tx_loc_by_hash`.
|
||||
|
||||
- Block headers and transactions are stored separately in the database,
|
||||
so that individual transactions can be accessed efficiently.
|
||||
Blocks can be re-created on request using the following process:
|
||||
- Look up `height` in `height_by_hash`
|
||||
- Get the block header for `height` from `block_header_by_height`
|
||||
- Iterate from `TransactionIndex` 0,
|
||||
to get each transaction with `height` from `tx_by_loc`,
|
||||
stopping when there are no more transactions in the block
|
||||
|
||||
- Block headers are stored by height, not by hash. This has the downside that looking
|
||||
up a block by hash requires an extra level of indirection. The upside is
|
||||
that blocks with adjacent heights are adjacent in the database, and many
|
||||
common access patterns, such as helping a client sync the chain or doing
|
||||
analysis, access blocks in (potentially sparse) height order. In addition,
|
||||
the fact that we commit blocks in order means we're writing only to the end
|
||||
of the rocksdb column family, which may help save space.
|
||||
|
||||
- Similarly, transaction data is stored in chain order in `tx_by_loc` and `utxo_by_out_loc`,
|
||||
and chain order within each vector in `utxo_loc_by_transparent_addr_loc` and
|
||||
`tx_loc_by_transparent_addr_loc`.
|
||||
|
||||
- `TransactionLocation`s are stored as a `(height, index)` pair referencing the
|
||||
height of the transaction's parent block and the transaction's index in that
|
||||
block. This would more traditionally be a `(hash, index)` pair, but because
|
||||
we store blocks by height, storing the height saves one level of indirection.
|
||||
Transaction hashes can be looked up using `hash_by_tx_loc`.
|
||||
|
||||
- Similarly, UTXOs are stored in `utxo_by_out_loc` by `OutputLocation`,
|
||||
rather than `OutPoint`. `OutPoint`s can be looked up using `tx_loc_by_hash`,
|
||||
and reconstructed using `hash_by_tx_loc`.
|
||||
|
||||
- The `Utxo` type can be constructed from the `OutputLocation` and `Output` data,
|
||||
`height: OutputLocation.height`, and
|
||||
`is_coinbase: OutputLocation.transaction_index == 0`
|
||||
(coinbase transactions are always the first transaction in a block).
|
||||
|
||||
- `balance_by_transparent_addr` is the sum of all `utxo_loc_by_transparent_addr_loc`s
|
||||
that are still in `utxo_by_out_loc`. It is cached to improve performance for
|
||||
addresses with large UTXO sets. It also stores the `AddressLocation` for each
|
||||
address, which allows for efficient lookups.
|
||||
|
||||
- `utxo_loc_by_transparent_addr_loc` stores unspent transparent output locations
|
||||
by address. The address location and UTXO location are stored as a RocksDB key,
|
||||
so they are in chain order, and get good database performance.
|
||||
This column family includes also includes the original address location UTXO,
|
||||
if it has not been spent.
|
||||
|
||||
- When a block write deletes a UTXO from `utxo_by_out_loc`,
|
||||
that UTXO location should be deleted from `utxo_loc_by_transparent_addr_loc`.
|
||||
The deleted UTXO can be removed efficiently, because the UTXO location is part of the key.
|
||||
This is an index optimisation, which does not affect query results.
|
||||
|
||||
- `tx_loc_by_transparent_addr_loc` stores transaction locations by address.
|
||||
This list includes transactions containing spent UTXOs.
|
||||
The address location and transaction location are stored as a RocksDB key,
|
||||
so they are in chain order, and get good database performance.
|
||||
This column family also includes the `TransactionLocation`
|
||||
of the transaction for the `AddressLocation`.
|
||||
|
||||
- The `sprout_note_commitment_tree` stores the note commitment tree state
|
||||
at the tip of the finalized state, for the specific pool. There is always
|
||||
a single entry. Each tree is stored
|
||||
as a "Merkle tree frontier" which is basically a (logarithmic) subset of
|
||||
the Merkle tree nodes as required to insert new items.
|
||||
For each block committed, the old tree is deleted and a new one is inserted
|
||||
by its new height.
|
||||
**TODO:** store the sprout note commitment tree by `()`,
|
||||
to avoid ReadStateService concurrent write issues.
|
||||
|
||||
- The `{sapling, orchard}_note_commitment_tree` stores the note commitment tree
|
||||
state for every height, for the specific pool. Each tree is stored
|
||||
as a "Merkle tree frontier" which is basically a (logarithmic) subset of
|
||||
the Merkle tree nodes as required to insert new items.
|
||||
|
||||
- `history_tree` stores the ZIP-221 history tree state at the tip of the finalized
|
||||
state. There is always a single entry for it. The tree is stored as the set of "peaks"
|
||||
of the "Merkle mountain range" tree structure, which is what is required to
|
||||
insert new items.
|
||||
**TODO:** store the history tree by `()`, to avoid ReadStateService concurrent write issues.
|
||||
|
||||
- Each `*_anchors` stores the anchor (the root of a Merkle tree) of the note commitment
|
||||
tree of a certain block. We only use the keys since we just need the set of anchors,
|
||||
regardless of where they come from. The exception is `sprout_anchors` which also maps
|
||||
the anchor to the matching note commitment tree. This is required to support interstitial
|
||||
treestates, which are unique to Sprout.
|
||||
**TODO:** store the `Root` hash in `sprout_note_commitment_tree`, and use it to look up the
|
||||
note commitment tree. This de-duplicates tree state data. But we currently only store one sprout tree by height.
|
||||
|
||||
- The value pools are only stored for the finalized tip.
|
||||
|
||||
- We do not store the cumulative work for the finalized chain,
|
||||
because the finalized work is equal for all non-finalized chains.
|
||||
So the additional non-finalized work can be used to calculate the relative chain order,
|
||||
and choose the best chain.
|
|
@ -218,6 +218,9 @@ impl DbFormatChange {
|
|||
///
|
||||
/// If `cancel_receiver` gets a message, or its sender is dropped,
|
||||
/// the format change stops running early.
|
||||
///
|
||||
/// See the format upgrade design docs for more details:
|
||||
/// <https://github.com/ZcashFoundation/zebra/blob/main/book/src/dev/state-db-upgrades.md#design>
|
||||
//
|
||||
// New format upgrades must be added to the *end* of this method.
|
||||
fn apply_format_upgrade(
|
||||
|
@ -259,8 +262,6 @@ impl DbFormatChange {
|
|||
};
|
||||
|
||||
// Example format change.
|
||||
//
|
||||
// TODO: link to format upgrade instructions doc here
|
||||
|
||||
// Check if we need to do this upgrade.
|
||||
let database_format_add_format_change_task =
|
||||
|
|
Loading…
Reference in New Issue