change(doc): Document how to upgrade the database format (#7261)

* Move the state format into a new doc * Add upgrade instructions * Link to the format upgrade docs from the upgrade code * Fix typo Co-authored-by: Marek <mail@marek.onl> --------- Co-authored-by: Marek <mail@marek.onl>
2023-07-20 11:50:25 +10:00 · 2023-07-20 11:50:25 +10:00 · 512dd9bc5d
parent 9ebd56092b
commit 512dd9bc5d
4 changed files with 367 additions and 307 deletions
--- a/book/src/SUMMARY.md
+++ b/book/src/SUMMARY.md
@ -20,7 +20,14 @@
 - [Developer Documentation](dev.md)
  - [Contribution Guide](CONTRIBUTING.md)
  - [Design Overview](dev/overview.md)
+  - [Diagrams](dev/diagrams.md)
+    - [Network Architecture](dev/diagrams/zebra-network.md)
+  - [Upgrading the State Database](dev/state-db-upgrades.md)
  - [Zebra versioning and releases](dev/release-process.md)
+  - [Continuous Integration](dev/continuous-integration.md)
+  - [Continuous Delivery](dev/continuous-delivery.md)
+  - [Generating Zebra Checkpoints](dev/zebra-checkpoints.md)
+  - [Doing Mass Renames](dev/mass-renames.md)
  - [Zebra RFCs](dev/rfcs.md)
    - [Pipelinable Block Lookup](dev/rfcs/0001-pipelinable-block-lookup.md)
    - [Parallel Verification](dev/rfcs/0002-parallel-verification.md)
@ -32,10 +39,4 @@
    - [V5 Transaction](dev/rfcs/0010-v5-transaction.md)
    - [Async Rust in Zebra](dev/rfcs/0011-async-rust-in-zebra.md)
    - [Value Pools](dev/rfcs/0012-value-pools.md)
-  - [Diagrams](dev/diagrams.md)
-    - [Network Architecture](dev/diagrams/zebra-network.md)
-  - [Continuous Integration](dev/continuous-integration.md)
-  - [Continuous Delivery](dev/continuous-delivery.md)
-  - [Generating Zebra Checkpoints](dev/zebra-checkpoints.md)
-  - [Doing Mass Renames](dev/mass-renames.md)
 - [API Reference](api.md)
--- a/book/src/dev/rfcs/0005-state-updates.md
+++ b/book/src/dev/rfcs/0005-state-updates.md
@ -663,305 +663,7 @@ New `non-finalized` blocks are committed as follows:
 ## rocksdb data structures
 [rocksdb]: #rocksdb

-rocksdb provides a persistent, thread-safe `BTreeMap<&[u8], &[u8]>`. Each map is
-a distinct "tree". Keys are sorted using lex order on byte strings, so
-integer values should be stored using big-endian encoding (so that the lex
-order on byte strings is the numeric ordering).
-
-Note that the lex order storage allows creating 1-to-many maps using keys only.
-For example, the `tx_loc_by_transparent_addr_loc` allows mapping each address
-to all transactions related to it, by simply storing each transaction prefixed
-with the address as the key, leaving the value empty. Since rocksdb allows
-listing all keys with a given prefix, it will allow listing all transactions
-related to a given address.
-
-We use the following rocksdb column families:
-
-| Column Family                      | Keys                   | Values                        | Changes |
-| ---------------------------------- | ---------------------- | ----------------------------- | ------- |
-| *Blocks*                           |                        |                               |         |
-| `hash_by_height`                   | `block::Height`        | `block::Hash`                 | Create  |
-| `height_by_hash`                   | `block::Hash`          | `block::Height`               | Create  |
-| `block_header_by_height`           | `block::Height`        | `block::Header`               | Create  |
-| *Transactions*                     |                        |                               |         |
-| `tx_by_loc`                        | `TransactionLocation`  | `Transaction`                 | Create  |
-| `hash_by_tx_loc`                   | `TransactionLocation`  | `transaction::Hash`           | Create  |
-| `tx_loc_by_hash`                   | `transaction::Hash`    | `TransactionLocation`         | Create  |
-| *Transparent*                      |                        |                               |         |
-| `balance_by_transparent_addr`      | `transparent::Address` | `Amount \|\| AddressLocation` | Update  |
-| `tx_loc_by_transparent_addr_loc`   | `AddressTransaction`   | `()`                          | Create  |
-| `utxo_by_out_loc`                  | `OutputLocation`       | `transparent::Output`         | Delete  |
-| `utxo_loc_by_transparent_addr_loc` | `AddressUnspentOutput` | `()`                          | Delete  |
-| *Sprout*                           |                        |                               |         |
-| `sprout_nullifiers`                | `sprout::Nullifier`    | `()`                          | Create  |
-| `sprout_anchors`                   | `sprout::tree::Root`   | `sprout::NoteCommitmentTree`  | Create  |
-| `sprout_note_commitment_tree`      | `block::Height`        | `sprout::NoteCommitmentTree`  | Delete  |
-| *Sapling*                          |                        |                               |         |
-| `sapling_nullifiers`               | `sapling::Nullifier`   | `()`                          | Create  |
-| `sapling_anchors`                  | `sapling::tree::Root`  | `()`                          | Create  |
-| `sapling_note_commitment_tree`     | `block::Height`        | `sapling::NoteCommitmentTree` | Create  |
-| *Orchard*                          |                        |                               |         |
-| `orchard_nullifiers`               | `orchard::Nullifier`   | `()`                          | Create  |
-| `orchard_anchors`                  | `orchard::tree::Root`  | `()`                          | Create  |
-| `orchard_note_commitment_tree`     | `block::Height`        | `orchard::NoteCommitmentTree` | Create  |
-| *Chain*                            |                        |                               |         |
-| `history_tree`                     | `block::Height`        | `NonEmptyHistoryTree`         | Delete  |
-| `tip_chain_value_pool`             | `()`                   | `ValueBalance`                | Update  |
-
-Zcash structures are encoded using `ZcashSerialize`/`ZcashDeserialize`.
-Other structures are encoded using `IntoDisk`/`FromDisk`.
-
-Block and Transaction Data:
- `Height`: 24 bits, big-endian, unsigned (allows for ~30 years worth of blocks)
- `TransactionIndex`: 16 bits, big-endian, unsigned (max ~23,000 transactions in the 2 MB block limit)
- `TransactionCount`: same as `TransactionIndex`
- `TransactionLocation`: `Height \|\| TransactionIndex`
- `OutputIndex`: 24 bits, big-endian, unsigned (max ~223,000 transfers in the 2 MB block limit)
- transparent and shielded input indexes, and shielded output indexes: 16 bits, big-endian, unsigned (max ~49,000 transfers in the 2 MB block limit)
- `OutputLocation`: `TransactionLocation \|\| OutputIndex`
- `AddressLocation`: the first `OutputLocation` used by a `transparent::Address`.
-  Always has the same value for each address, even if the first output is spent.
- `Utxo`: `Output`, derives extra fields from the `OutputLocation` key
- `AddressUnspentOutput`: `AddressLocation \|\| OutputLocation`,
-  used instead of a `BTreeSet<OutputLocation>` value, to improve database performance
- `AddressTransaction`: `AddressLocation \|\| TransactionLocation`
-  used instead of a `BTreeSet<TransactionLocation>` value, to improve database performance
-
-We use big-endian encoding for keys, to allow database index prefix searches.
-
-Amounts:
- `Amount`: 64 bits, little-endian, signed
- `ValueBalance`: `[Amount; 4]`
-
-Derived Formats:
- `*::NoteCommitmentTree`: `bincode` using `serde`
- `NonEmptyHistoryTree`: `bincode` using `serde`, using `zcash_history`'s `serde` implementation
-
-
-The following figure helps visualizing the address index, which is the most complicated part.
-Numbers in brackets are array sizes; bold arrows are compositions (i.e. `TransactionLocation` is the
-concatenation of `Height` and `TransactionIndex`); dashed arrows are compositions that are also 1-to-many
-maps (i.e. `AddressTransaction` is the concatenation of `AddressLocation` and `TransactionLocation`,
-but also is used to map each `AddressLocation` to multiple `TransactionLocation`s).
-
-```mermaid
-graph TD;
-    Address -->|"balance_by_transparent_addr<br/>"| AddressBalance;
-    AddressBalance ==> Amount;
-    AddressBalance ==> AddressLocation;
-    AddressLocation ==> FirstOutputLocation;
-    AddressLocation -.->|"tx_loc_by_transparent_addr_loc<br/>(AddressTransaction[13])"| TransactionLocation;
-    TransactionLocation ==> Height;
-    TransactionLocation ==> TransactionIndex;
-    OutputLocation -->|utxo_by_out_loc| Output;
-    OutputLocation ==> TransactionLocation;
-    OutputLocation ==> OutputIndex;
-    AddressLocation -.->|"utxo_loc_by_transparent_addr_loc<br/>(AddressUnspentOutput[16])"| OutputLocation;
-
-    AddressBalance["AddressBalance[16]"];
-    Amount["Amount[8]"];
-    Height["Height[3]"];
-    Address["Address[21]"];
-    TransactionIndex["TransactionIndex[2]"];
-    TransactionLocation["TransactionLocation[5]"];
-    OutputIndex["OutputIndex[3]"];
-    OutputLocation["OutputLocation[8]"];
-    FirstOutputLocation["First OutputLocation[8]"];
-    AddressLocation["AddressLocation[8]"];
-```
-
-### Implementing consensus rules using rocksdb
-[rocksdb-consensus-rules]: #rocksdb-consensus-rules
-
-Each column family handles updates differently, based on its specific consensus rules:
- Create:
-  - Each key-value entry is created once.
-  - Keys are never deleted, values are never updated.
- Delete:
-  - Each key-value entry is created once.
-  - Keys can be deleted, but values are never updated.
-  - Code called by ReadStateService must ignore deleted keys, or use a read lock.
-  - TODO: should we prevent re-inserts of keys that have been deleted?
- Update:
-  - Each key-value entry is created once.
-  - Keys are never deleted, but values can be updated.
-  - Code called by ReadStateService must handle old or new values, or use a read lock.
-
-We can't do some kinds of value updates, because they cause RocksDB performance issues:
- Append:
-  - Keys are never deleted.
-  - Existing values are never updated.
-  - Sets of values have additional items appended to the end of the set.
-  - Code called by ReadStateService must handle shorter or longer sets, or use a read lock.
- Up/Del:
-  - Keys can be deleted.
-  - Sets of values have items added or deleted (in any position).
-  - Code called by ReadStateService must ignore deleted keys and values,
-    accept shorter or longer sets, and accept old or new values.
-    Or it should use a read lock.
-
-Avoid using large sets of values as RocksDB keys or values.
-
-### RocksDB read locks
-[rocksdb-read-locks]: #rocksdb-read-locks
-
-The read-only ReadStateService needs to handle concurrent writes and deletes of the finalized
-column families it reads. It must also handle overlaps between the cached non-finalized `Chain`,
-and the current finalized state database.
-
-The StateService uses RocksDB transactions for each block write.
-So ReadStateService queries that only access a single key or value will always see
-a consistent view of the database.
-
-If a ReadStateService query only uses column families that have keys and values appended
-(`Never` in the Updates table above), it should ignore extra appended values.
-Most queries do this by default.
-
-For more complex queries, there are several options:
-
-Reading across multiple column families:
-1. Ignore deleted values using custom Rust code
-2. Take a database snapshot - https://docs.rs/rocksdb/latest/rocksdb/struct.DBWithThreadMode.html#method.snapshot
-
-Reading a single column family:
-3. multi_get - https://docs.rs/rocksdb/latest/rocksdb/struct.DBWithThreadMode.html#method.multi_get_cf
-4. iterator - https://docs.rs/rocksdb/latest/rocksdb/struct.DBWithThreadMode.html#method.iterator_cf
-
-RocksDB also has read transactions, but they don't seem to be exposed in the Rust crate.
-
-### Low-Level Implementation Details
-[rocksdb-low-level]: #rocksdb-low-level
-
-RocksDB ignores duplicate puts and deletes, preserving the latest values.
-If rejecting duplicate puts or deletes is consensus-critical,
-check [`db.get_cf(cf, key)?`](https://docs.rs/rocksdb/0.16.0/rocksdb/struct.DBWithThreadMode.html#method.get_cf)
-before putting or deleting any values in a batch.
-
-Currently, these restrictions should be enforced by code review:
- multiple `zs_insert`s are only allowed on Update column families, and
- [`delete_cf`](https://docs.rs/rocksdb/0.16.0/rocksdb/struct.WriteBatch.html#method.delete_cf)
-  is only allowed on Delete column families.
-
-In future, we could enforce these restrictions by:
- creating traits for Never, Delete, and Update
- doing different checks in `zs_insert` depending on the trait
- wrapping `delete_cf` in a trait, and only implementing that trait for types that use Delete column families.
-
-As of June 2021, the Rust `rocksdb` crate [ignores the delete callback](https://docs.rs/rocksdb/0.16.0/src/rocksdb/merge_operator.rs.html#83-94),
-and merge operators are unreliable (or have undocumented behaviour).
-So they should not be used for consensus-critical checks.
-
-### Notes on rocksdb column families
-[rocksdb-column-families]: #rocksdb-column-families
-
- The `hash_by_height` and `height_by_hash` column families provide a bijection between
-  block heights and block hashes.  (Since the rocksdb state only stores finalized
-  state, they are actually a bijection).
-
- Similarly, the `tx_loc_by_hash` and `hash_by_tx_loc` column families provide a bijection between
-  transaction locations and transaction hashes.
-
- The `block_header_by_height` column family provides a bijection between block
-  heights and block header data. There is no corresponding `height_by_block` column
-  family: instead, hash the block header, and use the hash from `height_by_hash`.
-  (Since the rocksdb state only stores finalized state, they are actually a bijection).
-  Similarly, there are no column families that go from transaction data
-  to transaction locations: hash the transaction and use `tx_loc_by_hash`.
-
- Block headers and transactions are stored separately in the database,
-  so that individual transactions can be accessed efficiently.
-  Blocks can be re-created on request using the following process:
-  - Look up `height` in `height_by_hash`
-  - Get the block header for `height` from `block_header_by_height`
-  - Iterate from `TransactionIndex` 0,
-    to get each transaction with `height` from `tx_by_loc`,
-    stopping when there are no more transactions in the block
-
- Block headers are stored by height, not by hash.  This has the downside that looking
-  up a block by hash requires an extra level of indirection.  The upside is
-  that blocks with adjacent heights are adjacent in the database, and many
-  common access patterns, such as helping a client sync the chain or doing
-  analysis, access blocks in (potentially sparse) height order.  In addition,
-  the fact that we commit blocks in order means we're writing only to the end
-  of the rocksdb column family, which may help save space.
-
- Similarly, transaction data is stored in chain order in `tx_by_loc` and `utxo_by_out_loc`,
-  and chain order within each vector in `utxo_loc_by_transparent_addr_loc` and
-  `tx_loc_by_transparent_addr_loc`.
-
- `TransactionLocation`s are stored as a `(height, index)` pair referencing the
-  height of the transaction's parent block and the transaction's index in that
-  block.  This would more traditionally be a `(hash, index)` pair, but because
-  we store blocks by height, storing the height saves one level of indirection.
-  Transaction hashes can be looked up using `hash_by_tx_loc`.
-
- Similarly, UTXOs are stored in `utxo_by_out_loc` by `OutputLocation`,
-  rather than `OutPoint`. `OutPoint`s can be looked up using `tx_loc_by_hash`,
-  and reconstructed using `hash_by_tx_loc`.
-
- The `Utxo` type can be constructed from the `OutputLocation` and `Output` data,
-  `height: OutputLocation.height`, and
-  `is_coinbase: OutputLocation.transaction_index == 0`
-  (coinbase transactions are always the first transaction in a block).
-
- `balance_by_transparent_addr` is the sum of all `utxo_loc_by_transparent_addr_loc`s
-  that are still in `utxo_by_out_loc`. It is cached to improve performance for
-  addresses with large UTXO sets. It also stores the `AddressLocation` for each
-  address, which allows for efficient lookups.
-
- `utxo_loc_by_transparent_addr_loc` stores unspent transparent output locations
-  by address. The address location and UTXO location are stored as a RocksDB key,
-  so they are in chain order, and get good database performance.
-  This column family includes also includes the original address location UTXO,
-  if it has not been spent.
-
- When a block write deletes a UTXO from `utxo_by_out_loc`,
-  that UTXO location should be deleted from `utxo_loc_by_transparent_addr_loc`.
-  The deleted UTXO can be removed efficiently, because the UTXO location is part of the key.
-  This is an index optimisation, which does not affect query results.
-
- `tx_loc_by_transparent_addr_loc` stores transaction locations by address.
-  This list includes transactions containing spent UTXOs.
-  The address location and transaction location are stored as a RocksDB key,
-  so they are in chain order, and get good database performance.
-  This column family also includes the `TransactionLocation`
-  of the transaction for the `AddressLocation`.
-
- The `sprout_note_commitment_tree` stores the note commitment tree state
-  at the tip of the finalized state, for the specific pool. There is always
-  a single entry. Each tree is stored
-  as a "Merkle tree frontier" which is basically a (logarithmic) subset of
-  the Merkle tree nodes as required to insert new items.
-  For each block committed, the old tree is deleted and a new one is inserted
-  by its new height.
-  **TODO:** store the sprout note commitment tree by `()`,
-  to avoid ReadStateService concurrent write issues.
-
- The `{sapling, orchard}_note_commitment_tree` stores the note commitment tree
-  state for every height, for the specific pool. Each tree is stored
-  as a "Merkle tree frontier" which is basically a (logarithmic) subset of
-  the Merkle tree nodes as required to insert new items.
-
- `history_tree` stores the ZIP-221 history tree state at the tip of the finalized
-  state. There is always a single entry for it. The tree is stored as the set of "peaks"
-  of the "Merkle mountain range" tree structure, which is what is required to
-  insert new items.
-  **TODO:** store the history tree by `()`, to avoid ReadStateService concurrent write issues.
-
- Each `*_anchors` stores the anchor (the root of a Merkle tree) of the note commitment
-  tree of a certain block. We only use the keys since we just need the set of anchors,
-  regardless of where they come from. The exception is `sprout_anchors` which also maps
-  the anchor to the matching note commitment tree. This is required to support interstitial
-  treestates, which are unique to Sprout.
-  **TODO:** store the `Root` hash in `sprout_note_commitment_tree`, and use it to look up the
-  note commitment tree. This de-duplicates tree state data. But we currently only store one sprout tree by height.
-
- The value pools are only stored for the finalized tip.
-
- We do not store the cumulative work for the finalized chain,
-  because the finalized work is equal for all non-finalized chains.
-  So the additional non-finalized work can be used to calculate the relative chain order,
-  and choose the best chain.
+The current database format is documented in [Upgrading the State Database](../state-db-upgrades.md).

 ## Committing finalized blocks

--- a/book/src/dev/state-db-upgrades.md
+++ b/book/src/dev/state-db-upgrades.md
@ -0,0 +1,356 @@
+# Zebra Cached State Database Implementation
+
+## Upgrading the State Database
+
+For most state upgrades, we want to modify the database format of the existing database. If we
+change the major database version, every user needs to re-download and re-verify all the blocks,
+which can take days.
+
+### In-Place Upgrade Goals
+
+- avoid a full download and rebuild of the state
+- the previous state format must be able to be loaded by the new state
+  - this is checked the first time CI runs on a PR with a new state version.
+    After the first CI run, the cached state is marked as upgraded, so the upgrade doesn't run
+    again. If CI fails on the first run, any cached states with that version should be deleted.
+- previous zebra versions should be able to load the new format
+  - this is checked by other PRs running using the upgraded cached state, but only if a Rust PR
+    runs after the new PR's CI finishes, but before it merges
+- best-effort loading of older supported states by newer Zebra versions
+- best-effort compatibility between newer states and older supported Zebra versions
+
+### Design Constraints
+[design]: #design
+
+Upgrades run concurrently with state verification and RPC requests.
+
+This means that:
+- the state must be able to read the old and new formats
+  - it can't panic if the data is missing
+  - it can't give incorrect results, because that can affect verification or wallets
+  - it can return an error
+  - it can only return an `Option` if the caller handles it correctly
+- multiple upgrades must produce a valid state format
+  - if Zebra is restarted, the format upgrade will run multiple times
+  - if an older Zebra version opens the state, data can be written in an older format
+- the format must be valid before and after each database transaction or API call, because an upgrade can be cancelled at any time
+  - multi-column family changes should made in database transactions
+  - if you are building new column family, disable state queries, then enable them once it's done
+  - if each database API call produces a valid format, transactions aren't needed
+
+If there is an upgrade failure, it can panic and tell the user to delete their cached state and re-launch Zebra.
+
+### Implementation Steps
+
+- [ ] update the [database format](https://github.com/ZcashFoundation/zebra/blob/main/book/src/dev/state-db-upgrades.md#current) in the Zebra docs
+- [ ] increment the state minor version
+- [ ] write the new format in the block write task
+- [ ] update older formats in the format upgrade task
+- [ ] test that the new format works when creating a new state, and updating an older state
+
+See the [upgrade design docs](https://github.com/ZcashFoundation/zebra/blob/main/book/src/dev/state-db-upgrades.md#design) for more details.
+
+These steps can be copied into tickets.
+
+## Current State Database Format
+[current]: #current
+
+rocksdb provides a persistent, thread-safe `BTreeMap<&[u8], &[u8]>`. Each map is
+a distinct "tree". Keys are sorted using lexographic order (`[u8].sorted()`) on byte strings, so
+integer values should be stored using big-endian encoding (so that the lex
+order on byte strings is the numeric ordering).
+
+Note that the lex order storage allows creating 1-to-many maps using keys only.
+For example, the `tx_loc_by_transparent_addr_loc` allows mapping each address
+to all transactions related to it, by simply storing each transaction prefixed
+with the address as the key, leaving the value empty. Since rocksdb allows
+listing all keys with a given prefix, it will allow listing all transactions
+related to a given address.
+
+We use the following rocksdb column families:
+
+| Column Family                      | Keys                   | Values                        | Changes |
+| ---------------------------------- | ---------------------- | ----------------------------- | ------- |
+| *Blocks*                           |                        |                               |         |
+| `hash_by_height`                   | `block::Height`        | `block::Hash`                 | Create  |
+| `height_by_hash`                   | `block::Hash`          | `block::Height`               | Create  |
+| `block_header_by_height`           | `block::Height`        | `block::Header`               | Create  |
+| *Transactions*                     |                        |                               |         |
+| `tx_by_loc`                        | `TransactionLocation`  | `Transaction`                 | Create  |
+| `hash_by_tx_loc`                   | `TransactionLocation`  | `transaction::Hash`           | Create  |
+| `tx_loc_by_hash`                   | `transaction::Hash`    | `TransactionLocation`         | Create  |
+| *Transparent*                      |                        |                               |         |
+| `balance_by_transparent_addr`      | `transparent::Address` | `Amount \|\| AddressLocation` | Update  |
+| `tx_loc_by_transparent_addr_loc`   | `AddressTransaction`   | `()`                          | Create  |
+| `utxo_by_out_loc`                  | `OutputLocation`       | `transparent::Output`         | Delete  |
+| `utxo_loc_by_transparent_addr_loc` | `AddressUnspentOutput` | `()`                          | Delete  |
+| *Sprout*                           |                        |                               |         |
+| `sprout_nullifiers`                | `sprout::Nullifier`    | `()`                          | Create  |
+| `sprout_anchors`                   | `sprout::tree::Root`   | `sprout::NoteCommitmentTree`  | Create  |
+| `sprout_note_commitment_tree`      | `block::Height`        | `sprout::NoteCommitmentTree`  | Delete  |
+| *Sapling*                          |                        |                               |         |
+| `sapling_nullifiers`               | `sapling::Nullifier`   | `()`                          | Create  |
+| `sapling_anchors`                  | `sapling::tree::Root`  | `()`                          | Create  |
+| `sapling_note_commitment_tree`     | `block::Height`        | `sapling::NoteCommitmentTree` | Create  |
+| *Orchard*                          |                        |                               |         |
+| `orchard_nullifiers`               | `orchard::Nullifier`   | `()`                          | Create  |
+| `orchard_anchors`                  | `orchard::tree::Root`  | `()`                          | Create  |
+| `orchard_note_commitment_tree`     | `block::Height`        | `orchard::NoteCommitmentTree` | Create  |
+| *Chain*                            |                        |                               |         |
+| `history_tree`                     | `block::Height`        | `NonEmptyHistoryTree`         | Delete  |
+| `tip_chain_value_pool`             | `()`                   | `ValueBalance`                | Update  |
+
+Zcash structures are encoded using `ZcashSerialize`/`ZcashDeserialize`.
+Other structures are encoded using `IntoDisk`/`FromDisk`.
+
+Block and Transaction Data:
+- `Height`: 24 bits, big-endian, unsigned (allows for ~30 years worth of blocks)
+- `TransactionIndex`: 16 bits, big-endian, unsigned (max ~23,000 transactions in the 2 MB block limit)
+- `TransactionCount`: same as `TransactionIndex`
+- `TransactionLocation`: `Height \|\| TransactionIndex`
+- `OutputIndex`: 24 bits, big-endian, unsigned (max ~223,000 transfers in the 2 MB block limit)
+- transparent and shielded input indexes, and shielded output indexes: 16 bits, big-endian, unsigned (max ~49,000 transfers in the 2 MB block limit)
+- `OutputLocation`: `TransactionLocation \|\| OutputIndex`
+- `AddressLocation`: the first `OutputLocation` used by a `transparent::Address`.
+  Always has the same value for each address, even if the first output is spent.
+- `Utxo`: `Output`, derives extra fields from the `OutputLocation` key
+- `AddressUnspentOutput`: `AddressLocation \|\| OutputLocation`,
+  used instead of a `BTreeSet<OutputLocation>` value, to improve database performance
+- `AddressTransaction`: `AddressLocation \|\| TransactionLocation`
+  used instead of a `BTreeSet<TransactionLocation>` value, to improve database performance
+
+We use big-endian encoding for keys, to allow database index prefix searches.
+
+Amounts:
+- `Amount`: 64 bits, little-endian, signed
+- `ValueBalance`: `[Amount; 4]`
+
+Derived Formats:
+- `*::NoteCommitmentTree`: `bincode` using `serde`
+- `NonEmptyHistoryTree`: `bincode` using `serde`, using `zcash_history`'s `serde` implementation
+
+
+The following figure helps visualizing the address index, which is the most complicated part.
+Numbers in brackets are array sizes; bold arrows are compositions (i.e. `TransactionLocation` is the
+concatenation of `Height` and `TransactionIndex`); dashed arrows are compositions that are also 1-to-many
+maps (i.e. `AddressTransaction` is the concatenation of `AddressLocation` and `TransactionLocation`,
+but also is used to map each `AddressLocation` to multiple `TransactionLocation`s).
+
+```mermaid
+graph TD;
+    Address -->|"balance_by_transparent_addr<br/>"| AddressBalance;
+    AddressBalance ==> Amount;
+    AddressBalance ==> AddressLocation;
+    AddressLocation ==> FirstOutputLocation;
+    AddressLocation -.->|"tx_loc_by_transparent_addr_loc<br/>(AddressTransaction[13])"| TransactionLocation;
+    TransactionLocation ==> Height;
+    TransactionLocation ==> TransactionIndex;
+    OutputLocation -->|utxo_by_out_loc| Output;
+    OutputLocation ==> TransactionLocation;
+    OutputLocation ==> OutputIndex;
+    AddressLocation -.->|"utxo_loc_by_transparent_addr_loc<br/>(AddressUnspentOutput[16])"| OutputLocation;
+
+    AddressBalance["AddressBalance[16]"];
+    Amount["Amount[8]"];
+    Height["Height[3]"];
+    Address["Address[21]"];
+    TransactionIndex["TransactionIndex[2]"];
+    TransactionLocation["TransactionLocation[5]"];
+    OutputIndex["OutputIndex[3]"];
+    OutputLocation["OutputLocation[8]"];
+    FirstOutputLocation["First OutputLocation[8]"];
+    AddressLocation["AddressLocation[8]"];
+```
+
+### Implementing consensus rules using rocksdb
+[rocksdb-consensus-rules]: #rocksdb-consensus-rules
+
+Each column family handles updates differently, based on its specific consensus rules:
+- Create:
+  - Each key-value entry is created once.
+  - Keys are never deleted, values are never updated.
+- Delete:
+  - Each key-value entry is created once.
+  - Keys can be deleted, but values are never updated.
+  - Code called by ReadStateService must ignore deleted keys, or use a read lock.
+  - TODO: should we prevent re-inserts of keys that have been deleted?
+- Update:
+  - Each key-value entry is created once.
+  - Keys are never deleted, but values can be updated.
+  - Code called by ReadStateService must handle old or new values, or use a read lock.
+
+We can't do some kinds of value updates, because they cause RocksDB performance issues:
+- Append:
+  - Keys are never deleted.
+  - Existing values are never updated.
+  - Sets of values have additional items appended to the end of the set.
+  - Code called by ReadStateService must handle shorter or longer sets, or use a read lock.
+- Up/Del:
+  - Keys can be deleted.
+  - Sets of values have items added or deleted (in any position).
+  - Code called by ReadStateService must ignore deleted keys and values,
+    accept shorter or longer sets, and accept old or new values.
+    Or it should use a read lock.
+
+Avoid using large sets of values as RocksDB keys or values.
+
+### RocksDB read locks
+[rocksdb-read-locks]: #rocksdb-read-locks
+
+The read-only ReadStateService needs to handle concurrent writes and deletes of the finalized
+column families it reads. It must also handle overlaps between the cached non-finalized `Chain`,
+and the current finalized state database.
+
+The StateService uses RocksDB transactions for each block write.
+So ReadStateService queries that only access a single key or value will always see
+a consistent view of the database.
+
+If a ReadStateService query only uses column families that have keys and values appended
+(`Never` in the Updates table above), it should ignore extra appended values.
+Most queries do this by default.
+
+For more complex queries, there are several options:
+
+Reading across multiple column families:
+1. Ignore deleted values using custom Rust code
+2. Take a database snapshot - https://docs.rs/rocksdb/latest/rocksdb/struct.DBWithThreadMode.html#method.snapshot
+
+Reading a single column family:
+3. multi_get - https://docs.rs/rocksdb/latest/rocksdb/struct.DBWithThreadMode.html#method.multi_get_cf
+4. iterator - https://docs.rs/rocksdb/latest/rocksdb/struct.DBWithThreadMode.html#method.iterator_cf
+
+RocksDB also has read transactions, but they don't seem to be exposed in the Rust crate.
+
+### Low-Level Implementation Details
+[rocksdb-low-level]: #rocksdb-low-level
+
+RocksDB ignores duplicate puts and deletes, preserving the latest values.
+If rejecting duplicate puts or deletes is consensus-critical,
+check [`db.get_cf(cf, key)?`](https://docs.rs/rocksdb/0.16.0/rocksdb/struct.DBWithThreadMode.html#method.get_cf)
+before putting or deleting any values in a batch.
+
+Currently, these restrictions should be enforced by code review:
+- multiple `zs_insert`s are only allowed on Update column families, and
+- [`delete_cf`](https://docs.rs/rocksdb/0.16.0/rocksdb/struct.WriteBatch.html#method.delete_cf)
+  is only allowed on Delete column families.
+
+In future, we could enforce these restrictions by:
+- creating traits for Never, Delete, and Update
+- doing different checks in `zs_insert` depending on the trait
+- wrapping `delete_cf` in a trait, and only implementing that trait for types that use Delete column families.
+
+As of June 2021, the Rust `rocksdb` crate [ignores the delete callback](https://docs.rs/rocksdb/0.16.0/src/rocksdb/merge_operator.rs.html#83-94),
+and merge operators are unreliable (or have undocumented behaviour).
+So they should not be used for consensus-critical checks.
+
+### Notes on rocksdb column families
+[rocksdb-column-families]: #rocksdb-column-families
+
+- The `hash_by_height` and `height_by_hash` column families provide a bijection between
+  block heights and block hashes.  (Since the rocksdb state only stores finalized
+  state, they are actually a bijection).
+
+- Similarly, the `tx_loc_by_hash` and `hash_by_tx_loc` column families provide a bijection between
+  transaction locations and transaction hashes.
+
+- The `block_header_by_height` column family provides a bijection between block
+  heights and block header data. There is no corresponding `height_by_block` column
+  family: instead, hash the block header, and use the hash from `height_by_hash`.
+  (Since the rocksdb state only stores finalized state, they are actually a bijection).
+  Similarly, there are no column families that go from transaction data
+  to transaction locations: hash the transaction and use `tx_loc_by_hash`.
+
+- Block headers and transactions are stored separately in the database,
+  so that individual transactions can be accessed efficiently.
+  Blocks can be re-created on request using the following process:
+  - Look up `height` in `height_by_hash`
+  - Get the block header for `height` from `block_header_by_height`
+  - Iterate from `TransactionIndex` 0,
+    to get each transaction with `height` from `tx_by_loc`,
+    stopping when there are no more transactions in the block
+
+- Block headers are stored by height, not by hash.  This has the downside that looking
+  up a block by hash requires an extra level of indirection.  The upside is
+  that blocks with adjacent heights are adjacent in the database, and many
+  common access patterns, such as helping a client sync the chain or doing
+  analysis, access blocks in (potentially sparse) height order.  In addition,
+  the fact that we commit blocks in order means we're writing only to the end
+  of the rocksdb column family, which may help save space.
+
+- Similarly, transaction data is stored in chain order in `tx_by_loc` and `utxo_by_out_loc`,
+  and chain order within each vector in `utxo_loc_by_transparent_addr_loc` and
+  `tx_loc_by_transparent_addr_loc`.
+
+- `TransactionLocation`s are stored as a `(height, index)` pair referencing the
+  height of the transaction's parent block and the transaction's index in that
+  block.  This would more traditionally be a `(hash, index)` pair, but because
+  we store blocks by height, storing the height saves one level of indirection.
+  Transaction hashes can be looked up using `hash_by_tx_loc`.
+
+- Similarly, UTXOs are stored in `utxo_by_out_loc` by `OutputLocation`,
+  rather than `OutPoint`. `OutPoint`s can be looked up using `tx_loc_by_hash`,
+  and reconstructed using `hash_by_tx_loc`.
+
+- The `Utxo` type can be constructed from the `OutputLocation` and `Output` data,
+  `height: OutputLocation.height`, and
+  `is_coinbase: OutputLocation.transaction_index == 0`
+  (coinbase transactions are always the first transaction in a block).
+
+- `balance_by_transparent_addr` is the sum of all `utxo_loc_by_transparent_addr_loc`s
+  that are still in `utxo_by_out_loc`. It is cached to improve performance for
+  addresses with large UTXO sets. It also stores the `AddressLocation` for each
+  address, which allows for efficient lookups.
+
+- `utxo_loc_by_transparent_addr_loc` stores unspent transparent output locations
+  by address. The address location and UTXO location are stored as a RocksDB key,
+  so they are in chain order, and get good database performance.
+  This column family includes also includes the original address location UTXO,
+  if it has not been spent.
+
+- When a block write deletes a UTXO from `utxo_by_out_loc`,
+  that UTXO location should be deleted from `utxo_loc_by_transparent_addr_loc`.
+  The deleted UTXO can be removed efficiently, because the UTXO location is part of the key.
+  This is an index optimisation, which does not affect query results.
+
+- `tx_loc_by_transparent_addr_loc` stores transaction locations by address.
+  This list includes transactions containing spent UTXOs.
+  The address location and transaction location are stored as a RocksDB key,
+  so they are in chain order, and get good database performance.
+  This column family also includes the `TransactionLocation`
+  of the transaction for the `AddressLocation`.
+
+- The `sprout_note_commitment_tree` stores the note commitment tree state
+  at the tip of the finalized state, for the specific pool. There is always
+  a single entry. Each tree is stored
+  as a "Merkle tree frontier" which is basically a (logarithmic) subset of
+  the Merkle tree nodes as required to insert new items.
+  For each block committed, the old tree is deleted and a new one is inserted
+  by its new height.
+  **TODO:** store the sprout note commitment tree by `()`,
+  to avoid ReadStateService concurrent write issues.
+
+- The `{sapling, orchard}_note_commitment_tree` stores the note commitment tree
+  state for every height, for the specific pool. Each tree is stored
+  as a "Merkle tree frontier" which is basically a (logarithmic) subset of
+  the Merkle tree nodes as required to insert new items.
+
+- `history_tree` stores the ZIP-221 history tree state at the tip of the finalized
+  state. There is always a single entry for it. The tree is stored as the set of "peaks"
+  of the "Merkle mountain range" tree structure, which is what is required to
+  insert new items.
+  **TODO:** store the history tree by `()`, to avoid ReadStateService concurrent write issues.
+
+- Each `*_anchors` stores the anchor (the root of a Merkle tree) of the note commitment
+  tree of a certain block. We only use the keys since we just need the set of anchors,
+  regardless of where they come from. The exception is `sprout_anchors` which also maps
+  the anchor to the matching note commitment tree. This is required to support interstitial
+  treestates, which are unique to Sprout.
+  **TODO:** store the `Root` hash in `sprout_note_commitment_tree`, and use it to look up the
+  note commitment tree. This de-duplicates tree state data. But we currently only store one sprout tree by height.
+
+- The value pools are only stored for the finalized tip.
+
+- We do not store the cumulative work for the finalized chain,
+  because the finalized work is equal for all non-finalized chains.
+  So the additional non-finalized work can be used to calculate the relative chain order,
+  and choose the best chain.
--- a/zebra-state/src/service/finalized_state/disk_format/upgrade.rs
+++ b/zebra-state/src/service/finalized_state/disk_format/upgrade.rs
@ -218,6 +218,9 @@ impl DbFormatChange {
    ///
    /// If `cancel_receiver` gets a message, or its sender is dropped,
    /// the format change stops running early.
+    ///
+    /// See the format upgrade design docs for more details:
+    /// <https://github.com/ZcashFoundation/zebra/blob/main/book/src/dev/state-db-upgrades.md#design>
    //
    // New format upgrades must be added to the *end* of this method.
    fn apply_format_upgrade(
@ -259,8 +262,6 @@ impl DbFormatChange {
        };

        // Example format change.
-        //
-        // TODO: link to format upgrade instructions doc here

        // Check if we need to do this upgrade.
        let database_format_add_format_change_task =