docs: add `README` for `snapshots` package (#10120)
## Description Closes: #10085 Adds some broad documentation for the `snapshots` package. --- ### Author Checklist *All items are required. Please add a note to the item if the item is not applicable and please add links to any relevant follow up issues.* I have... - [x] included the correct [type prefix](https://github.com/commitizen/conventional-commit-types/blob/v3.0.0/index.json) in the PR title - [x] added `!` to the type prefix if API or client breaking change - [x] targeted the correct branch (see [PR Targeting](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#pr-targeting)) - [x] provided a link to the relevant issue or specification - [x] followed the guidelines for [building modules](https://github.com/cosmos/cosmos-sdk/blob/master/docs/building-modules) - [x] included the necessary unit and integration [tests](https://github.com/cosmos/cosmos-sdk/blob/master/CONTRIBUTING.md#testing) - [x] added a changelog entry to `CHANGELOG.md` - [x] included comments for [documenting Go code](https://blog.golang.org/godoc) - [x] updated the relevant documentation or specification - [x] reviewed "Files changed" and left comments if necessary - [x] confirmed all CI checks have passed ### Reviewers Checklist *All items are required. Please add a note if the item is not applicable and please add your handle next to the items reviewed if you only reviewed selected items.* I have... - [ ] confirmed the correct [type prefix](https://github.com/commitizen/conventional-commit-types/blob/v3.0.0/index.json) in the PR title - [ ] confirmed `!` in the type prefix if API or client breaking change - [ ] confirmed all author checklist items have been addressed - [ ] reviewed state machine logic - [ ] reviewed API design and naming - [ ] reviewed documentation is accurate - [ ] reviewed tests and test coverage - [ ] manually tested (if applicable)
This commit is contained in:
parent
cfc8b4702f
commit
fffc2a6ede
|
@ -0,0 +1,236 @@
|
|||
# State Sync Snapshotting
|
||||
|
||||
The `snapshots` package implements automatic support for Tendermint state sync
|
||||
in Cosmos SDK-based applications. State sync allows a new node joining a network
|
||||
to simply fetch a recent snapshot of the application state instead of fetching
|
||||
and applying all historical blocks. This can reduce the time needed to join the
|
||||
network by several orders of magnitude (e.g. weeks to minutes), but the node
|
||||
will not contain historical data from previous heights.
|
||||
|
||||
This document describes the Cosmos SDK implementation of the ABCI state sync
|
||||
interface, for more information on Tendermint state sync in general see:
|
||||
|
||||
* [Tendermint Core State Sync for Developers](https://medium.com/tendermint/tendermint-core-state-sync-for-developers-70a96ba3ee35)
|
||||
* [ABCI State Sync Spec](https://docs.tendermint.com/master/spec/abci/apps.html#state-sync)
|
||||
* [ABCI State Sync Method/Type Reference](https://docs.tendermint.com/master/spec/abci/abci.html#state-sync)
|
||||
|
||||
## Overview
|
||||
|
||||
For an overview of how Cosmos SDK state sync is set up and configured by
|
||||
developers and end-users, see the
|
||||
[Cosmos SDK State Sync Guide](https://blog.cosmos.network/cosmos-sdk-state-sync-guide-99e4cf43be2f).
|
||||
|
||||
Briefly, the Cosmos SDK takes state snapshots at regular height intervals given
|
||||
by `state-sync.snapshot-interval` and stores them as binary files in the
|
||||
filesystem under `<node_home>/data/snapshots/`, with metadata in a LevelDB database
|
||||
`<node_home>/data/snapshots/metadata.db`. The number of recent snapshots to keep are given by
|
||||
`state-sync.snapshot-keep-recent`.
|
||||
|
||||
Snapshots are taken asynchronously, i.e. new blocks will be applied concurrently
|
||||
with snapshots being taken. This is possible because IAVL supports querying
|
||||
immutable historical heights. However, this requires `state-sync.snapshot-interval`
|
||||
to be a multiple of `pruning-keep-every`, to prevent a height from being removed
|
||||
while it is being snapshotted.
|
||||
|
||||
When a remote node is state syncing, Tendermint calls the ABCI method
|
||||
`ListSnapshots` to list available local snapshots and `LoadSnapshotChunk` to
|
||||
load a binary snapshot chunk. When the local node is being state synced,
|
||||
Tendermint calls `OfferSnapshot` to offer a discovered remote snapshot to the
|
||||
local application and `ApplySnapshotChunk` to apply a binary snapshot chunk to
|
||||
the local application. See the resources linked above for more details on these
|
||||
methods and how Tendermint performs state sync.
|
||||
|
||||
The Cosmos SDK does not currently do any incremental verification of snapshots
|
||||
during restoration, i.e. only after the entire snapshot has been restored will
|
||||
Tendermint compare the app hash against the trusted hash from the chain. Cosmos
|
||||
SDK snapshots and chunks do contain hashes as checksums to guard against IO
|
||||
corruption and non-determinism, but these are not tied to the chain state and
|
||||
can be trivially forged by an adversary. This was considered out of scope for
|
||||
the initial implementation, but can be added later without changes to the
|
||||
ABCI state sync protocol.
|
||||
|
||||
## Snapshot Metadata
|
||||
|
||||
The ABCI Protobuf type for a snapshot is listed below (refer to the ABCI spec
|
||||
for field details):
|
||||
|
||||
```protobuf
|
||||
message Snapshot {
|
||||
uint64 height = 1; // The height at which the snapshot was taken
|
||||
uint32 format = 2; // The application-specific snapshot format
|
||||
uint32 chunks = 3; // Number of chunks in the snapshot
|
||||
bytes hash = 4; // Arbitrary snapshot hash, equal only if identical
|
||||
bytes metadata = 5; // Arbitrary application metadata
|
||||
}
|
||||
```
|
||||
|
||||
Because the `metadata` field is application-specific, the Cosmos SDK uses a
|
||||
similar type `cosmos.base.snapshots.v1beta1.Snapshot` with its own metadata
|
||||
representation:
|
||||
|
||||
```protobuf
|
||||
// Snapshot contains Tendermint state sync snapshot info.
|
||||
message Snapshot {
|
||||
uint64 height = 1;
|
||||
uint32 format = 2;
|
||||
uint32 chunks = 3;
|
||||
bytes hash = 4;
|
||||
Metadata metadata = 5 [(gogoproto.nullable) = false];
|
||||
}
|
||||
|
||||
// Metadata contains SDK-specific snapshot metadata.
|
||||
message Metadata {
|
||||
repeated bytes chunk_hashes = 1; // SHA-256 chunk hashes
|
||||
}
|
||||
```
|
||||
|
||||
The `format` is currently `1`, defined in `snapshots.types.CurrentFormat`. This
|
||||
must be increased whenever the binary snapshot format changes, and it may be
|
||||
useful to support past formats in newer versions.
|
||||
|
||||
The `hash` is a SHA-256 hash of the entire binary snapshot, used to guard
|
||||
against IO corruption and non-determinism across nodes. Note that this is not
|
||||
tied to the chain state, and can be trivially forged (but Tendermint will always
|
||||
compare the final app hash against the chain app hash). Similarly, the
|
||||
`chunk_hashes` are SHA-256 checksums of each binary chunk.
|
||||
|
||||
The `metadata` field is Protobuf-serialized before it is placed into the ABCI
|
||||
snapshot.
|
||||
|
||||
## Snapshot Format
|
||||
|
||||
The current version `1` snapshot format is a zlib-compressed, length-prefixed
|
||||
Protobuf stream of `cosmos.base.store.v1beta1.SnapshotItem` messages, split into
|
||||
chunks at exact 10 MB byte boundaries.
|
||||
|
||||
```protobuf
|
||||
// SnapshotItem is an item contained in a rootmulti.Store snapshot.
|
||||
message SnapshotItem {
|
||||
// item is the specific type of snapshot item.
|
||||
oneof item {
|
||||
SnapshotStoreItem store = 1;
|
||||
SnapshotIAVLItem iavl = 2 [(gogoproto.customname) = "IAVL"];
|
||||
}
|
||||
}
|
||||
|
||||
// SnapshotStoreItem contains metadata about a snapshotted store.
|
||||
message SnapshotStoreItem {
|
||||
string name = 1;
|
||||
}
|
||||
|
||||
// SnapshotIAVLItem is an exported IAVL node.
|
||||
message SnapshotIAVLItem {
|
||||
bytes key = 1;
|
||||
bytes value = 2;
|
||||
int64 version = 3;
|
||||
int32 height = 4;
|
||||
}
|
||||
```
|
||||
|
||||
Snapshots are generated by `rootmulti.Store.Snapshot()` as follows:
|
||||
|
||||
1. Set up a `protoio.NewDelimitedWriter` that writes length-prefixed serialized
|
||||
`SnapshotItem` Protobuf messages.
|
||||
1. Iterate over each IAVL store in lexicographical order by store name.
|
||||
2. Emit a `SnapshotStoreItem` containing the store name.
|
||||
3. Start an IAVL export for the store using
|
||||
[`iavl.ImmutableTree.Export()`](https://pkg.go.dev/github.com/tendermint/iavl#ImmutableTree.Export).
|
||||
4. Iterate over each IAVL node.
|
||||
5. Emit a `SnapshotIAVLItem` for the IAVL node.
|
||||
2. Pass the serialized Protobuf output stream to a zlib compression writer.
|
||||
3. Split the zlib output stream into chunks at exactly every 10th megabyte.
|
||||
|
||||
Snapshots are restored via `rootmulti.Store.Restore()` as the inverse of the above, using
|
||||
[`iavl.MutableTree.Import()`](https://pkg.go.dev/github.com/tendermint/iavl#MutableTree.Import)
|
||||
to reconstruct each IAVL tree.
|
||||
|
||||
## Snapshot Storage
|
||||
|
||||
Snapshot storage is managed by `snapshots.Store`, with metadata in a `db.DB`
|
||||
database and binary chunks in the filesystem. Note that this is only used to
|
||||
store locally taken snapshots that are being offered to other nodes. When the
|
||||
local node is being state synced, Tendermint will take care of buffering and
|
||||
storing incoming snapshot chunks before they are applied to the application.
|
||||
|
||||
Metadata is generally stored in a LevelDB database at
|
||||
`<node_home>/data/snapshots/metadata.db`. It contains serialized
|
||||
`cosmos.base.snapshots.v1beta1.Snapshot` Protobuf messages with a key given by
|
||||
the concatenation of a key prefix, the big-endian height, and the big-endian
|
||||
format. Chunk data is stored as regular files under
|
||||
`<node_home>/data/snapshots/<height>/<format>/<chunk>`.
|
||||
|
||||
The `snapshots.Store` API is based on streaming IO, and integrates easily with
|
||||
the `snapshots.types.Snapshotter` snapshot/restore interface implemented by
|
||||
`rootmulti.Store`. The `Store.Save()` method stores a snapshot given as a
|
||||
`<- chan io.ReadCloser` channel of binary chunk streams, and `Store.Load()` loads
|
||||
the snapshot as a channel of binary chunk streams -- the same stream types used
|
||||
by `Snapshotter.Snapshot()` and `Snapshotter.Restore()` to take and restore
|
||||
snapshots using streaming IO.
|
||||
|
||||
The store also provides many other methods such as `List()` to list stored
|
||||
snapshots, `LoadChunk()` to load a single snapshot chunk, and `Prune()` to prune
|
||||
old snapshots.
|
||||
|
||||
## Taking Snapshots
|
||||
|
||||
`snapshots.Manager` is a high-level snapshot manager that integrates a
|
||||
`snapshots.types.Snapshotter` (i.e. the `rootmulti.Store` snapshot
|
||||
functionality) and a `snapshots.Store`, providing an API that maps easily onto
|
||||
the ABCI state sync API. The `Manager` will also make sure only one operation
|
||||
is in progress at a time, e.g. to prevent multiple snapshots being taken
|
||||
concurrently.
|
||||
|
||||
During `BaseApp.Commit`, once a state transition has been committed, the height
|
||||
is checked against the `state-sync.snapshot-interval` setting. If the committed
|
||||
height should be snapshotted, a goroutine `BaseApp.snapshot()` is spawned that
|
||||
calls `snapshots.Manager.Create()` to create the snapshot.
|
||||
|
||||
`Manager.Create()` will do some basic pre-flight checks, and then start
|
||||
generating a snapshot by calling `rootmulti.Store.Snapshot()`. The chunk stream
|
||||
is passed into `snapshots.Store.Save()`, which stores the chunks in the
|
||||
filesystem and records the snapshot metadata in the snapshot database.
|
||||
|
||||
Once the snapshot has been generated, `BaseApp.snapshot()` then removes any
|
||||
old snapshots based on the `state-sync.snapshot-keep-recent` setting.
|
||||
|
||||
## Serving Snapshots
|
||||
|
||||
When a remote node is discovering snapshots for state sync, Tendermint will
|
||||
call the `ListSnapshots` ABCI method to list the snapshots present on the
|
||||
local node. This is dispatched to `snapshots.Manager.List()`, which in turn
|
||||
dispatches to `snapshots.Store.List()`.
|
||||
|
||||
When a remote node is fetching snapshot chunks during state sync, Tendermint
|
||||
will call the `LoadSnapshotChunk` ABCI method to fetch a chunk from the local
|
||||
node. This dispatches to `snapshots.Manager.LoadChunk()`, which in turn
|
||||
dispatches to `snapshots.Store.LoadChunk()`.
|
||||
|
||||
## Restoring Snapshots
|
||||
|
||||
When the operator has configured the local Tendermint node to run state sync
|
||||
(see the resources listed in the introduction for details on Tendermint state
|
||||
sync), it will discover snapshots across the P2P network and offer their
|
||||
metadata in turn to the local application via the `OfferSnapshot` ABCI call.
|
||||
|
||||
`BaseApp.OfferSnapshot()` attempts to start a restore operation by calling
|
||||
`snapshots.Manager.Restore()`. This may fail, e.g. if the snapshot format is
|
||||
unknown (it may have been generated by a different version of the Cosmos SDK),
|
||||
in which case Tendermint will offer other discovered snapshots.
|
||||
|
||||
If the snapshot is accepted, `Manager.Restore()` will record that a restore
|
||||
operation is in progress, and spawn a separate goroutine that runs a synchronous
|
||||
`rootmulti.Store.Restore()` snapshot restoration which will be fed snapshot
|
||||
chunks until it is complete.
|
||||
|
||||
Tendermint will then start fetching and buffering chunks, providing them in
|
||||
order via ABCI `ApplySnapshotChunk` calls. These dispatch to
|
||||
`Manager.RestoreChunk()`, which passes the chunks to the ongoing restore
|
||||
process, checking if errors have been encountered yet (e.g. due to checksum
|
||||
mismatches or invalid IAVL data). Once the final chunk is passed,
|
||||
`Manager.RestoreChunk()` will wait for the restore process to complete before
|
||||
returning.
|
||||
|
||||
Once the restore is completed, Tendermint will go on to call the `Info` ABCI
|
||||
call to fetch the app hash, and compare this against the trusted chain app
|
||||
hash at the snapshot height to verify the restored state. If it matches,
|
||||
Tendermint goes on to process blocks.
|
Loading…
Reference in New Issue