2021-09-16 08:43:03 -07:00
|
|
|
# State Sync Snapshotting
|
|
|
|
|
|
|
|
The `snapshots` package implements automatic support for Tendermint state sync
|
|
|
|
in Cosmos SDK-based applications. State sync allows a new node joining a network
|
|
|
|
to simply fetch a recent snapshot of the application state instead of fetching
|
|
|
|
and applying all historical blocks. This can reduce the time needed to join the
|
|
|
|
network by several orders of magnitude (e.g. weeks to minutes), but the node
|
|
|
|
will not contain historical data from previous heights.
|
|
|
|
|
|
|
|
This document describes the Cosmos SDK implementation of the ABCI state sync
|
|
|
|
interface, for more information on Tendermint state sync in general see:
|
|
|
|
|
|
|
|
* [Tendermint Core State Sync for Developers](https://medium.com/tendermint/tendermint-core-state-sync-for-developers-70a96ba3ee35)
|
|
|
|
* [ABCI State Sync Spec](https://docs.tendermint.com/master/spec/abci/apps.html#state-sync)
|
|
|
|
* [ABCI State Sync Method/Type Reference](https://docs.tendermint.com/master/spec/abci/abci.html#state-sync)
|
|
|
|
|
|
|
|
## Overview
|
|
|
|
|
|
|
|
For an overview of how Cosmos SDK state sync is set up and configured by
|
|
|
|
developers and end-users, see the
|
|
|
|
[Cosmos SDK State Sync Guide](https://blog.cosmos.network/cosmos-sdk-state-sync-guide-99e4cf43be2f).
|
|
|
|
|
|
|
|
Briefly, the Cosmos SDK takes state snapshots at regular height intervals given
|
|
|
|
by `state-sync.snapshot-interval` and stores them as binary files in the
|
|
|
|
filesystem under `<node_home>/data/snapshots/`, with metadata in a LevelDB database
|
|
|
|
`<node_home>/data/snapshots/metadata.db`. The number of recent snapshots to keep are given by
|
|
|
|
`state-sync.snapshot-keep-recent`.
|
|
|
|
|
|
|
|
Snapshots are taken asynchronously, i.e. new blocks will be applied concurrently
|
|
|
|
with snapshots being taken. This is possible because IAVL supports querying
|
|
|
|
immutable historical heights. However, this requires `state-sync.snapshot-interval`
|
|
|
|
to be a multiple of `pruning-keep-every`, to prevent a height from being removed
|
|
|
|
while it is being snapshotted.
|
|
|
|
|
|
|
|
When a remote node is state syncing, Tendermint calls the ABCI method
|
|
|
|
`ListSnapshots` to list available local snapshots and `LoadSnapshotChunk` to
|
|
|
|
load a binary snapshot chunk. When the local node is being state synced,
|
|
|
|
Tendermint calls `OfferSnapshot` to offer a discovered remote snapshot to the
|
|
|
|
local application and `ApplySnapshotChunk` to apply a binary snapshot chunk to
|
|
|
|
the local application. See the resources linked above for more details on these
|
|
|
|
methods and how Tendermint performs state sync.
|
|
|
|
|
|
|
|
The Cosmos SDK does not currently do any incremental verification of snapshots
|
|
|
|
during restoration, i.e. only after the entire snapshot has been restored will
|
|
|
|
Tendermint compare the app hash against the trusted hash from the chain. Cosmos
|
|
|
|
SDK snapshots and chunks do contain hashes as checksums to guard against IO
|
|
|
|
corruption and non-determinism, but these are not tied to the chain state and
|
|
|
|
can be trivially forged by an adversary. This was considered out of scope for
|
|
|
|
the initial implementation, but can be added later without changes to the
|
|
|
|
ABCI state sync protocol.
|
|
|
|
|
|
|
|
## Snapshot Metadata
|
|
|
|
|
|
|
|
The ABCI Protobuf type for a snapshot is listed below (refer to the ABCI spec
|
|
|
|
for field details):
|
|
|
|
|
|
|
|
```protobuf
|
|
|
|
message Snapshot {
|
|
|
|
uint64 height = 1; // The height at which the snapshot was taken
|
|
|
|
uint32 format = 2; // The application-specific snapshot format
|
|
|
|
uint32 chunks = 3; // Number of chunks in the snapshot
|
|
|
|
bytes hash = 4; // Arbitrary snapshot hash, equal only if identical
|
|
|
|
bytes metadata = 5; // Arbitrary application metadata
|
|
|
|
}
|
|
|
|
```
|
|
|
|
|
|
|
|
Because the `metadata` field is application-specific, the Cosmos SDK uses a
|
|
|
|
similar type `cosmos.base.snapshots.v1beta1.Snapshot` with its own metadata
|
|
|
|
representation:
|
|
|
|
|
|
|
|
```protobuf
|
|
|
|
// Snapshot contains Tendermint state sync snapshot info.
|
|
|
|
message Snapshot {
|
|
|
|
uint64 height = 1;
|
|
|
|
uint32 format = 2;
|
|
|
|
uint32 chunks = 3;
|
|
|
|
bytes hash = 4;
|
|
|
|
Metadata metadata = 5 [(gogoproto.nullable) = false];
|
|
|
|
}
|
|
|
|
|
|
|
|
// Metadata contains SDK-specific snapshot metadata.
|
|
|
|
message Metadata {
|
|
|
|
repeated bytes chunk_hashes = 1; // SHA-256 chunk hashes
|
|
|
|
}
|
|
|
|
```
|
|
|
|
|
|
|
|
The `format` is currently `1`, defined in `snapshots.types.CurrentFormat`. This
|
|
|
|
must be increased whenever the binary snapshot format changes, and it may be
|
|
|
|
useful to support past formats in newer versions.
|
|
|
|
|
|
|
|
The `hash` is a SHA-256 hash of the entire binary snapshot, used to guard
|
|
|
|
against IO corruption and non-determinism across nodes. Note that this is not
|
|
|
|
tied to the chain state, and can be trivially forged (but Tendermint will always
|
|
|
|
compare the final app hash against the chain app hash). Similarly, the
|
|
|
|
`chunk_hashes` are SHA-256 checksums of each binary chunk.
|
|
|
|
|
|
|
|
The `metadata` field is Protobuf-serialized before it is placed into the ABCI
|
|
|
|
snapshot.
|
|
|
|
|
|
|
|
## Snapshot Format
|
|
|
|
|
|
|
|
The current version `1` snapshot format is a zlib-compressed, length-prefixed
|
|
|
|
Protobuf stream of `cosmos.base.store.v1beta1.SnapshotItem` messages, split into
|
|
|
|
chunks at exact 10 MB byte boundaries.
|
|
|
|
|
|
|
|
```protobuf
|
|
|
|
// SnapshotItem is an item contained in a rootmulti.Store snapshot.
|
|
|
|
message SnapshotItem {
|
|
|
|
// item is the specific type of snapshot item.
|
|
|
|
oneof item {
|
|
|
|
SnapshotStoreItem store = 1;
|
|
|
|
SnapshotIAVLItem iavl = 2 [(gogoproto.customname) = "IAVL"];
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// SnapshotStoreItem contains metadata about a snapshotted store.
|
|
|
|
message SnapshotStoreItem {
|
|
|
|
string name = 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
// SnapshotIAVLItem is an exported IAVL node.
|
|
|
|
message SnapshotIAVLItem {
|
|
|
|
bytes key = 1;
|
|
|
|
bytes value = 2;
|
|
|
|
int64 version = 3;
|
|
|
|
int32 height = 4;
|
|
|
|
}
|
|
|
|
```
|
|
|
|
|
|
|
|
Snapshots are generated by `rootmulti.Store.Snapshot()` as follows:
|
|
|
|
|
|
|
|
1. Set up a `protoio.NewDelimitedWriter` that writes length-prefixed serialized
|
|
|
|
`SnapshotItem` Protobuf messages.
|
|
|
|
1. Iterate over each IAVL store in lexicographical order by store name.
|
|
|
|
2. Emit a `SnapshotStoreItem` containing the store name.
|
|
|
|
3. Start an IAVL export for the store using
|
|
|
|
[`iavl.ImmutableTree.Export()`](https://pkg.go.dev/github.com/tendermint/iavl#ImmutableTree.Export).
|
|
|
|
4. Iterate over each IAVL node.
|
|
|
|
5. Emit a `SnapshotIAVLItem` for the IAVL node.
|
|
|
|
2. Pass the serialized Protobuf output stream to a zlib compression writer.
|
|
|
|
3. Split the zlib output stream into chunks at exactly every 10th megabyte.
|
|
|
|
|
|
|
|
Snapshots are restored via `rootmulti.Store.Restore()` as the inverse of the above, using
|
|
|
|
[`iavl.MutableTree.Import()`](https://pkg.go.dev/github.com/tendermint/iavl#MutableTree.Import)
|
|
|
|
to reconstruct each IAVL tree.
|
|
|
|
|
|
|
|
## Snapshot Storage
|
|
|
|
|
|
|
|
Snapshot storage is managed by `snapshots.Store`, with metadata in a `db.DB`
|
|
|
|
database and binary chunks in the filesystem. Note that this is only used to
|
|
|
|
store locally taken snapshots that are being offered to other nodes. When the
|
|
|
|
local node is being state synced, Tendermint will take care of buffering and
|
|
|
|
storing incoming snapshot chunks before they are applied to the application.
|
|
|
|
|
|
|
|
Metadata is generally stored in a LevelDB database at
|
|
|
|
`<node_home>/data/snapshots/metadata.db`. It contains serialized
|
|
|
|
`cosmos.base.snapshots.v1beta1.Snapshot` Protobuf messages with a key given by
|
|
|
|
the concatenation of a key prefix, the big-endian height, and the big-endian
|
|
|
|
format. Chunk data is stored as regular files under
|
|
|
|
`<node_home>/data/snapshots/<height>/<format>/<chunk>`.
|
|
|
|
|
|
|
|
The `snapshots.Store` API is based on streaming IO, and integrates easily with
|
|
|
|
the `snapshots.types.Snapshotter` snapshot/restore interface implemented by
|
|
|
|
`rootmulti.Store`. The `Store.Save()` method stores a snapshot given as a
|
|
|
|
`<- chan io.ReadCloser` channel of binary chunk streams, and `Store.Load()` loads
|
|
|
|
the snapshot as a channel of binary chunk streams -- the same stream types used
|
|
|
|
by `Snapshotter.Snapshot()` and `Snapshotter.Restore()` to take and restore
|
|
|
|
snapshots using streaming IO.
|
|
|
|
|
|
|
|
The store also provides many other methods such as `List()` to list stored
|
|
|
|
snapshots, `LoadChunk()` to load a single snapshot chunk, and `Prune()` to prune
|
|
|
|
old snapshots.
|
|
|
|
|
|
|
|
## Taking Snapshots
|
|
|
|
|
|
|
|
`snapshots.Manager` is a high-level snapshot manager that integrates a
|
|
|
|
`snapshots.types.Snapshotter` (i.e. the `rootmulti.Store` snapshot
|
|
|
|
functionality) and a `snapshots.Store`, providing an API that maps easily onto
|
|
|
|
the ABCI state sync API. The `Manager` will also make sure only one operation
|
|
|
|
is in progress at a time, e.g. to prevent multiple snapshots being taken
|
|
|
|
concurrently.
|
|
|
|
|
|
|
|
During `BaseApp.Commit`, once a state transition has been committed, the height
|
|
|
|
is checked against the `state-sync.snapshot-interval` setting. If the committed
|
|
|
|
height should be snapshotted, a goroutine `BaseApp.snapshot()` is spawned that
|
|
|
|
calls `snapshots.Manager.Create()` to create the snapshot.
|
|
|
|
|
|
|
|
`Manager.Create()` will do some basic pre-flight checks, and then start
|
|
|
|
generating a snapshot by calling `rootmulti.Store.Snapshot()`. The chunk stream
|
|
|
|
is passed into `snapshots.Store.Save()`, which stores the chunks in the
|
|
|
|
filesystem and records the snapshot metadata in the snapshot database.
|
|
|
|
|
|
|
|
Once the snapshot has been generated, `BaseApp.snapshot()` then removes any
|
|
|
|
old snapshots based on the `state-sync.snapshot-keep-recent` setting.
|
|
|
|
|
|
|
|
## Serving Snapshots
|
|
|
|
|
|
|
|
When a remote node is discovering snapshots for state sync, Tendermint will
|
|
|
|
call the `ListSnapshots` ABCI method to list the snapshots present on the
|
|
|
|
local node. This is dispatched to `snapshots.Manager.List()`, which in turn
|
|
|
|
dispatches to `snapshots.Store.List()`.
|
|
|
|
|
|
|
|
When a remote node is fetching snapshot chunks during state sync, Tendermint
|
|
|
|
will call the `LoadSnapshotChunk` ABCI method to fetch a chunk from the local
|
|
|
|
node. This dispatches to `snapshots.Manager.LoadChunk()`, which in turn
|
|
|
|
dispatches to `snapshots.Store.LoadChunk()`.
|
|
|
|
|
|
|
|
## Restoring Snapshots
|
|
|
|
|
|
|
|
When the operator has configured the local Tendermint node to run state sync
|
|
|
|
(see the resources listed in the introduction for details on Tendermint state
|
|
|
|
sync), it will discover snapshots across the P2P network and offer their
|
|
|
|
metadata in turn to the local application via the `OfferSnapshot` ABCI call.
|
|
|
|
|
|
|
|
`BaseApp.OfferSnapshot()` attempts to start a restore operation by calling
|
|
|
|
`snapshots.Manager.Restore()`. This may fail, e.g. if the snapshot format is
|
|
|
|
unknown (it may have been generated by a different version of the Cosmos SDK),
|
2021-10-30 06:43:04 -07:00
|
|
|
in which case Tendermint will offer other discovered snapshots.
|
2021-09-16 08:43:03 -07:00
|
|
|
|
|
|
|
If the snapshot is accepted, `Manager.Restore()` will record that a restore
|
|
|
|
operation is in progress, and spawn a separate goroutine that runs a synchronous
|
|
|
|
`rootmulti.Store.Restore()` snapshot restoration which will be fed snapshot
|
|
|
|
chunks until it is complete.
|
|
|
|
|
|
|
|
Tendermint will then start fetching and buffering chunks, providing them in
|
|
|
|
order via ABCI `ApplySnapshotChunk` calls. These dispatch to
|
|
|
|
`Manager.RestoreChunk()`, which passes the chunks to the ongoing restore
|
|
|
|
process, checking if errors have been encountered yet (e.g. due to checksum
|
|
|
|
mismatches or invalid IAVL data). Once the final chunk is passed,
|
|
|
|
`Manager.RestoreChunk()` will wait for the restore process to complete before
|
|
|
|
returning.
|
|
|
|
|
|
|
|
Once the restore is completed, Tendermint will go on to call the `Info` ABCI
|
|
|
|
call to fetch the app hash, and compare this against the trusted chain app
|
|
|
|
hash at the snapshot height to verify the restored state. If it matches,
|
|
|
|
Tendermint goes on to process blocks.
|