Proposal for RocksDB Compaction optimized for Solana BlockStore (#21358)
This commit is contained in:
parent
69b47915b1
commit
0855cef76c
|
@ -0,0 +1,259 @@
|
||||||
|
---
|
||||||
|
title: Optimize RocksDB Compaction for Solana BlockStore
|
||||||
|
---
|
||||||
|
|
||||||
|
This document explores RocksDB based solutions for Solana BlockStore
|
||||||
|
mentioned in issue [#16234](https://github.com/solana-labs/solana/issues/16234).
|
||||||
|
|
||||||
|
## Background
|
||||||
|
Solana uses RocksDB as the underlying storage for its blockstore. RocksDB
|
||||||
|
is a LSM-based key value store which consists of multiple logical levels,
|
||||||
|
and data in each level is sorted by key (read amplification). In such
|
||||||
|
leveled structure, each read hits at most one file for each level, while
|
||||||
|
all other mutable operations including writes, deletions, and merge
|
||||||
|
operations are implemented as append operations and will eventually create
|
||||||
|
more logical levels which makes the read performance worse over time.
|
||||||
|
|
||||||
|
To make reads more performant over time, RocksDB periodically reduces
|
||||||
|
the number of logical levels by running compaction in background, where
|
||||||
|
part or multiple logical levels are merged into one, which increases the
|
||||||
|
number of disk I/Os (write amplification) and storage (space amplification)
|
||||||
|
required for storing each entry. In other words, RocksDB uses compactions
|
||||||
|
to balance [write, space, and read amplifications](https://smalldatum.blogspot.com/2015/11/read-write-space-amplification-pick-2_23.html).
|
||||||
|
|
||||||
|
As different workloads have different requirements, RocksDB makes its options
|
||||||
|
highly configerable. However, it also means its default settings might not
|
||||||
|
be always suitable. This document focuses on RocksDB's compaction
|
||||||
|
optimization for Solana's Blockstore.
|
||||||
|
|
||||||
|
## Problems
|
||||||
|
As mentioned in [#16234](https://github.com/solana-labs/solana/issues/16234),
|
||||||
|
there're several issues in the Solana's BlockStore which runs RocksDB with
|
||||||
|
level compaction. Here's a quick summary of the issues:
|
||||||
|
|
||||||
|
### Long Write Stalls on Shred Insertions
|
||||||
|
Remember that RocksDB periodically runs background compactions in order to
|
||||||
|
keep the number of logical levels small in order to reach the target read
|
||||||
|
amplification. However, when the background compactions cannot catch up
|
||||||
|
the write rate, the number of logical levels will eventually exceeds the
|
||||||
|
configured limit. In such case, RocksDB will rate-limit / stop all writes
|
||||||
|
when it reaches soft / hard threshold.
|
||||||
|
|
||||||
|
In [#14586](https://github.com/solana-labs/solana/issues/14586), it is reported
|
||||||
|
that the write stalls in Solana's use case can be 40 minutes long. It is also
|
||||||
|
reported in [#16234](https://github.com/solana-labs/solana/issues/16234) that
|
||||||
|
writes are also slowed-down, indicating the underlying RocksDB instance has
|
||||||
|
reach the soft limit for write stall.
|
||||||
|
|
||||||
|
### Deletions are not Processed in Time
|
||||||
|
Deletions are processed in the same way as other write operations in RocksDB,
|
||||||
|
where multiple entries associated with the same key are merged / deleted
|
||||||
|
during the background compaction. Although the deleted entries will not
|
||||||
|
be visible from the read side right after the deletion is issued, the
|
||||||
|
deleted entries (including the original data entries and its deletion
|
||||||
|
entries) will still occupy disk storage.
|
||||||
|
|
||||||
|
TBD: explain how write-key order makes this worse
|
||||||
|
|
||||||
|
### High Write I/O from Write Amplification
|
||||||
|
In addition to write stalls, it is also observed that compactions cause
|
||||||
|
unwanted high write I/O. With the current design where level compaction
|
||||||
|
is configured for BlockStore, it has ~30x write amplification (10x write amp
|
||||||
|
per level and assuming three levels in average).
|
||||||
|
|
||||||
|
## Current Design
|
||||||
|
Blockstore stores three types of data in RocksDB: shred data, metadata,
|
||||||
|
accounts and transactional data. Each stores in multiple different column
|
||||||
|
families. For shred insertions, write batches are used to combine several
|
||||||
|
shred insertions that update both shred data and metadata related column
|
||||||
|
families, while column families related to accounts and transactional are
|
||||||
|
unrelated.
|
||||||
|
|
||||||
|
In the current BlockStore, the default level compaction is used for all
|
||||||
|
column families. As deletions are not processed in time by RocksDB,
|
||||||
|
a slot-ID based compaction filter with periodic manual compactions is used
|
||||||
|
in order to force the deletions to be processed. While this approach
|
||||||
|
can guarantees deletions are processed for specified period of time which
|
||||||
|
mitigates the write stall issue, period manual compactions will introduce
|
||||||
|
additional write amplification.
|
||||||
|
|
||||||
|
## The Proposed Design
|
||||||
|
As all the above issues are compaction related, it can be solved with a proper
|
||||||
|
compaction style and deletion policy. Fortunately, shred data column families,
|
||||||
|
ShredData and ShredCode, which contribute to 99% of the storage size in shred
|
||||||
|
insertion, have an unique write workload where write-keys are mostly
|
||||||
|
monotonically increasing over time. This allows data to be persisted in sorted
|
||||||
|
order naturally without compaction, and the deletion policy can be as simple as
|
||||||
|
deleting the oldest file when the storage size reaches the cleanup trigger.
|
||||||
|
|
||||||
|
In the proposed design, we will leverage such unique property to aggressively
|
||||||
|
config RocksDB to run as few compactions as possible while offering low read
|
||||||
|
amplification with no write stalls.
|
||||||
|
|
||||||
|
### Use FIFO Compaction for Shred Data Column Families
|
||||||
|
As mentioned above, shred data column families, ShredData and ShredCode, which
|
||||||
|
contribute to 99% of the storage size in shred insertion, have an unique write
|
||||||
|
workload where write-keys are mostly monotonically increasing over time. As a
|
||||||
|
result, after entries are flushed from memory into SST files, the keys are
|
||||||
|
naturally sorted accross multiple SST files where each SST file might have
|
||||||
|
a small overlapping key range between at most two other SST files. In other
|
||||||
|
words, files are sorted naturally from old to new, which allows us to use
|
||||||
|
the First-In-First-Out compaction, or FIFO Compaction.
|
||||||
|
|
||||||
|
FIFO Compaction actually does not compact files. Instead, it simply deletes
|
||||||
|
the oldest files when the storage size reaches the specified threshold. As a
|
||||||
|
result, it has a constant 1 write amplification. In addition, as keys are
|
||||||
|
naturally sorted accross multiple SST files, each read can be answered by
|
||||||
|
hitting mostly only one (or in the boundary case, two) file. This gives us
|
||||||
|
close to 1 read amplification. As each key is only inserted once, we have
|
||||||
|
space amplification 1.
|
||||||
|
|
||||||
|
### Use Current Settings for Metadata Column Families
|
||||||
|
The second type of the column families related to shred insertion is medadata
|
||||||
|
column families. These metadata column families contributes ~1% of the shred
|
||||||
|
insertion data in size. The largest metadata column family here is the Index
|
||||||
|
column family, which occupies 0.8% of the shred insertion data.
|
||||||
|
|
||||||
|
As these column families only contribute ~1% of the shred insertion data in
|
||||||
|
size, the current settings with default level compaction with compaction filter
|
||||||
|
should be good enough for now. We can revisit later if these metadata column
|
||||||
|
families become the performance bottleneck after we've optimized the shred data
|
||||||
|
column families.
|
||||||
|
|
||||||
|
## Benefits
|
||||||
|
|
||||||
|
### No More Write Stalls
|
||||||
|
Write stall is a RocksDB's mechanism to slow-down or stop writes in order to
|
||||||
|
allow compactions to catch up in order to keep read amplification low.
|
||||||
|
Luckily, because keys in data shred column families are written in mostly
|
||||||
|
monotonically increasing order, the resulting SST files are naturally sorted
|
||||||
|
that always keeps read amplification close to 1. As a result, there is no
|
||||||
|
need to stall writes in order to maintain the read amplification.
|
||||||
|
|
||||||
|
### Deletions are Processed in time
|
||||||
|
In FIFO compactions, deletions are happened immediately when the size of the
|
||||||
|
column family reaches the configured trigger. As a result, deletions are
|
||||||
|
always processed in time, and we don't need to worry about whether RocksDB
|
||||||
|
picks the correct file to process the deletion as FIFO compaction always
|
||||||
|
pick the oldest one, which is the correct deletion policy for shred data.
|
||||||
|
|
||||||
|
### Low I/Os with Minimum Amplification Factors
|
||||||
|
FIFO Compaction offers constant write amplification as it does not run any
|
||||||
|
compactions in background while it usually has a large read amplification
|
||||||
|
as each read must be answered by reaching every single SST file. However, it
|
||||||
|
is not the case in the shred data column families because SST files are naturally
|
||||||
|
sorted as write keys are inserted in mostly monotonically increasing order
|
||||||
|
without duplication. This gives us 1 space amplification and close to 1 read
|
||||||
|
amplification.
|
||||||
|
|
||||||
|
To sum up, if no other manual compaction is issued for quickly picking up
|
||||||
|
deletions, FIFO Compaction offers the following amplification factors
|
||||||
|
in Solana's BlockStore use case:
|
||||||
|
|
||||||
|
- Write Amplification: 1 (all data is written once without compaction.)
|
||||||
|
- Read Amplification: < 1.1 (assuming each SST file has 10% overlapping key
|
||||||
|
range with another SST file.)
|
||||||
|
- Space Amplification: 1 (same data never be written in more than one SST file,
|
||||||
|
and no additional temporary space required for compaction.)
|
||||||
|
|
||||||
|
## Migration
|
||||||
|
Here we discuss Level to FIFO and FIFO to Level migrations:
|
||||||
|
|
||||||
|
### Level to FIFO
|
||||||
|
heoretically, FIFO compaction is the superset of all other compaction styles,
|
||||||
|
as it does not have any assumption of the LSM tree structure. However, the
|
||||||
|
current RocksDB implementation does not offer such flexibility while it is
|
||||||
|
theoretically doable.
|
||||||
|
|
||||||
|
As the current RocksDB implementation doesn't offer such flexibility, the
|
||||||
|
best option is to extend the copy tool in the ledger tool to allow it
|
||||||
|
also specifying the destired compaction style of the output DB. This approach
|
||||||
|
also ensures the resulting FIFO compacted DB can clean up the SST files
|
||||||
|
in the correct order, as the copy tool iterates from smaller (older) slots
|
||||||
|
to bigger (latest) slots, leaving the resulting SST files generated in
|
||||||
|
the correct time order, which allows FIFO compaction to delete the oldest
|
||||||
|
data just by checking the file creation time during its clean up process.
|
||||||
|
|
||||||
|
### FIFO to Level
|
||||||
|
While one can opens a FIFO-compacted DB using level compaction, the DB will
|
||||||
|
likely to encounter long write stalls. It is because FIFO compaction puts
|
||||||
|
all files in level 0, and write stalls trigger when the number of level-0
|
||||||
|
files exceed the limit until all the level-0 files are compacted into other
|
||||||
|
levels.
|
||||||
|
|
||||||
|
To avoid the start-up write stalls, a more efficient way to perform FIFO
|
||||||
|
to level compaction is to do a manual compaction first, then open the DB.
|
||||||
|
|
||||||
|
## Release Plan
|
||||||
|
As the migration in either way cannot not be done smoothly in place, the
|
||||||
|
release will be divided into the following steps:
|
||||||
|
|
||||||
|
* v0 - merge FIFO compaction implementation with visible args.
|
||||||
|
* v1 - visible args with a big warning stating you'll lose your ledger if you enable it
|
||||||
|
* v2 - slow-roll and monitor FIFO compaction, fix any issues.
|
||||||
|
* v3 - if needed, add migration support.
|
||||||
|
|
||||||
|
In step v1, FIFO will use a different rocksdb directory (something like
|
||||||
|
rocksdb-v2 or rocksdb-fifo) to ensure that the validator will never mix
|
||||||
|
two different formats and panic.
|
||||||
|
|
||||||
|
## Experiments
|
||||||
|
## Single Node Benchmark Results
|
||||||
|
To verify the effectiveness, I ran both 1m slots and 100m slots shred insertion
|
||||||
|
benchmarks on my n2-standard-32 GC instance (32 cores 2800MHz cpu, 128GB memory,
|
||||||
|
2048GB SSD). Each slot contains 25 shreds, and the shreds are inserted with with
|
||||||
|
8 writers. Here are the summary of the result:
|
||||||
|
|
||||||
|
* FIFO based validator: Shred insertion took 13450.8s, 185.8k shreds/s
|
||||||
|
* Current setting: shred insertion took 30337.2s, 82.4k shreds/s
|
||||||
|
|
||||||
|
If we further remove the write lock inside the shred insertion to allow fully
|
||||||
|
concurrent shred insertion, the proposed FIFO setting can inserts 295k shreds/s:
|
||||||
|
|
||||||
|
* FIFO + no write lock: Shred insertion took 8459.3s, 295.5k shreds/s
|
||||||
|
|
||||||
|
The possibility of enabling fully concurrent multi-writer shred insertion is
|
||||||
|
discussed in #21657.
|
||||||
|
|
||||||
|
## Results from the Mainnet-Beta
|
||||||
|
To further understand the performance, I setup two validator instances joining
|
||||||
|
the Mainnet-Beta, but one with FIFO based validator and the other is based on
|
||||||
|
the current setting. Two validators have the same machine spec (24-core
|
||||||
|
2.8kMHz CPU, 128GB memory, 768GB SSD for blockstore, and everything else
|
||||||
|
stored in the 1024GB SSD.) Below are the results.
|
||||||
|
|
||||||
|
### Disk Write Bytes
|
||||||
|
I first compared the disk write bytes of the SSD for blockstore of the two
|
||||||
|
instances. This number represents how many bytes written are required in
|
||||||
|
order to store the same amount of logical data. It also reflects the
|
||||||
|
write amplification factor of the storage.
|
||||||
|
|
||||||
|
* FIFO based validator: ~15~20 MB/s
|
||||||
|
* Current setting: vs 25~30 MB/s
|
||||||
|
|
||||||
|
The result shows that FIFO-based validator writes ~33% less data to perform
|
||||||
|
the same task compared to the current setting.
|
||||||
|
|
||||||
|
### Compaction Stats on Data and Coding Shred Column Family
|
||||||
|
Another data point we have is the RocksDB compaction stats, which tells us
|
||||||
|
how much resource is spent in compaction. Below shows the compaction stats
|
||||||
|
on data and coding shreds:
|
||||||
|
|
||||||
|
* FIFO based validator: 188.24 GB write, 1.27 MB/s write, 0.00 GB read, 0.00 MB/s read, 870.4 seconds
|
||||||
|
* Current setting: 719.87 GB write, 4.88 MB/s write, 611.61 GB read, 4.14 MB/s read, 5782.6 seconds
|
||||||
|
|
||||||
|
The compaction stats show that FIFO based validator is 6.5x faster in
|
||||||
|
compacting data shreds and coding shreds with fewer than 1/3 disk writes.
|
||||||
|
In addition, there is no disk read involved in FIFO's compaction process.
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
This documents proposes a FIFO-compaction based solution to the performance
|
||||||
|
issues of blockstore [#16234](https://github.com/solana-labs/solana/issues/16234).
|
||||||
|
It minimizes read / write / space amplification factors by leveraging the
|
||||||
|
unique property of Solana BlockStore workload where write-keys are mostly
|
||||||
|
monotonically increasing over time. Experimental results from the single
|
||||||
|
node 100m slots insertion indicate the proposed solution can insert 185k
|
||||||
|
shred/s, which is ~2.25x faster than current design that inserts 82k shreds/s.
|
||||||
|
Experimental results from Mainnet-Beta also shows that the proposed FIFO-based
|
||||||
|
solution can achieve same task with 33% fewer disk writes compared to the
|
||||||
|
current design.
|
Loading…
Reference in New Issue