Proposal for RocksDB Compaction optimized for Solana BlockStore (#21358)

This commit is contained in:
Yueh-Hsuan Chiang 2022-02-11 21:46:06 -08:00 committed by GitHub
parent 69b47915b1
commit 0855cef76c
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 259 additions and 0 deletions

View File

@ -0,0 +1,259 @@
---
title: Optimize RocksDB Compaction for Solana BlockStore
---
This document explores RocksDB based solutions for Solana BlockStore
mentioned in issue [#16234](https://github.com/solana-labs/solana/issues/16234).
## Background
Solana uses RocksDB as the underlying storage for its blockstore. RocksDB
is a LSM-based key value store which consists of multiple logical levels,
and data in each level is sorted by key (read amplification). In such
leveled structure, each read hits at most one file for each level, while
all other mutable operations including writes, deletions, and merge
operations are implemented as append operations and will eventually create
more logical levels which makes the read performance worse over time.
To make reads more performant over time, RocksDB periodically reduces
the number of logical levels by running compaction in background, where
part or multiple logical levels are merged into one, which increases the
number of disk I/Os (write amplification) and storage (space amplification)
required for storing each entry. In other words, RocksDB uses compactions
to balance [write, space, and read amplifications](https://smalldatum.blogspot.com/2015/11/read-write-space-amplification-pick-2_23.html).
As different workloads have different requirements, RocksDB makes its options
highly configerable. However, it also means its default settings might not
be always suitable. This document focuses on RocksDB's compaction
optimization for Solana's Blockstore.
## Problems
As mentioned in [#16234](https://github.com/solana-labs/solana/issues/16234),
there're several issues in the Solana's BlockStore which runs RocksDB with
level compaction. Here's a quick summary of the issues:
### Long Write Stalls on Shred Insertions
Remember that RocksDB periodically runs background compactions in order to
keep the number of logical levels small in order to reach the target read
amplification. However, when the background compactions cannot catch up
the write rate, the number of logical levels will eventually exceeds the
configured limit. In such case, RocksDB will rate-limit / stop all writes
when it reaches soft / hard threshold.
In [#14586](https://github.com/solana-labs/solana/issues/14586), it is reported
that the write stalls in Solana's use case can be 40 minutes long. It is also
reported in [#16234](https://github.com/solana-labs/solana/issues/16234) that
writes are also slowed-down, indicating the underlying RocksDB instance has
reach the soft limit for write stall.
### Deletions are not Processed in Time
Deletions are processed in the same way as other write operations in RocksDB,
where multiple entries associated with the same key are merged / deleted
during the background compaction. Although the deleted entries will not
be visible from the read side right after the deletion is issued, the
deleted entries (including the original data entries and its deletion
entries) will still occupy disk storage.
TBD: explain how write-key order makes this worse
### High Write I/O from Write Amplification
In addition to write stalls, it is also observed that compactions cause
unwanted high write I/O. With the current design where level compaction
is configured for BlockStore, it has ~30x write amplification (10x write amp
per level and assuming three levels in average).
## Current Design
Blockstore stores three types of data in RocksDB: shred data, metadata,
accounts and transactional data. Each stores in multiple different column
families. For shred insertions, write batches are used to combine several
shred insertions that update both shred data and metadata related column
families, while column families related to accounts and transactional are
unrelated.
In the current BlockStore, the default level compaction is used for all
column families. As deletions are not processed in time by RocksDB,
a slot-ID based compaction filter with periodic manual compactions is used
in order to force the deletions to be processed. While this approach
can guarantees deletions are processed for specified period of time which
mitigates the write stall issue, period manual compactions will introduce
additional write amplification.
## The Proposed Design
As all the above issues are compaction related, it can be solved with a proper
compaction style and deletion policy. Fortunately, shred data column families,
ShredData and ShredCode, which contribute to 99% of the storage size in shred
insertion, have an unique write workload where write-keys are mostly
monotonically increasing over time. This allows data to be persisted in sorted
order naturally without compaction, and the deletion policy can be as simple as
deleting the oldest file when the storage size reaches the cleanup trigger.
In the proposed design, we will leverage such unique property to aggressively
config RocksDB to run as few compactions as possible while offering low read
amplification with no write stalls.
### Use FIFO Compaction for Shred Data Column Families
As mentioned above, shred data column families, ShredData and ShredCode, which
contribute to 99% of the storage size in shred insertion, have an unique write
workload where write-keys are mostly monotonically increasing over time. As a
result, after entries are flushed from memory into SST files, the keys are
naturally sorted accross multiple SST files where each SST file might have
a small overlapping key range between at most two other SST files. In other
words, files are sorted naturally from old to new, which allows us to use
the First-In-First-Out compaction, or FIFO Compaction.
FIFO Compaction actually does not compact files. Instead, it simply deletes
the oldest files when the storage size reaches the specified threshold. As a
result, it has a constant 1 write amplification. In addition, as keys are
naturally sorted accross multiple SST files, each read can be answered by
hitting mostly only one (or in the boundary case, two) file. This gives us
close to 1 read amplification. As each key is only inserted once, we have
space amplification 1.
### Use Current Settings for Metadata Column Families
The second type of the column families related to shred insertion is medadata
column families. These metadata column families contributes ~1% of the shred
insertion data in size. The largest metadata column family here is the Index
column family, which occupies 0.8% of the shred insertion data.
As these column families only contribute ~1% of the shred insertion data in
size, the current settings with default level compaction with compaction filter
should be good enough for now. We can revisit later if these metadata column
families become the performance bottleneck after we've optimized the shred data
column families.
## Benefits
### No More Write Stalls
Write stall is a RocksDB's mechanism to slow-down or stop writes in order to
allow compactions to catch up in order to keep read amplification low.
Luckily, because keys in data shred column families are written in mostly
monotonically increasing order, the resulting SST files are naturally sorted
that always keeps read amplification close to 1. As a result, there is no
need to stall writes in order to maintain the read amplification.
### Deletions are Processed in time
In FIFO compactions, deletions are happened immediately when the size of the
column family reaches the configured trigger. As a result, deletions are
always processed in time, and we don't need to worry about whether RocksDB
picks the correct file to process the deletion as FIFO compaction always
pick the oldest one, which is the correct deletion policy for shred data.
### Low I/Os with Minimum Amplification Factors
FIFO Compaction offers constant write amplification as it does not run any
compactions in background while it usually has a large read amplification
as each read must be answered by reaching every single SST file. However, it
is not the case in the shred data column families because SST files are naturally
sorted as write keys are inserted in mostly monotonically increasing order
without duplication. This gives us 1 space amplification and close to 1 read
amplification.
To sum up, if no other manual compaction is issued for quickly picking up
deletions, FIFO Compaction offers the following amplification factors
in Solana's BlockStore use case:
- Write Amplification: 1 (all data is written once without compaction.)
- Read Amplification: < 1.1 (assuming each SST file has 10% overlapping key
range with another SST file.)
- Space Amplification: 1 (same data never be written in more than one SST file,
and no additional temporary space required for compaction.)
## Migration
Here we discuss Level to FIFO and FIFO to Level migrations:
### Level to FIFO
heoretically, FIFO compaction is the superset of all other compaction styles,
as it does not have any assumption of the LSM tree structure. However, the
current RocksDB implementation does not offer such flexibility while it is
theoretically doable.
As the current RocksDB implementation doesn't offer such flexibility, the
best option is to extend the copy tool in the ledger tool to allow it
also specifying the destired compaction style of the output DB. This approach
also ensures the resulting FIFO compacted DB can clean up the SST files
in the correct order, as the copy tool iterates from smaller (older) slots
to bigger (latest) slots, leaving the resulting SST files generated in
the correct time order, which allows FIFO compaction to delete the oldest
data just by checking the file creation time during its clean up process.
### FIFO to Level
While one can opens a FIFO-compacted DB using level compaction, the DB will
likely to encounter long write stalls. It is because FIFO compaction puts
all files in level 0, and write stalls trigger when the number of level-0
files exceed the limit until all the level-0 files are compacted into other
levels.
To avoid the start-up write stalls, a more efficient way to perform FIFO
to level compaction is to do a manual compaction first, then open the DB.
## Release Plan
As the migration in either way cannot not be done smoothly in place, the
release will be divided into the following steps:
* v0 - merge FIFO compaction implementation with visible args.
* v1 - visible args with a big warning stating you'll lose your ledger if you enable it
* v2 - slow-roll and monitor FIFO compaction, fix any issues.
* v3 - if needed, add migration support.
In step v1, FIFO will use a different rocksdb directory (something like
rocksdb-v2 or rocksdb-fifo) to ensure that the validator will never mix
two different formats and panic.
## Experiments
## Single Node Benchmark Results
To verify the effectiveness, I ran both 1m slots and 100m slots shred insertion
benchmarks on my n2-standard-32 GC instance (32 cores 2800MHz cpu, 128GB memory,
2048GB SSD). Each slot contains 25 shreds, and the shreds are inserted with with
8 writers. Here are the summary of the result:
* FIFO based validator: Shred insertion took 13450.8s, 185.8k shreds/s
* Current setting: shred insertion took 30337.2s, 82.4k shreds/s
If we further remove the write lock inside the shred insertion to allow fully
concurrent shred insertion, the proposed FIFO setting can inserts 295k shreds/s:
* FIFO + no write lock: Shred insertion took 8459.3s, 295.5k shreds/s
The possibility of enabling fully concurrent multi-writer shred insertion is
discussed in #21657.
## Results from the Mainnet-Beta
To further understand the performance, I setup two validator instances joining
the Mainnet-Beta, but one with FIFO based validator and the other is based on
the current setting. Two validators have the same machine spec (24-core
2.8kMHz CPU, 128GB memory, 768GB SSD for blockstore, and everything else
stored in the 1024GB SSD.) Below are the results.
### Disk Write Bytes
I first compared the disk write bytes of the SSD for blockstore of the two
instances. This number represents how many bytes written are required in
order to store the same amount of logical data. It also reflects the
write amplification factor of the storage.
* FIFO based validator: ~15~20 MB/s
* Current setting: vs 25~30 MB/s
The result shows that FIFO-based validator writes ~33% less data to perform
the same task compared to the current setting.
### Compaction Stats on Data and Coding Shred Column Family
Another data point we have is the RocksDB compaction stats, which tells us
how much resource is spent in compaction. Below shows the compaction stats
on data and coding shreds:
* FIFO based validator: 188.24 GB write, 1.27 MB/s write, 0.00 GB read, 0.00 MB/s read, 870.4 seconds
* Current setting: 719.87 GB write, 4.88 MB/s write, 611.61 GB read, 4.14 MB/s read, 5782.6 seconds
The compaction stats show that FIFO based validator is 6.5x faster in
compacting data shreds and coding shreds with fewer than 1/3 disk writes.
In addition, there is no disk read involved in FIFO's compaction process.
## Summary
This documents proposes a FIFO-compaction based solution to the performance
issues of blockstore [#16234](https://github.com/solana-labs/solana/issues/16234).
It minimizes read / write / space amplification factors by leveraging the
unique property of Solana BlockStore workload where write-keys are mostly
monotonically increasing over time. Experimental results from the single
node 100m slots insertion indicate the proposed solution can insert 185k
shred/s, which is ~2.25x faster than current design that inserts 82k shreds/s.
Experimental results from Mainnet-Beta also shows that the proposed FIFO-based
solution can achieve same task with 33% fewer disk writes compared to the
current design.