Proposal for RocksDB Compaction optimized for Solana BlockStore (#21358)

2022-02-11 21:46:06 -08:00 · 2022-02-11 21:46:06 -08:00 · 0855cef76c
parent 69b47915b1
commit 0855cef76c
1 changed files with 259 additions and 0 deletions
--- a/docs/src/proposals/blockstore-rocksdb-compaction.md
+++ b/docs/src/proposals/blockstore-rocksdb-compaction.md
@ -0,0 +1,259 @@
 ---
 title: Optimize RocksDB Compaction for Solana BlockStore
 ---
 This document explores RocksDB based solutions for Solana BlockStore
 mentioned in issue [#16234](https://github.com/solana-labs/solana/issues/16234).
 ## Background
 Solana uses RocksDB as the underlying storage for its blockstore.  RocksDB
 is a LSM-based key value store which consists of multiple logical levels,
 and data in each level is sorted by key (read amplification).  In such
 leveled structure, each read hits at most one file for each level, while
 all other mutable operations including writes, deletions, and merge
 operations are implemented as append operations and will eventually create
 more logical levels which makes the read performance worse over time.
 To make reads more performant over time, RocksDB periodically reduces
 the number of logical levels by running compaction in background, where
 part or multiple logical levels are merged into one, which increases the
 number of disk I/Os (write amplification) and storage (space amplification)
 required for storing each entry.  In other words, RocksDB uses compactions
 to balance [write, space, and read amplifications](https://smalldatum.blogspot.com/2015/11/read-write-space-amplification-pick-2_23.html).
 As different workloads have different requirements, RocksDB makes its options
 highly configerable.  However, it also means its default settings might not
 be always suitable.  This document focuses on RocksDB's compaction
 optimization for Solana's Blockstore.
 ## Problems
 As mentioned in [#16234](https://github.com/solana-labs/solana/issues/16234),
 there're several issues in the Solana's BlockStore which runs RocksDB with
 level compaction.  Here's a quick summary of the issues:
 ### Long Write Stalls on Shred Insertions
 Remember that RocksDB periodically runs background compactions in order to
 keep the number of logical levels small in order to reach the target read
 amplification.  However, when the background compactions cannot catch up
 the write rate, the number of logical levels will eventually exceeds the
 configured limit.  In such case, RocksDB will rate-limit / stop all writes
 when it reaches soft / hard threshold.
 In [#14586](https://github.com/solana-labs/solana/issues/14586), it is reported
 that the write stalls in Solana's use case can be 40 minutes long.  It is also
 reported in [#16234](https://github.com/solana-labs/solana/issues/16234) that
 writes are also slowed-down, indicating the underlying RocksDB instance has
 reach the soft limit for write stall.
 ### Deletions are not Processed in Time
 Deletions are processed in the same way as other write operations in RocksDB,
 where multiple entries associated with the same key are merged / deleted
 during the background compaction.  Although the deleted entries will not
 be visible from the read side right after the deletion is issued, the
 deleted entries (including the original data entries and its deletion
 entries) will still occupy disk storage.
 TBD: explain how write-key order makes this worse
 ### High Write I/O from Write Amplification
 In addition to write stalls, it is also observed that compactions cause
 unwanted high write I/O.  With the current design where level compaction
 is configured for BlockStore, it has ~30x write amplification (10x write amp
 per level and assuming three levels in average).
 ## Current Design
 Blockstore stores three types of data in RocksDB: shred data, metadata,
 accounts and transactional data.  Each stores in multiple different column
 families.  For shred insertions, write batches are used to combine several
 shred insertions that update both shred data and metadata related column
 families, while column families related to accounts and transactional are
 unrelated.
 In the current BlockStore, the default level compaction is used for all
 column families.  As deletions are not processed in time by RocksDB,
 a slot-ID based compaction filter with periodic manual compactions is used
 in order to force the deletions to be processed.  While this approach
 can guarantees deletions are processed for specified period of time which
 mitigates the write stall issue, period manual compactions will introduce
 additional write amplification.
 ## The Proposed Design
 As all the above issues are compaction related, it can be solved with a proper
 compaction style and deletion policy.  Fortunately, shred data column families,
 ShredData and ShredCode, which contribute to 99% of the storage size in shred
 insertion, have an unique write workload where write-keys are mostly
 monotonically increasing over time.  This allows data to be persisted in sorted
 order naturally without compaction, and the deletion policy can be as simple as
 deleting the oldest file when the storage size reaches the cleanup trigger.
 In the proposed design, we will leverage such unique property to aggressively
 config RocksDB to run as few compactions as possible while offering low read
 amplification with no write stalls.
 ### Use FIFO Compaction for Shred Data Column Families
 As mentioned above, shred data column families, ShredData and ShredCode, which
 contribute to 99% of the storage size in shred insertion, have an unique write
 workload where write-keys are mostly monotonically increasing over time.  As a
 result, after entries are flushed from memory into SST files, the keys are
 naturally sorted accross multiple SST files where each SST file might have
 a small overlapping key range between at most two other SST files.  In other
 words, files are sorted naturally from old to new, which allows us to use
 the First-In-First-Out compaction, or FIFO Compaction.
 FIFO Compaction actually does not compact files.  Instead, it simply deletes
 the oldest files when the storage size reaches the specified threshold.  As a
 result, it has a constant 1 write amplification.  In addition, as keys are
 naturally sorted accross multiple SST files, each read can be answered by
 hitting mostly only one (or in the boundary case, two) file.  This gives us
 close to 1 read amplification.  As each key is only inserted once, we have
 space amplification 1.
 ### Use Current Settings for Metadata Column Families
 The second type of the column families related to shred insertion is medadata
 column families.  These metadata column families contributes ~1% of the shred
 insertion data in size.  The largest metadata column family here is the Index
 column family, which occupies 0.8% of the shred insertion data.
 As these column families only contribute ~1% of the shred insertion data in
 size, the current settings with default level compaction with compaction filter
 should be good enough for now.  We can revisit later if these metadata column
 families become the performance bottleneck after we've optimized the shred data
 column families.
 ## Benefits
 ### No More Write Stalls
 Write stall is a RocksDB's mechanism to slow-down or stop writes in order to
 allow compactions to catch up in order to keep read amplification low.
 Luckily, because keys in data shred column families are written in mostly
 monotonically increasing order, the resulting SST files are naturally sorted
 that always keeps read amplification close to 1.  As a result, there is no
 need to stall writes in order to maintain the read amplification.
 ### Deletions are Processed in time
 In FIFO compactions, deletions are happened immediately when the size of the
 column family reaches the configured trigger.  As a result, deletions are
 always processed in time, and we don't need to worry about whether RocksDB
 picks the correct file to process the deletion as FIFO compaction always
 pick the oldest one, which is the correct deletion policy for shred data.
 ### Low I/Os with Minimum Amplification Factors
 FIFO Compaction offers constant write amplification as it does not run any
 compactions in background while it usually has a large read amplification
 as each read must be answered by reaching every single SST file.  However, it
 is not the case in the shred data column families because SST files are naturally
 sorted as write keys are inserted in mostly monotonically increasing order
 without duplication.  This gives us 1 space amplification and close to 1 read
 amplification.
 To sum up, if no other manual compaction is issued for quickly picking up
 deletions, FIFO Compaction offers the following amplification factors
 in Solana's BlockStore use case:
 - Write Amplification: 1 (all data is written once without compaction.)
 - Read Amplification: < 1.1 (assuming each SST file has 10% overlapping key
  range with another SST file.)
 - Space Amplification: 1 (same data never be written in more than one SST file,
  and no additional temporary space required for compaction.)
 ## Migration
 Here we discuss Level to FIFO and FIFO to Level migrations:
 ### Level to FIFO
 heoretically, FIFO compaction is the superset of all other compaction styles,
 as it does not have any assumption of the LSM tree structure.  However, the
 current RocksDB implementation does not offer such flexibility while it is
 theoretically doable.
 As the current RocksDB implementation doesn't offer such flexibility, the
 best option is to extend the copy tool in the ledger tool to allow it
 also specifying the destired compaction style of the output DB. This approach
 also ensures the resulting FIFO compacted DB can clean up the SST files
 in the correct order, as the copy tool iterates from smaller (older) slots
 to bigger (latest) slots, leaving the resulting SST files generated in
 the correct time order, which allows FIFO compaction to delete the oldest
 data just by checking the file creation time during its clean up process.
 ### FIFO to Level
 While one can opens a FIFO-compacted DB using level compaction, the DB will
 likely to encounter long write stalls.  It is because FIFO compaction puts
 all files in level 0, and write stalls trigger when the number of level-0
 files exceed the limit until all the level-0 files are compacted into other
 levels.
 To avoid the start-up write stalls, a more efficient way to perform FIFO
 to level compaction is to do a manual compaction first, then open the DB.
 ## Release Plan
 As the migration in either way cannot not be done smoothly in place, the
 release will be divided into the following steps:
 * v0 - merge FIFO compaction implementation with visible args.
 * v1 - visible args with a big warning stating you'll lose your ledger if you enable it
 * v2 - slow-roll and monitor FIFO compaction, fix any issues.
 * v3 - if needed, add migration support.
 In step v1, FIFO will use a different rocksdb directory (something like
 rocksdb-v2 or rocksdb-fifo) to ensure that the validator will never mix
 two different formats and panic.
 ## Experiments
 ## Single Node Benchmark Results
 To verify the effectiveness, I ran both 1m slots and 100m slots shred insertion
 benchmarks on my n2-standard-32 GC instance (32 cores 2800MHz cpu, 128GB memory,
 2048GB SSD).  Each slot contains 25 shreds, and the shreds are inserted with with
 8 writers.  Here are the summary of the result:
 * FIFO based validator: Shred insertion took 13450.8s, 185.8k shreds/s
 * Current setting: shred insertion took 30337.2s, 82.4k shreds/s
 If we further remove the write lock inside the shred insertion to allow fully
 concurrent shred insertion, the proposed FIFO setting can inserts 295k shreds/s:
 * FIFO + no write lock: Shred insertion took 8459.3s, 295.5k shreds/s
 The possibility of enabling fully concurrent multi-writer shred insertion is
 discussed in #21657.
 ## Results from the Mainnet-Beta
 To further understand the performance, I setup two validator instances joining
 the Mainnet-Beta, but one with FIFO based validator and the other is based on
 the current setting.  Two validators have the same machine spec (24-core
 2.8kMHz CPU, 128GB memory, 768GB SSD for blockstore, and everything else
 stored in the 1024GB SSD.)  Below are the results.
 ### Disk Write Bytes
 I first compared the disk write bytes of the SSD for blockstore of the two
 instances.  This number represents how many bytes written are required in
 order to store the same amount of logical data.  It also reflects the
 write amplification factor of the storage.
  * FIFO based validator: ~15~20 MB/s
  * Current setting: vs 25~30 MB/s
 The result shows that FIFO-based validator writes ~33% less data to perform
 the same task compared to the current setting.
 ### Compaction Stats on Data and Coding Shred Column Family
 Another data point we have is the RocksDB compaction stats, which tells us
 how much resource is spent in compaction.  Below shows the compaction stats
 on data and coding shreds:
 * FIFO based validator: 188.24 GB write, 1.27 MB/s write, 0.00 GB read, 0.00 MB/s read, 870.4 seconds
 * Current setting: 719.87 GB write, 4.88 MB/s write, 611.61 GB read, 4.14 MB/s read, 5782.6 seconds
 The compaction stats show that FIFO based validator is 6.5x faster in
 compacting data shreds and coding shreds with fewer than 1/3 disk writes.
 In addition, there is no disk read involved in FIFO's compaction process.
 ## Summary
 This documents proposes a FIFO-compaction based solution to the performance
 issues of blockstore [#16234](https://github.com/solana-labs/solana/issues/16234).
 It minimizes read / write / space amplification factors by leveraging the
 unique property of Solana BlockStore workload where write-keys are mostly
 monotonically increasing over time.  Experimental results from the single
 node 100m slots insertion indicate the proposed solution can insert 185k
 shred/s, which is ~2.25x faster than current design that inserts 82k shreds/s.
 Experimental results from Mainnet-Beta also shows that the proposed FIFO-based
 solution can achieve same task with 33% fewer disk writes compared to the
 current design.