🤖 AI Summary
KV-separated LSM trees alleviate write amplification but incur severe space amplification—particularly detrimental in cost-sensitive deployments. Existing garbage collection (GC) strategies neither adapt to workload characteristics nor account for the non-negligible space overhead of index LSM trees. To optimize the space–time trade-off, this paper proposes Scavenger+, featuring three key innovations: (1) an I/O-efficient garbage collection mechanism; (2) a lightweight, compensation-size-aware space-merging strategy; and (3) a dynamic GC scheduler that adapts to workload fluctuations. Experimental evaluation demonstrates that, compared to BlobDB, Titan, and TerarkDB, Scavenger+ achieves up to 37% lower space amplification while sustaining high write throughput—significantly improving storage efficiency and overall system performance.
📝 Abstract
Key-Value Stores (KVS) based on log-structured merge-trees (LSM-trees) are widely used in storage systems but face significant challenges, such as high write amplification caused by compaction. KV-separated LSM-trees address write amplification but introduce significant space amplification, a critical concern in cost-sensitive scenarios. Garbage collection (GC) can reduce space amplification, but existing strategies are often inefficient and fail to account for workload characteristics. Moreover, current key-value (KV) separated LSM-trees overlook the space amplification caused by the index LSM-tree. In this paper, we systematically analyze the sources of space amplification in KV-separated LSM-trees and propose Scavenger+, which achieves a better performance-space trade-off. Scavenger+ introduces (1) an I/O-efficient garbage collection scheme to reduce I/O overhead, (2) a space-aware compaction strategy based on compensated size to mitigate index-induced space amplification, and (3) a dynamic GC scheduler that adapts to system load to make better use of CPU and storage resources. Extensive experiments demonstrate that Scavenger+ significantly improves write performance and reduces space amplification compared to state-of-the-art KV-separated LSM-trees, including BlobDB, Titan, and TerarkDB.