๐ค AI Summary
This study addresses the trade-offs among storage footprint, throughput, and energy consumption in large-scale source code archives, where sustainable and efficient storage solutions remain lacking. The authors design and implement a compressed key-value store, systematically quantifying Pareto-optimal configurations under lossless compression that balance space, throughput, and energy efficiency. They further introduce a green benchmarking methodology integrable into CI/CD pipelines. Experimental results demonstrate that high-ratio compression can yield order-of-magnitude improvements in both retrieval throughput and energy efficiency. While data parallelism significantly accelerates processing, its energy-efficiency gains are constrained by the non-linear power characteristics of modern hardware, revealing a scalability bottleneck. This work challenges the conventional assumption of linear correlation between time and energy consumption, offering a new paradigm for green software engineering.
๐ Abstract
Retrieving data from large-scale source code archives is vital for AI training, neural-based software analysis, and information retrieval, to cite a few. This paper studies and experiments with the design of a compressed key-value store for the indexing of large-scale source code datasets, evaluating its trade-off among three primary computational resources: (compressed) space occupancy, time, and energy efficiency. Extensive experiments on a national high-performance computing infrastructure demonstrate that different compression configurations yield distinct trade-offs, with high compression ratios and order-of-magnitude gains in retrieval throughput and energy efficiency. We also study data parallelism and show that, while it significantly improves speed, scaling energy efficiency is more difficult, reflecting the known non-energy-proportionality of modern hardware and challenging the assumption of a direct time-energy correlation. This work streamlines automation in energy-aware configuration tuning and standardized green benchmarking deployable in CI/CD pipelines, thus empowering system architects with a spectrum of Pareto-optimal energy-compression-throughput trade-offs and actionable guidelines for building sustainable, efficient storage backends for massive open-source code archival.