🤖 AI Summary
This study addresses the overlooked trade-off between accuracy gains and environmental costs in ensemble-based recommender systems. Through 93 controlled experiments, it systematically quantifies the balance between predictive performance and energy consumption—translated into CO₂ equivalents—for four ensemble strategies across explicit rating prediction and implicit feedback ranking tasks. The experimental pipeline, built on Surprise and LensKit, integrates whole-system power measurements via EMERS and smart plugs to capture real-world energy use. Results reveal that while ensemble methods yield accuracy improvements of 0.3%–5.7%, they incur substantial energy overheads ranging from 19% to 2549%. Notably, selective ensembling demonstrates superior efficiency, achieving comparable or better accuracy than full-model averaging while significantly reducing energy consumption and associated carbon emissions.
📝 Abstract
Ensemble methods are frequently used in recommender systems to improve accuracy by combining multiple models. Recent work reports sizable performance gains, but most studies still optimize primarily for accuracy and robustness rather than for energy efficiency. This paper measures accuracy energy trade offs of ensemble techniques relative to strong single models. We run 93 controlled experiments in two pipelines: 1. explicit rating prediction with Surprise (RMSE) and 2. implicit feedback ranking with LensKit (NDCG@10). We evaluate four datasets ranging from 100,000 to 7.8 million interactions (MovieLens 100K, MovieLens 1M, ModCloth, Anime). We compare four ensemble strategies (Average, Weighted, Stacking or Rank Fusion, Top Performers) against baselines and optimized single models. Whole system energy is measured with EMERS using a smart plug and converted to CO2 equivalents. Across settings, ensembles improve accuracy by 0.3% to 5.7% while increasing energy by 19% to 2,549%. On MovieLens 1M, a Top Performers ensemble improves RMSE by 0.96% at an 18.8% energy overhead over SVD++. On MovieLens 100K, an averaging ensemble improves NDCG@10 by 5.7% with 103% additional energy. On Anime, a Surprise Top Performers ensemble improves RMSE by 1.2% but consumes 2,005% more energy (0.21 vs. 0.01 Wh), increasing emissions from 2.6 to 53.8 mg CO2 equivalents, and LensKit ensembles fail due to memory limits. Overall, selective ensembles are more energy efficient than exhaustive averaging,