Efficient Content-based Recommendation Model Training via Noise-aware Coreset Selection

📅 2026-01-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost of training large-scale recommender systems and the performance degradation caused by noise in user-item interactions under conventional coreset selection methods. The authors propose NaCS, a noise-aware coreset selection framework that, for the first time, integrates label noise correction and uncertainty awareness into coreset construction for recommendation. NaCS employs gradient-driven submodular optimization to select representative samples while incorporating progressive label correction and uncertainty quantification to filter low-confidence interactions. Evaluated across multiple benchmarks, the method achieves 93–95% of the full-data training performance using only 1% of the original data, significantly outperforming existing coreset approaches in terms of efficiency, robustness, and training quality.

Technology Category

Application Category

📝 Abstract
Content-based recommendation systems (CRSs) utilize content features to predict user-item interactions, serving as essential tools for helping users navigate information-rich web services. However, ensuring the effectiveness of CRSs requires large-scale and even continuous model training to accommodate diverse user preferences, resulting in significant computational costs and resource demands. A promising approach to this challenge is coreset selection, which identifies a small but representative subset of data samples that preserves model quality while reducing training overhead. Yet, the selected coreset is vulnerable to the pervasive noise in user-item interactions, particularly when it is minimally sized. To this end, we propose Noise-aware Coreset Selection (NaCS), a specialized framework for CRSs. NaCS constructs coresets through submodular optimization based on training gradients, while simultaneously correcting noisy labels using a progressively trained model. Meanwhile, we refine the selected coreset by filtering out low-confidence samples through uncertainty quantification, thereby avoid training with unreliable interactions. Through extensive experiments, we show that NaCS produces higher-quality coresets for CRSs while achieving better efficiency than existing coreset selection techniques. Notably, NaCS recovers 93-95\% of full-dataset training performance using merely 1\% of the training data. The source code is available at \href{https://github.com/chenxing1999/nacs}{https://github.com/chenxing1999/nacs}.
Problem

Research questions and friction points this paper is trying to address.

content-based recommendation
coreset selection
noise
computational efficiency
user-item interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Noise-aware Coreset Selection
Submodular Optimization
Uncertainty Quantification
Content-based Recommendation
Gradient-based Sampling
🔎 Similar Papers
No similar papers found.
Hung Vinh Tran
Hung Vinh Tran
Department of Mathematics, University of Wisconsin Madison
PDE
T
Tong Chen
The University of Queensland, Brisbane, Queensland, Australia
H
Hechuan Wen
The University of Queensland, Brisbane, Queensland, Australia
Q
Quoc Viet Hung Nguyen
Griffith University, Gold Coast, Queensland, Australia
B
Bin Cui
Peking University, Beijing, China
Hongzhi Yin
Hongzhi Yin
Professor and ARC Future Fellow, University of Queensland
Recommender SystemGraph LearningSpatial-temporal PredictionEdge IntelligenceLLM