🤖 AI Summary
Real-world datasets commonly suffer from heterogeneous data quality and sample redundancy. Existing dataset pruning methods largely rely on static heuristics or task-specific metrics, limiting their generalizability and robustness. To address this, we propose a dynamic dataset pruning framework that jointly models task difficulty estimation and cross-modal semantic consistency—marking the first such integration. Leveraging pre-trained multimodal foundation models (e.g., CLIP, Flamingo), our method generates fine-grained, supervision-free signals to guide adaptive sample selection via semantic alignment. It requires no human annotations or task-specific fine-tuning and supports cross-domain transfer. Extensive experiments demonstrate consistent improvements across multiple vision and multimodal benchmarks: +1.2–2.8% accuracy gain, 1.3–1.7× training speedup, and enhanced out-of-distribution robustness. Our approach establishes a general, efficient, and scalable paradigm for data-centric learning.
📝 Abstract
Modern deep models are trained on large real-world datasets, where data quality varies and redundancy is common. Data-centric approaches such as dataset pruning have shown promise in improving training efficiency and model performance. However, most existing methods rely on static heuristics or task-specific metrics, limiting their robustness and generalizability across domains. In this work, we introduce a dynamic dataset pruning framework that adaptively selects training samples based on both task-driven difficulty and cross-modality semantic consistency. By incorporating supervision from pretrained multimodal foundation models, our approach captures training dynamics while effectively filtering out uninformative samples. Our work highlights the potential of integrating cross-modality alignment for robust sample selection, advancing data-centric learning toward more efficient and robust practices across application domains.