🤖 AI Summary
Food image recognition is hindered by high inter-class similarity, intra-class variability, and structural complexity. To address this, we propose NoisyViT—a novel vision transformer that systematically integrates controllable noise injection into the ViT architecture. During training, entropy regularization dynamically modulates model uncertainty to reduce task complexity and enhance robustness. Our method further combines multi-scale feature modeling with noise-augmented fine-tuning to improve generalization. Evaluated on three benchmarks—Food2K, Food-101, and CNFOOD-241—NoisyViT achieves state-of-the-art Top-1 accuracies of 95.0%, 99.5%, and 96.6%, respectively, surpassing all existing methods. These results demonstrate the effectiveness and broad applicability of noise-driven visual representation learning for fine-grained food recognition.
📝 Abstract
Food image recognition is a challenging task in computer vision due to the high variability and complexity of food images. In this study, we investigate the potential of Noisy Vision Transformers (NoisyViT) for improving food classification performance. By introducing noise into the learning process, NoisyViT reduces task complexity and adjusts the entropy of the system, leading to enhanced model accuracy. We fine-tune NoisyViT on three benchmark datasets: Food2K (2,000 categories, ~1M images), Food-101 (101 categories, ~100K images), and CNFOOD-241 (241 categories, ~190K images). The performance of NoisyViT is evaluated against state-of-the-art food recognition models. Our results demonstrate that NoisyViT achieves Top-1 accuracies of 95%, 99.5%, and 96.6% on Food2K, Food-101, and CNFOOD-241, respectively, significantly outperforming existing approaches. This study underscores the potential of NoisyViT for dietary assessment, nutritional monitoring, and healthcare applications, paving the way for future advancements in vision-based food computing. Code for reproducing NoisyViT for food recognition is available at NoisyViT_Food.