π€ AI Summary
Existing food vision-language models suffer from limited reasoning and generalization capabilities due to the scarcity of high-quality nutritional annotations and reliance on supervised fine-tuning. To address this, this work proposes Food-R1, a unified multitask food vision-language model, and introduces CalorieBench-80Kβa large-scale food image benchmark featuring chain-of-thought annotations. The approach integrates chain-of-thought cold-start instruction tuning with reinforcement fine-tuning based on Group Relative Policy Optimization (GRPO). This strategy substantially enhances the modelβs performance across diverse food analysis tasks, consistently outperforming strong baselines on both CalorieBench-80K and multiple established benchmarks, thereby demonstrating the efficacy of combining reinforcement learning with multitask learning in food vision-language modeling.
π Abstract
Recent studies have explored Vision-Language Models (VLMs) for food analysis. However, most existing methods rely primarily on supervised fine-tuning (SFT), which often limits reasoning and generalization capabilities. Moreover, high-quality large-scale nutritional annotations remain scarce. To address these issues, we introduce CalorieBench-80K, a large-scale benchmark with curated calorie labels and dietary advice annotations. To the best of our knowledge, it is the first food image benchmark to incorporate Chain-of-Thought (CoT) annotations for calorie reasoning. We also propose Food-R1, a unified food VLM trained in a multi-task learning paradigm to equip the model with broad capabilities. Food-R1 undergoes CoT-based cold-start instruction tuning, followed by reinforcement fine-tuning (RFT) using Group Relative Policy Optimization (GRPO) to improve reasoning and performance. Experiments on CalorieBench-80K and representative benchmarks show that Food-R1 consistently outperforms strong baselines across food-related tasks. The code, model weights, and benchmark annotations are available at the project repository.