Food-R1: A Unified Multi-Task Food Vision-Language Model with Reinforcement Learning

πŸ“… 2026-06-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

185K/year
πŸ€– AI Summary
Existing food vision-language models suffer from limited reasoning and generalization capabilities due to the scarcity of high-quality nutritional annotations and reliance on supervised fine-tuning. To address this, this work proposes Food-R1, a unified multitask food vision-language model, and introduces CalorieBench-80Kβ€”a large-scale food image benchmark featuring chain-of-thought annotations. The approach integrates chain-of-thought cold-start instruction tuning with reinforcement fine-tuning based on Group Relative Policy Optimization (GRPO). This strategy substantially enhances the model’s performance across diverse food analysis tasks, consistently outperforming strong baselines on both CalorieBench-80K and multiple established benchmarks, thereby demonstrating the efficacy of combining reinforcement learning with multitask learning in food vision-language modeling.
πŸ“ Abstract
Recent studies have explored Vision-Language Models (VLMs) for food analysis. However, most existing methods rely primarily on supervised fine-tuning (SFT), which often limits reasoning and generalization capabilities. Moreover, high-quality large-scale nutritional annotations remain scarce. To address these issues, we introduce CalorieBench-80K, a large-scale benchmark with curated calorie labels and dietary advice annotations. To the best of our knowledge, it is the first food image benchmark to incorporate Chain-of-Thought (CoT) annotations for calorie reasoning. We also propose Food-R1, a unified food VLM trained in a multi-task learning paradigm to equip the model with broad capabilities. Food-R1 undergoes CoT-based cold-start instruction tuning, followed by reinforcement fine-tuning (RFT) using Group Relative Policy Optimization (GRPO) to improve reasoning and performance. Experiments on CalorieBench-80K and representative benchmarks show that Food-R1 consistently outperforms strong baselines across food-related tasks. The code, model weights, and benchmark annotations are available at the project repository.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models
food analysis
nutritional annotations
reasoning
generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Model
Reinforcement Fine-Tuning
Chain-of-Thought
Multi-Task Learning
Calorie Reasoning
πŸ”Ž Similar Papers
No similar papers found.