Food-R1: A Unified Multi-Task Food Vision-Language Model with Reinforcement Learning

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing food vision-language models suffer from limited reasoning and generalization capabilities due to the scarcity of high-quality nutritional annotations and reliance on supervised fine-tuning. To address this, this work proposes Food-R1, a unified multitask food vision-language model, and introduces CalorieBench-80K—a large-scale food image benchmark featuring chain-of-thought annotations. The approach integrates chain-of-thought cold-start instruction tuning with reinforcement fine-tuning based on Group Relative Policy Optimization (GRPO). This strategy substantially enhances the model’s performance across diverse food analysis tasks, consistently outperforming strong baselines on both CalorieBench-80K and multiple established benchmarks, thereby demonstrating the efficacy of combining reinforcement learning with multitask learning in food vision-language modeling.

📝 Abstract

Recent studies have explored Vision-Language Models (VLMs) for food analysis. However, most existing methods rely primarily on supervised fine-tuning (SFT), which often limits reasoning and generalization capabilities. Moreover, high-quality large-scale nutritional annotations remain scarce. To address these issues, we introduce CalorieBench-80K, a large-scale benchmark with curated calorie labels and dietary advice annotations. To the best of our knowledge, it is the first food image benchmark to incorporate Chain-of-Thought (CoT) annotations for calorie reasoning. We also propose Food-R1, a unified food VLM trained in a multi-task learning paradigm to equip the model with broad capabilities. Food-R1 undergoes CoT-based cold-start instruction tuning, followed by reinforcement fine-tuning (RFT) using Group Relative Policy Optimization (GRPO) to improve reasoning and performance. Experiments on CalorieBench-80K and representative benchmarks show that Food-R1 consistently outperforms strong baselines across food-related tasks. The code, model weights, and benchmark annotations are available at the project repository.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

food analysis

nutritional annotations

reasoning

generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Model

Reinforcement Fine-Tuning

Chain-of-Thought