🤖 AI Summary
Current vision-language models struggle to accurately estimate comprehensive micronutrients from food images, often refusing to respond or generating hallucinated outputs. This work proposes a dietary recall–driven synthetic supervision approach, leveraging a decade of population-scale 24-hour dietary recall data to construct approximately 1.1 million image–caption–nutrient triplets, thereby creating the largest food image corpus to date covering 65 micronutrients. The authors fine-tune multimodal large language models—including Qwen3-VL and GLM-4.6V-Flash—on this dataset, yielding the NutriMLLM series. Evaluated on real-world food images, NutriMLLM achieves near-complete coverage in micronutrient estimation and significantly outperforms closed-source models such as GPT-5, Gemini 3, and Claude Sonnet 4.5 across four key dimensions: refusal rate, hallucination control, usability, and numerical accuracy.
📝 Abstract
Comprehensive estimation of dietary micronutrients from food images could improve clinical nutrition care, but training such models requires large multimodal datasets linking diverse foods to complete nutrient profiles. We first show that existing multimodal large language models (MLLMs), including leading proprietary models, are unreliable for this task. Across five model families and four independent evaluation benchmarks (ASA24, SNAPMe, FNDDS, and NutriBench), models frequently abstained or returned statistically implausible values. To address this gap without costly expert annotation, we repurposed a decade of population-scale 24-hour dietary recalls as structured prompts for text-to-image generation. This pipeline produced a synthetic corpus of about 1.1 million image-description-nutrient triplets, each pairing a generated food image with a complete 65-nutrient label. To our knowledge, this is the largest synthetic food-image corpus with comprehensive micronutrient annotation planned for public release upon publication. Fine-tuning Qwen3-VL (2B/4B/8B/30B) and GLM-4.6V-Flash on this corpus yielded NutriMLLM, the first family of vision-language models specialized for comprehensive dietary micronutrient estimation. We evaluate these models with a four-component framework that separately measures abstention, hallucination, overall usability, and per-nutrient numerical accuracy. On real food images, every NutriMLLM variant achieved near-complete coverage across all 65 nutrients, and the largest variant matched or exceeded proprietary baselines (GPT-5, Gemini 3, and Claude Sonnet 4.5) in accuracy on most nutrients. These results show that recall-driven synthetic supervision can make image-based comprehensive micronutrient estimation a tractable engineering problem and support dietary assessment, personalized nutrition guidance, and population-scale micronutrient surveillance.