Seeing and Reasoning with Confidence: Supercharging Multimodal LLMs with an Uncertainty-Aware Agentic Framework

📅 2025-03-11

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Multimodal large language models (MLLMs) suffer from unreliable reasoning and high risks in external tool invocation for visual question answering (VQA). Method: This paper proposes SRICE, a training-free, uncertainty-aware agent framework. It introduces conformal prediction to calibrate outputs from external vision models and dynamically schedules multi-stage region-based reasoning tools based on uncertainty estimates derived from the MLLM’s own generative outputs. The method integrates external vision models, uncertainty quantification (UQ), interactive region selection, and an uncertainty-driven tool orchestration mechanism. Contribution/Results: Evaluated on five VQA benchmarks, SRICE achieves an average 4.6% accuracy improvement over strong baselines; on certain metrics, it surpasses supervised fine-tuning methods. Results demonstrate that uncertainty-guided, trustworthy tool coordination significantly enhances MLLM reasoning robustness—establishing the critical role of calibrated uncertainty in reliable multimodal reasoning.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) show promise in tasks like visual question answering (VQA) but still face challenges in multimodal reasoning. Recent works adapt agentic frameworks or chain-of-thought (CoT) reasoning to improve performance. However, CoT-based multimodal reasoning often demands costly data annotation and fine-tuning, while agentic approaches relying on external tools risk introducing unreliable output from these tools. In this paper, we propose Seeing and Reasoning with Confidence (SRICE), a training-free multimodal reasoning framework that integrates external vision models with uncertainty quantification (UQ) into an MLLM to address these challenges. Specifically, SRICE guides the inference process by allowing MLLM to autonomously select regions of interest through multi-stage interactions with the help of external tools. We propose to use a conformal prediction-based approach to calibrate the output of external tools and select the optimal tool by estimating the uncertainty of an MLLM's output. Our experiment shows that the average improvement of SRICE over the base MLLM is 4.6% on five datasets and the performance on some datasets even outperforms fine-tuning-based methods, revealing the significance of ensuring reliable tool use in an MLLM agent.

Problem

Research questions and friction points this paper is trying to address.

Improves multimodal reasoning in MLLMs without costly fine-tuning.

Addresses unreliable outputs from external tools in agentic frameworks.

Enhances MLLM performance by integrating uncertainty-aware vision models.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates external vision models with uncertainty quantification

Uses conformal prediction for tool output calibration

Autonomously selects regions of interest via multi-stage interactions

🔎 Similar Papers

No similar papers found.