Taming Object Hallucinations with Verified Atomic Confidence Estimation

📅 2025-11-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Multimodal large language models (MLLMs) frequently exhibit hallucinations in object existence, attribute, and relational reasoning, undermining output reliability. To address this, we propose TACO—a framework that decomposes model responses into verifiable atomic queries via **atomic query decomposition**, estimates black-box confidence through **sentence rewriting and self-consistency aggregation**, and enhances calibration via **gray-box language model fine-tuning**, all without external visual experts. TACO effectively mitigates systematic biases and consistently outperforms direct prompting and visual contrastive decoding across five benchmarks. It improves both faithfulness—measured by alignment between generated outputs and visual evidence—and confidence calibration accuracy. By enabling lightweight, general-purpose, and interpretable hallucination mitigation, TACO establishes a novel paradigm for enhancing MLLM reliability.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) often suffer from hallucinations, particularly errors in object existence, attributes, or relations, which undermine their reliability. We introduce TACO (Verified Atomic Confidence Estimation), a simple framework that mitigates hallucinations through self-verification and confidence calibration without relying on external vision experts. TACO decomposes responses into atomic queries, paraphrases them to reduce sensitivity to wording, and estimates confidence using self-consistency (black-box) or self-confidence (gray-box) aggregation, before refining answers with a language model. Experiments on five benchmarks (POPE, MME, HallusionBench, AMBER, and MM-Hal Bench) with two MLLMs ( exttt{LLaVA-1.5-7B} and exttt{CogVLM2}) show that TACO consistently outperforms direct prompting and Visual Contrastive Decoding, reduces systematic biases, and improves confidence calibration, demonstrating its effectiveness in enhancing the faithfulness of MLLMs.

Problem

Research questions and friction points this paper is trying to address.

Mitigating object hallucinations in Multimodal Large Language Models

Reducing errors in object existence, attributes, and relations

Improving confidence calibration without external vision experts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes responses into atomic queries for verification

Paraphrases queries to reduce wording sensitivity

Estimates confidence via self-consistency or self-confidence aggregation

🔎 Similar Papers

Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations

2024-10-03arXiv.orgCitations: 4

Authors to Follow