Mitigating Open-Vocabulary Caption Hallucinations

📅 2023-12-06

🏛️ Conference on Empirical Methods in Natural Language Processing

📈 Citations: 6

✨ Influential: 2

career value

172K/year

🤖 AI Summary

This work addresses the insufficient concreteness and severe noise in image-text pairing data for multimodal pretraining. To quantify semantic concreteness and long-tail hallucination in image captions, we introduce OpenCHAIR—the first open-vocabulary hallucination evaluation benchmark. Building upon it, we propose MOCHa, an unsupervised reinforcement learning framework that operates without a closed vocabulary and explicitly balances generation fidelity and completeness via a multi-objective reward mechanism. MOCHa leverages generative foundation models to construct the benchmark and employs PPO optimization with joint image-text representations. Experiments demonstrate that OpenCHAIR achieves superior diversity and accuracy compared to CHAIR; MOCHa significantly improves captioning performance across both OpenCHAIR and conventional metrics (CIDEr, SPICE) for multiple captioning models. The code and models are publicly released.

Technology Category

Application Category

📝 Abstract

While recent years have seen rapid progress in image-conditioned text generation, image captioning still suffers from the fundamental issue of hallucinations, namely, the generation of spurious details that cannot be inferred from the given image. Existing methods largely use closed-vocabulary object lists to mitigate or evaluate hallucinations in image captioning, ignoring the long-tailed nature of hallucinations that occur in practice. To this end, we propose a framework for addressing hallucinations in image captioning in the open-vocabulary setting. Our framework includes a new benchmark, OpenCHAIR, that leverages generative foundation models to evaluate open-vocabulary object hallucinations for image captioning, surpassing the popular and similarly-sized CHAIR benchmark in both diversity and accuracy. Furthermore, to mitigate open-vocabulary hallucinations without using a closed object list, we propose MOCHa, an approach harnessing advancements in reinforcement learning. Our multi-objective reward function explicitly targets the trade-off between fidelity and adequacy in generations without requiring any strong supervision. MOCHa improves a large variety of image captioning models, as captured by our OpenCHAIR benchmark and other existing metrics. We will release our code and models.

Problem

Research questions and friction points this paper is trying to address.

Measure image caption concreteness for noisy datasets

Filter abstract text to improve multimodal learning

Enhance dataset quality for efficient training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes image caption concreteness metric

Leverages foundation models for visual-semantic loss

Filters high-quality samples for efficient training

🔎 Similar Papers

No similar papers found.