Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception

📅 2025-04-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current multimodal large language models (MLLMs) often generate image captions lacking fine-grained visual details and prone to hallucination, limiting performance in cross-modal tasks. To address this, we propose a training-free “Divide-and-Aggregate” framework grounded in feature integration theory. First, images are semantically and spatially co-partitioned to enhance local visual perception. Second, hierarchical feature aggregation models global structural relationships. Third, a parameter-free semantic filtering mechanism suppresses hallucinations and semantic inconsistencies. The framework is plug-and-play compatible with mainstream MLLMs—including LLaVA, GPT-4o, and Claude-3.5—requiring no architectural modification or fine-tuning. Evaluated across multiple benchmarks, our method significantly improves caption detail richness and factual consistency, surpassing state-of-the-art approaches. It achieves these gains at zero training cost and is fully open-sourced.

Technology Category

Application Category

📝 Abstract

High-quality image captions play a crucial role in improving the performance of cross-modal applications such as text-to-image generation, text-to-video generation, and text-image retrieval. To generate long-form, high-quality captions, many recent studies have employed multimodal large language models (MLLMs). However, current MLLMs often produce captions that lack fine-grained details or suffer from hallucinations, a challenge that persists in both open-source and closed-source models. Inspired by Feature-Integration theory, which suggests that attention must focus on specific regions to integrate visual information effectively, we propose a extbf{divide-then-aggregate} strategy. Our method first divides the image into semantic and spatial patches to extract fine-grained details, enhancing the model's local perception of the image. These local details are then hierarchically aggregated to generate a comprehensive global description. To address hallucinations and inconsistencies in the generated captions, we apply a semantic-level filtering process during hierarchical aggregation. This training-free pipeline can be applied to both open-source models (LLaVA-1.5, LLaVA-1.6, Mini-Gemini) and closed-source models (Claude-3.5-Sonnet, GPT-4o, GLM-4V-Plus). Extensive experiments demonstrate that our method generates more detailed, reliable captions, advancing multimodal description generation without requiring model retraining. The source code are available at https://github.com/GeWu-Lab/Patch-Matters

Problem

Research questions and friction points this paper is trying to address.

Enhancing fine-grained details in image captions

Reducing hallucinations in multimodal caption generation

Improving local perception without model retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Divide-then-aggregate strategy for fine details

Semantic and spatial patches enhance local perception

Semantic-level filtering reduces hallucinations

🔎 Similar Papers

No similar papers found.

Authors to Follow