Learning to Compress Contexts for Efficient Knowledge-based Visual Question Answering

📅 2024-09-11

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

To address inefficient reasoning in knowledge-based visual question answering (KB-VQA) caused by redundant external knowledge inputs to large models, this paper proposes a retrieval-augmented KV cache compression mechanism. It dynamically retrieves relevant knowledge and compresses it into lightweight KV caches to modulate frozen multimodal large language models (MLLMs), eliminating the need for fine-tuning. The method balances efficiency and generalization, supporting diverse MLLM architectures and heterogeneous knowledge sources—including textual and multimodal ones. Evaluated on the OK-VQA benchmark, it achieves 63.92% accuracy—setting a new state-of-the-art—while reducing inference latency by 22.0%–59.7%. To our knowledge, this is the first framework unifying high-accuracy KB-VQA with low computational overhead, cross-model compatibility, and cross-modal knowledge adaptation.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) have demonstrated great performance on visual question answering (VQA). When it comes to knowledge-based Visual Question Answering (KB-VQA), MLLMs may lack the specialized domain knowledge needed to answer questions, necessitating the retrieval of necessary information from external knowledge sources. Previous works like Retrival-Augmented VQA-v2 (RAVQA-v2) focus on utilizing as much input information, such as image-based textual descriptions and retrieved knowledge, as possible to improve performance, but they all overlook the issue that with the number of input tokens increasing, inference efficiency significantly decreases, which contradicts the demands of practical applications. To address this issue, we propose extbf{R}etrieval- extbf{A}ugmented MLLMs with Compressed Contexts (RACC). RACC learns to compress and aggregate retrieved knowledge for a given image-question pair, generating a compact modulation in the form of Key-Value (KV) cache to adapt the downstream frozen MLLM, thereby achieving effective and efficient inference. RACC achieves a state-of-the-art (SOTA) performance of 63.92% on OK-VQA. Moreover, it significantly reduces inference latency by 22.0%-59.7% compared to the prominent RAVQA-v2. Abundant experiments show RACC's broad applicability. It is compatible with various off-the-shelf MLLMs and can also handle different knowledge sources including textual and multimodal documents.

Problem

Research questions and friction points this paper is trying to address.

Knowledge-based Visual Question Answering

Accuracy

Information Processing Speed

Innovation

Methods, ideas, or system contributions that make the work stand out.

RACC

Multimodal Language Model

Knowledge-based Visual Question Answering

🔎 Similar Papers

No similar papers found.