One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

πŸ“… 2026-06-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of deploying retrieval-based question answering systems in resource-constrained settings, where the direct use of raw text and images as evidence leads to excessively long input sequences and high storage overhead. To overcome this, the authors propose Latent Memoryβ€”a novel mechanism that compresses each multimodal evidence item into a single high-dimensional latent token within a unified representation space, enabling joint support for query embedding, retrieval, and generation. The latent tokens are produced by a compact LLM/VLM-based compressor trained end-to-end with reconstruction loss, contrastive learning, and knowledge distillation, and are directly used as input prompts for the generator. Evaluated across seven textual and multimodal QA benchmarks, the method matches the performance of state-of-the-art RAG approaches while reducing generation token consumption by 3–10Γ— and achieving the best results on image-grounded QA in the WebQA benchmark.
πŸ“ Abstract
External memory effectively grounds large language models (LLMs) and vision-language models (VLMs)-based question answering (QA) in relevant multimodal evidence. However, existing memory paradigms represent each memory item in raw text and image forms, so retrieval-based systems must pass the retrieved text or images to the generation LLMs/VLMs, resulting in high token consumption and storage pressure, making it unaffordable for resource-constrained applications. We propose Latent Memory, a latent-space memory paradigm that replaces each raw text or image evidence item with a single high-dimensional latent token produced by a small compressor LLM/VLM. Rather than retrieving raw evidence for generation, Latent Memory operates in a unified latent representation space: the query is embedded into this space to retrieve relevant latent tokens, and the retrieved latent tokens are directly prompted to a pretrained LLM or VLM for answer generation. To make each latent token simultaneously informative for reconstruction, retrieval, and generation, we train the compressor with reconstruction, contrastive, and distillation objectives in a unified end-to-end manner. Latent Memory is evaluated on seven text-only QA benchmarks (e.g., HotpotQA) and multimodal QA benchmarks, where it achieves competitive QA performance compared to advanced RAG baselines while consuming 3x to 10x fewer generator tokens. It can also deliver the strongest image-grounded QA performance on WebQA. Code is available at https://github.com/zz1358m/Latent-Memory-Master.
Problem

Research questions and friction points this paper is trying to address.

multimodal QA
resource-constrained
external memory
token efficiency
evidence representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Memory
Resource-Constrained QA
Multimodal Compression
Latent Token
End-to-End Training