Task-Focused Memorization for Multimodal Agents

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the challenge of information overload in multimodal agents during continual perception, which often impedes their ability to identify task-relevant memories. To tackle this, the authors propose TaskMem, a novel framework that introduces reinforcement learning into multimodal memory selection for the first time. TaskMem employs a two-stage training strategy: first optimizing memory fidelity and then dynamically refining memory content using task-specific rewards. Built upon the Qwen3-VL-30B-A3B large model, the approach leverages adapter-based fine-tuning and a task-driven reward model to learn an effective memory selection policy. Evaluated on the VideoMME, EgoLife, and EgoTempo streaming benchmarks, TaskMem achieves notable gains—improving VQA accuracy by 6.3%, 7.0%, and 5.3%, respectively—using only retrieved memories, thereby significantly enhancing task-oriented memory capabilities.

📝 Abstract

Long-term memory is essential for multimodal agents to build coherent experience, accumulate world knowledge, and achieve continual learning. However, constructing effective memory goes beyond memory module design and basic requirements such as accuracy and fidelity; the key challenge lies in determining what to memorize. Multimodal agents, such as embodied agents, continuously perceive, reason, and act in real or virtual environments, receiving an unbounded stream of multimodal observations. From this combinatorial explosion of information, an agent must selectively retain content that is relevant to its role in the environment and valuable for future tasks. To bridge this gap, we frame memory generation as a learnable memorization policy and introduce TaskMem (Task-focused Memorization Policy Learning), a reinforcement-learning-based framework that enables the policy to dynamically adjust its focus to the demands of real tasks encountered in the environment. TaskMem adopts a two-phase training paradigm: Phase One learns how to memorize by optimizing memory quality under fundamental fidelity requirements; Phase Two occurs after deployment, where the agent learns what to memorize by tuning an adapter on its base MLLM, using recent environment tasks to define a reward model that guides the memorization policy toward task-relevant content. To evaluate our approach, we reformulate VideoMME, EgoLife, and EgoTempo into streaming benchmarks that simulate a realistic setting in which an agent processes streaming observations and handles tasks arriving online. To isolate memory assessment, the questions must be answered using only the agent's memory, without access to raw video. Built on Qwen3-VL-30B-A3B, TaskMem improves VQA accuracy by 6.3%, 7.0%, and 5.3% on these benchmarks, respectively.

Problem

Research questions and friction points this paper is trying to address.

multimodal agents

long-term memory

selective memorization

task relevance

memory selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

task-focused memorization

reinforcement learning

multimodal agents