PhotoCraft: Agentic Reasoning with Hierarchical Self-Evolving Memory for Deep Image Search

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the limitation of existing large language model–based image search agents, which lack a persistent memory mechanism and consequently struggle to maintain contextual coherence over extended interactions or transfer learned experience across tasks. To overcome this, the authors propose a training-free hierarchical memory system that, for the first time, integrates human cognition–inspired components—working memory, episodic memory, and semantic memory—into multimodal large language models. This architecture enables context-aware multi-step reasoning and facilitates the reuse of past experiences. Evaluated on the DISBench benchmark, the proposed method substantially improves retrieval performance, achieving gains of up to 18.5%, thereby effectively alleviating a critical bottleneck faced by memoryless agents in deep image search scenarios.

📝 Abstract

Deep Image Search requires multi-step reasoning over rich contextual cues, such as time, location, and event relations. However, most existing LLM-based agents are stateless and reactive, lacking persistent memory to maintain long-horizon context or transfer experience across tasks, which often leads to execution drift and experience isolation. To address these limitations, we propose PhotoCraft, a training-free, hierarchical memory system for photo-search agents. Inspired by human cognition, PhotoCraft equips MLLMs with working, episodic, and semantic memory, which are dynamically invoked during reasoning to preserve logical consistency and knowledge transferability throughout multi-step reasoning and answer generation. Extensive experiments on DISBench demonstrate that PhotoCraft consistently improves context-aware retrieval across diverse MLLM backbones, achieving gains of up to 18.5\% and effectively mitigating key bottlenecks in memoryless deep image search, offering a practical path toward reliable and generalizable multimodal search agents.

Problem

Research questions and friction points this paper is trying to address.

Deep Image Search

Memoryless Agents

Multi-step Reasoning

Contextual Cues

Experience Transfer

Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical memory

deep image search

multimodal LLM agents