π€ AI Summary
Current evaluations of large language models lack long-term semantic consistency, dynamic persona modeling, and the ability to integrate multi-source heterogeneous data such as documents and emails. To address these limitations, this work proposes RHELM, a novel benchmark that establishes the first evaluation framework unifying dynamic evolution, heterogeneous data integration, and long-term memory. Leveraging detailed user profiles and a LOOP (pLan-rOllout-evOlve-Prune) mechanism, RHELM generates temporally coherent multi-scenario dialogues with precise alignment between external data and user event trajectories. The framework defines 27 memory-related attributes spanning seven categories of complex queries. Experimental results reveal that existing models exhibit significant deficiencies in aggregating multi-source information and reasoning within realistic contexts, thereby demonstrating RHELMβs challenge and practical relevance.
π Abstract
In existing memory benchmarks for Large Language Models (LLMs), the evaluated dialogue sessions often lack long-term semantic consistency, and the underlying personas tend to be flat and static. Furthermore, in real-world scenarios, interactions between users and assistants involve more diverse, heterogeneous data streams, such as documents and emails. These shortcomings significantly limit the realism and effectiveness of current evaluations. To address these limitations, we introduce RHELM (Realistic, Heterogeneous, and Evolving Long-term Memory). Driven by meticulously crafted user profiles and a novel LOOP (pLan-rOllout-evOlve-Prune) module, we construct realistic dialogues across diverse interaction scenarios that exhibit dynamic temporal evolution and long-term coherence. Crucially, these dialogues are deeply integrated with heterogeneous external sources synchronized with the user's temporal event trajectory. The resulting benchmark encompasses challenging question-answer pairs spanning seven inquiry types, with each question mapping to at least one of 27 critical memory characteristics that we identify as essential yet underexplored in current research. Comprehensive experiments across full-context models, retrieval-augmented generation (RAG) methods, and representative memory frameworks reveal that contemporary approaches still expose critical weaknesses in complex, real-world settings, particularly in resolving multi-source aggregation and real-world contextual reasoning.