RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction
By: Haonan Bian , Zhiyuan Yao , Sen Hu and more
Potential Business Impact:
Helps AI remember long projects to finish them.
As Large Language Models (LLMs) evolve from static dialogue interfaces to autonomous general agents, effective memory is paramount to ensuring long-term consistency. However, existing benchmarks primarily focus on casual conversation or task-oriented dialogue, failing to capture **"long-term project-oriented"** interactions where agents must track evolving goals. To bridge this gap, we introduce **RealMem**, the first benchmark grounded in realistic project scenarios. RealMem comprises over 2,000 cross-session dialogues across eleven scenarios, utilizing natural user queries for evaluation. We propose a synthesis pipeline that integrates Project Foundation Construction, Multi-Agent Dialogue Generation, and Memory and Schedule Management to simulate the dynamic evolution of memory. Experiments reveal that current memory systems face significant challenges in managing the long-term project states and dynamic context dependencies inherent in real-world projects. Our code and datasets are available at [https://github.com/AvatarMemory/RealMemBench](https://github.com/AvatarMemory/RealMemBench).
Similar Papers
EvolMem: A Cognitive-Driven Benchmark for Multi-Session Dialogue Memory
Computation and Language
Tests how well computers remember long talks.
Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents
Computation and Language
Helps AI remember conversations with pictures.
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Computation and Language
Helps AI agents remember and learn from past tasks.