Score: 2

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

Published: August 13, 2025 | arXiv ID: 2508.09736v2

By: Lin Long , Yichen He , Wentao Ye and more

BigTech Affiliations: ByteDance

Potential Business Impact:

Lets robots remember and learn like people.

We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update its long-term memory. Beyond episodic memory, it also develops semantic memory, enabling it to accumulate world knowledge over time. Its memory is organized in an entity-centric, multimodal format, allowing deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn, iterative reasoning and retrieves relevant information from memory to accomplish the task. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a new long-video question answering benchmark. M3-Bench comprises 100 newly recorded real-world videos captured from a robot's perspective (M3-Bench-robot) and 920 web-sourced videos across diverse scenarios (M3-Bench-web). We annotate question-answer pairs designed to test key capabilities essential for agent applications, such as human understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances the multimodal agents toward more human-like long-term memory and provides insights into their practical design. Model, code and data are available at https://github.com/bytedance-seed/m3-agent

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

CV and Pattern Recognition

Helps robots remember and learn from videos.

13 Aug 2025 2

91%

WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

CV and Pattern Recognition

Lets computers understand very long videos better.

2 Dec 2025 1

90%

Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents

Computation and Language

Helps AI remember conversations with pictures.

7 Jan 2026 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com github.com

Page Count

47 pages

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

Lets robots remember and learn like people.

Technical Abstract

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents