HippoMM: Hippocampal-inspired Multimodal Memory for Long Audiovisual Event Understanding
By: Yueqian Lin , Qinsi Wang , Hancheng Ye and more
Potential Business Impact:
Helps computers remember and understand videos better.
Comprehending extended audiovisual experiences remains a fundamental challenge for computational systems. Current approaches struggle with temporal integration and cross-modal associations that humans accomplish effortlessly through hippocampal-cortical networks. We introduce HippoMM, a biologically-inspired architecture that transforms hippocampal mechanisms into computational advantages for multimodal understanding. HippoMM implements three key innovations: (i) hippocampus-inspired pattern separation and completion specifically designed for continuous audiovisual streams, (ii) short-to-long term memory consolidation that transforms perceptual details into semantic abstractions, and (iii) cross-modal associative retrieval pathways enabling modality-crossing queries. Unlike existing retrieval systems with static indexing schemes, HippoMM dynamically forms integrated episodic representations through adaptive temporal segmentation and dual-process memory encoding. Evaluations on our challenging HippoVlog benchmark demonstrate that HippoMM significantly outperforms state-of-the-art approaches (78.2% vs. 64.2% accuracy) while providing substantially faster response times (20.4s vs. 112.5s). Our results demonstrate that translating neuroscientific memory principles into computational architectures provides a promising foundation for next-generation multimodal understanding systems. The code and benchmark dataset are publicly available at https://github.com/linyueqian/HippoMM.
Similar Papers
HEMA : A Hippocampus-Inspired Extended Memory Architecture for Long-Context AI Conversations
Computation and Language
Lets computers remember long talks perfectly.
WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning
CV and Pattern Recognition
Lets computers understand very long videos better.
A Multimodal-Multitask Framework with Cross-modal Relation and Hierarchical Interactive Attention for Semantic Comprehension
CV and Pattern Recognition
Makes computers understand mixed information better.