Score: 1

MoM: Mixtures of Scenario-Aware Document Memories for Retrieval-Augmented Generation Systems

Published: October 16, 2025 | arXiv ID: 2510.14252v1

By: Jihao Zhao , Zhiyuan Ji , Simin Niu and more

Potential Business Impact:

Computers understand books like people do.

Business Areas:

Semantic Search Internet Services

The traditional RAG paradigm, which typically engages in the comprehension of relevant text chunks in response to received queries, inherently restricts both the depth of knowledge internalization and reasoning capabilities. To address this limitation, our research transforms the text processing in RAG from passive chunking to proactive understanding, defining this process as document memory extraction with the objective of simulating human cognitive processes during reading. Building upon this, we propose the Mixtures of scenario-aware document Memories (MoM) framework, engineered to efficiently handle documents from multiple domains and train small language models (SLMs) to acquire the ability to proactively explore and construct document memories. The MoM initially instructs large language models (LLMs) to simulate domain experts in generating document logical outlines, thereby directing structured chunking and core content extraction. It employs a multi-path sampling and multi-perspective evaluation mechanism, specifically designing comprehensive metrics that represent chunk clarity and extraction completeness to select the optimal document memories. Additionally, to infuse deeper human-like reading abilities during the training of SLMs, we incorporate a reverse reasoning strategy, which deduces refined expert thinking paths from high-quality outcomes. Finally, leveraging diverse forms of content generated by MoM, we develop a three-layer document memory retrieval mechanism, which is grounded in our theoretical proof from the perspective of probabilistic modeling. Extensive experimental results across three distinct domains demonstrate that the MoM framework not only resolves text chunking challenges in existing RAG systems, providing LLMs with semantically complete document memories, but also paves the way for SLMs to achieve human-centric intelligent text processing.

MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System

Computation and Language

Makes AI understand and use information better.

12 Mar 2025 0

89%

Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding

Computation and Language

Helps computers understand all parts of documents.

17 Oct 2025 0

89%

MoK-RAG: Mixture of Knowledge Paths Enhanced Retrieval-Augmented Generation for Embodied AI Environments

Machine Learning (CS)

Helps AI build more varied virtual worlds.

18 Mar 2025 0

View PDF Login to Bookmark

Repos / Data Links

github.com github.com github.com github.com

Page Count

15 pages

MoM: Mixtures of Scenario-Aware Document Memories for Retrieval-Augmented Generation Systems

Computers understand books like people do.

Technical Abstract

MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System

Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding

MoK-RAG: Mixture of Knowledge Paths Enhanced Retrieval-Augmented Generation for Embodied AI Environments