MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System
By: Jihao Zhao , Zhiyuan Ji , Zhaoxin Fan and more
Potential Business Impact:
Makes AI understand and use information better.
Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline. This paper initially introduces a dual-metric evaluation method, comprising Boundary Clarity and Chunk Stickiness, to enable the direct quantification of chunking quality. Leveraging this assessment method, we highlight the inherent limitations of traditional and semantic chunking in handling complex contextual nuances, thereby substantiating the necessity of integrating LLMs into chunking process. To address the inherent trade-off between computational efficiency and chunking precision in LLM-based approaches, we devise the granularity-aware Mixture-of-Chunkers (MoC) framework, which consists of a three-stage processing mechanism. Notably, our objective is to guide the chunker towards generating a structured list of chunking regular expressions, which are subsequently employed to extract chunks from the original text. Extensive experiments demonstrate that both our proposed metrics and the MoC framework effectively settle challenges of the chunking task, revealing the chunking kernel while enhancing the performance of the RAG system.
Similar Papers
Reconstructing Context: Evaluating Advanced Chunking Strategies for Retrieval-Augmented Generation
Information Retrieval
Helps computers use more information without getting confused.
HiChunk: Evaluating and Enhancing Retrieval-Augmented Generation with Hierarchical Chunking
Computation and Language
Improves AI's ability to find and use information.
Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal Document Understanding
Machine Learning (CS)
Helps computers understand complex documents better.