The Chonkers Algorithm: Content-Defined Chunking with Strict Guarantees on Size and Locality
By: Benjamin Berger
Potential Business Impact:
Makes computer files smaller and easier to update.
This paper presents the Chonkers algorithm, a novel content-defined chunking method providing simultaneous strict guarantees on chunk size and edit locality. Unlike existing algorithms such as Rabin fingerprinting and anchor-based methods, Chonkers achieves bounded propagation of edits and precise control over chunk sizes. I describe the algorithm's layered structure, theoretical guarantees, implementation considerations, and introduce the Yarn datatype, a deduplicated, merge-tree-based string representation benefiting from Chonkers' strict guarantees.
Similar Papers
The Chonkers Algorithm: Content-Defined Chunking with Provable Strict Guarantees on Size and Locality
Data Structures and Algorithms
Makes computer files smaller and easier to update.
FreeChunker: A Cross-Granularity Chunking Framework
Computation and Language
Lets computers find answers better, faster.
MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System
Computation and Language
Makes AI understand and use information better.