Revisiting Data Compression with Language Modeling
By: Chen-Han Tsai
Potential Business Impact:
Makes files smaller using smart computer language.
In this report, we investigate the potential use of large language models (LLM's) in the task of data compression. Previous works have demonstrated promising results in applying LLM's towards compressing not only text, but also a wide range of multi-modal data. Despite the favorable performance achieved, there still remains several practical questions that pose a challenge towards replacing existing data compression algorithms with LLM's. In this work, we explore different methods to achieve a lower adjusted compression rate using LLM's as data compressors. In comparison to previous works, we were able to achieve a new state-of-the-art (SOTA) adjusted compression rate of around $18\%$ on the enwik9 dataset without additional model training. Furthermore, we explore the use of LLM's in compressing non-English data, code data, byte stream sequences. We show that while LLM's excel in compressing data in text-dominant domains, their ability in compressing non-natural text sequences still remain competitive if configured in the right way.
Similar Papers
Compression Laws for Large Language Models
Computation and Language
Makes big AI models smaller and faster.
Scaling Down, Serving Fast: Compressing and Deploying Efficient LLMs for Recommendation Systems
Information Retrieval
Makes small AI models work like big ones.
Learning to Compress: Unlocking the Potential of Large Language Models for Text Representation
Computation and Language
Makes computers understand writing better for searching.