Score: 1

Revisiting Data Compression with Language Modeling

Published: January 6, 2026 | arXiv ID: 2601.02875v1

By: Chen-Han Tsai

Potential Business Impact:

Makes files smaller using smart computer language.

Business Areas:
Natural Language Processing Artificial Intelligence, Data and Analytics, Software

In this report, we investigate the potential use of large language models (LLM's) in the task of data compression. Previous works have demonstrated promising results in applying LLM's towards compressing not only text, but also a wide range of multi-modal data. Despite the favorable performance achieved, there still remains several practical questions that pose a challenge towards replacing existing data compression algorithms with LLM's. In this work, we explore different methods to achieve a lower adjusted compression rate using LLM's as data compressors. In comparison to previous works, we were able to achieve a new state-of-the-art (SOTA) adjusted compression rate of around $18\%$ on the enwik9 dataset without additional model training. Furthermore, we explore the use of LLM's in compressing non-English data, code data, byte stream sequences. We show that while LLM's excel in compressing data in text-dominant domains, their ability in compressing non-natural text sequences still remain competitive if configured in the right way.

Page Count
18 pages

Category
Computer Science:
Computation and Language