Optical Context Compression Is Just (Bad) Autoencoding
By: Ivan Yee Lee, Cheng Yang, Taylor Berg-Kirkpatrick
Potential Business Impact:
Makes computers understand pictures better than before.
DeepSeek-OCR demonstrates that rendered text can be reconstructed with high fidelity from a small number of vision tokens. This finding has sparked excitement about vision-based context compression for language models. But the evaluation stops at reconstruction; whether these representations help language modeling remains untested. We test two assumptions implicit in the optical-compression narrative: that vision-based compression provides unique advantages for text reconstruction from compressed representations, and that DeepSeek-OCR's reconstruction results are evidence that vision-based compression will be useful for language modeling. Comparing their vision encoder against simple alternatives--parameter-free mean pooling and a learned hierarchical encoder--we find that these simple approaches match or surpass vision for reconstruction at matched compression ratios, and outperform it for language modeling--where vision-based compression fails to beat truncation. The excitement around optical context compression outpaces the evidence. Code and checkpoints are available at https://github.com/ivnle/bad-autoencoding
Similar Papers
DeepSeek-OCR: Contexts Optical Compression
CV and Pattern Recognition
Reads tiny text in pictures super fast.
Visual Merit or Linguistic Crutch? A Close Look at DeepSeek-OCR
Computation and Language
Makes computers read text from pictures better.
Context Cascade Compression: Exploring the Upper Limits of Text Compression
Computation and Language
Makes computers understand super long texts better.