Visual Merit or Linguistic Crutch? A Close Look at DeepSeek-OCR
By: Yunhao Liang , Ruixuan Ying , Bo Li and more
Potential Business Impact:
Makes computers read text from pictures better.
DeepSeek-OCR utilizes an optical 2D mapping approach to achieve high-ratio vision-text compression, claiming to decode text tokens exceeding ten times the input visual tokens. While this suggests a promising solution for the LLM long-context bottleneck, we investigate a critical question: "Visual merit or linguistic crutch - which drives DeepSeek-OCR's performance?" By employing sentence-level and word-level semantic corruption, we isolate the model's intrinsic OCR capabilities from its language priors. Results demonstrate that without linguistic support, DeepSeek-OCR's performance plummets from approximately 90% to 20%. Comparative benchmarking against 13 baseline models reveals that traditional pipeline OCR methods exhibit significantly higher robustness to such semantic perturbations than end-to-end methods. Furthermore, we find that lower visual token counts correlate with increased reliance on priors, exacerbating hallucination risks. Context stress testing also reveals a total model collapse around 10,000 text tokens, suggesting that current optical compression techniques may paradoxically aggravate the long-context bottleneck. This study empirically defines DeepSeek-OCR's capability boundaries and offers essential insights for future optimizations of the vision-text compression paradigm. We release all data, results and scripts used in this study at https://github.com/dududuck00/DeepSeekOCR.
Similar Papers
DeepSeek-OCR: Contexts Optical Compression
CV and Pattern Recognition
Reads tiny text in pictures super fast.
Optical Context Compression Is Just (Bad) Autoencoding
CV and Pattern Recognition
Makes computers understand pictures better than before.
Context Cascade Compression: Exploring the Upper Limits of Text Compression
Computation and Language
Makes computers understand super long texts better.