Score: 1

From Press to Pixels: Evolving Urdu Text Recognition

Published: May 20, 2025 | arXiv ID: 2505.13943v2

By: Samee Arif, Sualeha Farid

Potential Business Impact:

Helps computers read old Urdu newspapers.

Business Areas:
Image Recognition Data and Analytics, Software

This paper introduces an end-to-end pipeline for Optical Character Recognition (OCR) on Urdu newspapers, addressing challenges posed by complex multi-column layouts, low-resolution scans, and the stylistic variability of the Nastaliq script. Our system comprises four modules: (1) article segmentation, (2) image super-resolution, (3) column segmentation, and (4) text recognition. We fine-tune YOLOv11x for segmentation, achieving 0.963 precision for articles and 0.970 for columns. A SwinIR-based super-resolution model boosts LLM text recognition accuracy by 25-70%. We also introduce the Urdu Newspaper Benchmark (UNB), a manually annotated dataset for Urdu OCR. Using UNB and the OpenITI corpus, we compare traditional CNN+RNN-based OCR models with modern LLMs. Gemini-2.5-Pro achieves the best performance with a WER of 0.133. We further analyze LLM outputs via insertion, deletion, and substitution error breakdowns, as well as character-level confusion analysis. Finally, we show that fine-tuning on just 500 samples yields a 6.13% WER improvement, highlighting the adaptability of LLMs for Urdu OCR.

Country of Origin
πŸ‡ΊπŸ‡Έ United States

Repos / Data Links

Page Count
10 pages

Category
Computer Science:
CV and Pattern Recognition