Score: 1

Low-Resource Language Processing: An OCR-Driven Summarization and Translation Pipeline

Published: May 16, 2025 | arXiv ID: 2505.11177v1

By: Hrishit Madhavi , Jacob Cherian , Yuvraj Khamkar and more

Potential Business Impact:

Reads and understands text from pictures in any language.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

This paper presents an end-to-end suite for multilingual information extraction and processing from image-based documents. The system uses Optical Character Recognition (Tesseract) to extract text in languages such as English, Hindi, and Tamil, and then a pipeline involving large language model APIs (Gemini) for cross-lingual translation, abstractive summarization, and re-translation into a target language. Additional modules add sentiment analysis (TensorFlow), topic classification (Transformers), and date extraction (Regex) for better document comprehension. Made available in an accessible Gradio interface, the current research shows a real-world application of libraries, models, and APIs to close the language gap and enhance access to information in image media across different linguistic environments

Digitization of Document and Information Extraction using OCR

CV and Pattern Recognition

Reads messy and typed papers perfectly.

11 Jun 2025 0

89%

Seeing Justice Clearly: Handwritten Legal Document Translation with OCR and Vision-Language Models

CV and Pattern Recognition

Translates handwritten legal papers instantly.

19 Dec 2025 1

89%

Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil

Computation and Language

Helps computers read Tamil and Sinhala text.

24 Jul 2025 2

View PDF Login to Bookmark

Country of Origin

🇮🇳 India

Page Count

9 pages

Low-Resource Language Processing: An OCR-Driven Summarization and Translation Pipeline

Reads and understands text from pictures in any language.

Technical Abstract

Digitization of Document and Information Extraction using OCR

Seeing Justice Clearly: Handwritten Legal Document Translation with OCR and Vision-Language Models

Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil