Score: 1

Low-Resource Language Processing: An OCR-Driven Summarization and Translation Pipeline

Published: May 16, 2025 | arXiv ID: 2505.11177v1

By: Hrishit Madhavi , Jacob Cherian , Yuvraj Khamkar and more

Potential Business Impact:

Reads and understands text from pictures in any language.

Business Areas:
Natural Language Processing Artificial Intelligence, Data and Analytics, Software

This paper presents an end-to-end suite for multilingual information extraction and processing from image-based documents. The system uses Optical Character Recognition (Tesseract) to extract text in languages such as English, Hindi, and Tamil, and then a pipeline involving large language model APIs (Gemini) for cross-lingual translation, abstractive summarization, and re-translation into a target language. Additional modules add sentiment analysis (TensorFlow), topic classification (Transformers), and date extraction (Regex) for better document comprehension. Made available in an accessible Gradio interface, the current research shows a real-world application of libraries, models, and APIs to close the language gap and enhance access to information in image media across different linguistic environments

Country of Origin
🇮🇳 India

Page Count
9 pages

Category
Computer Science:
Computation and Language