Low-Resource Language Processing: An OCR-Driven Summarization and Translation Pipeline
By: Hrishit Madhavi , Jacob Cherian , Yuvraj Khamkar and more
Potential Business Impact:
Reads and understands text from pictures in any language.
This paper presents an end-to-end suite for multilingual information extraction and processing from image-based documents. The system uses Optical Character Recognition (Tesseract) to extract text in languages such as English, Hindi, and Tamil, and then a pipeline involving large language model APIs (Gemini) for cross-lingual translation, abstractive summarization, and re-translation into a target language. Additional modules add sentiment analysis (TensorFlow), topic classification (Transformers), and date extraction (Regex) for better document comprehension. Made available in an accessible Gradio interface, the current research shows a real-world application of libraries, models, and APIs to close the language gap and enhance access to information in image media across different linguistic environments
Similar Papers
Digitization of Document and Information Extraction using OCR
CV and Pattern Recognition
Reads messy and typed papers perfectly.
Seeing Justice Clearly: Handwritten Legal Document Translation with OCR and Vision-Language Models
CV and Pattern Recognition
Translates handwritten legal papers instantly.
Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil
Computation and Language
Helps computers read Tamil and Sinhala text.