Design and Implementation of an OCR-Powered Pipeline for Table Extraction from Invoices
By: Parshva Dhilankumar Patel
Potential Business Impact:
Reads invoice tables automatically for faster work.
This paper presents the design and development of an OCR-powered pipeline for efficient table extraction from invoices. The system leverages Tesseract OCR for text recognition and custom post-processing logic to detect, align, and extract structured tabular data from scanned invoice documents. Our approach includes dynamic preprocessing, table boundary detection, and row-column mapping, optimized for noisy and non-standard invoice formats. The resulting pipeline significantly improves data extraction accuracy and consistency, supporting real-world use cases such as automated financial workflows and digital archiving.
Similar Papers
Low-Resource Language Processing: An OCR-Driven Summarization and Translation Pipeline
Computation and Language
Reads and understands text from pictures in any language.
Digitization of Document and Information Extraction using OCR
CV and Pattern Recognition
Reads messy and typed papers perfectly.
A Multimodal Pipeline for Clinical Data Extraction: Applying Vision-Language Models to Scans of Transfusion Reaction Reports
Computation and Language
Reads checkboxes on paper forms automatically.