Score: 0

Design and Implementation of an OCR-Powered Pipeline for Table Extraction from Invoices

Published: July 9, 2025 | arXiv ID: 2507.07029v1

By: Parshva Dhilankumar Patel

Potential Business Impact:

Reads invoice tables automatically for faster work.

Business Areas:
Image Recognition Data and Analytics, Software

This paper presents the design and development of an OCR-powered pipeline for efficient table extraction from invoices. The system leverages Tesseract OCR for text recognition and custom post-processing logic to detect, align, and extract structured tabular data from scanned invoice documents. Our approach includes dynamic preprocessing, table boundary detection, and row-column mapping, optimized for noisy and non-standard invoice formats. The resulting pipeline significantly improves data extraction accuracy and consistency, supporting real-world use cases such as automated financial workflows and digital archiving.

Page Count
17 pages

Category
Computer Science:
CV and Pattern Recognition