Score: 0

Design and Implementation of an OCR-Powered Pipeline for Table Extraction from Invoices

Published: July 9, 2025 | arXiv ID: 2507.07029v1

By: Parshva Dhilankumar Patel

Potential Business Impact:

Reads invoice tables automatically for faster work.

Business Areas:

Image Recognition Data and Analytics, Software

This paper presents the design and development of an OCR-powered pipeline for efficient table extraction from invoices. The system leverages Tesseract OCR for text recognition and custom post-processing logic to detect, align, and extract structured tabular data from scanned invoice documents. Our approach includes dynamic preprocessing, table boundary detection, and row-column mapping, optimized for noisy and non-standard invoice formats. The resulting pipeline significantly improves data extraction accuracy and consistency, supporting real-world use cases such as automated financial workflows and digital archiving.

Low-Resource Language Processing: An OCR-Driven Summarization and Translation Pipeline

Computation and Language

Reads and understands text from pictures in any language.

16 May 2025 1

87%

Digitization of Document and Information Extraction using OCR

CV and Pattern Recognition

Reads messy and typed papers perfectly.

11 Jun 2025 0

86%

A Multimodal Pipeline for Clinical Data Extraction: Applying Vision-Language Models to Scans of Transfusion Reaction Reports

Computation and Language

Reads checkboxes on paper forms automatically.

28 Apr 2025 1

View PDF Login to Bookmark

Page Count

17 pages

Design and Implementation of an OCR-Powered Pipeline for Table Extraction from Invoices

Reads invoice tables automatically for faster work.

Technical Abstract

Low-Resource Language Processing: An OCR-Driven Summarization and Translation Pipeline

Digitization of Document and Information Extraction using OCR

A Multimodal Pipeline for Clinical Data Extraction: Applying Vision-Language Models to Scans of Transfusion Reaction Reports