Generating Synthetic Invoices via Layout-Preserving Content Replacement
By: Bevin V , Ananthakrishnan P V , Ragesh KR and more
Potential Business Impact:
Creates fake invoices for training AI.
The performance of machine learning models for automated invoice processing is critically dependent on large-scale, diverse datasets. However, the acquisition of such datasets is often constrained by privacy regulations and the high cost of manual annotation. To address this, we present a novel pipeline for generating high-fidelity, synthetic invoice documents and their corresponding structured data. Our method first utilizes Optical Character Recognition (OCR) to extract the text content and precise spatial layout from a source invoice. Select data fields are then replaced with contextually realistic, synthetic content generated by a large language model (LLM). Finally, we employ an inpainting technique to erase the original text from the image and render the new, synthetic text in its place, preserving the exact layout and font characteristics. This process yields a pair of outputs: a visually realistic new invoice image and a perfectly aligned structured data file (JSON) reflecting the synthetic content. Our approach provides a scalable and automated solution to amplify small, private datasets, enabling the creation of large, varied corpora for training more robust and accurate document intelligence models.
Similar Papers
An Efficient Deep Learning-Based Approach to Automating Invoice Document Validation
CV and Pattern Recognition
Checks bills automatically, even messy ones.
Invoice Information Extraction: Methods and Performance Evaluation
Artificial Intelligence
Reads important info from bills automatically.
Invoice Information Extraction: Methods and Performance Evaluation
Artificial Intelligence
Reads bills and finds important money details.