Score: 1

Generating Synthetic Invoices via Layout-Preserving Content Replacement

Published: August 4, 2025 | arXiv ID: 2508.03754v1

By: Bevin V , Ananthakrishnan P V , Ragesh KR and more

Potential Business Impact:

Creates fake invoices for training AI.

The performance of machine learning models for automated invoice processing is critically dependent on large-scale, diverse datasets. However, the acquisition of such datasets is often constrained by privacy regulations and the high cost of manual annotation. To address this, we present a novel pipeline for generating high-fidelity, synthetic invoice documents and their corresponding structured data. Our method first utilizes Optical Character Recognition (OCR) to extract the text content and precise spatial layout from a source invoice. Select data fields are then replaced with contextually realistic, synthetic content generated by a large language model (LLM). Finally, we employ an inpainting technique to erase the original text from the image and render the new, synthetic text in its place, preserving the exact layout and font characteristics. This process yields a pair of outputs: a visually realistic new invoice image and a perfectly aligned structured data file (JSON) reflecting the synthetic content. Our approach provides a scalable and automated solution to amplify small, private datasets, enabling the creation of large, varied corpora for training more robust and accurate document intelligence models.

Repos / Data Links

Page Count
9 pages

Category
Computer Science:
CV and Pattern Recognition