Score: 0

Multi-Modal Vision vs. Text-Based Parsing: Benchmarking LLM Strategies for Invoice Processing

Published: August 29, 2025 | arXiv ID: 2509.04469v1

By: David Berghaus , Armin Berger , Lars Hillebrand and more

Potential Business Impact:

Helps computers understand invoices better.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

This paper benchmarks eight multi-modal large language models from three families (GPT-5, Gemini 2.5, and open-source Gemma 3) on three diverse openly available invoice document datasets using zero-shot prompting. We compare two processing strategies: direct image processing using multi-modal capabilities and a structured parsing approach converting documents to markdown first. Results show native image processing generally outperforms structured approaches, with performance varying across model types and document characteristics. This benchmark provides insights for selecting appropriate models and processing strategies for automated document systems. Our code is available online.

On the Comprehensibility of Multi-structured Financial Documents using LLMs and Pre-processing Tools

Information Retrieval

Helps computers understand complex charts and tables.

5 Jun 2025 1

89%

Benchmarking Vision-Language and Multimodal Large Language Models in Zero-shot and Few-shot Scenarios: A study on Christian Iconography

CV and Pattern Recognition

Computers can now identify religious pictures.

23 Sep 2025 1

88%

Can Multi-modal (reasoning) LLMs detect document manipulation?

CV and Pattern Recognition

Finds fake papers using smart computer eyes and brains.

14 Aug 2025 2

View PDF Login to Bookmark

Page Count

8 pages

Multi-Modal Vision vs. Text-Based Parsing: Benchmarking LLM Strategies for Invoice Processing

Helps computers understand invoices better.

Technical Abstract

On the Comprehensibility of Multi-structured Financial Documents using LLMs and Pre-processing Tools

Benchmarking Vision-Language and Multimodal Large Language Models in Zero-shot and Few-shot Scenarios: A study on Christian Iconography

Can Multi-modal (reasoning) LLMs detect document manipulation?