Multi-Modal Vision vs. Text-Based Parsing: Benchmarking LLM Strategies for Invoice Processing
By: David Berghaus , Armin Berger , Lars Hillebrand and more
Potential Business Impact:
Helps computers understand invoices better.
This paper benchmarks eight multi-modal large language models from three families (GPT-5, Gemini 2.5, and open-source Gemma 3) on three diverse openly available invoice document datasets using zero-shot prompting. We compare two processing strategies: direct image processing using multi-modal capabilities and a structured parsing approach converting documents to markdown first. Results show native image processing generally outperforms structured approaches, with performance varying across model types and document characteristics. This benchmark provides insights for selecting appropriate models and processing strategies for automated document systems. Our code is available online.
Similar Papers
On the Comprehensibility of Multi-structured Financial Documents using LLMs and Pre-processing Tools
Information Retrieval
Helps computers understand complex charts and tables.
Benchmarking Vision-Language and Multimodal Large Language Models in Zero-shot and Few-shot Scenarios: A study on Christian Iconography
CV and Pattern Recognition
Computers can now identify religious pictures.
Can Multi-modal (reasoning) LLMs detect document manipulation?
CV and Pattern Recognition
Finds fake papers using smart computer eyes and brains.