Metadata-Driven Retrieval-Augmented Generation for Financial Question Answering
By: Michail Dadopoulos , Anestis Ladas , Stratos Moschidis and more
Potential Business Impact:
Helps computers understand long financial papers better.
Retrieval-Augmented Generation (RAG) struggles on long, structured financial filings where relevant evidence is sparse and cross-referenced. This paper presents a systematic investigation of advanced metadata-driven Retrieval-Augmented Generation (RAG) techniques, proposing and evaluating a novel, multi-stage RAG architecture that leverages LLM-generated metadata. We introduce a sophisticated indexing pipeline to create contextually rich document chunks and benchmark a spectrum of enhancements, including pre-retrieval filtering, post-retrieval reranking, and enriched embeddings, benchmarked on the FinanceBench dataset. Our results reveal that while a powerful reranker is essential for precision, the most significant performance gains come from embedding chunk metadata directly with text ("contextual chunks"). Our proposed optimal architecture combines LLM-driven pre-retrieval optimizations with these contextual embeddings to achieve superior performance. Additionally, we present a custom metadata reranker that offers a compelling, cost-effective alternative to commercial solutions, highlighting a practical trade-off between peak performance and operational efficiency. This study provides a blueprint for building robust, metadata-aware RAG systems for financial document analysis.
Similar Papers
Rethinking Retrieval: From Traditional Retrieval Augmented Generation to Agentic and Non-Vector Reasoning Systems in the Financial Domain for Large Language Models
Computation and Language
Answers money questions using company reports.
Retrieval Augmented Generation (RAG) for Fintech: Agentic Design and Evaluation
Artificial Intelligence
Helps computers understand tricky money words better.
Insight-RAG: Enhancing LLMs with Insight-Driven Augmentation
Computation and Language
Helps computers find better answers from many texts.