DocDancer: Towards Agentic Document-Grounded Information Seeking
By: Qintong Zhang , Xinjie Lv , Jialong Wu and more
Potential Business Impact:
Helps computers find answers in long documents.
Document Question Answering (DocQA) focuses on answering questions grounded in given documents, yet existing DocQA agents lack effective tool utilization and largely rely on closed-source models. In this work, we introduce DocDancer, an end-to-end trained open-source Doc agent. We formulate DocQA as an information-seeking problem and propose a tool-driven agent framework that explicitly models document exploration and comprehension. To enable end-to-end training of such agents, we introduce an Exploration-then-Synthesis data synthesis pipeline that addresses the scarcity of high-quality training data for DocQA. Training on the synthesized data, the trained models on two long-context document understanding benchmarks, MMLongBench-Doc and DocBench, show their effectiveness. Further analysis provides valuable insights for the agentic tool design and synthetic data.
Similar Papers
MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding
Machine Learning (CS)
Answers questions using text and pictures together.
DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering
CV and Pattern Recognition
Helps computers understand videos with text.
DocHop-QA: Towards Multi-Hop Reasoning over Multimodal Document Collections
Computation and Language
Helps computers answer questions from many science papers.