Benchmarking Table Extraction from Heterogeneous Scientific Extraction Documents
By: Marijan Soric , Cécile Gracianne , Ioana Manolescu and more
Potential Business Impact:
Helps computers understand tables in messy documents.
Table Extraction (TE) consists in extracting tables from PDF documents, in a structured format which can be automatically processed. While numerous TE tools exist, the variety of methods and techniques makes it difficult for users to choose an appropriate one. We propose a novel benchmark for assessing end-to-end TE methods (from PDF to the final table). We contribute an analysis of TE evaluation metrics, and the design of a rigorous evaluation process, which allows scoring each TE sub-task as well as end-to-end TE, and captures model uncertainty. Along with a prior dataset, our benchmark comprises two new heterogeneous datasets of 37k samples. We run our benchmark on diverse models, including off-the-shelf libraries, software tools, large vision language models, and approaches based on computer vision. The results demonstrate that TE remains challenging: current methods suffer from a lack of generalizability when facing heterogeneous data, and from limitations in robustness and interpretability.
Similar Papers
PubTables-v2: A new large-scale dataset for full-page and multi-page table extraction
CV and Pattern Recognition
Helps computers find and understand tables in documents.
Extracting Information from Scientific Literature via Visual Table Question Answering Models
Information Retrieval
Helps computers understand science tables for answers.
ExtracTable: Human-in-the-Loop Transformation of Scientific Corpora into Structured Knowledge
Digital Libraries
Helps scientists quickly find facts in papers.