DocIE@XLLM25: In-Context Learning for Information Extraction using Fully Synthetic Demonstrations
By: Nicholas Popovič , Ashish Kangen , Tim Schopf and more
Potential Business Impact:
Teaches computers to find facts in long texts.
Large, high-quality annotated corpora remain scarce in document-level entity and relation extraction in zero-shot or few-shot settings. In this paper, we present a fully automatic, LLM-based pipeline for synthetic data generation and in-context learning for document-level entity and relation extraction. In contrast to existing approaches that rely on manually annotated demonstrations or direct zero-shot inference, our method combines synthetic data generation with retrieval-based in-context learning, using a reasoning-optimized language model. This allows us to build a high-quality demonstration database without manual annotation and to dynamically retrieve relevant examples at inference time. Based on our approach we produce a synthetic dataset of over $5k$ Wikipedia abstracts with approximately $59k$ entities and $30k$ relation triples. Finally, we evaluate in-context learning performance on the DocIE shared task, extracting entities and relations from long documents in a zero-shot setting. We find that in-context joint entity and relation extraction at document-level remains a challenging task, even for state-of-the-art large language models.
Similar Papers
Zero-Shot Document-Level Biomedical Relation Extraction via Scenario-based Prompt Design in Two-Stage with LLM
Neural and Evolutionary Computing
Helps computers find health facts without human work.
GuideX: Guided Synthetic Data Generation for Zero-Shot Information Extraction
Computation and Language
Helps computers understand new information without human help.
Large Language Models and Synthetic Data for Monitoring Dataset Mentions in Research Papers
Computation and Language
Finds where research data is used automatically.