GuideX: Guided Synthetic Data Generation for Zero-Shot Information Extraction
By: Neil De La Fuente , Oscar Sainz , Iker García-Ferrero and more
Potential Business Impact:
Helps computers understand new information without human help.
Information Extraction (IE) systems are traditionally domain-specific, requiring costly adaptation that involves expert schema design, data annotation, and model training. While Large Language Models have shown promise in zero-shot IE, performance degrades significantly in unseen domains where label definitions differ. This paper introduces GUIDEX, a novel method that automatically defines domain-specific schemas, infers guidelines, and generates synthetically labeled instances, allowing for better out-of-domain generalization. Fine-tuning Llama 3.1 with GUIDEX sets a new state-of-the-art across seven zeroshot Named Entity Recognition benchmarks. Models trained with GUIDEX gain up to 7 F1 points over previous methods without humanlabeled data, and nearly 2 F1 points higher when combined with it. Models trained on GUIDEX demonstrate enhanced comprehension of complex, domain-specific annotation schemas. Code, models, and synthetic datasets are available at neilus03.github.io/guidex.com
Similar Papers
Synthesized Annotation Guidelines are Knowledge-Lite Boosters for Clinical Information Extraction
Computation and Language
Computers learn to find medical facts automatically.
DocIE@XLLM25: In-Context Learning for Information Extraction using Fully Synthetic Demonstrations
Computation and Language
Teaches computers to find facts in long texts.
ELTEX: A Framework for Domain-Driven Synthetic Data Generation
Computation and Language
Teaches computers to be experts in specific topics.