Knowledge-Instruct: Effective Continual Pre-training from Limited Data using Instructions
By: Oded Ovadia , Meni Brief , Rachel Lemberg and more
Potential Business Impact:
Teaches AI new facts without forgetting old ones.
While Large Language Models (LLMs) acquire vast knowledge during pre-training, they often lack domain-specific, new, or niche information. Continual pre-training (CPT) attempts to address this gap but suffers from catastrophic forgetting and inefficiencies in low-data regimes. We introduce Knowledge-Instruct, a novel approach to efficiently inject knowledge from limited corpora through pure instruction-tuning. By generating information-dense synthetic instruction data, it effectively integrates new knowledge while preserving general reasoning and instruction-following abilities. Knowledge-Instruct demonstrates superior factual memorization, minimizes catastrophic forgetting, and remains scalable by leveraging synthetic data from relatively small language models. Additionally, it enhances contextual understanding, including complex multi-hop reasoning, facilitating integration with retrieval systems. We validate its effectiveness across diverse benchmarks, including Companies, a new dataset that we release to measure knowledge injection capabilities.
Similar Papers
IKnow: Instruction-Knowledge-Aware Continual Pretraining for Effective Domain Adaptation
Artificial Intelligence
Teaches AI new things without forgetting old skills.
Learning Dynamics in Continual Pre-Training for Large Language Models
Computation and Language
Predicts how well AI learns new tasks.
What Does Loss Optimization Actually Teach, If Anything? Knowledge Dynamics in Continual Pre-training of LLMs
Computation and Language
Makes AI remember new facts without forgetting old ones.