Arce: Augmented Roberta with Contextualized Elucidations for Ner in Automated Rule Checking
By: Jian Chen , Jinbao Tian , Yankui Li and more
Potential Business Impact:
Helps computers understand building plans better.
Accurate information extraction from specialized texts is a critical challenge, particularly for named entity recognition (NER) in the architecture, engineering, and construction (AEC) domain to support automated rule checking (ARC). The performance of standard pre-trained models is often constrained by the domain gap, as they struggle to interpret the specialized terminology and complex relational contexts inherent in AEC texts. Although this issue can be mitigated by further pre-training on large, human-curated domain corpora, as exemplified by methods like ARCBERT, this approach is both labor-intensive and cost-prohibitive. Consequently, leveraging large language models (LLMs) for automated knowledge generation has emerged as a promising alternative. However, the optimal strategy for generating knowledge that can genuinely enhance smaller, efficient models remains an open question. To address this, we propose ARCE (augmented RoBERTa with contextualized elucidations), a novel approach that systematically explores and optimizes this generation process. ARCE employs an LLM to first generate a corpus of simple, direct explanations, which we term Cote, and then uses this corpus to incrementally pre-train a RoBERTa model prior to its fine-tuning on the downstream task. Our extensive experiments show that ARCE establishes a new state-of-the-art on a benchmark AEC dataset, achieving a Macro-F1 score of 77.20%. This result also reveals a key finding: simple, explanation-based knowledge proves surprisingly more effective than complex, role-based rationales for this task. The code is publicly available at:https://github.com/nxcc-lab/ARCE.
Similar Papers
Arce: Augmented Roberta with Contextualized Elucidations for Ner in Automated Rule Checking
Computation and Language
Helps computers understand building plans better.
From Amateur to Master: Infusing Knowledge into LLMs via Automated Curriculum Learning
Computation and Language
Teaches computers to be smart in special subjects.
ARC-Encoder: learning compressed text representations for large language models
Computation and Language
Makes AI understand more text with less work.