Score: 1

Domain-Adapted Pre-trained Language Models for Implicit Information Extraction in Crash Narratives

Published: October 10, 2025 | arXiv ID: 2510.09434v1

By: Xixi Wang , Jordanka Kovaceva , Miguel Costa and more

Potential Business Impact:

Helps cars understand crash details better.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Free-text crash narratives recorded in real-world crash databases have been shown to play a significant role in improving traffic safety. However, large-scale analyses remain difficult to implement as there are no documented tools that can batch process the unstructured, non standardized text content written by various authors with diverse experience and attention to detail. In recent years, Transformer-based pre-trained language models (PLMs), such as Bidirectional Encoder Representations from Transformers (BERT) and large language models (LLMs), have demonstrated strong capabilities across various natural language processing tasks. These models can extract explicit facts from crash narratives, but their performance declines on inference-heavy tasks in, for example, Crash Type identification, which can involve nearly 100 categories. Moreover, relying on closed LLMs through external APIs raises privacy concerns for sensitive crash data. Additionally, these black-box tools often underperform due to limited domain knowledge. Motivated by these challenges, we study whether compact open-source PLMs can support reasoning-intensive extraction from crash narratives. We target two challenging objectives: 1) identifying the Manner of Collision for a crash, and 2) Crash Type for each vehicle involved in the crash event from real-world crash narratives. To bridge domain gaps, we apply fine-tuning techniques to inject task-specific knowledge to LLMs with Low-Rank Adaption (LoRA) and BERT. Experiments on the authoritative real-world dataset Crash Investigation Sampling System (CISS) demonstrate that our fine-tuned compact models outperform strong closed LLMs, such as GPT-4o, while requiring only minimal training resources. Further analysis reveals that the fine-tuned PLMs can capture richer narrative details and even correct some mislabeled annotations in the dataset.

Improving Crash Data Quality with Large Language Models: Evidence from Secondary Crash Narratives in Kentucky

Computation and Language

Finds hidden car crash causes in police reports.

6 Aug 2025 0

89%

Domain Adaptation of LLMs for Process Data

Computation and Language

Helps computers predict what happens next in a process.

3 Sep 2025 1

89%

Improving Narrative Classification and Explanation via Fine Tuned Language Models

Computation and Language

Finds hidden messages and explains them clearly.

4 Sep 2025 0

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

45 pages

Domain-Adapted Pre-trained Language Models for Implicit Information Extraction in Crash Narratives

Helps cars understand crash details better.

Technical Abstract

Improving Crash Data Quality with Large Language Models: Evidence from Secondary Crash Narratives in Kentucky

Domain Adaptation of LLMs for Process Data

Improving Narrative Classification and Explanation via Fine Tuned Language Models