Score: 0

Evaluating CxG Generalisation in LLMs via Construction-Based NLI Fine Tuning

Published: September 19, 2025 | arXiv ID: 2509.16422v1

By: Tom Mackintosh, Harish Tayyar Madabushi, Claire Bonial

Potential Business Impact:

Helps computers understand sentence structure better.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

We probe large language models' ability to learn deep form-meaning mappings as defined by construction grammars. We introduce the ConTest-NLI benchmark of 80k sentences covering eight English constructions from highly lexicalized to highly schematic. Our pipeline generates diverse synthetic NLI triples via templating and the application of a model-in-the-loop filter. This provides aspects of human validation to ensure challenge and label reliability. Zero-shot tests on leading LLMs reveal a 24% drop in accuracy between naturalistic (88%) and adversarial data (64%), with schematic patterns proving hardest. Fine-tuning on a subset of ConTest-NLI yields up to 9% improvement, yet our results highlight persistent abstraction gaps in current LLMs and offer a scalable framework for evaluating construction-informed learning.

NeurIPS 2023 LLM Efficiency Fine-tuning Competition

Computation and Language

Makes AI smarter by cleaning its learning data.

13 Mar 2025 3

87%

MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference

Computation and Language

Makes AI understand sentences better, even when words change.

28 Oct 2025 0

87%

Enhancing NLP Robustness and Generalization through LLM-Generated Contrast Sets: A Scalable Framework for Systematic Evaluation and Adversarial Training

Computation and Language

Makes AI understand language better, even tricky parts.

9 Mar 2025 1

View PDF Login to Bookmark

Page Count

10 pages

Evaluating CxG Generalisation in LLMs via Construction-Based NLI Fine Tuning

Helps computers understand sentence structure better.

Technical Abstract

NeurIPS 2023 LLM Efficiency Fine-tuning Competition

MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference

Enhancing NLP Robustness and Generalization through LLM-Generated Contrast Sets: A Scalable Framework for Systematic Evaluation and Adversarial Training