Score: 2

Automatic Extraction of Clausal Embedding Based on Large-Scale English Text Data

Published: June 16, 2025 | arXiv ID: 2506.14064v1

By: Iona Carslaw , Sivan Milton , Nicolas Navarre and more

Potential Business Impact:

Finds complex sentence parts in real writing.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

For linguists, embedded clauses have been of special interest because of their intricate distribution of syntactic and semantic features. Yet, current research relies on schematically created language examples to investigate these constructions, missing out on statistical information and naturally-occurring examples that can be gained from large language corpora. Thus, we present a methodological approach for detecting and annotating naturally-occurring examples of English embedded clauses in large-scale text data using constituency parsing and a set of parsing heuristics. Our tool has been evaluated on our dataset Golden Embedded Clause Set (GECS), which includes hand-annotated examples of naturally-occurring English embedded clause sentences. Finally, we present a large-scale dataset of naturally-occurring English embedded clauses which we have extracted from the open-source corpus Dolma using our extraction tool.

A large-scale, unsupervised pipeline for automatic corpus annotation using LLMs: variation and change in the English consider construction

Computation and Language

Lets computers sort words for language study.

14 Oct 2025 0

86%

Semantic Tree Inference on Text Corpa using a Nested Density Approach together with Large Language Model Embeddings

Computation and Language

Organizes texts into a family tree of ideas.

29 Dec 2025 1

86%

Scalable Multi-phase Word Embedding Using Conjunctive Propositional Clauses

Machine Learning (CS)

Helps computers understand words better, even long sentences.

31 Jan 2025 0

View PDF Login to Bookmark

Country of Origin

🇬🇧 United Kingdom

Repos / Data Links

github.com huggingface.co

Page Count

11 pages

Automatic Extraction of Clausal Embedding Based on Large-Scale English Text Data

Finds complex sentence parts in real writing.

Technical Abstract

A large-scale, unsupervised pipeline for automatic corpus annotation using LLMs: variation and change in the English consider construction

Semantic Tree Inference on Text Corpa using a Nested Density Approach together with Large Language Model Embeddings

Scalable Multi-phase Word Embedding Using Conjunctive Propositional Clauses