Score: 0

LIME: Making LLM Data More Efficient with Linguistic Metadata Embeddings

Published: December 8, 2025 | arXiv ID: 2512.07522v1

By: Sebastian Sztwiertnia , Felix Friedrich , Kristian Kersting and more

Potential Business Impact:

Teaches computers language faster and better.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Pre-training decoder-only language models relies on vast amounts of high-quality data, yet the availability of such data is increasingly reaching its limits. While metadata is commonly used to create and curate these datasets, its potential as a direct training signal remains under-explored. We challenge this status quo and propose LIME (Linguistic Metadata Embeddings), a method that enriches token embeddings with metadata capturing syntax, semantics, and contextual properties. LIME substantially improves pre-training efficiency. Specifically, it adapts up to 56% faster to the training data distribution, while introducing only 0.01% additional parameters at negligible compute overhead. Beyond efficiency, LIME improves tokenization, leading to remarkably stronger language modeling capabilities and generative task performance. These benefits persist across model scales (500M to 2B). In addition, we develop a variant with shifted metadata, LIME+1, that can guide token generation. Given prior metadata for the next token, LIME+1 improves reasoning performance by up to 38% and arithmetic accuracy by up to 35%.

Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining

Computation and Language

Makes AI learn much faster with extra clues.

26 Nov 2025 0

88%

LIME: Link-based user-item Interaction Modeling with decoupled xor attention for Efficient test time scaling

Information Retrieval

Recommends things faster, even with lots of choices.

21 Oct 2025 2

88%

LIME: Link-based user-item Interaction Modeling with decoupled xor attention for Efficient test time scaling

Information Retrieval

Makes online suggestions faster and better.

21 Oct 2025 2

View PDF Login to Bookmark

Page Count

27 pages

LIME: Making LLM Data More Efficient with Linguistic Metadata Embeddings

Teaches computers language faster and better.

Technical Abstract

Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining

LIME: Link-based user-item Interaction Modeling with decoupled xor attention for Efficient test time scaling

LIME: Link-based user-item Interaction Modeling with decoupled xor attention for Efficient test time scaling