Score: 0

Mitigating Position-Shift Failures in Text-Based Modular Arithmetic via Position Curriculum and Template Diversity

Published: January 7, 2026 | arXiv ID: 2601.04283v1

By: Nikolay Yudin

Potential Business Impact:

Teaches computers to add numbers, even if they look different.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Building on insights from the grokking literature, we study character-level Transformers trained to compute modular addition from text, and focus on robustness under input-format variation rather than only in-distribution accuracy. We identify a previously under-emphasized failure mode: models that achieve high in-distribution accuracy can fail catastrophically when the same expression is shifted to different absolute character positions ("position shift") or presented under out-of-distribution natural-language templates. Using a disjoint-pair split over all ordered pairs for p=97, we show that a baseline model reaches strong in-distribution performance yet collapses under position shift and template OOD. We then introduce a simple training recipe that combines (i) explicit expression boundary markers, (ii) position curriculum that broadens the range of absolute positions seen during training, (iii) diverse template mixtures, and (iv) consistency training across multiple variants per example. Across three seeds, this intervention substantially improves robustness to position shift and template OOD while maintaining high in-distribution accuracy, whereas an ALiBi-style ablation fails to learn the task under our setup. Our results suggest that steering procedural generalization under noisy supervision benefits from explicitly training invariances that are otherwise absent from the data distribution, and we provide a reproducible evaluation protocol and artifacts.

Mitigating Coordinate Prediction Bias from Positional Encoding Failures

CV and Pattern Recognition

Helps computers find exact spots in pictures.

25 Oct 2025 1

87%

From Shortcut to Induction Head: How Data Diversity Shapes Algorithm Selection in Transformers

Machine Learning (CS)

Teaches computers to learn and remember better.

21 Dec 2025 0

87%

Impact of Positional Encoding: Clean and Adversarial Rademacher Complexity for Transformers under In-Context Regression

Machine Learning (Stat)

Makes AI models less accurate and more easily fooled.

10 Dec 2025 0

View PDF Login to Bookmark

Page Count

12 pages

Mitigating Position-Shift Failures in Text-Based Modular Arithmetic via Position Curriculum and Template Diversity

Teaches computers to add numbers, even if they look different.

Technical Abstract

Mitigating Coordinate Prediction Bias from Positional Encoding Failures

From Shortcut to Induction Head: How Data Diversity Shapes Algorithm Selection in Transformers

Impact of Positional Encoding: Clean and Adversarial Rademacher Complexity for Transformers under In-Context Regression