Score: 0

Tape: A Cellular Automata Benchmark for Evaluating Rule-Shift Generalization in Reinforcement Learning

Published: January 8, 2026 | arXiv ID: 2601.04695v1

By: Enze Pan

Potential Business Impact:

Tests AI when rules unexpectedly change.

Business Areas:

A/B Testing Data and Analytics

We present Tape, a controlled reinforcement-learning benchmark designed to isolate out-of-distribution (OOD) failure under latent rule shifts.Tape is derived from one-dimensional cellular automata, enabling precise train/test splits where observation and action spaces are held fixed while transition rules change. Using a reproducible evaluation pipeline, we compare model-free baselines, model-based planning with learned world models, and task-inference (meta-RL) methods. A consistent pattern emerges: methods that are strong in-distribution (ID) can collapse under heldout-rule OOD, and high-variance OOD evaluation can make rankings unstable unless experiments are sufficiently replicated.We provide (i) standardized OOD protocols, (ii) statistical reporting requirements (seeds, confidence intervals, and hypothesis tests), and (iii) information-theoretic identities connecting entropy reduction to conditional mutual information and expected posterior KL divergence, clarifying what "uncertainty reduction" objectives can and cannot guarantee under rule shifts.

Toward a Metrology for Artificial Intelligence: Hidden-Rule Environments and Reinforcement Learning

Machine Learning (CS)

Teaches computers to solve puzzles by guessing rules.

7 Sep 2025 0

85%

Toward a Metrology for Artificial Intelligence: Hidden-Rule Environments and Reinforcement Learning

Machine Learning (CS)

Teaches computers to solve puzzles by guessing rules.

7 Sep 2025 0

85%

Out-of-Distribution Generalization with a SPARC: Racing 100 Unseen Vehicles with a Single Policy

Machine Learning (CS)

Helps robots learn new tasks without retraining.

12 Nov 2025 1

View PDF Login to Bookmark

Country of Origin

🇭🇰 Hong Kong

Page Count

11 pages

Tape: A Cellular Automata Benchmark for Evaluating Rule-Shift Generalization in Reinforcement Learning

Tests AI when rules unexpectedly change.

Technical Abstract

Toward a Metrology for Artificial Intelligence: Hidden-Rule Environments and Reinforcement Learning

Toward a Metrology for Artificial Intelligence: Hidden-Rule Environments and Reinforcement Learning

Out-of-Distribution Generalization with a SPARC: Racing 100 Unseen Vehicles with a Single Policy