Tape: A Cellular Automata Benchmark for Evaluating Rule-Shift Generalization in Reinforcement Learning
By: Enze Pan
Potential Business Impact:
Tests AI when rules unexpectedly change.
We present Tape, a controlled reinforcement-learning benchmark designed to isolate out-of-distribution (OOD) failure under latent rule shifts.Tape is derived from one-dimensional cellular automata, enabling precise train/test splits where observation and action spaces are held fixed while transition rules change. Using a reproducible evaluation pipeline, we compare model-free baselines, model-based planning with learned world models, and task-inference (meta-RL) methods. A consistent pattern emerges: methods that are strong in-distribution (ID) can collapse under heldout-rule OOD, and high-variance OOD evaluation can make rankings unstable unless experiments are sufficiently replicated.We provide (i) standardized OOD protocols, (ii) statistical reporting requirements (seeds, confidence intervals, and hypothesis tests), and (iii) information-theoretic identities connecting entropy reduction to conditional mutual information and expected posterior KL divergence, clarifying what "uncertainty reduction" objectives can and cannot guarantee under rule shifts.
Similar Papers
Toward a Metrology for Artificial Intelligence: Hidden-Rule Environments and Reinforcement Learning
Machine Learning (CS)
Teaches computers to solve puzzles by guessing rules.
Toward a Metrology for Artificial Intelligence: Hidden-Rule Environments and Reinforcement Learning
Machine Learning (CS)
Teaches computers to solve puzzles by guessing rules.
Out-of-Distribution Generalization with a SPARC: Racing 100 Unseen Vehicles with a Single Policy
Machine Learning (CS)
Helps robots learn new tasks without retraining.