Scaling Bidirectional Spans and Span Violations in Attention Mechanism
By: Jongwook Kim, Sangheon Yun, Sukjin Yoon
Potential Business Impact:
Makes AI learn faster by fixing its thinking.
The canonical $O(N^2)$ Transformer remains the empirical performance frontier in sequence modeling, and its training can be further optimized by addressing geometric inefficiency. We propose an optimization framework that leverages an asymmetric projection to decompose the backward-pass gradients into parallel spans and orthogonal violations, while keeping the canonical forward-pass $QKV$ structure intact. Through consistent experimental validation across various decomposition and projection setups, we provide strong theoretical evidence: the standard attention gradient is suboptimal. We demonstrated that selectively scaling these components, focusing primarily on $0^{th}$ order bidirectional parallel spans, yields the most effective learning signal. On the limited WikiText-2 dataset, and using a crude configuration, this method achieved a $0.56\%$ reduction in validation loss, confirming the framework's fundamental validity and suggesting significant potential gains on larger datasets and deeper training regimes
Similar Papers
KQ-SVD: Compressing the KV Cache with Provable Guarantees on Attention Fidelity
Machine Learning (CS)
Makes AI models remember more without using more memory.
ScaleFormer: Span Representation Cumulation for Long-Context Transformer
Computation and Language
Lets computers understand long stories better.
db-SP: Accelerating Sparse Attention for Visual Generative Models with Dual-Balanced Sequence Parallelism
CV and Pattern Recognition
Makes AI art creation much faster.