Rethinking the long-range dependency in Mamba/SSM and transformer models
By: Cong Ma, Kayvan Najarian
Potential Business Impact:
Makes computers remember longer, like brains do.
Long-range dependency is one of the most desired properties of recent sequence models such as state-space models (particularly Mamba) and transformer models. New model architectures are being actively developed and benchmarked for prediction tasks requiring long-range dependency. However, the capability of modeling long-range dependencies of these models has not been investigated from a theoretical perspective, which hinders a systematic improvement on this aspect. In this work, we mathematically define long-range dependency using the derivative of hidden states with respect to past inputs and compare the capability of SSM and transformer models of modeling long-range dependency based on this definition. We showed that the long-range dependency of SSM decays exponentially with the sequence length, which aligns with the exponential decay of memory function in RNN. But the attention mechanism used in transformers is more flexible and is not constrained to exponential decay, which could in theory perform better at modeling long-range dependency with sufficient training data, computing resources, and proper training. To combine the flexibility of long-range dependency of attention mechanism and computation efficiency of SSM, we propose a new formulation for hidden state update in SSM and prove its stability under a standard Gaussian distribution of the input data.
Similar Papers
Leveraging State Space Models in Long Range Genomics
Genomics
Helps computers understand long DNA codes better.
When recalling in-context, Transformers are not SSMs
Machine Learning (CS)
Makes AI better at remembering and understanding.
Characterizing the Behavior of Training Mamba-based State Space Models on GPUs
Machine Learning (CS)
Makes AI faster at understanding long texts.