Small transformer architectures for task switching
By: Claudius Gros
Potential Business Impact:
Helps AI switch tasks better, like a smart student.
The rapid progress seen in terms of large-scale generative AI is largely based on the attention mechanism. It is conversely non-trivial to conceive small-scale applications for which attention-based architectures outperform traditional approaches, such as multi-layer perceptrons or recurrent networks. We examine this problem in the context of 'task switching'. In this framework models work on ongoing token sequences with the current task being determined by stochastically interspersed control tokens. We show that standard transformers cannot solve a basic task switching reference model based on finite domain arithmetics which contains subtasks dedicated to increment / addition / reverse copy / context (IARC). We show that transformers, long short-term memory recurrent networks (LSTM), and plain multi-layer perceptrons (MLPs) achieve similar, but only modest prediction accuracies. We enlarge our comparative study by including an extension of the standard transformer architecture to its non-translational invariant counterpart, the cisformer, and an alternative attention mechanism, extensive attention. A combination of the latter is found to be the only model able to achieve considerable performance levels, of around 95%. Our results indicate that the workings of attention can be understood better, and even improved, when comparing qualitatively different formulations in task-switching settings.
Similar Papers
When recalling in-context, Transformers are not SSMs
Machine Learning (CS)
Makes AI better at remembering and understanding.
Efficient Inter-Task Attention for Multitask Transformer Models
CV and Pattern Recognition
Makes smart computers learn many things faster.
Memory-Augmented Transformers: A Systematic Review from Neuroscience Principles to Enhanced Model Architectures
Machine Learning (CS)
Helps computers remember more, like humans.