Mamba-2 audio captioning: design space exploration and analysis
By: Taehan Lee, Jaehan Jung, Hyukjun Lee
Potential Business Impact:
Listens to sounds and describes them in words.
We present an audio captioning model built on the Mamba-2 large language model backbone, which is a state-of-the-art (SOTA) state-space model (SSM). We systematically explore the design space: LLM sizes, LoRA ranks, and connector designs leveraging Mamba-2's linear-time complexity with respect to sequence length. Across benchmarks, our models achieve strong captioning performance compared with larger language models trained on the same dataset, despite using fewer parameters. For the first time, we conduct an in-depth analysis of how the number of LLM parameters, audio encoder fine-tuning strategies, audio feature diversity, and different feature reduction or expansion techniques affect performance.
Similar Papers
State Space Models for Bioacoustics: A comparative Evaluation with Transformers
Sound
Helps computers identify animal sounds using less power.
An Exploration of Mamba for Speech Self-Supervised Models
Computation and Language
Makes computers understand speech faster and better.
MLMA: Towards Multilingual ASR With Mamba-based Architectures
Computation and Language
Lets computers understand many languages spoken.