AudioToolAgent: An Agentic Framework for Audio-Language Models
By: Gijs Wijngaard, Elia Formisano, Michel Dumontier
Potential Business Impact:
Lets computers understand and answer questions about sounds.
Large Audio-Language Models (LALMs) perform well on audio understanding tasks but lack multi-step reasoning and tool-calling found in recent Large Language Models (LLMs). This paper presents AudioToolAgent, a framework that coordinates audio-language models as tools via a central LLM agent that accesses tool adapters for audio question answering and speech-to-text. The agent selects tools, asks follow-up questions, and compares outputs for verification. Experiments with MMAU, MMAR, and MMAU-Pro show state-of-the-art accuracy: up to 74.10% on MMAU, 68.80% on MMAR, and 57.96% on MMAU-Pro. Monte Carlo sampling for shapley values across 374 configurations identifies effective agent-tool combinations. The modular design allows integration of new tools and eliminates the use of data and training costs. Code and reproduction materials are available at: github.com/GLJS/AudioToolAgent
Similar Papers
Audio-Maestro: Enhancing Large Audio-Language Models with Tool-Augmented Reasoning
Sound
Helps computers understand sounds better using special tools.
AutoTool: Efficient Tool Selection for Large Language Model Agents
Artificial Intelligence
Makes smart computer helpers work faster and cheaper.
LALM-Eval: An Open-Source Toolkit for Holistic Evaluation of Large Audio Language Models
Sound
Tests AI that understands sounds faster and better.