FOCAL: A Novel Benchmarking Technique for Multi-modal Agents
By: Aditya Choudhary, Anupam Purwar
With the recent advancements in reasoning capa- bilities, tool calling using MCP servers and Audio Language Models (ALMs), development and integration of multi-modal agents (with voice and text support) has come to the industry forefront. Cascading pipelines for voice agents still play a central role in the industry owing to their superior reasoning capabilities facilitated by LLMs. Although, cascading pipelines often present error propagation through the pipeline. We propose a framework, FOCAL to benchmark end-to-end reasoning, component-wise error propagation and error analysis for automated as well as human-assisted testing of multi-modal agents (voice to voice + text input). We also share two novel metrics viz. Reasoning and Semantic scores to evaluate efficacy of the agent in having meaningful conversations in voice mode.
Similar Papers
Spoken Conversational Agents with Large Language Models
Computation and Language
Lets computers understand and talk like people.
MultiVox: Benchmarking Voice Assistants for Multimodal Interactions
Multimedia
Lets voice helpers understand feelings and sights.
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
CV and Pattern Recognition
Helps computers understand who speaks in videos.