Score: 0

Training-Free Multimodal Large Language Model Orchestration

Published: August 6, 2025 | arXiv ID: 2508.10016v2

By: Tianyu Xie , Yuhang Wu , Yongdong Luo and more

Potential Business Impact:

Lets AI understand and talk using pictures and words.

Different Multimodal Large Language Models (MLLMs) cannot be integrated into a unified multimodal input-output system directly. In previous work, training has been considered as an inevitable component due to challenges in modal alignment, Text-to-Speech efficiency and other integration issues. In this paper, we introduce Multimodal Large Language Model Orchestration, an effective approach for creating interactive multimodal AI systems without additional training. MLLM Orchestration leverages the inherent reasoning capabilities of large language models to coordinate specialized models through explicit workflows, enabling natural multimodal interactions while maintaining modularity, improving interpretability, and significantly enhancing computational efficiency. Our orchestration framework is built upon three key innovations: (1) a central controller LLM that analyzes user inputs and dynamically routes tasks to appropriate specialized models through carefully designed agents; (2) a parallel Text-to-Speech architecture that enables true full-duplex interaction with seamless interruption handling and natural conversational flow; and (3) a cross-modal memory integration system that maintains coherent context across modalities through intelligent information synthesis and retrieval, selectively avoiding unnecessary modality calls in certain scenarios to improve response speed. Extensive evaluations demonstrate that MLLM Orchestration achieves comprehensive multimodal capabilities without additional training, performance improvements of up to 7.8% over traditional jointly-trained approaches on standard benchmarks, reduced latency by 10.3%, and significantly enhanced interpretability through explicit orchestration processes.

Large Multimodal Models-Empowered Task-Oriented Autonomous Communications: Design Methodology and Implementation Challenges

Machine Learning (CS)

AI helps machines talk and work together better.

23 Oct 2025 2

90%

A Survey of Generative Categories and Techniques in Multimodal Large Language Models

Multimedia

Computers can now create pictures, music, and videos.

29 May 2025 0

90%

Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

Computation and Language

Helps computers understand how people *really* talk.

23 Apr 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

18 pages

Training-Free Multimodal Large Language Model Orchestration

Lets AI understand and talk using pictures and words.

Technical Abstract

Large Multimodal Models-Empowered Task-Oriented Autonomous Communications: Design Methodology and Implementation Challenges

A Survey of Generative Categories and Techniques in Multimodal Large Language Models

Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark