Score: 0

ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding

Published: January 15, 2026 | arXiv ID: 2601.10323v1

By: Xueyun Tian , Wei Li , Bingbing Xu and more

Recent Omni-multimodal Large Language Models show promise in unified audio, vision, and text modeling. However, streaming audio-video understanding remains challenging, as existing approaches suffer from disjointed capabilities: they typically exhibit incomplete modality support or lack autonomous proactive monitoring. To address this, we present ROMA, a real-time omni-multimodal assistant for unified reactive and proactive interaction. ROMA processes continuous inputs as synchronized multimodal units, aligning dense audio with discrete video frames to handle granularity mismatches. For online decision-making, we introduce a lightweight speak head that decouples response initiation from generation to ensure precise triggering without task conflict. We train ROMA with a curated streaming dataset and a two-stage curriculum that progressively optimizes for streaming format adaptation and proactive responsiveness. To standardize the fragmented evaluation landscape, we reorganize diverse benchmarks into a unified suite covering both proactive (alert, narration) and reactive (QA) settings. Extensive experiments across 12 benchmarks demonstrate ROMA achieves state-of-the-art performance on proactive tasks while competitive in reactive settings, validating its robustness in unified real-time omni-multimodal understanding.

OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding

CV and Pattern Recognition

Lets computers understand sounds and sights together better.

29 Dec 2025 0

88%

InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue

CV and Pattern Recognition

Lets computers understand and talk about videos.

15 Oct 2025 2

88%

RoboOmni: Proactive Robot Manipulation in Omni-modal Context

Robotics

Robot understands what you want without being told.

27 Oct 2025 1

View PDF Login to Bookmark

ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding

Technical Abstract

OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding

InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue

RoboOmni: Proactive Robot Manipulation in Omni-modal Context