Score: 1

PrismVAU: Prompt-Refined Inference System for Multimodal Video Anomaly Understanding

Published: January 6, 2026 | arXiv ID: 2601.02927v1

By: Iñaki Erregue, Kamal Nasrollahi, Sergio Escalera

Potential Business Impact:

Helps computers understand weird video moments.

Business Areas:
Image Recognition Data and Analytics, Software

Video Anomaly Understanding (VAU) extends traditional Video Anomaly Detection (VAD) by not only localizing anomalies but also describing and reasoning about their context. Existing VAU approaches often rely on fine-tuned multimodal large language models (MLLMs) or external modules such as video captioners, which introduce costly annotations, complex training pipelines, and high inference overhead. In this work, we introduce PrismVAU, a lightweight yet effective system for real-time VAU that leverages a single off-the-shelf MLLM for anomaly scoring, explanation, and prompt optimization. PrismVAU operates in two complementary stages: (1) a coarse anomaly scoring module that computes frame-level anomaly scores via similarity to textual anchors, and (2) an MLLM-based refinement module that contextualizes anomalies through system and user prompts. Both textual anchors and prompts are optimized with a weakly supervised Automatic Prompt Engineering (APE) framework. Extensive experiments on standard VAD benchmarks demonstrate that PrismVAU delivers competitive detection performance and interpretable anomaly explanations -- without relying on instruction tuning, frame-level annotations, and external modules or dense processing -- making it an efficient and practical solution for real-world applications.

Country of Origin
🇩🇰 🇪🇸 Denmark, Spain

Page Count
22 pages

Category
Computer Science:
CV and Pattern Recognition