Uncovering and Mitigating Transient Blindness in Multimodal Model Editing
By: Xiaoqi Han , Ru Li , Ran Yi and more
Potential Business Impact:
Fixes AI that sees and reads better.
Multimodal Model Editing (MMED) aims to correct erroneous knowledge in multimodal models. Existing evaluation methods, adapted from textual model editing, overstate success by relying on low-similarity or random inputs, obscure overfitting. We propose a comprehensive locality evaluation framework, covering three key dimensions: random-image locality, no-image locality, and consistent-image locality, operationalized through seven distinct data types, enabling a detailed and structured analysis of multimodal edits. We introduce De-VQA, a dynamic evaluation for visual question answering, uncovering a phenomenon we term transient blindness, overfitting to edit-similar text while ignoring visuals. Token analysis shows edits disproportionately affect textual tokens. We propose locality-aware adversarial losses to balance cross-modal representations. Empirical results demonstrate that our approach consistently outperforms existing baselines, reducing transient blindness and improving locality by 17% on average.
Similar Papers
DualEdit: Dual Editing for Knowledge Updating in Vision-Language Models
CV and Pattern Recognition
Updates AI's knowledge without retraining.
MultiMedEdit: A Scenario-Aware Benchmark for Evaluating Knowledge Editing in Medical VQA
Artificial Intelligence
Helps AI learn new medical facts from pictures.
Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation
Computation and Language
Teaches AI to forget private or bad information.