Score: 3

Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning

Published: June 8, 2025 | arXiv ID: 2506.07227v1

By: Tianyi Bai , Yuxuan Fan , Jiantao Qiu and more

Potential Business Impact:

Teaches computers to see tiny differences in pictures.

Business Areas:

Video Editing Content and Publishing, Media and Entertainment, Video

Multimodal large language models (MLLMs) have achieved strong performance on vision-language tasks but still struggle with fine-grained visual differences, leading to hallucinations or missed semantic shifts. We attribute this to limitations in both training data and learning objectives. To address these issues, we propose a controlled data generation pipeline that produces minimally edited image pairs with semantically aligned captions. Using this pipeline, we construct the Micro Edit Dataset (MED), containing over 50K image-text pairs spanning 11 fine-grained edit categories, including attribute, count, position, and object presence changes. Building on MED, we introduce a supervised fine-tuning (SFT) framework with a feature-level consistency loss that promotes stable visual embeddings under small edits. We evaluate our approach on the Micro Edit Detection benchmark, which includes carefully balanced evaluation pairs designed to test sensitivity to subtle visual variations across the same edit categories. Our method improves difference detection accuracy and reduces hallucinations compared to strong baselines, including GPT-4o. Moreover, it yields consistent gains on standard vision-language tasks such as image captioning and visual question answering. These results demonstrate the effectiveness of combining targeted data and alignment objectives for enhancing fine-grained visual reasoning in MLLMs.

Hallucination-Aware Multimodal Benchmark for Gastrointestinal Image Analysis with Large Vision-Language Models

CV and Pattern Recognition

Teaches AI to spot and fix fake medical image descriptions.

11 May 2025 1

89%

Fact-Controlled Diagnosis of Hallucinations in Medical Text Summarization

Computation and Language

Fixes AI summaries of doctor talks to be true.

31 May 2025 1

89%

FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model

CV and Pattern Recognition

Edits pictures precisely from your words.

25 Mar 2025 0

View PDF Login to Bookmark

Country of Origin

🇭🇰 🇨🇳 Hong Kong, China

Repos / Data Links

github.com

Page Count

27 pages

Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning

Teaches computers to see tiny differences in pictures.

Technical Abstract

Hallucination-Aware Multimodal Benchmark for Gastrointestinal Image Analysis with Large Vision-Language Models

Fact-Controlled Diagnosis of Hallucinations in Medical Text Summarization

FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model