Score: 0

Better Reasoning with Less Data: Enhancing VLMs Through Unified Modality Scoring

Published: June 10, 2025 | arXiv ID: 2506.08429v1

By: Mingjie Xu , Andrew Estornell , Hongzheng Yang and more

Potential Business Impact:

Cleans up computer vision data for better understanding.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

The application of visual instruction tuning and other post-training techniques has significantly enhanced the capabilities of Large Language Models (LLMs) in visual understanding, enriching Vision-Language Models (VLMs) with more comprehensive visual language datasets. However, the effectiveness of VLMs is highly dependent on large-scale, high-quality datasets that ensure precise recognition and accurate reasoning. Two key challenges hinder progress: (1) noisy alignments between images and the corresponding text, which leads to misinterpretation, and (2) ambiguous or misleading text, which obscures visual content. To address these challenges, we propose SCALE (Single modality data quality and Cross modality Alignment Evaluation), a novel quality-driven data selection pipeline for VLM instruction tuning datasets. Specifically, SCALE integrates a cross-modality assessment framework that first assigns each data entry to its appropriate vision-language task, generates general and task-specific captions (covering scenes, objects, style, etc.), and evaluates the alignment, clarity, task rarity, text coherence, and image clarity of each entry based on the generated captions. We reveal that: (1) current unimodal quality assessment methods evaluate one modality while overlooking the rest, which can underestimate samples essential for specific tasks and discard the lower-quality instances that help build model robustness; and (2) appropriately generated image captions provide an efficient way to transfer the image-text multimodal task into a unified text modality.

Mixed Signals: Decoding VLMs' Reasoning and Underlying Bias in Vision-Language Conflict

Artificial Intelligence

Helps computers understand pictures and words better.

11 Apr 2025 2

91%

Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure

CV and Pattern Recognition

Helps computers summarize videos and text together.

14 Apr 2025 2

91%

Iterative Self-Improvement of Vision Language Models for Image Scoring and Self-Explanation

CV and Pattern Recognition

Helps computers explain why they give an image a score.

3 Jun 2025 0

View PDF Login to Bookmark

Page Count

18 pages

Better Reasoning with Less Data: Enhancing VLMs Through Unified Modality Scoring

Cleans up computer vision data for better understanding.

Technical Abstract

Mixed Signals: Decoding VLMs' Reasoning and Underlying Bias in Vision-Language Conflict

Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure

Iterative Self-Improvement of Vision Language Models for Image Scoring and Self-Explanation