Score: 1

Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation

Published: July 3, 2025 | arXiv ID: 2507.02859v1

By: Jiaer Xia , Bingkui Tong , Yuhang Zang and more

Potential Business Impact:

Teaches computers to understand charts and tables better.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in interpreting images using natural language. However, without using large-scale datasets for retraining, these models are difficult to adapt to specialized vision tasks, e.g., chart understanding. This problem is caused by a mismatch between pre-training and downstream datasets: pre-training datasets primarily concentrate on scenes and objects but contain limited information about specialized, non-object images, such as charts and tables. In this paper, we share an interesting finding that training an MLLM with Chain-of-Thought (CoT) reasoning data can facilitate model adaptation in specialized vision tasks, especially under data-limited regimes. However, we identify a critical issue within CoT data distilled from pre-trained MLLMs, i.e., the data often contains multiple factual errors in the reasoning steps. To address the problem, we propose Grounded Chain-of-Thought (GCoT), a simple bootstrapping-based approach that aims to inject grounding information (i.e., bounding boxes) into CoT data, essentially making the reasoning steps more faithful to input images. We evaluate our approach on five specialized vision tasks, which cover a variety of visual formats including charts, tables, receipts, and reports. The results demonstrate that under data-limited regimes our approach significantly improves upon fine-tuning and distillation.

Grounded Chain-of-Thought for Multimodal Large Language Models

CV and Pattern Recognition

Makes AI understand pictures without making things up.

17 Mar 2025 1

93%

From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models

Computation and Language

Helps AI "think step-by-step" to solve harder problems.

17 Nov 2025 0

93%

From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models

Computation and Language

Helps AI "think" step-by-step to solve harder problems.

17 Nov 2025 0

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

15 pages

Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation

Teaches computers to understand charts and tables better.

Technical Abstract

Grounded Chain-of-Thought for Multimodal Large Language Models

From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models

From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models