In-Depth and In-Breadth: Pre-training Multimodal Language Models Customized for Comprehensive Chart Understanding
By: Wan-Cyuan Fan , Yen-Chun Chen , Mengchen Liu and more
Potential Business Impact:
Helps computers understand any kind of chart.
Recent methods for customizing Large Vision Language Models (LVLMs) for domain-specific tasks have shown promising results in scientific chart comprehension. However, existing approaches face two major limitations: First, they rely on paired data from only a few chart types, limiting generalization to wide range of chart types. Secondly, they lack targeted pre-training for chart-data alignment, which hampers the model's understanding of underlying data. In this paper, we introduce ChartScope, an LVLM optimized for in-depth chart comprehension across diverse chart types. We propose an efficient data generation pipeline that synthesizes paired data for a wide range of chart types, along with a novel Dual-Path training strategy that enabling the model to succinctly capture essential data details while preserving robust reasoning capabilities by incorporating reasoning over the underlying data. Lastly, we establish ChartDQA, a new benchmark for evaluating not only question-answering at different levels but also underlying data understanding. Experimental results demonstrate that ChartScope significantly enhances comprehension on a wide range of chart types. The code and data are available at https://davidhalladay.github.io/chartscope_demo.
Similar Papers
ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models
Computation and Language
Helps computers understand charts better.
Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?
Computation and Language
Lets computers check charts faster and cheaper.
Scaling Large Vision-Language Models for Enhanced Multimodal Comprehension In Biomedical Image Analysis
CV and Pattern Recognition
Helps doctors understand cancer treatment images better.