Score: 1

Multimodal Mathematical Reasoning Embedded in Aerial Vehicle Imagery: Benchmarking, Analysis, and Exploration

Published: September 12, 2025 | arXiv ID: 2509.10059v1

By: Yue Zhou , Litong Feng , Mengcheng Lan and more

Potential Business Impact:

Tests if drones can do math from pictures.

Business Areas:

Image Recognition Data and Analytics, Software

Mathematical reasoning is critical for tasks such as precise distance and area computations, trajectory estimations, and spatial analysis in unmanned aerial vehicle (UAV) based remote sensing, yet current vision-language models (VLMs) have not been adequately tested in this domain. To address this gap, we introduce AVI-Math, the first benchmark to rigorously evaluate multimodal mathematical reasoning in aerial vehicle imagery, moving beyond simple counting tasks to include domain-specific knowledge in areas such as geometry, logic, and algebra. The dataset comprises 3,773 high-quality vehicle-related questions captured from UAV views, covering 6 mathematical subjects and 20 topics. The data, collected at varying altitudes and from multiple UAV angles, reflects real-world UAV scenarios, ensuring the diversity and complexity of the constructed mathematical problems. In this paper, we benchmark 14 prominent VLMs through a comprehensive evaluation and demonstrate that, despite their success on previous multimodal benchmarks, these models struggle with the reasoning tasks in AVI-Math. Our detailed analysis highlights significant limitations in the mathematical reasoning capabilities of current VLMs and suggests avenues for future research. Furthermore, we explore the use of Chain-of-Thought prompting and fine-tuning techniques, which show promise in addressing the reasoning challenges in AVI-Math. Our findings not only expose the limitations of VLMs in mathematical reasoning but also offer valuable insights for advancing UAV-based trustworthy VLMs in real-world applications. The code, and datasets will be released at https://github.com/VisionXLab/avi-math

Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency

CV and Pattern Recognition

Tests if computers can do math with pictures.

24 Apr 2025 1

91%

MathSight: A Benchmark Exploring Have Vision-Language Models Really Seen in University-Level Mathematical Reasoning?

CV and Pattern Recognition

Tests if computers *really* see math problems.

28 Nov 2025 0

90%

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

CV and Pattern Recognition

Helps computers solve math problems from videos.

5 Jun 2025 0

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

17 pages

Multimodal Mathematical Reasoning Embedded in Aerial Vehicle Imagery: Benchmarking, Analysis, and Exploration

Tests if drones can do math from pictures.

Technical Abstract

Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency

MathSight: A Benchmark Exploring Have Vision-Language Models Really Seen in University-Level Mathematical Reasoning?

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos