RECODE: Reasoning Through Code Generation for Visual Question Answering
By: Junhong Shen , Mu Cai , Bo Hu and more
Potential Business Impact:
Makes computers understand charts by turning them into code.
Multimodal Large Language Models (MLLMs) struggle with precise reasoning for structured visuals like charts and diagrams, as pixel-based perception lacks a mechanism for verification. To address this, we propose to leverage derendering -- the process of reverse-engineering visuals into executable code -- as a new modality for verifiable visual reasoning. Specifically, we propose RECODE, an agentic framework that first generates multiple candidate programs to reproduce the input image. It then uses a critic to select the most faithful reconstruction and iteratively refines the code. This process not only transforms an ambiguous perceptual task into a verifiable, symbolic problem, but also enables precise calculations and logical inferences later on. On various visual reasoning benchmarks such as CharXiv, ChartQA, and Geometry3K, RECODE significantly outperforms methods that do not leverage code or only use code for drawing auxiliary lines or cropping. Our work demonstrates that grounding visual perception in executable code provides a new path toward more accurate and verifiable multimodal reasoning.
Similar Papers
Thinking with Programming Vision: Towards a Unified View for Thinking with Images
CV and Pattern Recognition
Lets computers use any image tool to solve problems.
ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding
Machine Learning (CS)
Helps computers build websites by looking at them.
Composition-Grounded Instruction Synthesis for Visual Reasoning
CV and Pattern Recognition
Teaches computers to understand charts and websites.