Understanding Multi-Agent Reasoning with Large Language Models for Cartoon VQA
By: Tong Wu, Thanet Markchom
Potential Business Impact:
Helps computers understand cartoon questions better.
Visual Question Answering (VQA) for stylised cartoon imagery presents challenges, such as interpreting exaggerated visual abstraction and narrative-driven context, which are not adequately addressed by standard large language models (LLMs) trained on natural images. To investigate this issue, a multi-agent LLM framework is introduced, specifically designed for VQA tasks in cartoon imagery. The proposed architecture consists of three specialised agents: visual agent, language agent and critic agent, which work collaboratively to support structured reasoning by integrating visual cues and narrative context. The framework was systematically evaluated on two cartoon-based VQA datasets: Pororo and Simpsons. Experimental results provide a detailed analysis of how each agent contributes to the final prediction, offering a deeper understanding of LLM-based multi-agent behaviour in cartoon VQA and multimodal inference.
Similar Papers
Towards Faithful Reasoning in Comics for Small MLLMs
CV and Pattern Recognition
Helps computers understand funny comics and jokes.
A Multi-Agent System for Complex Reasoning in Radiology Visual Question Answering
Artificial Intelligence
Helps doctors understand X-rays better and faster.
Beyond Generation: Multi-Hop Reasoning for Factual Accuracy in Vision-Language Models
Artificial Intelligence
Makes AI understand pictures and facts better.