CogGuide: Human-Like Guidance for Zero-Shot Omni-Modal Reasoning
By: Zhou-Peng Shou , Zhi-Qiang You , Fang Wang and more
Potential Business Impact:
Helps computers understand pictures and words better.
Targeting the issues of "shortcuts" and insufficient contextual understanding in complex cross-modal reasoning of multimodal large models, this paper proposes a zero-shot multimodal reasoning component guided by human-like cognitive strategies centered on an "intent sketch". The component comprises a plug-and-play three-module pipeline-Intent Perceiver, Strategy Generator, and Strategy Selector-that explicitly constructs a "understand-plan-select" cognitive process. By generating and filtering "intent sketch" strategies to guide the final reasoning, it requires no parameter fine-tuning and achieves cross-model transfer solely through in-context engineering. Information-theoretic analysis shows that this process can reduce conditional entropy and improve information utilization efficiency, thereby suppressing unintended shortcut reasoning. Experiments on IntentBench, WorldSense, and Daily-Omni validate the method's generality and robust gains; compared with their respective baselines, the complete "three-module" scheme yields consistent improvements across different reasoning engines and pipeline combinations, with gains up to approximately 9.51 percentage points, demonstrating the practical value and portability of the "intent sketch" reasoning component in zero-shot scenarios.
Similar Papers
CogGuide: Human-Like Guidance for Zero-Shot Omni-Modal Reasoning
Artificial Intelligence
Helps AI understand pictures and words better.
Plug-and-Play Clarifier: A Zero-Shot Multimodal Framework for Egocentric Intent Disambiguation
Human-Computer Interaction
Helps robots understand what you mean and point at.
GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking
CV and Pattern Recognition
Helps computers understand pictures and solve problems.