Revealing Multimodal Causality with Large Language Models
By: Jin Li , Shoujin Wang , Qi Zhang and more
Potential Business Impact:
Finds hidden causes in mixed-up information.
Uncovering cause-and-effect mechanisms from data is fundamental to scientific progress. While large language models (LLMs) show promise for enhancing causal discovery (CD) from unstructured data, their application to the increasingly prevalent multimodal setting remains a critical challenge. Even with the advent of multimodal LLMs (MLLMs), their efficacy in multimodal CD is hindered by two primary limitations: (1) difficulty in exploring intra- and inter-modal interactions for comprehensive causal variable identification; and (2) insufficiency to handle structural ambiguities with purely observational data. To address these challenges, we propose MLLM-CD, a novel framework for multimodal causal discovery from unstructured data. It consists of three key components: (1) a novel contrastive factor discovery module to identify genuine multimodal factors based on the interactions explored from contrastive sample pairs; (2) a statistical causal structure discovery module to infer causal relationships among discovered factors; and (3) an iterative multimodal counterfactual reasoning module to refine the discovery outcomes iteratively by incorporating the world knowledge and reasoning capabilities of MLLMs. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of MLLM-CD in revealing genuine factors and causal relationships among them from multimodal unstructured data.
Similar Papers
Causal MAS: A Survey of Large Language Model Architectures for Discovery and Effect Estimation
Artificial Intelligence
Helps AI understand why things happen.
Towards Multi-modal Graph Large Language Model
Machine Learning (CS)
Teaches computers to understand many kinds of connected information.
Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark
Computation and Language
Helps computers understand how people *really* talk.