Score: 1

Transformers Provably Learn Directed Acyclic Graphs via Kernel-Guided Mutual Information

Published: October 29, 2025 | arXiv ID: 2510.25542v1

By: Yuan Cheng , Yu Huang , Zhe Xiong and more

Potential Business Impact:

Helps computers find hidden connections in data.

Business Areas:

Predictive Analytics Artificial Intelligence, Data and Analytics, Software

Uncovering hidden graph structures underlying real-world data is a critical challenge with broad applications across scientific domains. Recently, transformer-based models leveraging the attention mechanism have demonstrated strong empirical success in capturing complex dependencies within graphs. However, the theoretical understanding of their training dynamics has been limited to tree-like graphs, where each node depends on a single parent. Extending provable guarantees to more general directed acyclic graphs (DAGs) -- which involve multiple parents per node -- remains challenging, primarily due to the difficulty in designing training objectives that enable different attention heads to separately learn multiple different parent relationships. In this work, we address this problem by introducing a novel information-theoretic metric: the kernel-guided mutual information (KG-MI), based on the $f$-divergence. Our objective combines KG-MI with a multi-head attention framework, where each head is associated with a distinct marginal transition kernel to model diverse parent-child dependencies effectively. We prove that, given sequences generated by a $K$-parent DAG, training a single-layer, multi-head transformer via gradient ascent converges to the global optimum in polynomial time. Furthermore, we characterize the attention score patterns at convergence. In addition, when particularizing the $f$-divergence to the KL divergence, the learned attention scores accurately reflect the ground-truth adjacency matrix, thereby provably recovering the underlying graph structure. Experimental results validate our theoretical findings.

Information Gradient for Directed Acyclic Graphs: A Score-based Framework for End-to-End Mutual Information Maximization

Information Theory

Helps computers learn to send and get information better.

5 Jan 2026 0

87%

Efficient Knowledge Tracing Leveraging Higher-Order Information in Integrated Graphs

Machine Learning (CS)

Makes online learning faster and cheaper.

24 Jul 2025 0

87%

Attention Beyond Neighborhoods: Reviving Transformer for Graph Clustering

Machine Learning (CS)

Helps computers group similar things by looking at connections.

18 Sep 2025 2

View PDF Login to Bookmark

Country of Origin

🇸🇬 🇺🇸 United States, Singapore

Page Count

27 pages

Transformers Provably Learn Directed Acyclic Graphs via Kernel-Guided Mutual Information

Helps computers find hidden connections in data.

Technical Abstract

Information Gradient for Directed Acyclic Graphs: A Score-based Framework for End-to-End Mutual Information Maximization

Efficient Knowledge Tracing Leveraging Higher-Order Information in Integrated Graphs

Attention Beyond Neighborhoods: Reviving Transformer for Graph Clustering