TreeDiff: AST-Guided Code Generation with Diffusion LLMs
By: Yiming Zeng , Jinghan Cao , Zexin Li and more
Potential Business Impact:
Helps computers write correct computer code.
Recent advances in diffusion-based language models have opened new possibilities for controllable and bidirectional sequence generation. These models provide an alternative to traditional autoregressive approaches by framing text generation as an iterative denoising process. However, applying diffusion models to structured domains such as source code remains a significant challenge. Programming languages differ from natural language in that they follow strict syntactic and semantic rules, with hierarchical organization that must be preserved for correctness. Standard token-level corruption techniques used during training often ignore this structure, which may hinder the model's ability to learn meaningful representations of code. To address this limitation, we propose a syntax-aware diffusion framework that incorporates structural priors from Abstract Syntax Trees (ASTs) into the denoising process. Instead of masking individual tokens at random, we selectively corrupt syntactically meaningful code spans derived from AST subtrees. This enables the model to reconstruct programs in a way that respects grammatical boundaries and captures long-range dependencies. Experimental results demonstrate that syntax-aware corruption significantly improves syntactic correctness, reconstruction accuracy, and generalization to unseen code patterns. These findings highlight the potential of incorporating structural information into diffusion-based training and suggest that syntax-guided denoising is a promising direction for advancing diffusion-based language models in code generation tasks.
Similar Papers
Syntax-Guided Diffusion Language Models with User-Integrated Personalization
Computation and Language
Writes stories that sound like you wrote them.
Beyond Autoregression: An Empirical Study of Diffusion Large Language Models for Code Generation
Software Engineering
Makes computers write code much faster and better.
From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model
Computation and Language
Fixes computer mistakes when writing stories.