Understanding Chain-of-Thought Effectiveness in Code Generation: An Empirical and Information-Theoretic Analysis
By: Naizhu Jin , Zhong Li , Guang Yang and more
Large language models (LLMs) achieve strong performance on code generation, but the mechanisms by which Chain-of-Thought (CoT) prompting helps remain unclear. We present a systematic empirical and information-theoretic study of CoT effectiveness in neural code generation, evaluating five paradigms (Zero-Shot, Zero-Shot CoT, Self-Planning, Structured CoT, Reasoning-CoT) across six Python benchmarks, a multilingual benchmark with 12 programming languages, and six models from 7B to 480B parameters, using conditional mutual information $I(Y;C|X)$ as a conceptual lens. Our results show that externally guided CoT consistently outperforms direct generation, with structured methods improving Pass@1 by 5--12\% on average while using substantially fewer tokens than reflective reasoning, and that CoT benefits depend on language type systems and model capacity. We further find that reasoning \emph{quality} is critical: high-quality structured CoT from strong generators yields significantly higher accuracy than lightweight alternatives with the same template, whereas naive Zero-Shot CoT can even degrade performance. These findings provide practical guidance for choosing CoT strategies based on model capacity, language characteristics, and task complexity.
Similar Papers
Effectiveness of Chain-of-Thought in Distilling Reasoning Capability from Large Language Models
Computation and Language
Teaches small computers to think like big ones.
Focused Chain-of-Thought: Efficient LLM Reasoning via Structured Input Information
Computation and Language
Makes AI think faster with less information.
How Chain-of-Thought Works? Tracing Information Flow from Decoding, Projection, and Activation
Artificial Intelligence
Makes AI think better by guiding its answers.