Hierarchical Evaluation of Software Design Capabilities of Large Language Models of Code
By: Mootez Saad , Boqi Chen , José Antonio Hernández López and more
Potential Business Impact:
Computers struggle to fix messy code on their own.
Large language models (LLMs) are being increasingly adopted in the software engineering domain, yet the robustness of their grasp on core software design concepts remains unclear. We conduct an empirical study to systematically evaluate their understanding of cohesion (intra-module) and coupling (inter-module). We programmatically generate poorly designed code fragments and test the DeepSeek-R1 model family ($14$B, $32$B, $70$B) under varying levels of guidance, from simple \textit{Verification} to \textit{Guided} and \textit{Open-ended Generation}, while varying contextual noise by injecting distractor elements. While models exhibit a solid baseline understanding of both concepts in ideal conditions, their practical knowledge is fragile and highly asymmetrical. Reasoning about coupling proves brittle; performance collapses in noisy, open-ended scenarios, with F1 scores dropping by over $50\%$. In contrast, the models' analysis of cohesion is remarkably robust to internal noise in guided tasks, showing little performance degradation. However, this resilience also fails when all guidance is removed. Reasoning-trace analysis confirms these failure modes, revealing \textit{cognitive shortcutting} for coupling versus a more exhaustive (yet still failing) analysis for cohesion. To summarize, while LLMs can provide reliable assistance for recognizing design flaws, their ability to reason autonomously in noisy, realistic contexts is limited, highlighting the critical need for more scalable and robust program understanding capabilities.
Similar Papers
Holistic Evaluation of State-of-the-Art LLMs for Code Generation
Software Engineering
Makes computers write better, error-free code.
Understanding the Role of Large Language Models in Software Engineering: Evidence from an Industry Survey
Software Engineering
Helps coders write better programs faster.
Cross-Task Benchmarking and Evaluation of General-Purpose and Code-Specific Large Language Models
Software Engineering
Makes computers better at understanding language and code.