Score: 1

Hierarchical Evaluation of Software Design Capabilities of Large Language Models of Code

Published: November 25, 2025 | arXiv ID: 2511.20933v1

By: Mootez Saad , Boqi Chen , José Antonio Hernández López and more

Potential Business Impact:

Computers struggle to fix messy code on their own.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large language models (LLMs) are being increasingly adopted in the software engineering domain, yet the robustness of their grasp on core software design concepts remains unclear. We conduct an empirical study to systematically evaluate their understanding of cohesion (intra-module) and coupling (inter-module). We programmatically generate poorly designed code fragments and test the DeepSeek-R1 model family ($14$B, $32$B, $70$B) under varying levels of guidance, from simple \textit{Verification} to \textit{Guided} and \textit{Open-ended Generation}, while varying contextual noise by injecting distractor elements. While models exhibit a solid baseline understanding of both concepts in ideal conditions, their practical knowledge is fragile and highly asymmetrical. Reasoning about coupling proves brittle; performance collapses in noisy, open-ended scenarios, with F1 scores dropping by over $50\%$. In contrast, the models' analysis of cohesion is remarkably robust to internal noise in guided tasks, showing little performance degradation. However, this resilience also fails when all guidance is removed. Reasoning-trace analysis confirms these failure modes, revealing \textit{cognitive shortcutting} for coupling versus a more exhaustive (yet still failing) analysis for cohesion. To summarize, while LLMs can provide reliable assistance for recognizing design flaws, their ability to reason autonomously in noisy, realistic contexts is limited, highlighting the critical need for more scalable and robust program understanding capabilities.

Holistic Evaluation of State-of-the-Art LLMs for Code Generation

Software Engineering

Makes computers write better, error-free code.

19 Dec 2025 1

91%

Understanding the Role of Large Language Models in Software Engineering: Evidence from an Industry Survey

Software Engineering

Helps coders write better programs faster.

19 Dec 2025 0

91%

Cross-Task Benchmarking and Evaluation of General-Purpose and Code-Specific Large Language Models

Software Engineering

Makes computers better at understanding language and code.

4 Dec 2025 1

View PDF Login to Bookmark

Country of Origin

🇸🇪 🇨🇦 🇪🇸 Sweden, Spain, Canada

Page Count

21 pages

Hierarchical Evaluation of Software Design Capabilities of Large Language Models of Code

Computers struggle to fix messy code on their own.

Technical Abstract

Holistic Evaluation of State-of-the-Art LLMs for Code Generation

Understanding the Role of Large Language Models in Software Engineering: Evidence from an Industry Survey

Cross-Task Benchmarking and Evaluation of General-Purpose and Code-Specific Large Language Models