Evaluating Large Language Models for Functional and Maintainable Code in Industrial Settings: A Case Study at ASML
By: Yash Mundhra, Max Valk, Maliheh Izadi
Potential Business Impact:
Helps computers write complex factory code.
Large language models have shown impressive performance in various domains, including code generation across diverse open-source domains. However, their applicability in proprietary industrial settings, where domain-specific constraints and code interdependencies are prevalent, remains largely unexplored. We present a case study conducted in collaboration with the leveling department at ASML to investigate the performance of LLMs in generating functional, maintainable code within a closed, highly specialized software environment. We developed an evaluation framework tailored to ASML's proprietary codebase and introduced a new benchmark. Additionally, we proposed a new evaluation metric, build@k, to assess whether LLM-generated code successfully compiles and integrates within real industrial repositories. We investigate various prompting techniques, compare the performance of generic and code-specific LLMs, and examine the impact of model size on code generation capabilities, using both match-based and execution-based metrics. The findings reveal that prompting techniques and model size have a significant impact on output quality, with few-shot and chain-of-thought prompting yielding the highest build success rates. The difference in performance between the code-specific LLMs and generic LLMs was less pronounced and varied substantially across different model families.
Similar Papers
Cross-Task Benchmarking and Evaluation of General-Purpose and Code-Specific Large Language Models
Software Engineering
Makes computers better at understanding language and code.
An Experimental Study of Real-Life LLM-Proposed Performance Improvements
Software Engineering
Computers write faster code, but humans write best.
Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications
Software Engineering
Computers can't always tell if code matches instructions.