Hallucinations in Code Change to Natural Language Generation: Prevalence and Evaluation of Detection Metrics
By: Chunhua Liu, Hong Yi Lin, Patanamon Thongtanunam
Potential Business Impact:
Finds mistakes in computer code writing.
Language models have shown strong capabilities across a wide range of tasks in software engineering, such as code generation, yet they suffer from hallucinations. While hallucinations have been studied independently in natural language and code generation, their occurrence in tasks involving code changes which have a structurally complex and context-dependent format of code remains largely unexplored. This paper presents the first comprehensive analysis of hallucinations in two critical tasks involving code change to natural language generation: commit message generation and code review comment generation. We quantify the prevalence of hallucinations in recent language models and explore a range of metric-based approaches to automatically detect them. Our findings reveal that approximately 50\% of generated code reviews and 20\% of generated commit messages contain hallucinations. Whilst commonly used metrics are weak detectors on their own, combining multiple metrics substantially improves performance. Notably, model confidence and feature attribution metrics effectively contribute to hallucination detection, showing promise for inference-time detection.\footnote{All code and data will be released upon acceptance.
Similar Papers
A Systematic Literature Review of Code Hallucinations in LLMs: Characterization, Mitigation Methods, Challenges, and Future Directions for Reliable AI
Software Engineering
Fixes computer code mistakes made by AI.
Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges
Software Engineering
Finds and fixes mistakes in computer code.
Hallucination in LLM-Based Code Generation: An Automotive Case Study
Software Engineering
Helps computers write car software correctly.