VeriContaminated: Assessing LLM-Driven Verilog Coding for Data Contamination
By: Zeng Wang , Minghao Shao , Jitendra Bhandari and more
Potential Business Impact:
Finds computer code mistakes in AI-made designs.
Large Language Models (LLMs) have revolutionized code generation, achieving exceptional results on various established benchmarking frameworks. However, concerns about data contamination - where benchmark data inadvertently leaks into pre-training or fine-tuning datasets - raise questions about the validity of these evaluations. While this issue is known, limiting the industrial adoption of LLM-driven software engineering, hardware coding has received little to no attention regarding these risks. For the first time, we analyze state-of-the-art (SOTA) evaluation frameworks for Verilog code generation (VerilogEval and RTLLM), using established methods for contamination detection (CCD and Min-K% Prob). We cover SOTA commercial and open-source LLMs (CodeGen2.5, Minitron 4b, Mistral 7b, phi-4 mini, LLaMA-{1,2,3.1}, GPT-{2,3.5,4o}, Deepseek-Coder, and CodeQwen 1.5), in baseline and fine-tuned models (RTLCoder and Verigen). Our study confirms that data contamination is a critical concern. We explore mitigations and the resulting trade-offs for code quality vs fairness (i.e., reducing contamination toward unbiased benchmarking).
Similar Papers
Rethinking the effects of data contamination in Code Intelligence
Software Engineering
Finds if computer code is copied unfairly.
Reasoning Multimodal Large Language Model: Data Contamination and Dynamic Evaluation
Artificial Intelligence
Tests AI to see if it truly understands.
Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination
Software Engineering
Tests computer code writers for honesty.