C-ing Clearly: Enhanced Binary Code Explanations using C code
By: Teodor Poncu , Ioana Pintilie , Marius Dragoi and more
Potential Business Impact:
Teaches computers to understand tricky, old computer code.
Large Language Models (LLMs) typically excel at coding tasks involving high-level programming languages, as opposed to lower-level programming languages, such as assembly. We propose a synthetic data generation method named C-ing Clearly, which leverages the corresponding C code to enhance an LLM's understanding of assembly. By fine-tuning on data generated through our method, we demonstrate improved LLM performance for binary code summarization and vulnerability detection. Our approach demonstrates consistent gains across different LLM families and model sizes.
Similar Papers
On Code-Induced Reasoning in LLMs
Computation and Language
Code's structure helps computers think better than its meaning.
Strengthening Programming Comprehension in Large Language Models through Code Generation
Software Engineering
Teaches computers to understand code better.
LLM-CSEC: Empirical Evaluation of Security in C/C++ Code Generated by Large Language Models
Artificial Intelligence
Finds security problems in computer code made by AI.