Score: 1

LLM Benchmarking with LLaMA2: Evaluating Code Development Performance Across Multiple Programming Languages

Published: March 24, 2025 | arXiv ID: 2503.19217v2

By: Patrick Diehl , Nojoud Nader , Maxim Moraru and more

Potential Business Impact:

AI writes code, but needs help for hard jobs.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

The rapid evolution of large language models (LLMs) has opened new possibilities for automating various tasks in software development. This paper evaluates the capabilities of the Llama 2-70B model in automating these tasks for scientific applications written in commonly used programming languages. Using representative test problems, we assess the model's capacity to generate code, documentation, and unit tests, as well as its ability to translate existing code between commonly used programming languages. Our comprehensive analysis evaluates the compilation, runtime behavior, and correctness of the generated and translated code. Additionally, we assess the quality of automatically generated code, documentation and unit tests. Our results indicate that while Llama 2-70B frequently generates syntactically correct and functional code for simpler numerical tasks, it encounters substantial difficulties with more complex, parallelized, or distributed computations, requiring considerable manual corrections. We identify key limitations and suggest areas for future improvements to better leverage AI-driven automation in scientific computing workflows.

ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code

Artificial Intelligence

Helps computers write code from new science papers.

2 Jun 2025 1

91%

Large Language Models for Code Generation: A Comprehensive Survey of Challenges, Techniques, Evaluation, and Applications

Software Engineering

Lets anyone write computer programs with plain English.

3 Mar 2025 1

91%

Cross-Task Benchmarking and Evaluation of General-Purpose and Code-Specific Large Language Models

Software Engineering

Makes computers better at understanding language and code.

4 Dec 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com github.com github.com

Page Count

28 pages

LLM Benchmarking with LLaMA2: Evaluating Code Development Performance Across Multiple Programming Languages

AI writes code, but needs help for hard jobs.

Technical Abstract

ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code

Large Language Models for Code Generation: A Comprehensive Survey of Challenges, Techniques, Evaluation, and Applications

Cross-Task Benchmarking and Evaluation of General-Purpose and Code-Specific Large Language Models