Cross-Task Benchmarking and Evaluation of General-Purpose and Code-Specific Large Language Models
By: Gunjan Das, Paheli Bhattacharya, Rishabh Gupta
Potential Business Impact:
Makes computers better at understanding language and code.
Large Language Models (LLMs) have revolutionized both general natural language processing and domain-specific applications such as code synthesis, legal reasoning, and finance. However, while prior studies have explored individual model capabilities, a systematic cross-domain comparison that unifies linguistic, reasoning, and code understanding abilities remains underexplored. In this work, we present a comprehensive evaluation of five general-purpose and three code-specific state-of-the-art LLMs across six diverse benchmarks encompassing linguistic competence, mathematical reasoning, and trustworthiness. Additionally, we analyze model behavior on the CoNaLa dataset for code explanation, comparing natural language and code-specialized LLMs. Our findings reveal that models optimized for code (e.g., CodeLLaMA variants) exhibit strong reasoning and syntactic precision, that even for non-coding tasks can show measurable performance gains, in contrast to general-purpose models like Mistral-7B and Llama-3-8B.
Similar Papers
From Code Foundation Models to Agents and Applications: A Comprehensive Survey and Practical Guide to Code Intelligence
Software Engineering
Helps computers write computer programs from words.
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence
Software Engineering
Helps computers write computer programs from words.
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence
Software Engineering
Makes computers write computer programs from your words.