Evaluating Open-Source Large Language Models for Technical Telecom Question Answering
By: Arina Caraus , Alessio Buscemi , Sumit Kumar and more
Potential Business Impact:
Tests AI for phone network questions.
Large Language Models (LLMs) have shown remarkable capabilities across various fields. However, their performance in technical domains such as telecommunications remains underexplored. This paper evaluates two open-source LLMs, Gemma 3 27B and DeepSeek R1 32B, on factual and reasoning-based questions derived from advanced wireless communications material. We construct a benchmark of 105 question-answer pairs and assess performance using lexical metrics, semantic similarity, and LLM-as-a-judge scoring. We also analyze consistency, judgment reliability, and hallucination through source attribution and score variance. Results show that Gemma excels in semantic fidelity and LLM-rated correctness, while DeepSeek demonstrates slightly higher lexical consistency. Additional findings highlight current limitations in telecom applications and the need for domain-adapted models to support trustworthy Artificial Intelligence (AI) assistants in engineering.
Similar Papers
Evaluation of LLMs for mathematical problem solving
Artificial Intelligence
Computers solve harder math problems better.
Can Small and Reasoning Large Language Models Score Journal Articles for Research Quality and Do Averaging and Few-shot Help?
Digital Libraries
Helps smaller AI judge research papers as well as big AI.
Comparative Analysis Based on DeepSeek, ChatGPT, and Google Gemini: Features, Techniques, Performance, Future Prospects
Computation and Language
Compares AI models for better text, code, and image use.