Score: 0

Evaluating Open-Source Large Language Models for Technical Telecom Question Answering

Published: September 26, 2025 | arXiv ID: 2509.21949v1

By: Arina Caraus , Alessio Buscemi , Sumit Kumar and more

Potential Business Impact:

Tests AI for phone network questions.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large Language Models (LLMs) have shown remarkable capabilities across various fields. However, their performance in technical domains such as telecommunications remains underexplored. This paper evaluates two open-source LLMs, Gemma 3 27B and DeepSeek R1 32B, on factual and reasoning-based questions derived from advanced wireless communications material. We construct a benchmark of 105 question-answer pairs and assess performance using lexical metrics, semantic similarity, and LLM-as-a-judge scoring. We also analyze consistency, judgment reliability, and hallucination through source attribution and score variance. Results show that Gemma excels in semantic fidelity and LLM-rated correctness, while DeepSeek demonstrates slightly higher lexical consistency. Additional findings highlight current limitations in telecom applications and the need for domain-adapted models to support trustworthy Artificial Intelligence (AI) assistants in engineering.

Evaluation of LLMs for mathematical problem solving

Artificial Intelligence

Computers solve harder math problems better.

30 May 2025 1

91%

Can Small and Reasoning Large Language Models Score Journal Articles for Research Quality and Do Averaging and Few-shot Help?

Digital Libraries

Helps smaller AI judge research papers as well as big AI.

25 Oct 2025 0

91%

Comparative Analysis Based on DeepSeek, ChatGPT, and Google Gemini: Features, Techniques, Performance, Future Prospects

Computation and Language

Compares AI models for better text, code, and image use.

25 Feb 2025 1

View PDF Login to Bookmark

Page Count

6 pages

Evaluating Open-Source Large Language Models for Technical Telecom Question Answering

Tests AI for phone network questions.

Technical Abstract

Evaluation of LLMs for mathematical problem solving

Can Small and Reasoning Large Language Models Score Journal Articles for Research Quality and Do Averaging and Few-shot Help?

Comparative Analysis Based on DeepSeek, ChatGPT, and Google Gemini: Features, Techniques, Performance, Future Prospects