Comparing the Performance of LLMs in RAG-based Question-Answering: A Case Study in Computer Science Literature
By: Ranul Dayarathne, Uvini Ranaweera, Upeksha Ganegoda
Potential Business Impact:
Helps AI answer questions more truthfully and accurately.
Retrieval Augmented Generation (RAG) is emerging as a powerful technique to enhance the capabilities of Generative AI models by reducing hallucination. Thus, the increasing prominence of RAG alongside Large Language Models (LLMs) has sparked interest in comparing the performance of different LLMs in question-answering (QA) in diverse domains. This study compares the performance of four open-source LLMs, Mistral-7b-instruct, LLaMa2-7b-chat, Falcon-7b-instruct and Orca-mini-v3-7b, and OpenAI's trending GPT-3.5 over QA tasks within the computer science literature leveraging RAG support. Evaluation metrics employed in the study include accuracy and precision for binary questions and ranking by a human expert, ranking by Google's AI model Gemini, alongside cosine similarity for long-answer questions. GPT-3.5, when paired with RAG, effectively answers binary and long-answer questions, reaffirming its status as an advanced LLM. Regarding open-source LLMs, Mistral AI's Mistral-7b-instruct paired with RAG surpasses the rest in answering both binary and long-answer questions. However, among the open-source LLMs, Orca-mini-v3-7b reports the shortest average latency in generating responses, whereas LLaMa2-7b-chat by Meta reports the highest average latency. This research underscores the fact that open-source LLMs, too, can go hand in hand with proprietary models like GPT-3.5 with better infrastructure.
Similar Papers
Aligning LLMs for the Classroom with Knowledge-Based Retrieval -- A Comparative RAG Study
Artificial Intelligence
Makes AI answers for school more truthful.
Agri-Query: A Case Study on RAG vs. Long-Context LLMs for Cross-Lingual Technical Question Answering
Computation and Language
Helps computers answer questions from manuals better.
Knowledge-Graph Based RAG System Evaluation Framework
Computation and Language
Tests AI writing better by checking its thinking.