Advancing Academic Chatbots: Evaluation of Non Traditional Outputs
By: Nicole Favero, Francesca Salute, Daniel Hardt
Potential Business Impact:
Helps AI create better presentations and scripts.
Most evaluations of large language models focus on standard tasks such as factual question answering or short summarization. This research expands that scope in two directions: first, by comparing two retrieval strategies, Graph RAG, structured knowledge-graph based, and Advanced RAG, hybrid keyword-semantic search, for QA; and second, by evaluating whether LLMs can generate high quality non-traditional academic outputs, specifically slide decks and podcast scripts. We implemented a prototype combining Meta's LLaMA 3 70B open weight and OpenAI's GPT 4o mini API based. QA performance was evaluated using both human ratings across eleven quality dimensions and large language model judges for scalable cross validation. GPT 4o mini with Advanced RAG produced the most accurate responses. Graph RAG offered limited improvements and led to more hallucinations, partly due to its structural complexity and manual setup. Slide and podcast generation was tested with document grounded retrieval. GPT 4o mini again performed best, though LLaMA 3 showed promise in narrative coherence. Human reviewers were crucial for detecting layout and stylistic flaws, highlighting the need for combined human LLM evaluation in assessing emerging academic outputs.
Similar Papers
Aligning LLMs for the Classroom with Knowledge-Based Retrieval -- A Comparative RAG Study
Artificial Intelligence
Makes AI answers for school more truthful.
Comparing the Performance of LLMs in RAG-based Question-Answering: A Case Study in Computer Science Literature
Computation and Language
Helps AI answer questions more truthfully and accurately.
Addressing accuracy and hallucination of LLMs in Alzheimer's disease research through knowledge graphs
Artificial Intelligence
Helps AI answer science questions accurately.