REFLEX: Reference-Free Evaluation of Log Summarization via Large Language Model Judgment
By: Priyanka Mudgal
Potential Business Impact:
Checks computer logs for problems automatically.
Evaluating log summarization systems is challenging due to the lack of high-quality reference summaries and the limitations of existing metrics like ROUGE and BLEU, which depend on surface-level lexical overlap. We introduce REFLEX, a reference-free evaluation metric for log summarization based on large language model (LLM) judgment. REFLEX uses LLMs as zero-shot evaluators to assess summary quality along dimensions such as relevance, informativeness, and coherence, without requiring gold-standard references or human annotations. We show that REFLEX produces stable, interpretable, and fine-grained evaluations across multiple log summarization dataset, and more effectively distinguishes model outputs than traditional metrics. REFLEX provides a scalable alternative for evaluating log summaries in real-world settings where reference data is scarce or unavailable.
Similar Papers
A Critical Study of Automatic Evaluation in Sign Language Translation
Computation and Language
Helps computers judge sign language videos better.
Learning from Self Critique and Refinement for Faithful LLM Summarization
Computation and Language
Teaches AI to write summaries without making things up.
Mind the Blind Spots: A Focus-Level Evaluation Framework for LLM Reviews
Computation and Language
Helps AI understand what makes a science paper good.