Auto-BenchmarkCard: Automated Synthesis of Benchmark Documentation
By: Aris Hofmann , Inge Vejsbjerg , Dhaval Salwala and more
Potential Business Impact:
Makes AI tests easier to understand and compare.
We present Auto-BenchmarkCard, a workflow for generating validated descriptions of AI benchmarks. Benchmark documentation is often incomplete or inconsistent, making it difficult to interpret and compare benchmarks across tasks or domains. Auto-BenchmarkCard addresses this gap by combining multi-agent data extraction from heterogeneous sources (e.g., Hugging Face, Unitxt, academic papers) with LLM-driven synthesis. A validation phase evaluates factual accuracy through atomic entailment scoring using the FactReasoner tool. This workflow has the potential to promote transparency, comparability, and reusability in AI benchmark reporting, enabling researchers and practitioners to better navigate and evaluate benchmark choices.
Similar Papers
AutoSynth: Automated Workflow Optimization for High-Quality Synthetic Dataset Generation via Monte Carlo Tree Search
Machine Learning (CS)
Creates smart computer answers without human help.
ArenaBencher: Automatic Benchmark Evolution via Multi-Model Competitive Evaluation
Computation and Language
Makes AI tests harder to cheat on.
IndiMathBench: Autoformalizing Mathematical Reasoning Problems with a Human Touch
Artificial Intelligence
Helps computers prove math problems from Olympiads.