Score: 1

BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation

Published: May 31, 2025 | arXiv ID: 2506.00482v1

By: Eunsu Kim , Haneul Yoo , Guijin Son and more

Potential Business Impact:

Organizes AI tests for better learning.

Business Areas:

Test and Measurement Data and Analytics

As large language models (LLMs) continue to advance, the need for up-to-date and well-organized benchmarks becomes increasingly critical. However, many existing datasets are scattered, difficult to manage, and make it challenging to perform evaluations tailored to specific needs or domains, despite the growing importance of domain-specific models in areas such as math or code. In this paper, we introduce BenchHub, a dynamic benchmark repository that empowers researchers and developers to evaluate LLMs more effectively. BenchHub aggregates and automatically classifies benchmark datasets from diverse domains, integrating 303K questions across 38 benchmarks. It is designed to support continuous updates and scalable data management, enabling flexible and customizable evaluation tailored to various domains or use cases. Through extensive experiments with various LLM families, we demonstrate that model performance varies significantly across domain-specific subsets, emphasizing the importance of domain-aware benchmarking. We believe BenchHub can encourage better dataset reuse, more transparent model comparisons, and easier identification of underrepresented areas in existing benchmarks, offering a critical infrastructure for advancing LLM evaluation research.

HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis Generation

Artificial Intelligence

Tests AI to find better science ideas.

15 Apr 2025 2

90%

YourBench: Easy Custom Evaluation Sets for Everyone

Computation and Language

Creates custom tests for AI, fast and cheap.

2 Apr 2025 3

89%

FrontendBench: A Benchmark for Evaluating LLMs on Front-End Development via Automatic Evaluation

Software Engineering

Tests computer code better for websites.

16 Jun 2025 2

View PDF Login to Bookmark

Country of Origin

🇰🇷 Korea, Republic of

Repos / Data Links

github.com github.com

Page Count

32 pages

BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation

Organizes AI tests for better learning.

Technical Abstract

HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis Generation

YourBench: Easy Custom Evaluation Sets for Everyone

FrontendBench: A Benchmark for Evaluating LLMs on Front-End Development via Automatic Evaluation