Score: 3

PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning

Published: November 14, 2025 | arXiv ID: 2511.11562v1

By: Afra Feyza Akyürek , Advait Gosai , Chen Bo Calvin Zhang and more

BigTech Affiliations: Scale AI

Potential Business Impact:

Tests AI on real-world law and money problems.

Business Areas:

Skill Assessment Education

Frontier model progress is often measured by academic benchmarks, which offer a limited view of performance in real-world professional contexts. Existing evaluations often fail to assess open-ended, economically consequential tasks in high-stakes domains like Legal and Finance, where practical returns are paramount. To address this, we introduce Professional Reasoning Bench (PRBench), a realistic, open-ended, and difficult benchmark of real-world problems in Finance and Law. We open-source its 1,100 expert-authored tasks and 19,356 expert-curated criteria, making it, to our knowledge, the largest public, rubric-based benchmark for both legal and finance domains. We recruit 182 qualified professionals, holding JDs, CFAs, or 6+ years of experience, who contributed tasks inspired by their actual workflows. This process yields significant diversity, with tasks spanning 114 countries and 47 US jurisdictions. Our expert-curated rubrics are validated through a rigorous quality pipeline, including independent expert validation. Subsequent evaluation of 20 leading models reveals substantial room for improvement, with top scores of only 0.39 (Finance) and 0.37 (Legal) on our Hard subsets. We further catalog associated economic impacts of the prompts and analyze performance using human-annotated rubric categories. Our analysis shows that models with similar overall scores can diverge significantly on specific capabilities. Common failure modes include inaccurate judgments, a lack of process transparency and incomplete reasoning, highlighting critical gaps in their reliability for professional adoption.

ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

Computation and Language

Tests AI on hard professional jobs.

21 Oct 2025 4

89%

R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation

CV and Pattern Recognition

Tests AI's smart thinking on hard problems.

4 May 2025 0

89%

FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and Challenging

Computation and Language

Teaches computers to solve tricky money math problems.

6 Jun 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Repos / Data Links

github.com huggingface.co

Page Count

24 pages

PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning

Tests AI on real-world law and money problems.

Technical Abstract

ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation

FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and Challenging