LeanCat: A Benchmark Suite for Formal Category Theory in Lean (Part I: 1-Categories)
By: Rongge Xu , Hui Dai , Yiming Fu and more
Large language models (LLMs) have made rapid progress in formal theorem proving, yet current benchmarks under-measure the kind of abstraction and library-mediated reasoning that organizes modern mathematics. In parallel with FATE's emphasis on frontier algebra, we introduce LeanCat, a Lean benchmark for category-theoretic formalization -- a unifying language for mathematical structure and a core layer of modern proof engineering -- serving as a stress test of structural, interface-level reasoning. Part I: 1-Categories contains 100 fully formalized statement-level tasks, curated into topic families and three difficulty tiers via an LLM-assisted + human grading process. The best model solves 8.25% of tasks at pass@1 (32.50%/4.17%/0.00% by Easy/Medium/High) and 12.00% at pass@4 (50.00%/4.76%/0.00%). We also evaluate LeanBridge which use LeanExplore to search Mathlib, and observe consistent gains over single-model baselines. LeanCat is intended as a compact, reusable checkpoint for tracking both AI and human progress toward reliable, research-level formalization in Lean.
Similar Papers
FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models
Artificial Intelligence
Teaches computers math proofs better.
A Modular Lean 4 Framework for Confluence and Strong Normalization of Lambda Calculi with Products and Sums
Logic in Computer Science
Proves math rules for computer programs perfectly.
LeanTutor: A Formally-Verified AI Tutor for Mathematical Proofs
Artificial Intelligence
Helps students learn math proofs by checking work.