Score: 0

CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency

Published: November 29, 2025 | arXiv ID: 2512.00417v1

By: Jiacheng Guo , Suozhi Huang , Zixin Yao and more

Potential Business Impact:

Tests AI's money-guessing skills in fast markets.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

This paper introduces CryptoBench, the first expert-curated, dynamic benchmark designed to rigorously evaluate the real-world capabilities of Large Language Model (LLM) agents in the uniquely demanding and fast-paced cryptocurrency domain. Unlike general-purpose agent benchmarks for search and prediction, professional crypto analysis presents specific challenges: \emph{extreme time-sensitivity}, \emph{a highly adversarial information environment}, and the critical need to synthesize data from \emph{diverse, specialized sources}, such as on-chain intelligence platforms and real-time Decentralized Finance (DeFi) dashboards. CryptoBench thus serves as a much more challenging and valuable scenario for LLM agent assessment. To address these challenges, we constructed a live, dynamic benchmark featuring 50 questions per month, expertly designed by crypto-native professionals to mirror actual analyst workflows. These tasks are rigorously categorized within a four-quadrant system: Simple Retrieval, Complex Retrieval, Simple Prediction, and Complex Prediction. This granular categorization enables a precise assessment of an LLM agent's foundational data-gathering capabilities alongside its advanced analytical and forecasting skills. Our evaluation of ten LLMs, both directly and within an agentic framework, reveals a performance hierarchy and uncovers a failure mode. We observe a \textit{retrieval-prediction imbalance}, where many leading models, despite being proficient at data retrieval, demonstrate a pronounced weakness in tasks requiring predictive analysis. This highlights a problematic tendency for agents to appear factually grounded while lacking the deeper analytical capabilities to synthesize information.

CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency

Computation and Language

Tests AI's ability to predict crypto prices.

29 Nov 2025 0

90%

AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence

Cryptography and Security

Helps computers understand computer attack dangers better.

3 Nov 2025 2

90%

Finance Agent Benchmark: Benchmarking LLMs on Real-world Financial Research Tasks

Computational Engineering, Finance, and Science

Tests AI on real money problems, finds big gaps

20 May 2025 1

View PDF Login to Bookmark

Page Count

15 pages

CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency

Tests AI's money-guessing skills in fast markets.

Technical Abstract

CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency

AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence

Finance Agent Benchmark: Benchmarking LLMs on Real-world Financial Research Tasks