Score: 0

The Forecast Critic: Leveraging Large Language Models for Poor Forecast Identification

Published: December 12, 2025 | arXiv ID: 2512.12059v1

By: Luke Bhan , Hanyu Zhang , Andrew Gordon Wilson and more

Monitoring forecasting systems is critical for customer satisfaction, profitability, and operational efficiency in large-scale retail businesses. We propose The Forecast Critic, a system that leverages Large Language Models (LLMs) for automated forecast monitoring, taking advantage of their broad world knowledge and strong ``reasoning'' capabilities. As a prerequisite for this, we systematically evaluate the ability of LLMs to assess time series forecast quality, focusing on three key questions. (1) Can LLMs be deployed to perform forecast monitoring and identify obviously unreasonable forecasts? (2) Can LLMs effectively incorporate unstructured exogenous features to assess what a reasonable forecast looks like? (3) How does performance vary across model sizes and reasoning capabilities, measured across state-of-the-art LLMs? We present three experiments, including on both synthetic and real-world forecasting data. Our results show that LLMs can reliably detect and critique poor forecasts, such as those plagued by temporal misalignment, trend inconsistencies, and spike errors. The best-performing model we evaluated achieves an F1 score of 0.88, somewhat below human-level performance (F1 score: 0.97). We also demonstrate that multi-modal LLMs can effectively incorporate unstructured contextual signals to refine their assessment of the forecast. Models correctly identify missing or spurious promotional spikes when provided with historical context about past promotions (F1 score: 0.84). Lastly, we demonstrate that these techniques succeed in identifying inaccurate forecasts on the real-world M5 time series dataset, with unreasonable forecasts having an sCRPS at least 10% higher than that of reasonable forecasts. These findings suggest that LLMs, even without domain-specific fine-tuning, may provide a viable and scalable option for automated forecast monitoring and evaluation.

Pitfalls in Evaluating Language Model Forecasters

Machine Learning (CS)

Makes AI predictions more trustworthy and accurate.

31 May 2025 0

91%

Future Is Unevenly Distributed: Forecasting Ability of LLMs Depends on What We're Asking

Machine Learning (CS)

Models guess future events better with more facts.

23 Nov 2025 0

91%

Forecasting Credit Ratings: A Case Study where Traditional Methods Outperform Generative LLMs

Risk Management

Helps predict company money health better.

24 Jul 2024 1

View PDF Login to Bookmark

The Forecast Critic: Leveraging Large Language Models for Poor Forecast Identification

Technical Abstract

Pitfalls in Evaluating Language Model Forecasters

Future Is Unevenly Distributed: Forecasting Ability of LLMs Depends on What We're Asking

Forecasting Credit Ratings: A Case Study where Traditional Methods Outperform Generative LLMs