QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals

Jeremy Qin; Maksym Andriushchenko

arxiv: 2604.15859 · v1 · submitted 2026-04-17 · 💻 cs.LG · cs.AI

QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals

Jeremy Qin , Maksym Andriushchenko This is my paper

Pith reviewed 2026-05-10 08:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords LLM evaluationprediction intervalsquantitative forecastingmodel calibrationuncertainty quantificationbenchmarkingoverconfidence

0 comments

The pith

No evaluated LLM reaches the 90% coverage target for prediction intervals in quantitative forecasting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces QuantSightBench to test large language models on generating prediction intervals for numerical forecasts over continuous quantities. Existing evaluations use only binary or multiple-choice formats, yet real decisions in economics, health, and demographics require calibrated estimates with explicit uncertainty. Testing 11 frontier and open-weight models shows none hit the 90% coverage goal, with the strongest results at 79.1%, 76.4%, and 75.3% and clear overconfidence on extreme values. Prediction intervals demand scale awareness and consistency across confidence levels, exposing limitations that point estimates miss. The work therefore positions proper uncertainty quantification as a missing capability in current models.

Core claim

We propose prediction intervals as a natural and rigorous interface for evaluating LLM quantitative forecasting. To assess this capability, we introduce QuantSightBench and evaluate frontier models under multiple settings, assessing both empirical coverage and interval sharpness. Our results show that none of the 11 evaluated frontier and open-weight models achieves the 90% coverage target, with the top performers Gemini 3.1 Pro (79.1%), Grok 4 (76.4%), and GPT-5.4 (75.3%) all falling at least 10 percentage points short. Calibration degrades sharply at extreme magnitudes, revealing systematic overconfidence across all evaluated models.

What carries the argument

QuantSightBench benchmark, which uses prediction intervals to test empirical coverage, interval sharpness, and calibration across magnitudes for continuous quantitative forecasts.

If this is right

All tested models exhibit systematic overconfidence when expressing uncertainty in numerical forecasts.
Calibration accuracy declines sharply for forecasts involving extreme or large-magnitude values.
Prediction intervals provide a stricter and more informative evaluation than point estimates for numerical reasoning.
No frontier model yet demonstrates reliable calibration for continuous quantitative forecasting tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real-world applications that rely on LLM-generated forecasts would benefit from post-hoc calibration or human review to correct overconfidence.
Future model training could incorporate explicit objectives that reward proper interval calibration rather than point accuracy alone.
The observed pattern suggests that extending the benchmark to additional domains would likely reveal similar calibration gaps.

Load-bearing premise

The tasks and ground-truth values in QuantSightBench are representative of real-world quantitative forecasting and measured without error or selection bias.

What would settle it

Finding that any current or future model consistently achieves at least 90% empirical coverage across the full range of QuantSightBench tasks, including at extreme magnitudes, would directly contradict the reported shortfall.

Figures

Figures reproduced from arXiv: 2604.15859 by Jeremy Qin, Maksym Andriushchenko.

**Figure 2.** Figure 2: Relative interval width vs. mean relative error under the agentic setting. A positive [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Coverage vs. MLIS under the agentic setting. Models in the upper-left corner [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Coverage and MLIS by agentic retrieval iterations across all models. Questions [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Calibration, interval width, and MLIS across target confidence levels (80%, 90%, [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Coverage and MLIS across the zero-shot, context, and agentic prompt settings for [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Coverage broken down by the magnitude of the ground truth value for all models. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: MLIS broken down by the magnitude of the ground truth value for all models. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Forecasting example with GPT-5.1 A.5 Prompt Details Below, we show the full prompt template used in the agentic setting. Agentic Forecasting Prompt System: You are a calibrated forecasting assistant with access to a news article search tool. Decide whether to search for relevant articles to inform your prediction. Match the units specified in the resolution criteria. User: You are a forecasting expert task… view at source ↗

**Figure 10.** Figure 10: Agentic forecasting prompt template. The model receives a question with [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

read the original abstract

Forecasting has become a natural benchmark for reasoning under uncertainty. Yet existing evaluations of large language models remain limited to judgmental tasks in simple formats, such as binary or multiple-choice questions. In practice, however, forecasting spans a far broader scope. Across domains such as economics, public health, and social demographics, decisions hinge on numerical estimates over continuous quantities, a capability that current benchmarks do not capture. Evaluating such estimates requires a format that makes uncertainty explicit and testable. We propose prediction intervals as a natural and rigorous interface for this purpose. They demand scale awareness, internal consistency across confidence levels, and calibration over a continuum of outcomes, making them a more suitable evaluation format than point estimates for numerical forecasting. To assess this capability, we introduce a new benchmark QuantSightBench, and evaluate frontier models under multiple settings, assessing both empirical coverage and interval sharpness. Our results show that none of the 11 evaluated frontier and open-weight models achieves the 90\% coverage target, with the top performers Gemini 3.1 Pro (79.1\%), Grok 4 (76.4\%), and GPT-5.4 (75.3\%) all falling at least 10 percentage points short. Calibration degrades sharply at extreme magnitudes, revealing systematic overconfidence across all evaluated models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

QuantSightBench pushes LLM eval toward proper prediction intervals for continuous forecasts and shows clear shortfalls in coverage, but the methods behind the benchmark stay too thin to judge the results.

read the letter

The paper's core move is to replace point estimates or binary questions with prediction intervals for quantitative forecasting tasks. That shift makes sense for domains where decisions rest on ranges rather than single numbers. They evaluate 11 frontier and open models, report that none reach 90% coverage, and note that performance drops further on extreme values. Top scores sit around 75-79% for Gemini 3.1 Pro, Grok 4, and GPT-5.4. The emphasis on calibration across magnitudes is a reasonable direction to explore.

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark evaluation

full rationale

The paper introduces QuantSightBench and reports empirical coverage and calibration results for 11 LLMs on quantitative forecasting tasks using prediction intervals. All load-bearing claims consist of direct statistical comparisons between model outputs and held-out external ground-truth values. No equations, derivations, fitted parameters, ansatzes, or self-citations are invoked to produce the headline findings; the evaluation protocol is independent of the model responses being measured and does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, axioms, or invented entities are described in the abstract; the contribution is an empirical benchmark and model evaluation.

pith-pipeline@v0.9.0 · 5523 in / 1142 out tokens · 23953 ms · 2026-05-10T08:53:48.505427+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

[1]

doi: 10.18653/v1/2023.emnlp-main.330

Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.330. URLhttps://aclanthology.org/2023.emnlp-main.330/. Vladimir Vovk, Alex Gammerman, and Glenn Shafer.Algorithmic Learning in a Random World. Springer, 2005. Zhen Wang, Xi Zhou, Yating Yang, Bo Ma, Lei Wang, Rui Dong, and Azmat Anwar. Open- Forecast: A large-scale open-ended even...

work page doi:10.18653/v1/2023.emnlp-main.330 2023
[2]

Think about what data points would reduce uncertainty

Assess what information you need. Think about what data points would reduce uncertainty

work page
[3]

If you need external data, use thesearch articlestool

work page
[4]

Evaluate relevance foreachretrieved article: mark asRELEVANTorNOT RELEVANTwith justification

work page
[5]

Decide whether additional searches are needed

work page
[6]

PREDICTION REQUIREMENTS: Provide a {probability level} prediction interval (lower, median, upper) in the same units as the resolution criteria

Base predictiononlyon relevant articles; if none found, rely on prior knowledge with wide intervals. PREDICTION REQUIREMENTS: Provide a {probability level} prediction interval (lower, median, upper) in the same units as the resolution criteria. OUTPUT FORMAT: <lower>NUMBER</lower> <median>NUMBER</median> <upper>NUMBER</upper> Figure 10: Agentic forecastin...

work page

[1] [1]

doi: 10.18653/v1/2023.emnlp-main.330

Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.330. URLhttps://aclanthology.org/2023.emnlp-main.330/. Vladimir Vovk, Alex Gammerman, and Glenn Shafer.Algorithmic Learning in a Random World. Springer, 2005. Zhen Wang, Xi Zhou, Yating Yang, Bo Ma, Lei Wang, Rui Dong, and Azmat Anwar. Open- Forecast: A large-scale open-ended even...

work page doi:10.18653/v1/2023.emnlp-main.330 2023

[2] [2]

Think about what data points would reduce uncertainty

Assess what information you need. Think about what data points would reduce uncertainty

work page

[3] [3]

If you need external data, use thesearch articlestool

work page

[4] [4]

Evaluate relevance foreachretrieved article: mark asRELEVANTorNOT RELEVANTwith justification

work page

[5] [5]

Decide whether additional searches are needed

work page

[6] [6]

PREDICTION REQUIREMENTS: Provide a {probability level} prediction interval (lower, median, upper) in the same units as the resolution criteria

Base predictiononlyon relevant articles; if none found, rely on prior knowledge with wide intervals. PREDICTION REQUIREMENTS: Provide a {probability level} prediction interval (lower, median, upper) in the same units as the resolution criteria. OUTPUT FORMAT: <lower>NUMBER</lower> <median>NUMBER</median> <upper>NUMBER</upper> Figure 10: Agentic forecastin...

work page