Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

Ezra Karger; Jaeho Lee; Nick Merrill

arxiv: 2605.22672 · v2 · pith:DF6FBB35new · submitted 2026-05-21 · 💻 cs.AI

Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

Nick Merrill , Jaeho Lee , Ezra Karger This is my paper

Pith reviewed 2026-05-22 05:19 UTC · model grok-4.3

classification 💻 cs.AI

keywords inverse scalingLLM forecastingdistributional forecastssuperlinear growthtail riskregime changeforecast calibrationepidemiology

0 comments

The pith

More capable language models produce worse forecasts on superlinear growth problems with tail risks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether greater capability in language models improves or harms their performance on certain forecasting tasks. It finds that for time series showing superlinear growth and risks of sudden regime changes, stronger models yield less accurate full distributions of possible future values. The main problem is that these models extend the upper possibilities too aggressively to match past growth trends. This inverse scaling shows in both controlled simulations of epidemics and real data from COVID-19, housing, and other areas. Traditional scoring that looks only at whether a single threshold is crossed fails to reveal the issue and can suggest better performance for capable models.

Core claim

More capable language models make worse distributional forecasts than less capable ones when the underlying time series exhibit superlinear growth and tail risk of regime change, with the error arising from upward shifts in the upper quantiles of the forecast.

What carries the argument

The per-quantile decomposition that reveals the concentration of forecast errors in the upper tail for more capable models.

If this is right

Both larger model scale and additional post-training increase the severity of the upper-tail overestimation.
Providing domain knowledge does not reliably improve the calibration of these forecasts.
Single-threshold metrics common in benchmarks miss the tail degradation and can reverse the observed capability-accuracy trend.
Continuous measures of forecast accuracy are required to detect and address these failures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This inverse scaling may indicate that current training paradigms encourage excessive extrapolation in uncertain growth scenarios.
The results highlight the importance of developing evaluation methods that account for tail risks in real-world applications.
Similar patterns could be tested in other domains involving accelerating processes, such as technology adoption curves.

Load-bearing premise

The synthetic SIR epidemics, linear controls, and real-world examples like COVID-19 and hyperinflation adequately represent the general class of superlinear growth forecasting problems with regime change tail risks.

What would settle it

Observing no degradation in upper-tail accuracy for more capable models when evaluated on a new set of superlinear growth time series with regime risks.

Figures

Figures reproduced from arXiv: 2605.22672 by Ezra Karger, Jaeho Lee, Nick Merrill.

**Figure 2.** Figure 2: Upper-tail predictions drive the inverse scaling, in both domains. Per-quantile pinball-loss decomposition. Top: FBSim disruptable templates, N=28 (apples-to-apples panel; same model set as Appendix [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Time series with superlinear growth and tail-risk of regime change trigger the inverse scaling. Top: Ground-truth series shown to models as history (black) with continuations not shown (gray). Left: SIR epidemic (log scale): exponential growth then intervention-driven decline. Right: linear growth with the same downward-jump structure. Bottom: CRPS vs. ECI at h=210. On SIR data, more capable models produce… view at source ↗

**Figure 4.** Figure 4: Domain knowledge has inconsistent effects across domains. Naming the domain rescues positive scaling on COVID-19 (ρ: −0.49 → +0.39), substantially attenuates it on housing (∆ρ=+0.86), measles (∆ρ=+0.36), and SIR (∆ρ=+0.24), but has essentially no effect on hyperinflation (∆ρ=+0.00). Red dots: unlabeled numbers (inverse-scaling baseline). Orange crosses: “the current trend may or may not continue.” Green … view at source ↗

**Figure 5.** Figure 5: Across-horizon evolution of the capability–accuracy relationship. Spearman ρ between model capability (Epoch Capabilities Index) and forecast accuracy vs. horizon, sign-flipped so positive = positive scaling, negative = inverse scaling. Top: FBSim, pooled across all six question templates (H1–H7 = game turns). Bottom: pre-vaccine US measles case counts, 1928–1962, pooled across all 35 seasons. In both doma… view at source ↗

read the original abstract

We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change, a structure common in finance and epidemiology. On these tasks, more capable models produce worse distributional forecasts. The pattern appears on ForecastBench-Sim (FBSim), a contamination-free, simulated-world benchmark we release, in forecasting synthetic SIR epidemics with a matched linear control, and replicates in real-world datasets on COVID-19, measles, housing markets, and hyperinflation. A per-quantile decomposition shows the failure concentrates at the upper tail, which more capable models shift upward to track aggressive extrapolations of growth, while the lower tail stays put. A within-family study of Llama-3.1 shows that both model scale and post-training independently contribute to this effect. Domain knowledge does not reliably rescue calibration. This inverse scaling does not appear on single-threshold metrics common in LLM forecasting benchmarks, reversing the sign of the capability--accuracy relationship on identical outputs. Single-threshold scoring at conventional cutoffs misses the upper-tail cost; tail-inclusive scoring reverses the sign of the capability--accuracy relationship on the same outputs. We recommend that LLM forecasting evaluations use continuous (and unbounded) measures of accuracy alongside bounded binary threshold metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that more capable LLMs produce worse distributional forecasts on time series exhibiting superlinear growth and tail risk of regime change. This inverse scaling is shown on the new contamination-free FBSim benchmark, synthetic SIR epidemics with matched linear controls, and real-world datasets (COVID-19, measles, housing markets, hyperinflation). A per-quantile decomposition localizes the failure to upward shifts in the upper tail; within-family ablations on Llama-3.1 attribute the effect to both scale and post-training. Single-threshold metrics miss the cost and can reverse the sign of the capability-accuracy relationship.

Significance. If the central empirical pattern holds after methodological clarification, the result would be significant for LLM evaluation and deployment in forecasting. It shows that capability can exacerbate errors precisely where tail risks matter most, with direct implications for epidemiology and finance. The release of FBSim, the per-quantile analysis, and the demonstration that threshold metrics can mask the liability are concrete contributions that could influence how future forecasting benchmarks are designed.

major comments (2)

[§4] §4 (per-quantile decomposition): the claim that more capable models shift the upper tail upward rests on the decomposition, yet the manuscript supplies no quantitative metrics, error bars, or details on quantile estimation (empirical vs. parametric) and how contamination was ruled out. These omissions make it impossible to assess the statistical reliability of the reported tail shifts.
[§3] §3 (dataset construction): the synthetic SIR epidemics, matched linear controls, and four real-world series are presented as representative of superlinear-growth problems with regime-change tail risk, but the paper does not address potential selection artifacts from post-hoc window choice or series filtering, nor does it test additional growth exponents or change-point statistics to support generalization.

minor comments (2)

[Abstract] Abstract and §5: the statement that the pattern 'replicates' across datasets would be strengthened by reporting effect sizes or summary statistics rather than qualitative descriptions alone.
[§2] Notation in §2: define 'distributional forecast' and the exact prompting procedure used to elicit quantiles from the models to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript accordingly to improve clarity and strengthen the supporting analyses.

read point-by-point responses

Referee: [§4] §4 (per-quantile decomposition): the claim that more capable models shift the upper tail upward rests on the decomposition, yet the manuscript supplies no quantitative metrics, error bars, or details on quantile estimation (empirical vs. parametric) and how contamination was ruled out. These omissions make it impossible to assess the statistical reliability of the reported tail shifts.

Authors: We agree that the original submission omitted key quantitative details and statistical support for the per-quantile results. In the revised manuscript we now report the mean upward shift in the 90th percentile (with bootstrap standard errors) across model families and runs. Quantile estimation is performed empirically from 1000 Monte Carlo samples drawn from each model's predictive distribution; we have added explicit description of this procedure and a comparison to a parametric log-normal fit for robustness. Contamination is ruled out because FBSim is generated from a fully synthetic process with no overlap to any public training corpora; we have expanded the methods section to document the generation pipeline and data provenance checks. revision: yes
Referee: [§3] §3 (dataset construction): the synthetic SIR epidemics, matched linear controls, and four real-world series are presented as representative of superlinear-growth problems with regime-change tail risk, but the paper does not address potential selection artifacts from post-hoc window choice or series filtering, nor does it test additional growth exponents or change-point statistics to support generalization.

Authors: We acknowledge the risk of post-hoc selection. The real-world windows were chosen from publicly documented intervals of superlinear growth and known regime-change events (e.g., initial COVID-19 exponential phase, 2008 housing bubble); these criteria and the exact date ranges are now listed in a new appendix table. To address generalization we have added synthetic experiments with quadratic and cubic growth exponents, applied two standard change-point detectors (CUSUM and PELT) to confirm regime shifts, and included a sensitivity analysis showing that the inverse-scaling pattern is robust to modest shifts in window boundaries. These additions are reported in the revised §3 and supplementary figures. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons on released benchmarks and public datasets

full rationale

The paper presents direct empirical results from evaluating LLMs on ForecastBench-Sim (a released synthetic benchmark), matched SIR/linear simulations, and four real-world time series (COVID-19, measles, housing, hyperinflation). Claims rest on observed performance differences, per-quantile decompositions, and within-family scaling studies rather than any derivation that reduces by construction to fitted parameters, self-definitions, or load-bearing self-citations. The central inverse-scaling pattern is measured against external outputs and public data; no equations or steps equate predictions to inputs by definition. This is a standard non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper adds an empirical observation and a new benchmark rather than new mathematical axioms or invented entities; the main unverified premise is the representativeness of the selected tasks.

axioms (1)

domain assumption The selected synthetic and real-world time series exhibit superlinear growth and tail risk of regime change.
Invoked to justify the choice of FBSim, SIR epidemics, and the listed real datasets as test cases.

pith-pipeline@v0.9.0 · 5752 in / 1195 out tokens · 40098 ms · 2026-05-22T05:19:21.068438+00:00 · methodology

Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)