arxiv: 2605.09712 · v1 · submitted 2026-05-10 · 💰 econ.EM · q-fin.PM· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Quantifying the Risk-Return Tradeoff in Forecasting

Philippe Goulet Coulombe

Pith reviewed 2026-05-12 03:46 UTC · model grok-4.3

classification 💰 econ.EM q-fin.PMstat.ML

keywords forecast evaluationrisk-adjusted performancemacroeconomic forecastingprofessional forecastersmachine learningedge ratiosharpe ratioforecast reliability

0 comments

The pith

Treating forecast loss differentials as returns shows professional forecasters are hard to beat on risk-adjusted measures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes forecast evaluation by converting the loss gap between any model and a benchmark into a return series, then applies finance-style risk metrics to measure not just average accuracy but reliability and downside exposure. When this is done for U.S. macroeconomic targets, many machine learning and econometric models improve on raw error relative to the Survey of Professional Forecasters, yet the professionals maintain better risk profiles, fewer large failures, and higher Edge Ratios that capture unique informativeness. This matters because practical decisions often penalize occasional large misses more than they reward small average gains. The same mapping also permits direct comparisons across dozens of targets, horizons, and samples, including density forecasts and the M4 competition.

Core claim

By mapping forecast loss differentials to a return series, the author shows that risk-adjusted performance measures from finance indicate professional forecasters from the Survey of Professional Forecasters rarely produce catastrophic errors and frequently post high Edge Ratios, even while selected machine learning models achieve competitive risk profiles on particular variables.

What carries the argument

The conversion of forecast loss differentials relative to a benchmark into a return series, which then supports calculation of the Sharpe ratio, Sortino ratio, Omega ratio, drawdown statistics, and the Edge Ratio that quantifies unique predictive value outside the current frontier.

If this is right

Professional forecasters maintain high Edge Ratios that plausibly reflect the value of contextual judgment.
Selected machine learning methods deliver attractive risk profiles for specific macroeconomic targets.
Beating professional forecasters on average accuracy does not automatically translate to superiority on risk-adjusted bases.
The framework supports unified meta-analyses across targets, horizons, samples, density forecasts, and competitions such as M4.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same loss-to-return mapping could be used to compare forecasting approaches in domains outside macroeconomics, such as finance or energy demand.
If high Edge Ratios stem from human judgment, hybrid systems that blend model output with professional oversight might raise both accuracy and risk-adjusted scores.
Extending the metrics to explicitly model the time-series dependence in forecast errors could alter which methods appear safest.
Large-scale application across many datasets might identify whether certain model classes minimize downside risk more consistently than others.

Load-bearing premise

Forecast loss differentials relative to a benchmark can be treated directly as a return series to which standard financial risk-adjusted performance measures apply without further modification for the statistical properties of forecasting errors.

What would settle it

Recalculating the Sharpe and Edge Ratios on a fresh macroeconomic sample after adjusting the loss differentials for serial correlation and heteroskedasticity would reverse the ranking that favors professional forecasters.

read the original abstract

Average forecast accuracy is not the same as forecast reliability. I treat forecast loss differentials relative to a benchmark as a return series. I then evaluate these returns using risk-adjusted performance measures from finance, including the Sharpe ratio, Sortino ratio, Omega ratio, and drawdown-based metrics. I also introduce the Edge Ratio capturing a model's propensity to deliver uniquely informative predictions relative to the forecasting frontier. I apply this framework to U.S. macroeconomic forecasting, comparing econometric benchmarks, machine learning models, a foundation model (TabPFN), and the Survey of Professional Forecasters. While it is often feasible to beat professional forecasters in terms of average accuracy, it is much harder to beat them on a risk-adjusted basis. They rarely exhibit catastrophic failures and often achieve high Edge Ratios, plausibly reflecting the value of contextual judgment. Nonetheless, selected machine learning methods deliver attractive risk profiles for specific targets. The framework naturally extends to meta-analyses across targets, horizons, and samples, illustrated with a density forecast evaluation and the M4 competition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reframes forecast loss differentials as returns to borrow Sharpe-style metrics and adds an Edge Ratio, but macro error serial correlation likely distorts the ratios without adjustments.

read the letter

The main point is that professional forecasters hold up better on risk-adjusted measures than on raw accuracy, and the new Edge Ratio suggests they often deliver unique value. This comes from treating loss differentials versus a benchmark as a return series and applying Sharpe, Sortino, Omega, and drawdown metrics, plus the Edge Ratio that tracks predictions outside the forecasting frontier. The application covers U.S. macro targets, econometric models, machine learning, TabPFN, the SPF, plus density forecasts and the M4 data. The comparisons are concrete and show selected ML methods with good risk profiles on some variables while pros avoid big failures more consistently. That reframing is new for forecast evaluation and gives a practical way to weigh reliability when large errors matter for decisions. The empirical scope across models and datasets is a plus, and extending the same logic to meta-analyses is straightforward. The soft spot is the direct mapping to financial metrics. Macro forecast errors are serially correlated from persistent shocks and often heteroskedastic, unlike clean return increments. The abstract gives no sign of HAC standard errors, adjustments for overlapping origins, or handling of bounded losses, so volatility and tail measures could be biased. If that holds in the full text, the risk-adjusted rankings and the claim that pros are harder to beat on that basis rest on shaky ground. The Edge Ratio inherits the same distributional assumptions. This is for macro forecasters and time-series researchers who want evaluation tools beyond average loss. A reader looking for a fresh angle on reliability will get usable examples even if the statistical details need work. It deserves a serious referee because the core idea is original and the data exercises are broad, though robustness checks on error properties are required.

Referee Report

3 major / 2 minor

Summary. The paper reframes forecast evaluation by treating loss differentials relative to a benchmark as a return series and applies standard financial risk-adjusted metrics (Sharpe, Sortino, Omega ratios, drawdowns) plus a new Edge Ratio that measures propensity for uniquely informative predictions relative to a forecasting frontier. Applied to U.S. macro targets, it compares econometric benchmarks, ML models, TabPFN, and the Survey of Professional Forecasters (SPF), finding that average-accuracy gains by ML are common but risk-adjusted outperformance is rarer; SPF exhibits fewer catastrophic failures and higher Edge Ratios, plausibly due to contextual judgment. The framework is illustrated on density forecasts and the M4 competition.

Significance. If the risk metrics can be validly adapted to forecast-error series, the work supplies a useful new evaluation lens that privileges reliability and tail-risk avoidance over raw MSE/MAE, helping explain why professional forecasters persist despite point-accuracy shortfalls. Explicit credit is due for the parameter-light extension to meta-analyses across horizons/targets/samples and for the reproducible-style application to M4 and density forecasts; these features make the contribution more than a one-off empirical exercise.

major comments (3)

[Methodology section defining the return series and risk measures] The central claim—that SPF is harder to beat on risk-adjusted metrics—rests on direct application of Sharpe/Sortino/Omega and drawdown statistics to loss differentials. No modification for serial correlation, heteroskedasticity, or overlapping forecast origins is described, yet macro forecast errors routinely violate the iid-increments assumption required for these ratios to retain their usual interpretation (see the skeptic note on persistent shocks).
[Section introducing the Edge Ratio] The Edge Ratio is defined relative to a 'forecasting frontier' whose construction is data-dependent; if the frontier is itself estimated from the same sample used to compute the ratio, the measure risks circularity and the claim that SPF 'often achieve high Edge Ratios' cannot be assessed without the precise definition and any out-of-sample safeguards.
[Empirical results section (SPF comparison)] Table or figure reporting SPF vs. ML rankings on Edge Ratio and maximum drawdown: the reported superiority of SPF on these metrics is load-bearing for the 'rarely exhibit catastrophic failures' conclusion, yet the abstract supplies no verification that the loss series were pre-whitened or that HAC standard errors were used when ranking models.

minor comments (2)

[Abstract] The abstract states that the framework 'naturally extends' to meta-analyses but does not illustrate the aggregation rule (e.g., how Edge Ratios are pooled across targets); a short clarifying sentence would improve readability.
[Notation and definitions] Notation for the loss-differential series should be introduced once and used consistently; currently the mapping from forecast error to 'return' is described only informally.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, clarifying our approach and indicating revisions that will strengthen the manuscript's methodological rigor and empirical transparency.

read point-by-point responses

Referee: The central claim—that SPF is harder to beat on risk-adjusted metrics—rests on direct application of Sharpe/Sortino/Omega and drawdown statistics to loss differentials. No modification for serial correlation, heteroskedasticity, or overlapping forecast origins is described, yet macro forecast errors routinely violate the iid-increments assumption required for these ratios to retain their usual interpretation (see the skeptic note on persistent shocks).

Authors: We acknowledge that macro forecast errors frequently exhibit serial correlation, heteroskedasticity, and overlapping origins, which can influence the interpretation of standard risk-adjusted metrics. In the revised manuscript, we will incorporate HAC standard errors for the Sharpe, Sortino, and Omega ratios, discuss the implications of overlapping forecast origins, and add a robustness section applying pre-whitening to the loss differential series. While these metrics are used primarily for model ranking and comparison rather than formal hypothesis testing, the relative orderings remain informative; the additions will better align the analysis with the data properties. revision: yes
Referee: The Edge Ratio is defined relative to a 'forecasting frontier' whose construction is data-dependent; if the frontier is itself estimated from the same sample used to compute the ratio, the measure risks circularity and the claim that SPF 'often achieve high Edge Ratios' cannot be assessed without the precise definition and any out-of-sample safeguards.

Authors: The referee rightly flags the risk of circularity in the Edge Ratio. We will revise the methodology section to provide the exact construction details of the forecasting frontier and introduce an out-of-sample safeguard: the frontier will be estimated on a training subsample, with the Edge Ratio then computed on a held-out evaluation period. This change will allow us to substantiate the SPF results without circularity while preserving the measure's comparative value. revision: yes
Referee: Table or figure reporting SPF vs. ML rankings on Edge Ratio and maximum drawdown: the reported superiority of SPF on these metrics is load-bearing for the 'rarely exhibit catastrophic failures' conclusion, yet the abstract supplies no verification that the loss series were pre-whitened or that HAC standard errors were used when ranking models.

Authors: We agree that explicit verification is needed for the SPF superiority claims. The revised manuscript will include a new table or figure directly comparing SPF and ML models on Edge Ratio and maximum drawdown. We will also add text and footnotes confirming that loss series are pre-whitened where appropriate and that HAC standard errors are used for rankings and comparisons. These updates will provide the requested transparency and reinforce the reliability conclusion. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces an original reframing of forecast loss differentials as a return series and applies standard financial risk metrics plus a newly defined Edge Ratio relative to a forecasting frontier. All central claims follow from direct empirical application to U.S. macro data and model comparisons (SPF, ML, benchmarks). No load-bearing step reduces by construction to its own inputs: there are no self-definitional equations, no fitted parameters relabeled as out-of-sample predictions, and no uniqueness theorems or ansatzes imported solely via self-citation chains. The framework remains falsifiable through the reported performance rankings and is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on treating loss differentials as returns and introducing the Edge Ratio; no free parameters are mentioned in the abstract.

axioms (1)

domain assumption Forecast loss differentials relative to a benchmark can be treated as a return series.
This is the foundational step stated in the abstract for applying financial performance measures.

invented entities (1)

Edge Ratio no independent evidence
purpose: Capturing a model's propensity to deliver uniquely informative predictions relative to the forecasting frontier.
Newly introduced in the paper as an additional evaluation metric.

pith-pipeline@v0.9.0 · 5473 in / 1409 out tokens · 44882 ms · 2026-05-12T03:46:09.009553+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
I treat forecast loss differentials relative to a benchmark as a return series. I then evaluate these returns using risk-adjusted performance measures from finance, including the Sharpe ratio, Sortino ratio, Omega ratio, and drawdown-based metrics. I also introduce the Edge Ratio...
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear
The Omega ratio captures the full distribution of gains versus losses... Ω = Average Upside / Average Downside

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages

[1]

J., Boyle, S., Li, H., and Sekhposyan, T

Alam, M. J., Boyle, S., Li, H., and Sekhposyan, T. (2025). ChatMacro : Evaluating inflation forecasts of generative AI . Working Paper 2025-13, Federal Reserve Bank of San Francisco

work page 2025
[2]

Ang, A., Bekaert, G., and Wei, M. (2007). Do macro variables, asset markets, or surveys forecast inflation better? Journal of Monetary Economics , 54(4):1163--1212

work page 2007
[3]

Bell, D. E. (1982). Regret in decision making under uncertainty. Operations Research , 30(5):961--981

work page 1982
[4]

Breiman, L. (2001). Random forests. Machine Learning , 45(1):5--32

work page 2001
[5]

Bybee, L. (2023). Surveying generative AI 's economic expectations. arXiv preprint arXiv:2305.02823

work page arXiv 2023
[6]

Carriero, A., Pettenuzzo, D., and Shekhar, S. (2024). Macroeconomic forecasting with large language models. arXiv preprint arXiv:2407.00890

work page arXiv 2024
[7]

Chekhlov, A., Uryasev, S., and Zabarankin, M. (2005). Drawdown measure in portfolio optimization. International Journal of Theoretical and Applied Finance , 8(1):13--58

work page 2005
[8]

A., George, E

Chipman, H. A., George, E. I., and McCulloch, R. E. (2010). BART : B ayesian additive regression trees. The Annals of Applied Statistics , 4(1):266--298

work page 2010
[9]

Christoffersen, P. F. and Diebold, F. X. (1997). Optimal prediction under asymmetric loss. Econometric Theory , 13(6):808--817

work page 1997
[10]

E., Huber, F., Koop, G., Marcellino, M., and Pfarrhofer, M

Clark, T. E., Huber, F., Koop, G., Marcellino, M., and Pfarrhofer, M. (2023). Tail forecasting with multivariate B ayesian additive regression trees. International Economic Review , 64(3):979--1022

work page 2023
[11]

W., Tang, K., Wang, J., and Zhang, Y

Cong, L. W., Tang, K., Wang, J., and Zhang, Y. (2021). AlphaPortfolio : Direct construction through deep reinforcement learning and interpretable AI . SSRN Working Paper . Available at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3554486

work page 2021
[12]

Diebold, F. X. and Mariano, R. S. (1995). Comparing predictive accuracy. Journal of Business & Economic Statistics , 13(3):253--263

work page 1995
[13]

Ehm, W., Gneiting, T., Jordan, A., and Kr \"u ger, F. (2016). Of quantiles and expectiles: Consistent scoring functions, C hoquet representations and forecast rankings. Journal of the Royal Statistical Society: Series B , 78(3):505--562

work page 2016
[14]

F., and Williams, J

Engelberg, J., Manski, C. F., and Williams, J. (2009). Comparing the point predictions and subjective probability distributions of professional forecasters. Journal of Business & Economic Statistics , 27(1):30--41

work page 2009
[15]

and Wright, J

Faust, J. and Wright, J. H. (2013). Forecasting inflation. In Handbook of Economic Forecasting , volume 2A, pages 2--56. Elsevier

work page 2013
[16]

and White, H

Giacomini, R. and White, H. (2006). Tests of conditional predictive ability. Econometrica , 74(6):1545--1578

work page 2006
[17]

Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the American Statistical Association , 106(494):746--762

work page 2011
[18]

and Raftery, A

Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association , 102(477):359--378

work page 2007
[19]

Goulet Coulombe, P. (2025a). A neural phillips curve and a deep output gap. Journal of Business & Economic Statistics , 43(3):669--683

work page
[20]

Goulet Coulombe, P. (2025b). To bag is to prune. Studies in Nonlinear Dynamics & Econometrics , 29(6):669--697

work page
[21]

Goulet Coulombe, P. (2026). LGB+ : A macroeconomic forecasting road test. Working paper

work page 2026
[22]

Goulet Coulombe, P., Frenette, M., and Klieber, K. (2026). From reactive to proactive volatility modeling with hemisphere neural networks. Journal of Applied Econometrics . forthcoming

work page 2026
[23]

Goulet Coulombe, P., G \"o bel, M., and Klieber, K. (2025). Dual interpretation of machine learning forecasts. Technical report, Working paper

work page 2025
[24]

Goulet Coulombe, P., Leroux, M., Stevanovic, D., and Surprenant, S. (2022). How is machine learning useful for macroeconomic forecasting? Journal of Applied Econometrics , 37(5):920--964

work page 2022
[25]

Granger, C. W. J. (1999). Outline of forecast theory using generalized cost functions. Spanish Economic Review , 1(2):161--173

work page 1999
[26]

Granger, C. W. J. and Pesaran, M. H. (2000). Economic and statistical measures of forecast accuracy. Journal of Forecasting , 19(7):537--560

work page 2000
[27]

Griffin, J. E. and Brown, P. J. (2010). Inference with normal-gamma prior distributions in regression problems. Bayesian Analysis , 5(1):171--188

work page 2010
[28]

Hansen, P. R. (2005). A test for superior predictive ability. Journal of Business & Economic Statistics , 23(4):365--380

work page 2005
[29]

R., Lunde, A., and Nason, J

Hansen, P. R., Lunde, A., and Nason, J. M. (2011). The model confidence set. Econometrica , 79(2):453--497

work page 2011
[30]

Harvey, D., Leybourne, S., and Newbold, P. (1997). Testing the equality of prediction mean squared errors. International Journal of Forecasting , 13(2):281--291

work page 1997
[31]

Hollmann, N., M \"u ller, S., Eggensperger, K., and Hutter, F. (2022). TabPFN : A transformer that solves small tabular classification problems in a second. arXiv preprint arXiv:2207.01848

work page arXiv 2022
[32]

and Fr \"u hwirth-Schnatter, S

Kastner, G. and Fr \"u hwirth-Schnatter, S. (2014). Ancillarity-sufficiency interweaving strategy ( ASIS ) for boosting MCMC estimation of stochastic volatility models. Computational Statistics & Data Analysis , 76:408--423

work page 2014
[33]

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. (2017). LightGBM : A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems , volume 30, pages 3146--3154

work page 2017
[34]

and Shadwick, W

Keating, C. and Shadwick, W. F. (2002). A universal performance measure. Journal of Performance Measurement , 6(3):59--84

work page 2002
[35]

L., Ravazzolo, F., and Gneiting, T

Lerch, S., Thorarinsdottir, T. L., Ravazzolo, F., and Gneiting, T. (2017). Forecaster's dilemma: Extreme events and forecast evaluation. Statistical Science , 32(1):106--127

work page 2017
[36]

and Sugden, R

Loomes, G. and Sugden, R. (1982). Regret theory: An alternative theory of rational choice under uncertainty. The Economic Journal , 92(368):805--824

work page 1982
[37]

and Atiya, A

Magdon-Ismail, M. and Atiya, A. F. (2004). Maximum drawdown. Risk , 17(10):99--102

work page 2004
[38]

Makridakis, S., Spiliotis, E., and Assimakopoulos, V. (2018). The M4 competition: Results, findings, conclusion and way forward. International Journal of Forecasting , 34(4):802--808

work page 2018
[39]

Makridakis, S., Spiliotis, E., and Assimakopoulos, V. (2020). The M4 competition: 100,000 time series and 61 forecasting methods. International Journal of Forecasting , 36(1):54--74

work page 2020
[40]

McCracken, M. W. and Ng, S. (2020). FRED-QD : A quarterly database for macroeconomic research. Working Paper 2020-005, Federal Reserve Bank of St. Louis

work page 2020
[41]

C., Vasconcelos, G

Medeiros, M. C., Vasconcelos, G. F. R., Veiga, \'A ., and Zilberman, E. (2021). Forecasting inflation in a data-rich environment: The benefits of machine learning methods. Journal of Business & Economic Statistics , 39(1):98--119

work page 2021
[42]

Newey, W. K. and Powell, J. L. (1987). Asymmetric least squares estimation and testing. Econometrica , 55(4):819--847

work page 1987
[43]

Patton, A. J. and Timmermann, A. (2007). Properties of optimal forecasts under asymmetric loss and nonlinearity. Journal of Econometrics , 140(2):884--918

work page 2007
[44]

Rossi, B. (2021). Forecasting in the presence of instabilities: How we know whether models predict well and how to improve them. Journal of Economic Literature , 59(4):1135--1190

work page 2021
[45]

Salinas, D., Flunkert, V., Gasthaus, J., and Januschowski, T. (2020). DeepAR : Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting , 36(3):1181--1191

work page 2020
[46]

Savage, L. J. (1951). The theory of statistical decision. Journal of the American Statistical Association , 46(253):55--67

work page 1951
[47]

Sharpe, W. F. (1966). Mutual fund performance. Journal of Business , 39(1):119--138

work page 1966
[48]

Sharpe, W. F. (1994). The S harpe ratio. Journal of Portfolio Management , 21(1):49--58

work page 1994
[49]

Smyl, S. (2020). A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting. In International Journal of Forecasting , volume 36, pages 75--85

work page 2020
[50]

Sortino, F. A. and Price, L. N. (1994). Performance measurement in a downside risk framework. Journal of Investing , 3(3):59--64

work page 1994
[51]

Sortino, F. A. and van der Meer, R. (1991). Downside risk. Journal of Portfolio Management , 17(4):27--31

work page 1991
[52]

Stark, T. (2010). Realistic evaluation of real-time forecasts in the S urvey of P rofessional F orecasters. Federal Reserve Bank of Philadelphia Research Rap Special Report , pages 1--20

work page 2010
[53]

Stock, J. H. and Watson, M. W. (2002). Forecasting using principal components from a large number of predictors. Journal of the American Statistical Association , 97(460):1167--1179

work page 2002
[54]

West, K. D. (1996). Asymptotic inference about predictive ability. Econometrica , 64(5):1067--1084

work page 1996
[55]

West, K. D. (2006). Forecast evaluation. In Elliott, G., Granger, C. W. J., and Timmermann, A., editors, Handbook of Economic Forecasting , volume 1, chapter 3, pages 99--134. Elsevier

work page 2006
[56]

White, H. (2000). A reality check for data snooping. Econometrica , 68(5):1097--1126

work page 2000