Recognition: 2 theorem links
· Lean TheoremQuantifying the Risk-Return Tradeoff in Forecasting
Pith reviewed 2026-05-12 03:46 UTC · model grok-4.3
The pith
Treating forecast loss differentials as returns shows professional forecasters are hard to beat on risk-adjusted measures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By mapping forecast loss differentials to a return series, the author shows that risk-adjusted performance measures from finance indicate professional forecasters from the Survey of Professional Forecasters rarely produce catastrophic errors and frequently post high Edge Ratios, even while selected machine learning models achieve competitive risk profiles on particular variables.
What carries the argument
The conversion of forecast loss differentials relative to a benchmark into a return series, which then supports calculation of the Sharpe ratio, Sortino ratio, Omega ratio, drawdown statistics, and the Edge Ratio that quantifies unique predictive value outside the current frontier.
If this is right
- Professional forecasters maintain high Edge Ratios that plausibly reflect the value of contextual judgment.
- Selected machine learning methods deliver attractive risk profiles for specific macroeconomic targets.
- Beating professional forecasters on average accuracy does not automatically translate to superiority on risk-adjusted bases.
- The framework supports unified meta-analyses across targets, horizons, samples, density forecasts, and competitions such as M4.
Where Pith is reading between the lines
- The same loss-to-return mapping could be used to compare forecasting approaches in domains outside macroeconomics, such as finance or energy demand.
- If high Edge Ratios stem from human judgment, hybrid systems that blend model output with professional oversight might raise both accuracy and risk-adjusted scores.
- Extending the metrics to explicitly model the time-series dependence in forecast errors could alter which methods appear safest.
- Large-scale application across many datasets might identify whether certain model classes minimize downside risk more consistently than others.
Load-bearing premise
Forecast loss differentials relative to a benchmark can be treated directly as a return series to which standard financial risk-adjusted performance measures apply without further modification for the statistical properties of forecasting errors.
What would settle it
Recalculating the Sharpe and Edge Ratios on a fresh macroeconomic sample after adjusting the loss differentials for serial correlation and heteroskedasticity would reverse the ranking that favors professional forecasters.
read the original abstract
Average forecast accuracy is not the same as forecast reliability. I treat forecast loss differentials relative to a benchmark as a return series. I then evaluate these returns using risk-adjusted performance measures from finance, including the Sharpe ratio, Sortino ratio, Omega ratio, and drawdown-based metrics. I also introduce the Edge Ratio capturing a model's propensity to deliver uniquely informative predictions relative to the forecasting frontier. I apply this framework to U.S. macroeconomic forecasting, comparing econometric benchmarks, machine learning models, a foundation model (TabPFN), and the Survey of Professional Forecasters. While it is often feasible to beat professional forecasters in terms of average accuracy, it is much harder to beat them on a risk-adjusted basis. They rarely exhibit catastrophic failures and often achieve high Edge Ratios, plausibly reflecting the value of contextual judgment. Nonetheless, selected machine learning methods deliver attractive risk profiles for specific targets. The framework naturally extends to meta-analyses across targets, horizons, and samples, illustrated with a density forecast evaluation and the M4 competition.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reframes forecast evaluation by treating loss differentials relative to a benchmark as a return series and applies standard financial risk-adjusted metrics (Sharpe, Sortino, Omega ratios, drawdowns) plus a new Edge Ratio that measures propensity for uniquely informative predictions relative to a forecasting frontier. Applied to U.S. macro targets, it compares econometric benchmarks, ML models, TabPFN, and the Survey of Professional Forecasters (SPF), finding that average-accuracy gains by ML are common but risk-adjusted outperformance is rarer; SPF exhibits fewer catastrophic failures and higher Edge Ratios, plausibly due to contextual judgment. The framework is illustrated on density forecasts and the M4 competition.
Significance. If the risk metrics can be validly adapted to forecast-error series, the work supplies a useful new evaluation lens that privileges reliability and tail-risk avoidance over raw MSE/MAE, helping explain why professional forecasters persist despite point-accuracy shortfalls. Explicit credit is due for the parameter-light extension to meta-analyses across horizons/targets/samples and for the reproducible-style application to M4 and density forecasts; these features make the contribution more than a one-off empirical exercise.
major comments (3)
- [Methodology section defining the return series and risk measures] The central claim—that SPF is harder to beat on risk-adjusted metrics—rests on direct application of Sharpe/Sortino/Omega and drawdown statistics to loss differentials. No modification for serial correlation, heteroskedasticity, or overlapping forecast origins is described, yet macro forecast errors routinely violate the iid-increments assumption required for these ratios to retain their usual interpretation (see the skeptic note on persistent shocks).
- [Section introducing the Edge Ratio] The Edge Ratio is defined relative to a 'forecasting frontier' whose construction is data-dependent; if the frontier is itself estimated from the same sample used to compute the ratio, the measure risks circularity and the claim that SPF 'often achieve high Edge Ratios' cannot be assessed without the precise definition and any out-of-sample safeguards.
- [Empirical results section (SPF comparison)] Table or figure reporting SPF vs. ML rankings on Edge Ratio and maximum drawdown: the reported superiority of SPF on these metrics is load-bearing for the 'rarely exhibit catastrophic failures' conclusion, yet the abstract supplies no verification that the loss series were pre-whitened or that HAC standard errors were used when ranking models.
minor comments (2)
- [Abstract] The abstract states that the framework 'naturally extends' to meta-analyses but does not illustrate the aggregation rule (e.g., how Edge Ratios are pooled across targets); a short clarifying sentence would improve readability.
- [Notation and definitions] Notation for the loss-differential series should be introduced once and used consistently; currently the mapping from forecast error to 'return' is described only informally.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below, clarifying our approach and indicating revisions that will strengthen the manuscript's methodological rigor and empirical transparency.
read point-by-point responses
-
Referee: The central claim—that SPF is harder to beat on risk-adjusted metrics—rests on direct application of Sharpe/Sortino/Omega and drawdown statistics to loss differentials. No modification for serial correlation, heteroskedasticity, or overlapping forecast origins is described, yet macro forecast errors routinely violate the iid-increments assumption required for these ratios to retain their usual interpretation (see the skeptic note on persistent shocks).
Authors: We acknowledge that macro forecast errors frequently exhibit serial correlation, heteroskedasticity, and overlapping origins, which can influence the interpretation of standard risk-adjusted metrics. In the revised manuscript, we will incorporate HAC standard errors for the Sharpe, Sortino, and Omega ratios, discuss the implications of overlapping forecast origins, and add a robustness section applying pre-whitening to the loss differential series. While these metrics are used primarily for model ranking and comparison rather than formal hypothesis testing, the relative orderings remain informative; the additions will better align the analysis with the data properties. revision: yes
-
Referee: The Edge Ratio is defined relative to a 'forecasting frontier' whose construction is data-dependent; if the frontier is itself estimated from the same sample used to compute the ratio, the measure risks circularity and the claim that SPF 'often achieve high Edge Ratios' cannot be assessed without the precise definition and any out-of-sample safeguards.
Authors: The referee rightly flags the risk of circularity in the Edge Ratio. We will revise the methodology section to provide the exact construction details of the forecasting frontier and introduce an out-of-sample safeguard: the frontier will be estimated on a training subsample, with the Edge Ratio then computed on a held-out evaluation period. This change will allow us to substantiate the SPF results without circularity while preserving the measure's comparative value. revision: yes
-
Referee: Table or figure reporting SPF vs. ML rankings on Edge Ratio and maximum drawdown: the reported superiority of SPF on these metrics is load-bearing for the 'rarely exhibit catastrophic failures' conclusion, yet the abstract supplies no verification that the loss series were pre-whitened or that HAC standard errors were used when ranking models.
Authors: We agree that explicit verification is needed for the SPF superiority claims. The revised manuscript will include a new table or figure directly comparing SPF and ML models on Edge Ratio and maximum drawdown. We will also add text and footnotes confirming that loss series are pre-whitened where appropriate and that HAC standard errors are used for rankings and comparisons. These updates will provide the requested transparency and reinforce the reliability conclusion. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper introduces an original reframing of forecast loss differentials as a return series and applies standard financial risk metrics plus a newly defined Edge Ratio relative to a forecasting frontier. All central claims follow from direct empirical application to U.S. macro data and model comparisons (SPF, ML, benchmarks). No load-bearing step reduces by construction to its own inputs: there are no self-definitional equations, no fitted parameters relabeled as out-of-sample predictions, and no uniqueness theorems or ansatzes imported solely via self-citation chains. The framework remains falsifiable through the reported performance rankings and is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Forecast loss differentials relative to a benchmark can be treated as a return series.
invented entities (1)
-
Edge Ratio
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearI treat forecast loss differentials relative to a benchmark as a return series. I then evaluate these returns using risk-adjusted performance measures from finance, including the Sharpe ratio, Sortino ratio, Omega ratio, and drawdown-based metrics. I also introduce the Edge Ratio...
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclearThe Omega ratio captures the full distribution of gains versus losses... Ω = Average Upside / Average Downside
Reference graph
Works this paper leans on
-
[1]
J., Boyle, S., Li, H., and Sekhposyan, T
Alam, M. J., Boyle, S., Li, H., and Sekhposyan, T. (2025). ChatMacro : Evaluating inflation forecasts of generative AI . Working Paper 2025-13, Federal Reserve Bank of San Francisco
work page 2025
-
[2]
Ang, A., Bekaert, G., and Wei, M. (2007). Do macro variables, asset markets, or surveys forecast inflation better? Journal of Monetary Economics , 54(4):1163--1212
work page 2007
-
[3]
Bell, D. E. (1982). Regret in decision making under uncertainty. Operations Research , 30(5):961--981
work page 1982
-
[4]
Breiman, L. (2001). Random forests. Machine Learning , 45(1):5--32
work page 2001
- [5]
- [6]
-
[7]
Chekhlov, A., Uryasev, S., and Zabarankin, M. (2005). Drawdown measure in portfolio optimization. International Journal of Theoretical and Applied Finance , 8(1):13--58
work page 2005
-
[8]
Chipman, H. A., George, E. I., and McCulloch, R. E. (2010). BART : B ayesian additive regression trees. The Annals of Applied Statistics , 4(1):266--298
work page 2010
-
[9]
Christoffersen, P. F. and Diebold, F. X. (1997). Optimal prediction under asymmetric loss. Econometric Theory , 13(6):808--817
work page 1997
-
[10]
E., Huber, F., Koop, G., Marcellino, M., and Pfarrhofer, M
Clark, T. E., Huber, F., Koop, G., Marcellino, M., and Pfarrhofer, M. (2023). Tail forecasting with multivariate B ayesian additive regression trees. International Economic Review , 64(3):979--1022
work page 2023
-
[11]
W., Tang, K., Wang, J., and Zhang, Y
Cong, L. W., Tang, K., Wang, J., and Zhang, Y. (2021). AlphaPortfolio : Direct construction through deep reinforcement learning and interpretable AI . SSRN Working Paper . Available at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3554486
work page 2021
-
[12]
Diebold, F. X. and Mariano, R. S. (1995). Comparing predictive accuracy. Journal of Business & Economic Statistics , 13(3):253--263
work page 1995
-
[13]
Ehm, W., Gneiting, T., Jordan, A., and Kr \"u ger, F. (2016). Of quantiles and expectiles: Consistent scoring functions, C hoquet representations and forecast rankings. Journal of the Royal Statistical Society: Series B , 78(3):505--562
work page 2016
-
[14]
Engelberg, J., Manski, C. F., and Williams, J. (2009). Comparing the point predictions and subjective probability distributions of professional forecasters. Journal of Business & Economic Statistics , 27(1):30--41
work page 2009
-
[15]
Faust, J. and Wright, J. H. (2013). Forecasting inflation. In Handbook of Economic Forecasting , volume 2A, pages 2--56. Elsevier
work page 2013
-
[16]
Giacomini, R. and White, H. (2006). Tests of conditional predictive ability. Econometrica , 74(6):1545--1578
work page 2006
-
[17]
Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the American Statistical Association , 106(494):746--762
work page 2011
-
[18]
Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association , 102(477):359--378
work page 2007
-
[19]
Goulet Coulombe, P. (2025a). A neural phillips curve and a deep output gap. Journal of Business & Economic Statistics , 43(3):669--683
-
[20]
Goulet Coulombe, P. (2025b). To bag is to prune. Studies in Nonlinear Dynamics & Econometrics , 29(6):669--697
-
[21]
Goulet Coulombe, P. (2026). LGB+ : A macroeconomic forecasting road test. Working paper
work page 2026
-
[22]
Goulet Coulombe, P., Frenette, M., and Klieber, K. (2026). From reactive to proactive volatility modeling with hemisphere neural networks. Journal of Applied Econometrics . forthcoming
work page 2026
-
[23]
Goulet Coulombe, P., G \"o bel, M., and Klieber, K. (2025). Dual interpretation of machine learning forecasts. Technical report, Working paper
work page 2025
-
[24]
Goulet Coulombe, P., Leroux, M., Stevanovic, D., and Surprenant, S. (2022). How is machine learning useful for macroeconomic forecasting? Journal of Applied Econometrics , 37(5):920--964
work page 2022
-
[25]
Granger, C. W. J. (1999). Outline of forecast theory using generalized cost functions. Spanish Economic Review , 1(2):161--173
work page 1999
-
[26]
Granger, C. W. J. and Pesaran, M. H. (2000). Economic and statistical measures of forecast accuracy. Journal of Forecasting , 19(7):537--560
work page 2000
-
[27]
Griffin, J. E. and Brown, P. J. (2010). Inference with normal-gamma prior distributions in regression problems. Bayesian Analysis , 5(1):171--188
work page 2010
-
[28]
Hansen, P. R. (2005). A test for superior predictive ability. Journal of Business & Economic Statistics , 23(4):365--380
work page 2005
-
[29]
Hansen, P. R., Lunde, A., and Nason, J. M. (2011). The model confidence set. Econometrica , 79(2):453--497
work page 2011
-
[30]
Harvey, D., Leybourne, S., and Newbold, P. (1997). Testing the equality of prediction mean squared errors. International Journal of Forecasting , 13(2):281--291
work page 1997
- [31]
-
[32]
and Fr \"u hwirth-Schnatter, S
Kastner, G. and Fr \"u hwirth-Schnatter, S. (2014). Ancillarity-sufficiency interweaving strategy ( ASIS ) for boosting MCMC estimation of stochastic volatility models. Computational Statistics & Data Analysis , 76:408--423
work page 2014
-
[33]
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. (2017). LightGBM : A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems , volume 30, pages 3146--3154
work page 2017
-
[34]
Keating, C. and Shadwick, W. F. (2002). A universal performance measure. Journal of Performance Measurement , 6(3):59--84
work page 2002
-
[35]
L., Ravazzolo, F., and Gneiting, T
Lerch, S., Thorarinsdottir, T. L., Ravazzolo, F., and Gneiting, T. (2017). Forecaster's dilemma: Extreme events and forecast evaluation. Statistical Science , 32(1):106--127
work page 2017
-
[36]
Loomes, G. and Sugden, R. (1982). Regret theory: An alternative theory of rational choice under uncertainty. The Economic Journal , 92(368):805--824
work page 1982
-
[37]
Magdon-Ismail, M. and Atiya, A. F. (2004). Maximum drawdown. Risk , 17(10):99--102
work page 2004
-
[38]
Makridakis, S., Spiliotis, E., and Assimakopoulos, V. (2018). The M4 competition: Results, findings, conclusion and way forward. International Journal of Forecasting , 34(4):802--808
work page 2018
-
[39]
Makridakis, S., Spiliotis, E., and Assimakopoulos, V. (2020). The M4 competition: 100,000 time series and 61 forecasting methods. International Journal of Forecasting , 36(1):54--74
work page 2020
-
[40]
McCracken, M. W. and Ng, S. (2020). FRED-QD : A quarterly database for macroeconomic research. Working Paper 2020-005, Federal Reserve Bank of St. Louis
work page 2020
-
[41]
Medeiros, M. C., Vasconcelos, G. F. R., Veiga, \'A ., and Zilberman, E. (2021). Forecasting inflation in a data-rich environment: The benefits of machine learning methods. Journal of Business & Economic Statistics , 39(1):98--119
work page 2021
-
[42]
Newey, W. K. and Powell, J. L. (1987). Asymmetric least squares estimation and testing. Econometrica , 55(4):819--847
work page 1987
-
[43]
Patton, A. J. and Timmermann, A. (2007). Properties of optimal forecasts under asymmetric loss and nonlinearity. Journal of Econometrics , 140(2):884--918
work page 2007
-
[44]
Rossi, B. (2021). Forecasting in the presence of instabilities: How we know whether models predict well and how to improve them. Journal of Economic Literature , 59(4):1135--1190
work page 2021
-
[45]
Salinas, D., Flunkert, V., Gasthaus, J., and Januschowski, T. (2020). DeepAR : Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting , 36(3):1181--1191
work page 2020
-
[46]
Savage, L. J. (1951). The theory of statistical decision. Journal of the American Statistical Association , 46(253):55--67
work page 1951
-
[47]
Sharpe, W. F. (1966). Mutual fund performance. Journal of Business , 39(1):119--138
work page 1966
-
[48]
Sharpe, W. F. (1994). The S harpe ratio. Journal of Portfolio Management , 21(1):49--58
work page 1994
-
[49]
Smyl, S. (2020). A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting. In International Journal of Forecasting , volume 36, pages 75--85
work page 2020
-
[50]
Sortino, F. A. and Price, L. N. (1994). Performance measurement in a downside risk framework. Journal of Investing , 3(3):59--64
work page 1994
-
[51]
Sortino, F. A. and van der Meer, R. (1991). Downside risk. Journal of Portfolio Management , 17(4):27--31
work page 1991
-
[52]
Stark, T. (2010). Realistic evaluation of real-time forecasts in the S urvey of P rofessional F orecasters. Federal Reserve Bank of Philadelphia Research Rap Special Report , pages 1--20
work page 2010
-
[53]
Stock, J. H. and Watson, M. W. (2002). Forecasting using principal components from a large number of predictors. Journal of the American Statistical Association , 97(460):1167--1179
work page 2002
-
[54]
West, K. D. (1996). Asymptotic inference about predictive ability. Econometrica , 64(5):1067--1084
work page 1996
-
[55]
West, K. D. (2006). Forecast evaluation. In Elliott, G., Granger, C. W. J., and Timmermann, A., editors, Handbook of Economic Forecasting , volume 1, chapter 3, pages 99--134. Elsevier
work page 2006
-
[56]
White, H. (2000). A reality check for data snooping. Econometrica , 68(5):1097--1126
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.