pith. machine review for the scientific record. sign in

arxiv: 2605.09712 · v1 · submitted 2026-05-10 · 💰 econ.EM · q-fin.PM· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Quantifying the Risk-Return Tradeoff in Forecasting

Philippe Goulet Coulombe

Pith reviewed 2026-05-12 03:46 UTC · model grok-4.3

classification 💰 econ.EM q-fin.PMstat.ML
keywords forecast evaluationrisk-adjusted performancemacroeconomic forecastingprofessional forecastersmachine learningedge ratiosharpe ratioforecast reliability
0
0 comments X

The pith

Treating forecast loss differentials as returns shows professional forecasters are hard to beat on risk-adjusted measures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes forecast evaluation by converting the loss gap between any model and a benchmark into a return series, then applies finance-style risk metrics to measure not just average accuracy but reliability and downside exposure. When this is done for U.S. macroeconomic targets, many machine learning and econometric models improve on raw error relative to the Survey of Professional Forecasters, yet the professionals maintain better risk profiles, fewer large failures, and higher Edge Ratios that capture unique informativeness. This matters because practical decisions often penalize occasional large misses more than they reward small average gains. The same mapping also permits direct comparisons across dozens of targets, horizons, and samples, including density forecasts and the M4 competition.

Core claim

By mapping forecast loss differentials to a return series, the author shows that risk-adjusted performance measures from finance indicate professional forecasters from the Survey of Professional Forecasters rarely produce catastrophic errors and frequently post high Edge Ratios, even while selected machine learning models achieve competitive risk profiles on particular variables.

What carries the argument

The conversion of forecast loss differentials relative to a benchmark into a return series, which then supports calculation of the Sharpe ratio, Sortino ratio, Omega ratio, drawdown statistics, and the Edge Ratio that quantifies unique predictive value outside the current frontier.

If this is right

  • Professional forecasters maintain high Edge Ratios that plausibly reflect the value of contextual judgment.
  • Selected machine learning methods deliver attractive risk profiles for specific macroeconomic targets.
  • Beating professional forecasters on average accuracy does not automatically translate to superiority on risk-adjusted bases.
  • The framework supports unified meta-analyses across targets, horizons, samples, density forecasts, and competitions such as M4.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same loss-to-return mapping could be used to compare forecasting approaches in domains outside macroeconomics, such as finance or energy demand.
  • If high Edge Ratios stem from human judgment, hybrid systems that blend model output with professional oversight might raise both accuracy and risk-adjusted scores.
  • Extending the metrics to explicitly model the time-series dependence in forecast errors could alter which methods appear safest.
  • Large-scale application across many datasets might identify whether certain model classes minimize downside risk more consistently than others.

Load-bearing premise

Forecast loss differentials relative to a benchmark can be treated directly as a return series to which standard financial risk-adjusted performance measures apply without further modification for the statistical properties of forecasting errors.

What would settle it

Recalculating the Sharpe and Edge Ratios on a fresh macroeconomic sample after adjusting the loss differentials for serial correlation and heteroskedasticity would reverse the ranking that favors professional forecasters.

read the original abstract

Average forecast accuracy is not the same as forecast reliability. I treat forecast loss differentials relative to a benchmark as a return series. I then evaluate these returns using risk-adjusted performance measures from finance, including the Sharpe ratio, Sortino ratio, Omega ratio, and drawdown-based metrics. I also introduce the Edge Ratio capturing a model's propensity to deliver uniquely informative predictions relative to the forecasting frontier. I apply this framework to U.S. macroeconomic forecasting, comparing econometric benchmarks, machine learning models, a foundation model (TabPFN), and the Survey of Professional Forecasters. While it is often feasible to beat professional forecasters in terms of average accuracy, it is much harder to beat them on a risk-adjusted basis. They rarely exhibit catastrophic failures and often achieve high Edge Ratios, plausibly reflecting the value of contextual judgment. Nonetheless, selected machine learning methods deliver attractive risk profiles for specific targets. The framework naturally extends to meta-analyses across targets, horizons, and samples, illustrated with a density forecast evaluation and the M4 competition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper reframes forecast evaluation by treating loss differentials relative to a benchmark as a return series and applies standard financial risk-adjusted metrics (Sharpe, Sortino, Omega ratios, drawdowns) plus a new Edge Ratio that measures propensity for uniquely informative predictions relative to a forecasting frontier. Applied to U.S. macro targets, it compares econometric benchmarks, ML models, TabPFN, and the Survey of Professional Forecasters (SPF), finding that average-accuracy gains by ML are common but risk-adjusted outperformance is rarer; SPF exhibits fewer catastrophic failures and higher Edge Ratios, plausibly due to contextual judgment. The framework is illustrated on density forecasts and the M4 competition.

Significance. If the risk metrics can be validly adapted to forecast-error series, the work supplies a useful new evaluation lens that privileges reliability and tail-risk avoidance over raw MSE/MAE, helping explain why professional forecasters persist despite point-accuracy shortfalls. Explicit credit is due for the parameter-light extension to meta-analyses across horizons/targets/samples and for the reproducible-style application to M4 and density forecasts; these features make the contribution more than a one-off empirical exercise.

major comments (3)
  1. [Methodology section defining the return series and risk measures] The central claim—that SPF is harder to beat on risk-adjusted metrics—rests on direct application of Sharpe/Sortino/Omega and drawdown statistics to loss differentials. No modification for serial correlation, heteroskedasticity, or overlapping forecast origins is described, yet macro forecast errors routinely violate the iid-increments assumption required for these ratios to retain their usual interpretation (see the skeptic note on persistent shocks).
  2. [Section introducing the Edge Ratio] The Edge Ratio is defined relative to a 'forecasting frontier' whose construction is data-dependent; if the frontier is itself estimated from the same sample used to compute the ratio, the measure risks circularity and the claim that SPF 'often achieve high Edge Ratios' cannot be assessed without the precise definition and any out-of-sample safeguards.
  3. [Empirical results section (SPF comparison)] Table or figure reporting SPF vs. ML rankings on Edge Ratio and maximum drawdown: the reported superiority of SPF on these metrics is load-bearing for the 'rarely exhibit catastrophic failures' conclusion, yet the abstract supplies no verification that the loss series were pre-whitened or that HAC standard errors were used when ranking models.
minor comments (2)
  1. [Abstract] The abstract states that the framework 'naturally extends' to meta-analyses but does not illustrate the aggregation rule (e.g., how Edge Ratios are pooled across targets); a short clarifying sentence would improve readability.
  2. [Notation and definitions] Notation for the loss-differential series should be introduced once and used consistently; currently the mapping from forecast error to 'return' is described only informally.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, clarifying our approach and indicating revisions that will strengthen the manuscript's methodological rigor and empirical transparency.

read point-by-point responses
  1. Referee: The central claim—that SPF is harder to beat on risk-adjusted metrics—rests on direct application of Sharpe/Sortino/Omega and drawdown statistics to loss differentials. No modification for serial correlation, heteroskedasticity, or overlapping forecast origins is described, yet macro forecast errors routinely violate the iid-increments assumption required for these ratios to retain their usual interpretation (see the skeptic note on persistent shocks).

    Authors: We acknowledge that macro forecast errors frequently exhibit serial correlation, heteroskedasticity, and overlapping origins, which can influence the interpretation of standard risk-adjusted metrics. In the revised manuscript, we will incorporate HAC standard errors for the Sharpe, Sortino, and Omega ratios, discuss the implications of overlapping forecast origins, and add a robustness section applying pre-whitening to the loss differential series. While these metrics are used primarily for model ranking and comparison rather than formal hypothesis testing, the relative orderings remain informative; the additions will better align the analysis with the data properties. revision: yes

  2. Referee: The Edge Ratio is defined relative to a 'forecasting frontier' whose construction is data-dependent; if the frontier is itself estimated from the same sample used to compute the ratio, the measure risks circularity and the claim that SPF 'often achieve high Edge Ratios' cannot be assessed without the precise definition and any out-of-sample safeguards.

    Authors: The referee rightly flags the risk of circularity in the Edge Ratio. We will revise the methodology section to provide the exact construction details of the forecasting frontier and introduce an out-of-sample safeguard: the frontier will be estimated on a training subsample, with the Edge Ratio then computed on a held-out evaluation period. This change will allow us to substantiate the SPF results without circularity while preserving the measure's comparative value. revision: yes

  3. Referee: Table or figure reporting SPF vs. ML rankings on Edge Ratio and maximum drawdown: the reported superiority of SPF on these metrics is load-bearing for the 'rarely exhibit catastrophic failures' conclusion, yet the abstract supplies no verification that the loss series were pre-whitened or that HAC standard errors were used when ranking models.

    Authors: We agree that explicit verification is needed for the SPF superiority claims. The revised manuscript will include a new table or figure directly comparing SPF and ML models on Edge Ratio and maximum drawdown. We will also add text and footnotes confirming that loss series are pre-whitened where appropriate and that HAC standard errors are used for rankings and comparisons. These updates will provide the requested transparency and reinforce the reliability conclusion. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces an original reframing of forecast loss differentials as a return series and applies standard financial risk metrics plus a newly defined Edge Ratio relative to a forecasting frontier. All central claims follow from direct empirical application to U.S. macro data and model comparisons (SPF, ML, benchmarks). No load-bearing step reduces by construction to its own inputs: there are no self-definitional equations, no fitted parameters relabeled as out-of-sample predictions, and no uniqueness theorems or ansatzes imported solely via self-citation chains. The framework remains falsifiable through the reported performance rankings and is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on treating loss differentials as returns and introducing the Edge Ratio; no free parameters are mentioned in the abstract.

axioms (1)
  • domain assumption Forecast loss differentials relative to a benchmark can be treated as a return series.
    This is the foundational step stated in the abstract for applying financial performance measures.
invented entities (1)
  • Edge Ratio no independent evidence
    purpose: Capturing a model's propensity to deliver uniquely informative predictions relative to the forecasting frontier.
    Newly introduced in the paper as an additional evaluation metric.

pith-pipeline@v0.9.0 · 5473 in / 1409 out tokens · 44882 ms · 2026-05-12T03:46:09.009553+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages

  1. [1]

    J., Boyle, S., Li, H., and Sekhposyan, T

    Alam, M. J., Boyle, S., Li, H., and Sekhposyan, T. (2025). ChatMacro : Evaluating inflation forecasts of generative AI . Working Paper 2025-13, Federal Reserve Bank of San Francisco

  2. [2]

    Ang, A., Bekaert, G., and Wei, M. (2007). Do macro variables, asset markets, or surveys forecast inflation better? Journal of Monetary Economics , 54(4):1163--1212

  3. [3]

    Bell, D. E. (1982). Regret in decision making under uncertainty. Operations Research , 30(5):961--981

  4. [4]

    Breiman, L. (2001). Random forests. Machine Learning , 45(1):5--32

  5. [5]

    Bybee, L. (2023). Surveying generative AI 's economic expectations. arXiv preprint arXiv:2305.02823

  6. [6]

    Carriero, A., Pettenuzzo, D., and Shekhar, S. (2024). Macroeconomic forecasting with large language models. arXiv preprint arXiv:2407.00890

  7. [7]

    Chekhlov, A., Uryasev, S., and Zabarankin, M. (2005). Drawdown measure in portfolio optimization. International Journal of Theoretical and Applied Finance , 8(1):13--58

  8. [8]

    A., George, E

    Chipman, H. A., George, E. I., and McCulloch, R. E. (2010). BART : B ayesian additive regression trees. The Annals of Applied Statistics , 4(1):266--298

  9. [9]

    Christoffersen, P. F. and Diebold, F. X. (1997). Optimal prediction under asymmetric loss. Econometric Theory , 13(6):808--817

  10. [10]

    E., Huber, F., Koop, G., Marcellino, M., and Pfarrhofer, M

    Clark, T. E., Huber, F., Koop, G., Marcellino, M., and Pfarrhofer, M. (2023). Tail forecasting with multivariate B ayesian additive regression trees. International Economic Review , 64(3):979--1022

  11. [11]

    W., Tang, K., Wang, J., and Zhang, Y

    Cong, L. W., Tang, K., Wang, J., and Zhang, Y. (2021). AlphaPortfolio : Direct construction through deep reinforcement learning and interpretable AI . SSRN Working Paper . Available at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3554486

  12. [12]

    Diebold, F. X. and Mariano, R. S. (1995). Comparing predictive accuracy. Journal of Business & Economic Statistics , 13(3):253--263

  13. [13]

    Ehm, W., Gneiting, T., Jordan, A., and Kr \"u ger, F. (2016). Of quantiles and expectiles: Consistent scoring functions, C hoquet representations and forecast rankings. Journal of the Royal Statistical Society: Series B , 78(3):505--562

  14. [14]

    F., and Williams, J

    Engelberg, J., Manski, C. F., and Williams, J. (2009). Comparing the point predictions and subjective probability distributions of professional forecasters. Journal of Business & Economic Statistics , 27(1):30--41

  15. [15]

    and Wright, J

    Faust, J. and Wright, J. H. (2013). Forecasting inflation. In Handbook of Economic Forecasting , volume 2A, pages 2--56. Elsevier

  16. [16]

    and White, H

    Giacomini, R. and White, H. (2006). Tests of conditional predictive ability. Econometrica , 74(6):1545--1578

  17. [17]

    Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the American Statistical Association , 106(494):746--762

  18. [18]

    and Raftery, A

    Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association , 102(477):359--378

  19. [19]

    Goulet Coulombe, P. (2025a). A neural phillips curve and a deep output gap. Journal of Business & Economic Statistics , 43(3):669--683

  20. [20]

    Goulet Coulombe, P. (2025b). To bag is to prune. Studies in Nonlinear Dynamics & Econometrics , 29(6):669--697

  21. [21]

    Goulet Coulombe, P. (2026). LGB+ : A macroeconomic forecasting road test. Working paper

  22. [22]

    Goulet Coulombe, P., Frenette, M., and Klieber, K. (2026). From reactive to proactive volatility modeling with hemisphere neural networks. Journal of Applied Econometrics . forthcoming

  23. [23]

    Goulet Coulombe, P., G \"o bel, M., and Klieber, K. (2025). Dual interpretation of machine learning forecasts. Technical report, Working paper

  24. [24]

    Goulet Coulombe, P., Leroux, M., Stevanovic, D., and Surprenant, S. (2022). How is machine learning useful for macroeconomic forecasting? Journal of Applied Econometrics , 37(5):920--964

  25. [25]

    Granger, C. W. J. (1999). Outline of forecast theory using generalized cost functions. Spanish Economic Review , 1(2):161--173

  26. [26]

    Granger, C. W. J. and Pesaran, M. H. (2000). Economic and statistical measures of forecast accuracy. Journal of Forecasting , 19(7):537--560

  27. [27]

    Griffin, J. E. and Brown, P. J. (2010). Inference with normal-gamma prior distributions in regression problems. Bayesian Analysis , 5(1):171--188

  28. [28]

    Hansen, P. R. (2005). A test for superior predictive ability. Journal of Business & Economic Statistics , 23(4):365--380

  29. [29]

    R., Lunde, A., and Nason, J

    Hansen, P. R., Lunde, A., and Nason, J. M. (2011). The model confidence set. Econometrica , 79(2):453--497

  30. [30]

    Harvey, D., Leybourne, S., and Newbold, P. (1997). Testing the equality of prediction mean squared errors. International Journal of Forecasting , 13(2):281--291

  31. [31]

    Hollmann, N., M \"u ller, S., Eggensperger, K., and Hutter, F. (2022). TabPFN : A transformer that solves small tabular classification problems in a second. arXiv preprint arXiv:2207.01848

  32. [32]

    and Fr \"u hwirth-Schnatter, S

    Kastner, G. and Fr \"u hwirth-Schnatter, S. (2014). Ancillarity-sufficiency interweaving strategy ( ASIS ) for boosting MCMC estimation of stochastic volatility models. Computational Statistics & Data Analysis , 76:408--423

  33. [33]

    Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. (2017). LightGBM : A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems , volume 30, pages 3146--3154

  34. [34]

    and Shadwick, W

    Keating, C. and Shadwick, W. F. (2002). A universal performance measure. Journal of Performance Measurement , 6(3):59--84

  35. [35]

    L., Ravazzolo, F., and Gneiting, T

    Lerch, S., Thorarinsdottir, T. L., Ravazzolo, F., and Gneiting, T. (2017). Forecaster's dilemma: Extreme events and forecast evaluation. Statistical Science , 32(1):106--127

  36. [36]

    and Sugden, R

    Loomes, G. and Sugden, R. (1982). Regret theory: An alternative theory of rational choice under uncertainty. The Economic Journal , 92(368):805--824

  37. [37]

    and Atiya, A

    Magdon-Ismail, M. and Atiya, A. F. (2004). Maximum drawdown. Risk , 17(10):99--102

  38. [38]

    Makridakis, S., Spiliotis, E., and Assimakopoulos, V. (2018). The M4 competition: Results, findings, conclusion and way forward. International Journal of Forecasting , 34(4):802--808

  39. [39]

    Makridakis, S., Spiliotis, E., and Assimakopoulos, V. (2020). The M4 competition: 100,000 time series and 61 forecasting methods. International Journal of Forecasting , 36(1):54--74

  40. [40]

    McCracken, M. W. and Ng, S. (2020). FRED-QD : A quarterly database for macroeconomic research. Working Paper 2020-005, Federal Reserve Bank of St. Louis

  41. [41]

    C., Vasconcelos, G

    Medeiros, M. C., Vasconcelos, G. F. R., Veiga, \'A ., and Zilberman, E. (2021). Forecasting inflation in a data-rich environment: The benefits of machine learning methods. Journal of Business & Economic Statistics , 39(1):98--119

  42. [42]

    Newey, W. K. and Powell, J. L. (1987). Asymmetric least squares estimation and testing. Econometrica , 55(4):819--847

  43. [43]

    Patton, A. J. and Timmermann, A. (2007). Properties of optimal forecasts under asymmetric loss and nonlinearity. Journal of Econometrics , 140(2):884--918

  44. [44]

    Rossi, B. (2021). Forecasting in the presence of instabilities: How we know whether models predict well and how to improve them. Journal of Economic Literature , 59(4):1135--1190

  45. [45]

    Salinas, D., Flunkert, V., Gasthaus, J., and Januschowski, T. (2020). DeepAR : Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting , 36(3):1181--1191

  46. [46]

    Savage, L. J. (1951). The theory of statistical decision. Journal of the American Statistical Association , 46(253):55--67

  47. [47]

    Sharpe, W. F. (1966). Mutual fund performance. Journal of Business , 39(1):119--138

  48. [48]

    Sharpe, W. F. (1994). The S harpe ratio. Journal of Portfolio Management , 21(1):49--58

  49. [49]

    Smyl, S. (2020). A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting. In International Journal of Forecasting , volume 36, pages 75--85

  50. [50]

    Sortino, F. A. and Price, L. N. (1994). Performance measurement in a downside risk framework. Journal of Investing , 3(3):59--64

  51. [51]

    Sortino, F. A. and van der Meer, R. (1991). Downside risk. Journal of Portfolio Management , 17(4):27--31

  52. [52]

    Stark, T. (2010). Realistic evaluation of real-time forecasts in the S urvey of P rofessional F orecasters. Federal Reserve Bank of Philadelphia Research Rap Special Report , pages 1--20

  53. [53]

    Stock, J. H. and Watson, M. W. (2002). Forecasting using principal components from a large number of predictors. Journal of the American Statistical Association , 97(460):1167--1179

  54. [54]

    West, K. D. (1996). Asymptotic inference about predictive ability. Econometrica , 64(5):1067--1084

  55. [55]

    West, K. D. (2006). Forecast evaluation. In Elliott, G., Granger, C. W. J., and Timmermann, A., editors, Handbook of Economic Forecasting , volume 1, chapter 3, pages 99--134. Elsevier

  56. [56]

    White, H. (2000). A reality check for data snooping. Econometrica , 68(5):1097--1126