pith. sign in

arxiv: 2606.22719 · v1 · pith:UU77QKL5new · submitted 2026-06-21 · 💱 q-fin.ST · cs.AI

Leakage-Aware Benchmarking of LLM Forecasting: Real-Time Nowcasts as the Decision-Time Input for Macro Factor Ranking

Pith reviewed 2026-06-26 09:12 UTC · model grok-4.3

classification 💱 q-fin.ST cs.AI
keywords LLM forecastinginformation leakageequity factor rankingmacro nowcastsSpearman rank ICretrieval-augmentedinflation nowcastkNN baseline
0
0 comments X

The pith

A retrieval-augmented LLM using only decision-time macro inputs reaches a median 0.154 Spearman IC for equity factor ranking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests a retrieval-augmented LLM for ranking equity style factors while enforcing strict decision-time information constraints to avoid leakage. Only lag-shifted macro variables, event summaries, and an archived CPI nowcast are provided at each month-end. The system retrieves similar historical macro states, uses a critic LLM to derive a tactical rule, and has an actor LLM assign scores to seven factors. This yields a median Spearman rank information coefficient of 0.154, which a non-LLM kNN model using the same inputs nearly matches, suggesting that the real-time nowcast and analog retrieval carry most of the signal.

Core claim

The full pipeline obtains a median monthly Spearman rank IC of +0.154, with positive means across three non-overlapping contiguous 12-month subwindows; the mean IC remains statistically underpowered, with a bootstrap 95% confidence interval that includes zero. Non-LLM baselines under the same decision-time constraint demonstrate that a kNN macro-analog model recovers a comparable median IC, indicating that real-time inflation information and macro-similar retrieval explain much of the median signal. The LLM pipeline retains higher mean IC and a stronger long-short allocation sanity check, suggesting that any marginal benefit is concentrated in the extreme rankings that drive long-short portf

What carries the argument

A macro-analog retrieval module that selects historical states, a critic LLM that compresses them into one tactical rule, and an actor LLM that maps the current state and recent rules into scores for seven U.S. equity style factors.

If this is right

  • Real-time inflation nowcasts and macro-analog retrieval largely account for the observed median ranking performance.
  • The LLM pipeline shows advantages in average IC and long-short allocation effectiveness compared to the kNN baseline.
  • The mean IC across months is statistically indistinguishable from zero based on bootstrap confidence intervals.
  • The positive median IC appears consistently in each of three consecutive 12-month periods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applying the same leakage-controlled setup to other forecasting tasks like individual stock selection could test the generality of the findings.
  • Replacing the kNN baseline with additional simple statistical models might further isolate any contribution from the LLM critic and actor stages.
  • The results imply that benchmarks for LLM financial applications must document decision-time observability to avoid overstating model capabilities.

Load-bearing premise

The Cleveland Fed's archived daily CPI nowcast for the unreleased current-month inflation is strictly observable at month-end decision time with no residual leakage from later revisions or publication timing.

What would settle it

A replication showing a median IC of zero or below when the pipeline is rerun with the CPI nowcast replaced by its lagged value or omitted entirely would indicate that the reported signal depends on that specific real-time input.

Figures

Figures reproduced from arXiv: 2606.22719 by Mao Guan, Qian Chen.

Figure 1
Figure 1. Figure 1: Leakage-controlled monthly decision pipeline. At each month-end t, the system constructs only information observable at t: lag-shifted FRED variables, archived Cleveland Fed CPI nowcasts, and recent macro-event summaries. Historical analogs are retrieved from months at least 12 months before t. A critic compresses the analog evidence into one rule; the actor maps the current state and recent rules into sev… view at source ↗
Figure 2
Figure 2. Figure 2: Monthly rank IC for the two main analog-based methods. Points show raw monthly IC; thick lines show 3-month rolling averages for readability. Summary statistics in [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: LLM ablation summary. Points show mean and median monthly rank IC for the four ablation rows. The largest observed median-IC change occurs when the Cleveland Fed nowcast is added. 9 [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cumulative net return of the long-top-2 / short-bottom-2 sanity check (monthly rebalanced, 5 bps per unit weight change), following the methodology of [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
read the original abstract

Forecasting benchmarks for retrieval-augmented LLMs routinely confound model capability with information leakage: features labeled with a target's timestamp are often not observable at the system's decision time. We study leakage-controlled equity factor ranking with a retrieval-augmented 7B open-source LLM forecaster. At each month-end from 2023-04 to 2026-03, the forecaster observes only decision-time information: lag-shifted FRED macro variables, recent macro-event summaries, and the Cleveland Fed's archived daily CPI nowcast for unreleased current-month inflation. A macro-analog retrieval module selects historical states, a critic LLM compresses them into one tactical rule, and an actor LLM maps the current state and recent rules into scores for seven U.S. equity style factors. The full pipeline obtains a median monthly Spearman rank IC of +0.154, with positive means across three non-overlapping contiguous 12-month subwindows; the mean IC remains statistically underpowered, with a bootstrap 95% confidence interval that includes zero. Non-LLM baselines under the same decision-time constraint demonstrate that a kNN macro-analog model recovers a comparable median IC, indicating that real-time inflation information and macro-similar retrieval explain much of the median signal. The LLM pipeline retains higher mean IC and a stronger long-short allocation sanity check, suggesting that any marginal benefit is concentrated in the extreme rankings that drive long-short portfolio formation. A descriptive audit of the 36 critic rules and per-month case studies appears in the appendix.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates a retrieval-augmented 7B LLM pipeline for monthly U.S. equity factor ranking under a strict decision-time information constraint. At each month-end (2023-04 to 2026-03), the system uses only lag-shifted FRED macro variables, macro-event summaries, and the Cleveland Fed's archived daily CPI nowcast for the unreleased current-month inflation. A macro-analog retrieval module, critic LLM, and actor LLM produce scores for seven style factors. The pipeline reports a median monthly Spearman rank IC of +0.154 (positive means across three contiguous 12-month subwindows), with the mean IC statistically underpowered (bootstrap 95% CI includes zero). A kNN baseline under identical constraints recovers a comparable median IC, while the LLM shows higher mean IC and stronger long-short allocation performance. The authors conclude that real-time inflation and macro retrieval explain most of the median signal, with any LLM marginal benefit concentrated in extreme rankings.

Significance. If the leakage-control premise holds, the work supplies a concrete, reproducible benchmark that isolates the contribution of real-time macro data versus model architecture in LLM forecasting for finance. The direct kNN comparison, sub-period consistency checks, and long-short sanity test are strengths; the result that non-LLM retrieval recovers most of the median IC is a falsifiable, policy-relevant finding for the design of future LLM benchmarks.

major comments (2)
  1. [Data section] Data section (description of Cleveland Fed nowcast series): the central empirical claim (median IC +0.154 and attribution to real-time inflation) rests on every nowcast observation being strictly observable at month-end with zero residual leakage from later revisions or publication timestamps. The manuscript must supply an explicit verification—e.g., a table or code snippet showing the exact archive date for each of the 36 month-end observations and confirmation that no post-month-end revisions entered the backtest. Absent this, the leakage-control premise is unestablished and both LLM and kNN results are potentially contaminated.
  2. [Results] Results, paragraph on mean IC and bootstrap CI: the reported median IC of +0.154 is the headline number, yet the mean IC is underpowered (bootstrap 95% CI includes zero). The paper should state the exact bootstrap procedure (number of resamples, block length if time-series aware, and whether the CI is for the mean or median) and report the corresponding CI for the median to allow readers to assess whether the median result itself is statistically distinguishable from zero.
minor comments (2)
  1. [Abstract] Abstract and §3: the phrase 'the Cleveland Fed's archived daily CPI nowcast for unreleased current-month inflation' should be accompanied by a one-sentence clarification of the precise cutoff (e.g., 'values published by 23:59 ET on the last trading day of the prior month').
  2. [Appendix] Appendix (36 critic rules): the descriptive audit is useful but would benefit from a short table mapping each rule to the month it was generated and the subsequent realized factor ranking to facilitate replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments emphasizing rigorous verification of leakage control and statistical transparency. We address each major comment below.

read point-by-point responses
  1. Referee: [Data section] Data section (description of Cleveland Fed nowcast series): the central empirical claim (median IC +0.154 and attribution to real-time inflation) rests on every nowcast observation being strictly observable at month-end with zero residual leakage from later revisions or publication timestamps. The manuscript must supply an explicit verification—e.g., a table or code snippet showing the exact archive date for each of the 36 month-end observations and confirmation that no post-month-end revisions entered the backtest. Absent this, the leakage-control premise is unestablished and both LLM and kNN results are potentially contaminated.

    Authors: We agree that explicit verification of nowcast observability at decision time is required to substantiate the leakage-control premise. In the revised manuscript we will add a table to the Data section listing the archive date for each of the 36 Cleveland Fed nowcast observations together with confirmation that no post-month-end revisions entered the backtest. revision: yes

  2. Referee: [Results] Results, paragraph on mean IC and bootstrap CI: the reported median IC of +0.154 is the headline number, yet the mean IC is underpowered (bootstrap 95% CI includes zero). The paper should state the exact bootstrap procedure (number of resamples, block length if time-series aware, and whether the CI is for the mean or median) and report the corresponding CI for the median to allow readers to assess whether the median result itself is statistically distinguishable from zero.

    Authors: We will expand the Results section to fully specify the bootstrap procedure (number of resamples, block length for time-series dependence, and that the CI applies to the mean) and will additionally report the bootstrap 95% CI for the median IC. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical IC is measured against future returns

full rationale

The paper's central result is a descriptive empirical statistic (median Spearman rank IC of +0.154) computed from backtested rankings against subsequent realized returns under an explicitly leakage-controlled information set. No derivation chain exists that reduces this measurement to a fitted parameter, self-definition, or self-citation by construction. The kNN baseline comparison and LLM pipeline are both evaluated on the same external target (future returns), with no ansatz, uniqueness theorem, or renaming of known results invoked as load-bearing. The leakage assumption is a validity concern, not a circularity mechanism. The result is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical benchmarking study with no mathematical derivation or parameter fitting described in the abstract; no free parameters, axioms, or invented entities are required to support the reported performance numbers.

pith-pipeline@v0.9.1-grok · 5810 in / 1353 out tokens · 30217 ms · 2026-06-26T09:12:47.078139+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

97 extracted references · 12 canonical work pages · 2 internal anchors

  1. [1]

    Journal of Financial Economics , volume =

    Common risk factors in the returns on stocks and bonds , author =. Journal of Financial Economics , volume =

  2. [2]

    Journal of Financial Economics , volume =

    A five-factor asset pricing model , author =. Journal of Financial Economics , volume =

  3. [3]

    The Journal of Finance , volume =

    Returns to buying winners and selling losers: Implications for stock market efficiency , author =. The Journal of Finance , volume =

  4. [4]

    The Journal of Finance , volume =

    On persistence in mutual fund performance , author =. The Journal of Finance , volume =

  5. [5]

    Journal of Financial Economics , volume =

    Betting against beta , author =. Journal of Financial Economics , volume =

  6. [6]

    Review of Accounting Studies , volume =

    Quality minus junk , author =. Review of Accounting Studies , volume =

  7. [7]

    The Journal of Portfolio Management , volume =

    Contrarian factor timing is deceptively difficult , author =. The Journal of Portfolio Management , volume =

  8. [8]

    The Journal of Portfolio Management , volume =

    Factor momentum everywhere , author =. The Journal of Portfolio Management , volume =

  9. [9]

    The Journal of Finance , volume =

    Factor momentum and the momentum factor , author =. The Journal of Finance , volume =

  10. [10]

    The Review of Financial Studies , volume =

    Factor timing , author =. The Review of Financial Studies , volume =

  11. [11]

    The Journal of Finance , volume =

    Portfolio selection , author =. The Journal of Finance , volume =

  12. [12]

    The Journal of Portfolio Management , volume =

    Honey, I shrunk the sample covariance matrix , author =. The Journal of Portfolio Management , volume =

  13. [13]

    Optimal versus naive diversification: How inefficient is the

    DeMiguel, Victor and Garlappi, Lorenzo and Uppal, Raman , journal =. Optimal versus naive diversification: How inefficient is the

  14. [14]

    The Journal of Portfolio Management , volume =

    The properties of equally weighted risk contribution portfolios , author =. The Journal of Portfolio Management , volume =

  15. [15]

    Trading Costs of Asset Pricing Anomalies , author =

  16. [16]

    The Review of Financial Studies , volume =

    A taxonomy of anomalies and their trading costs , author =. The Review of Financial Studies , volume =

  17. [17]

    Advances in Financial Machine Learning , author =

  18. [18]

    and Lopez de Prado, Marcos , journal =

    Bailey, David H. and Lopez de Prado, Marcos , journal =. The deflated

  19. [19]

    Journal of the American Statistical Association , volume =

    The stationary bootstrap , author =. Journal of the American Statistical Association , volume =

  20. [20]

    Econometric Reviews , volume =

    Automatic block-length selection for the dependent bootstrap , author =. Econometric Reviews , volume =

  21. [21]

    and Liu, Yan and Zhu, Heqing , journal =

    Harvey, Campbell R. and Liu, Yan and Zhu, Heqing , journal =

  22. [22]

    The Journal of Finance , volume =

    False (and missed) discoveries in financial economics , author =. The Journal of Finance , volume =

  23. [23]

    Journal of Financial Economics , volume =

    Market efficiency, long-term returns, and behavioral finance , author =. Journal of Financial Economics , volume =

  24. [24]

    The Review of Financial Studies , volume =

    Empirical asset pricing via machine learning , author =. The Review of Financial Studies , volume =

  25. [25]

    Chen, Tianqi and Guestrin, Carlos , booktitle =

  26. [26]

    Advances in Neural Information Processing Systems , volume =

    Attention is all you need , author =. Advances in Neural Information Processing Systems , volume =

  27. [27]

    Advances in Neural Information Processing Systems , volume =

    Language models are few-shot learners , author =. Advances in Neural Information Processing Systems , volume =

  28. [28]

    2024 , eprint =

    Qwen2 technical report , author =. 2024 , eprint =

  29. [29]

    Advances in Neural Information Processing Systems , volume =

    Reflexion: Language agents with verbal reinforcement learning , author =. Advances in Neural Information Processing Systems , volume =

  30. [30]

    International Conference on Learning Representations (ICLR) , year =

    In-Context Reinforcement Learning with Algorithm Distillation , author =. International Conference on Learning Representations (ICLR) , year =

  31. [31]

    2410.05362 , archivePrefix =

    Schmied, Thomas and Bornschein, Jorg and Grau-Moya, Jordi and Wulfmeier, Markus and Pascanu, Razvan , year =. 2410.05362 , archivePrefix =

  32. [32]

    Reward Is Enough: LLMs Are In-Context Reinforcement Learners

    Liu, Kefan and Tao, Sihan and Hu, Xueyang and Zhao, Yuanhui and Yu, Yang and Wang, Hao , year =. Reward is enough:. 2506.06303 , archivePrefix =

  33. [33]

    and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

    Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

  34. [34]

    2017 , eprint =

    Proximal policy optimization algorithms , author =. 2017 , eprint =

  35. [35]

    Nature , volume =

    Human-level control through deep reinforcement learning , author =. Nature , volume =

  36. [36]

    Xiao, Yijia and Sun, Edward and Luo, Di and Wang, Wei , year =

  37. [37]

    2409.17266 , archivePrefix =

    Cao, Junyan and Yan, Yi-Ho and Tang, Yuyang and Tian, Bo , year =. 2409.17266 , archivePrefix =

  38. [38]

    Automate strategy finding with

    Kuznetsov, Zhizhuo and Tsai, Cheng-Hua and Wang, Hong-Han and Liang, Yu-Lun and Lin, Tsai-Chih and Hsu, Yi-Cheng and Tang, Yan , year =. Automate strategy finding with. 2409.06289 , archivePrefix =

  39. [39]

    Lopez-Lira, Alejandro and Tang, Yuehua , year =. Can

  40. [40]

    Understanding

    Tang, Eric and Yang, Bangding and Song, Xingyou , year =. Understanding. 2411.14708 , archivePrefix =

  41. [41]

    Finance agent benchmark: Benchmarking llms on real-world financial research tasks, 2025

    Various , year =. Finance Agent Benchmark: Benchmarking. 2508.00828 , archivePrefix =

  42. [42]

    Position: Standard Benchmarks Fail -- Auditing

    Chen, Zichen and Chen, Jiaao and Chen, Jianda and Sra, Misha , year =. Position: Standard Benchmarks Fail -- Auditing

  43. [43]

    Li, Haohang and Cao, Yupeng and Yu, Yangyang and Javaji, Shashidhar Reddy and Deng, Zhiyang and He, Yueru and Jiang, Yuechen and Zhu, Zining and Subbalakshmi, K. P. and Huang, Jimin and Qian, Lingfei and Peng, Xueqing and Suchow, Jordan W. and Xie, Qianqian , booktitle =. 2025 , note =

  44. [44]

    The Econometrics Journal , volume =

    Double/debiased machine learning for treatment and structural parameters , author =. The Econometrics Journal , volume =

  45. [45]

    Econometrica , volume =

    A new approach to the economic analysis of nonstationary time series and the business cycle , author =. Econometrica , volume =

  46. [46]

    Journal of Business and Economic Statistics , volume =

    Regime switches in interest rates , author =. Journal of Business and Economic Statistics , volume =

  47. [47]

    2004 , month =

    Greetham, Trevor and Hartnett, Michael , title =. 2004 , month =

  48. [48]

    Journal of Finance , year =

    Moreira, Alan and Muir, Tyler , title =. Journal of Finance , year =

  49. [49]

    Journal of Portfolio Management , year =

    Lopez de Prado, Marcos , title =. Journal of Portfolio Management , year =

  50. [50]

    Journal of Fixed Income , year =

    Black, Fischer and Litterman, Robert , title =. Journal of Fixed Income , year =

  51. [51]

    and Zhang, Cyril and Slivkins, Aleksandrs , title =

    Krishnamurthy, Akshay and Harris, Keegan and Foster, Dylan J. and Zhang, Cyril and Slivkins, Aleksandrs , title =. Proceedings of the 41st International Conference on Machine Learning , year =

  52. [52]

    Journal of Finance , volume =

    Presidential Address: Discount Rates , author =. Journal of Finance , volume =

  53. [53]

    Journal of Financial Economics , volume =

    The Other Side of Value: The Gross Profitability Premium , author =. Journal of Financial Economics , volume =

  54. [54]

    Review of Financial Studies , volume =

    Digesting Anomalies: An Investment Approach , author =. Review of Financial Studies , volume =

  55. [55]

    Journal of Finance , volume =

    The Value Spread , author =. Journal of Finance , volume =

  56. [56]

    Working Paper , year =

    Asset-Growth Effects: Evidence from the International Equity Markets , author =. Working Paper , year =

  57. [57]

    Journal of Financial Economics , volume =

    Momentum Crashes , author =. Journal of Financial Economics , volume =

  58. [58]

    Journal of Financial Economics , volume =

    An Exploratory Investigation of the Firm Size Effect , author =. Journal of Financial Economics , volume =

  59. [59]

    Journal of Accounting and Economics , volume =

    The Implied Cost of Capital: A New Approach , author =. Journal of Accounting and Economics , volume =

  60. [60]

    American Economic Review , volume =

    Credit Spreads and Business Cycle Fluctuations , author =. American Economic Review , volume =

  61. [61]

    2024 , eprint =

    Approaching Human-Level Forecasting with Language Models , author =. 2024 , eprint =

  62. [62]

    2024 , eprint =

    Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy , author =. 2024 , eprint =

  63. [63]

    2018 , publisher =

    Mastering the Market Cycle: Getting the Odds on Your Side , author =. 2018 , publisher =

  64. [64]

    2017 , publisher =

    Principles: Life and Work , author =. 2017 , publisher =

  65. [65]

    Journal of Business

    Tests for Forecast Encompassing , author =. Journal of Business

  66. [66]

    Journal of Business

    Comparing Predictive Accuracy , author =. Journal of Business

  67. [67]

    2024 , eprint =

    Learning to Generate Explainable Stock Predictions using Self-Reflective Large Language Models , author =. 2024 , eprint =

  68. [68]

    2023 , eprint =

    FinMem: A Performance-Enhanced LLM Trading Agent with Layered Memory and Character Design , author =. 2023 , eprint =

  69. [69]

    International Conference on Learning Representations (ICLR) , year =

    PromptAgent: Strategic Planning with Language Models Enables Expert-level Prompt Optimization , author =. International Conference on Learning Representations (ICLR) , year =

  70. [70]

    2025 , eprint =

    History Rhymes: Macro-Contextual Retrieval for Robust Financial Forecasting , author =. 2025 , eprint =

  71. [71]

    and Liu, Rong and Cui, Zhenyu and Xu, Denghui and Zhao, Haoran and Khashanah, Khaldoun , booktitle =

    Yu, Yangyang and Yao, Zhiyuan and Li, Haohang and Deng, Zhi and Cao, Yuechen and Chen, Zhi and Suchow, Jordan W. and Liu, Rong and Cui, Zhenyu and Xu, Denghui and Zhao, Haoran and Khashanah, Khaldoun , booktitle =

  72. [72]

    and Zaman, Saeed , journal =

    Knotek, Edward S. and Zaman, Saeed , journal =. Nowcasting. 2017 , publisher =

  73. [73]

    , title =

    French, Kenneth R. , title =. 2026 , note =

  74. [74]

    2026 , note =

    Inflation Nowcasting , howpublished =. 2026 , note =

  75. [75]

    Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , booktitle =

  76. [76]

    AQR factor and data sets

    AQR Capital Management . AQR factor and data sets. https://www.aqr.com/Insights/Datasets, 2026. BAB and QMJ factor returns; accessed 2026-05-13

  77. [77]

    S., Frazzini, A., and Pedersen, L

    Asness, C. S., Frazzini, A., and Pedersen, L. H. Quality minus junk. Review of Accounting Studies, 24 0 (1): 0 34--112, 2019

  78. [78]

    Bailey, D. H. and Lopez de Prado, M. The deflated Sharpe ratio: Correcting for selection bias, backtest overfitting, and non-normality. The Journal of Portfolio Management, 40 0 (5): 0 94--107, 2014

  79. [79]

    Position: Standard benchmarks fail -- auditing LLM agents in finance must prioritize risk, 2025

    Chen, Z., Chen, J., Chen, J., and Sra, M. Position: Standard benchmarks fail -- auditing LLM agents in finance must prioritize risk, 2025. arXiv:2502.15865

  80. [80]

    QLoRA : Efficient finetuning of quantized LLMs

    Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. QLoRA : Efficient finetuning of quantized LLMs . In Advances in Neural Information Processing Systems, 2023

Showing first 80 references.