arxiv: 2604.21433 · v1 · submitted 2026-04-23 · 💱 q-fin.GN

Recognition: unknown

ChatGPT as a Time Capsule: The Limits of Price Discovery

Alejandro Lopez-Lira, Sebastian Lehner

Pith reviewed 2026-05-08 12:55 UTC · model grok-4.3

classification 💱 q-fin.GN

keywords LLMprice discoveryequity returnsanalyst revisionstextual informationmarket efficiencyfrozen checkpointsoutlook scores

0 comments

The pith

Frozen LLM checkpoints extract outlook scores from past public text that predict future analyst revisions, target prices, and stock returns after standard valuation controls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats twelve OpenAI model snapshots from 2021-2025 as fixed time capsules containing only the public textual record available up to each cutoff date. From these, it derives sector-neutral outlook scores for about 7,000 U.S. equities and shows these scores correlate with subsequent analyst revisions, target-price changes, and one-month cross-sectional returns. The associations survive Fama-MacBeth regressions and pooled panels that include model fixed effects plus direct controls for market-implied valuations and common factors, with a t-statistic of 6.02. Predictability rises with longer horizons and is stronger among firms that already have high analyst coverage, pointing to aggregation costs for dispersed qualitative information rather than investor inattention as the key friction.

Core claim

Frozen large language model checkpoints serve as compressed representations of public textual information at specific past dates; when prompted to produce sector-neutral outlook scores for equities, these scores are positively associated with future analyst revisions, target-price changes, and one-month cross-sectional returns in regressions that control for contemporaneous market-implied valuations and standard factors.

What carries the argument

The LLM outlook score, obtained by prompting each frozen checkpoint on firm-related public text to produce a forward-looking assessment that is then normalized within sectors.

If this is right

The association with returns strengthens at longer horizons even after an intermediate dip.
The signal is larger for high-analyst-coverage firms, consistent with the bottleneck being the cost of synthesizing many documents.
LLM checkpoints can serve as objective, time-stamped summaries of the public textual record for testing information aggregation.
Standard valuation measures leave measurable qualitative text information unpriced.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Markets may systematically underweight qualitative signals that require reading across many scattered sources.
Repeating the exercise with non-OpenAI models or older checkpoints could test whether the result depends on specific training data or architecture.
If aggregation costs are the limit, improvements in LLM summarization might gradually reduce the observed predictability.

Load-bearing premise

The outlook scores capture genuine incremental qualitative information from the public record that markets have not yet fully incorporated, rather than arising from prompting choices or data-construction decisions.

What would settle it

Finding that the predictive coefficients become statistically insignificant when the same analysis is repeated with model snapshots whose cutoffs fall after the forecast return window.

Figures

Figures reproduced from arXiv: 2604.21433 by Alejandro Lopez-Lira, Sebastian Lehner.

**Figure 1.** Figure 1: Outlook-score coefficient γ across return horizons. Panel A shows the singlemodel specification (GPT-4.1); Panel B shows the pooled panel with model fixed effects. Dark bars indicate significance at the 5% level. Despite an intermediate-horizon dip in Panel A, the overall trend is positive and significant (Spearman rank correlations reported in the insets). 5.4.2 Cross-Model Comparison view at source ↗

**Figure 2.** Figure 2: Outlook-score coefficient γ at τ = 1 month for each of the twelve model checkpoints. Bars are colour-coded by knowledge cutoff date. Black outlines indicate significance at the 5% level. Dashed vertical lines separate cutoff clusters. 31 view at source ↗

read the original abstract

Frozen large language model (LLM) checkpoints extract information from pre-cutoff public text that is associated with future fundamentals and equity returns beyond standard contemporaneous valuation measures. Because each frozen checkpoint has a fixed knowledge cutoff, it can be interpreted as a compressed representation of publicly available textual information at a given point in time. We treat twelve OpenAI snapshots spanning 2021-2025 as time-stamped summaries of the public textual record and extract a sector-neutral LLM outlook score for roughly 7,000 U.S. equities per cross-section. The outlook score is positively associated with analyst revisions, target-price changes and one-month cross-sectional returns in both Fama-MacBeth regressions and pooled panels with model fixed effects (t = 6.02), after direct controls for market-implied valuation and standard factors. Predictability broadly increases with the return horizon, despite a non-monotonic intermediate dip, and, in the pooled panel, is stronger for firms with high analyst coverage, consistent with the view that the bottleneck is not investor inattention but the cost of aggregating dispersed qualitative information across many documents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Using dated LLM checkpoints as time capsules yields predictive outlook scores, but prompting robustness is unaddressed.

read the letter

The paper's main contribution is showing that you can pull equity outlook scores from successive frozen ChatGPT checkpoints and those scores forecast analyst revisions and cross-sectional returns beyond valuation controls. The t-stat of 6 in the regressions is the headline number. What the authors do well is set up the time capsule idea cleanly. By using models with known cutoffs from 2021-2025 on 7000 stocks, they create a panel where the score reflects only info available up to that point. The increase in predictability over longer horizons and the stronger effect for high analyst coverage stocks both line up with the story about information aggregation costs. The soft spot is the extraction process itself. The abstract gives no information on the prompt wording, how the numerical score is derived from the model's response, or any checks that the result survives prompt changes or different aggregation methods. If the association disappears under a slightly different question, then the result may not reflect limits to price discovery so much as a quirk in how this particular model was queried. The stress test concern about artifacts holds until those tests appear. This work is aimed at people studying market efficiency with alternative data or AI tools. A reader interested in quant strategies or textual analysis in finance would get something out of it, provided the methods section fills in the gaps. I would send it to peer review. The framing is original enough and the basic result is sharp, but it needs the robustness material to stand up.

Referee Report

2 major / 2 minor

Summary. The manuscript treats twelve frozen OpenAI LLM checkpoints (2021-2025) as time-stamped compressions of public textual information. For each cross-section it extracts a sector-neutral outlook score across roughly 7,000 U.S. equities and shows that these scores are positively associated with subsequent analyst revisions, target-price changes, and one-month cross-sectional returns. The associations survive Fama-MacBeth regressions and pooled panels with model fixed effects (t = 6.02) after direct controls for market-implied valuation and standard factors. Predictability rises with horizon (with a non-monotonic dip) and is stronger among high-analyst-coverage firms, which the authors interpret as evidence that aggregation costs, rather than inattention, limit price discovery.

Significance. If the central empirical result is robust, the paper supplies a novel, temporally structured test of whether markets fully incorporate dispersed qualitative public information. The use of multiple frozen checkpoints provides a clean way to hold the information set fixed at different dates while varying the aggregator; the finding that predictability strengthens with analyst coverage supplies direct support for an aggregation-cost channel. These elements, together with the large cross-section and explicit valuation controls, make the work a useful contribution to the literature on information processing and limits to arbitrage.

major comments (2)

[Section 3 (Methodology)] The description of the LLM prompting procedure (exact template, output parsing rule, temperature, and sector-neutralization method) is insufficiently detailed. Because the outlook score is the sole novel regressor, any systematic bias introduced by the chosen prompt framing or scale anchoring could generate the reported t = 6.02 coefficient independently of the underlying text content. The manuscript should report invariance checks under prompt rephrasing and alternative aggregation rules.
[Section 4 (Results)] Table 2 and the associated Fama-MacBeth specifications report a t-statistic of 6.02 but do not list the full set of controls, the precise construction of the market-implied valuation measure, or any adjustment for multiple testing across horizons and sub-samples. Without these elements it is impossible to verify that the incremental predictive power is not an artifact of omitted variables or data-construction choices.

minor comments (2)

[Abstract and Section 4] The abstract states that predictability 'broadly increases with the return horizon' yet notes a 'non-monotonic intermediate dip'; a figure or table showing the full term structure of coefficients would clarify this pattern.
[Section 5 (Conclusion)] The paper would benefit from a short discussion of whether the results are sensitive to the choice of OpenAI checkpoints versus other LLM families, even if only as a robustness footnote.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to improve the transparency and robustness of our methodology and results presentation. We address each major comment in turn and will incorporate the suggested clarifications and checks into the revised manuscript.

read point-by-point responses

Referee: [Section 3 (Methodology)] The description of the LLM prompting procedure (exact template, output parsing rule, temperature, and sector-neutralization method) is insufficiently detailed. Because the outlook score is the sole novel regressor, any systematic bias introduced by the chosen prompt framing or scale anchoring could generate the reported t = 6.02 coefficient independently of the underlying text content. The manuscript should report invariance checks under prompt rephrasing and alternative aggregation rules.

Authors: We agree that greater detail on the prompting procedure is warranted to ensure reproducibility and to rule out prompt-specific artifacts. In the revised manuscript we will include the exact prompt template, the deterministic temperature setting of 0, the full output parsing rules, and a precise description of the sector-neutralization algorithm. We have already performed limited invariance tests with rephrased prompts and alternative aggregation (e.g., mean versus median outlook scores); these checks will be reported in a new appendix and confirm that the main coefficient remains positive and statistically significant. revision: yes
Referee: [Section 4 (Results)] Table 2 and the associated Fama-MacBeth specifications report a t-statistic of 6.02 but do not list the full set of controls, the precise construction of the market-implied valuation measure, or any adjustment for multiple testing across horizons and sub-samples. Without these elements it is impossible to verify that the incremental predictive power is not an artifact of omitted variables or data-construction choices.

Authors: We will expand the reporting as requested. The revised Table 2 will explicitly enumerate every control variable. We will add a dedicated paragraph in Section 3 describing the exact construction of the market-implied valuation measure (the residual from a cross-sectional regression of log market capitalization on log book equity, earnings, and other standard fundamentals). For multiple testing, the headline t = 6.02 is obtained from the single pooled panel specification with model fixed effects; we will note this and supply Bonferroni-adjusted p-values for the horizon-specific and sub-sample tests in the revised tables. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical test of LLM-extracted scores via standard regressions

full rationale

The paper extracts sector-neutral outlook scores by applying fixed prompting to frozen pre-cutoff LLM checkpoints on public text, then estimates associations with future analyst revisions, target prices, and returns using Fama-MacBeth and fixed-effects regressions that include explicit controls for market valuation and standard factors. No equation or step defines the score in terms of the target returns or revisions, nor renames a fitted parameter as a prediction; the regressions test rather than construct the claimed incremental association. The procedure is independent of the outcome data by design, with no self-citation load-bearing the central result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.0 · 5488 in / 1167 out tokens · 49853 ms · 2026-05-08T12:55:52.545994+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 4 canonical work pages

[1]

S., Moskowitz, T

Asness, C. S., Moskowitz, T. J., and Pedersen, L. H. (2013). Value and momentum everywhere. The Journal of Finance , 68(3):929--985

2013
[2]

and Brown, P

Ball, R. and Brown, P. (1968). An empirical evaluation of accounting income numbers. Journal of Accounting Research , 6(2):159--178

1968
[3]

Barberis, N., Shleifer, A., and Vishny, R. (1998). A model of investor sentiment. Journal of Financial Economics , 49(3):307--343

1998
[4]

Bernard, V. L. and Thomas, J. K. (1989). Post-earnings-announcement drift: delayed price response or risk premium?. Journal of Accounting Research , 27:1--36

1989
[5]

C., Gelbach, J

Cameron, A. C., Gelbach, J. B., and Miller, D. L. (2008). Bootstrap-Based Improvements for Inference with Clustered Errors. The Review of Economics and Statistics , 90(3):414--427

2008
[6]

Cao, S., Jiang, W., Wang, J., and Yang, B. (2024). From man vs. machine to man+ machine: The art and AI of stock analyses. Journal of Financial Economics , 160:103910

2024
[7]

Da, Z., Engelberg, J., and Gao, P. (2011). In search of attention. The Journal of Finance , 66(5):1461--1499

2011
[8]

and Pollet, J

DellaVigna, S. and Pollet, J. M. (2009). Investor inattention and Friday earnings announcements. The Journal of Finance , 64(2):709--749

2009
[9]

Didisheim, A., Fraschini, M., and Somoza, L. (2025). AI's predictable memory in financial analysis. Economics Letters :112602

2025
[10]

Driscoll, J. C. and Kraay, A. C. (1998). Consistent covariance matrix estimation with spatially dependent panel data. The Review of Economics and Statistics , 80(4):549--560

1998
[11]

Easton, P. D. (2004). PE ratios, PEG ratios, and estimating the implied expected rate of return on equity capital. The Accounting Review , 79(1):73--95

2004
[12]

Fama, E. F. (1970). Efficient capital markets. The Journal of Finance , 25(2):383--417

1970
[13]

Fama, E. F. and MacBeth, J. D. (1973). Risk, return, and equilibrium: Empirical tests. Journal of Political Economy , 81(3):607--636

1973
[14]

Fama, E. F. and French, K. R. (1992). The cross-section of expected stock returns. The Journal of Finance , 47(2):427--465

1992
[15]

R., Lee, C

Gebhardt, W. R., Lee, C. M., and Swaminathan, B. (2001). Toward an implied cost of capital. Journal of Accounting Research , 39(1):135--176

2001
[16]

Grossman, S. J. and Stiglitz, J. E. (1980). On the impossibility of informationally efficient markets. The American Economic Review , 70(3):393--408

1980
[17]

and Teoh, S

Hirshleifer, D. and Teoh, S. H. (2003). Limited attention, information disclosure, and financial reporting. Journal of Accounting and Economics , 36(1-3):337--386

2003
[18]

S., and Teoh, S

Hirshleifer, D., Lim, S. S., and Teoh, S. H. (2009). Driven to distraction: Extraneous events and underreaction to earnings news. The Journal of Finance , 64(5):2289--2325

2009
[19]

S., and Teoh, S

Hirshleifer, D., Lim, S. S., and Teoh, S. H. (2011). Limited investor attention and stock market misreactions to accounting information. The Review of Asset Pricing Studies , 1(1):35--73

2011
[20]

and Stein, J

Hong, H. and Stein, J. C. (1999). A unified theory of underreaction, momentum trading, and overreaction in asset markets. The Journal of Finance , 54(6):2143--2184

1999
[21]

Hong, H., Lim, T., and Stein, J. C. (2000). Bad news travels slowly: Size, analyst coverage, and the profitability of momentum strategies. The Journal of Finance , 55(1):265--295

2000
[22]

and Titman, S

Jegadeesh, N. and Titman, S. (1993). Returns to buying winners and selling losers: Implications for stock market efficiency. The Journal of Finance , 48(1):65--91

1993
[23]

Kim, A., Muhn, M., and Nikolaev, V. (2024). Financial statement analysis with large language models. arXiv preprint arXiv:2407.17866

work page arXiv 2024
[24]

Lazaridou, A., Kuncoro, A., Gribovskaya, E., Agrawal, D., Liska, A., Terzi, T., Gimenez, M., de Masson d'Autume, C., Kocisky, T., Ruder, S., and et al. (2021). Mind the gap: Assessing temporal generalization in neural language models. Advances in Neural Information Processing Systems , 34:29348--29363

2021
[25]

Lehner, S. (2024). The Future of Accounting: AI-Driven Tone Manipulation in SEC Filings. Available at SSRN 4984337

2024
[26]

Liu, J., Nissim, D., and Thomas, J. (2002). Equity valuation using multiples. Journal of Accounting Research , 40(1):135--172

2002
[27]

Lopez-Lira and Y

Lopez-Lira, A. and Tang, Y. (2023). Can chatgpt forecast stock price movements? return predictability and large language models. arXiv preprint arXiv:2304.07619

work page arXiv 2023
[28]

Lopez-Lira, A., Tang, Y., and Zhu, M. (2025). The Memorization Problem: Can We Trust LLMs' Economic Forecasts?. arXiv preprint arXiv:2504.14765

work page arXiv 2025
[29]

and McDonald, B

Loughran, T. and McDonald, B. (2011). When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks. The Journal of Finance , 66(1):35--65

2011
[30]

Luu, K., Khashabi, D., Gururangan, S., Mandyam, K., and Smith, N. A. (2021). Time waits for no one! analysis and challenges of temporal misalignment. arXiv preprint arXiv:2111.07408

work page arXiv 2021
[31]

McLean, R. D. and Pontiff, J. (2016). Does academic research destroy stock return predictability?. The Journal of Finance , 71(1):5--32

2016
[32]

Miller, B. P. (2010). The effects of reporting complexity on small and large investor trading. The Accounting Review , 85(6):2107--2143

2010
[33]

Novy-Marx, R. (2013). The other side of value: The gross profitability premium. Journal of Financial Economics , 108(1):1--28

2013
[34]

Rusticus, T. O. (2019). Market inefficiency and implied cost of capital models. Available at SSRN 1933713

2019
[35]

Schwert, G. W. (2003). Anomalies and market efficiency. Handbook of the Economics of Finance , 1:939--974

2003
[36]

Shiller, R. J. (2020). Narrative economics: How stories go viral and drive major economic events . Princeton University Press

2020
[37]

and Vishny, R

Shleifer, A. and Vishny, R. W. (1997). The limits of arbitrage. The Journal of Finance , 52(1):35--55

1997
[38]

Sims, C. A. (2003). Implications of rational inattention. Journal of Monetary Economics , 50(3):665--690

2003
[39]

Sloan, R. G. (1996). Do stock prices fully reflect information in accruals and cash flows about future earnings?. The Accounting Review :289--315

1996
[40]

Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the stock market. The Journal of Finance , 62(3):1139--1168

2007
[41]

C., Saar-Tsechansky, M., and Macskassy, S

Tetlock, P. C., Saar-Tsechansky, M., and Macskassy, S. (2008). More than words: Quantifying language to measure firms' fundamentals. The Journal of Finance , 63(3):1437--1467

2008
[42]

H., Han, X., and Lopez-Lira, A

Van Binsbergen, J. H., Han, X., and Lopez-Lira, A. (2023). Man versus machine learning: The term structure of earnings expectations and conditional biases. The Review of Financial Studies , 36(6):2361--2396

2023
[43]

V., Zhou, D., and et al

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., and et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems , 35:24824--24837

2022