Recognition: 2 theorem links
· Lean TheoremThe Oracle's Fingerprint: Correlated AI Forecasting Errors and the Limits of Bias Transmission
Pith reviewed 2026-05-10 18:07 UTC · model grok-4.3
The pith
Three independently developed LLMs display highly correlated errors when forecasting binary events, yet human crowd predictions show no added bias from them.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GPT-4o, Claude, and Gemini exhibit mean pairwise forecasting error correlations of r = 0.77 on 568 binary prediction questions despite independent development. Human community forecasts moved in LLM-predicted directions after November 2022, yet this movement is fully explained by rational updating to ground truth. Human forecasting biases already matched the LLM pattern strongly before ChatGPT (r = 0.87) and showed weaker resemblance afterward (r = -0.28), revealing that AI systems amplify biases humans already hold.
What carries the argument
Pairwise error correlation on resolved binary forecasting questions, measured across LLMs and tracked in human community predictions via a within-question pre/post-ChatGPT comparison.
If this is right
- Consulting multiple LLMs may not increase forecast diversity as much as expected because their errors align.
- Human forecasters appear to extract useful information from LLMs without adopting the models' specific error patterns.
- The pre-existing match between human and LLM biases suggests AIs largely reproduce patterns already present in training data.
- Epistemic monoculture among AIs could still limit long-run collective intelligence even without direct bias transmission to humans.
Where Pith is reading between the lines
- Persistent similarity in training data sources may sustain or increase these error correlations as models are updated.
- Targeted interventions to diversify model outputs on forecasting tasks could be tested by measuring whether they reduce cross-model error correlations.
- Prediction platforms that restrict LLM use might show different error evolution if the current rational-updating pattern holds.
- The results invite checking whether the same correlation pattern appears in other domains beyond binary geopolitical or economic forecasts.
Load-bearing premise
The observed error correlations among LLMs reflect genuine shared biases in their internal reasoning rather than overlapping training data or leakage of the tested questions.
What would settle it
A finding that error correlations drop sharply when restricting analysis to questions with no plausible public discussion or data before the models' training cutoffs.
Figures
read the original abstract
When large language models (LLMs) are consulted as forecasting tools, the independence of individual errors -- the foundation of collective intelligence -- may collapse. We test three conditions necessary for this "epistemic monoculture" to emerge. In Study 1, we show that GPT-4o, Claude, and Gemini exhibit highly correlated forecasting errors on 568 resolved binary prediction questions (mean pairwise error correlation r = 0.77, p < 0.001; r = 0.78 excluding likely-leaked questions), despite being developed independently by different organizations. In Study 2, we test whether this correlated bias has propagated into human crowd forecasts, using a within-question design that tracks community prediction shifts across the ChatGPT launch boundary (November 2022). We find that community forecasts move in the direction predicted by LLMs (r = 0.20, p = 0.007), but this shift is fully explained by rational updating toward ground truth. In Study 3, we examine whether the category-level pattern of human forecasting errors increasingly resembles the LLM bias fingerprint. We find the opposite: pre-ChatGPT human biases already strongly resembled the LLM pattern (r = 0.87), while post-ChatGPT the resemblance weakened (r = -0.28). Together, these findings reveal an epistemic monoculture that is built but not yet activated: three nominally independent AI systems share the same failure modes, amplifying precisely the biases humans already hold.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that three major LLMs (GPT-4o, Claude, and Gemini) exhibit highly correlated forecasting errors on 568 resolved binary prediction questions (mean pairwise r = 0.77, p < 0.001; r = 0.78 after excluding likely-leaked questions), indicating an epistemic monoculture despite independent development. Using a within-question pre/post-ChatGPT launch (November 2022) design on human crowd forecasts, it reports that community predictions shift toward LLM directions (r = 0.20, p = 0.007) but this is fully explained by rational updating to ground truth; additionally, human error patterns already resembled the LLM bias fingerprint pre-ChatGPT (r = 0.87) and weakened afterward (r = -0.28), suggesting the monoculture is built but not yet activated.
Significance. If the results hold after clarification of methods, the paper offers a timely empirical contribution to AI forecasting and collective intelligence by documenting shared LLM failure modes on resolved questions with ground truth, while showing no clear transmission of those biases to human forecasters. The pre/post design, leakage exclusion, and direct comparison of error patterns provide a falsifiable test without circular derivations or fitted parameters reducing to assumptions. This strengthens understanding of risks from AI monocultures and the resilience of human biases.
major comments (3)
- [Study 1] Study 1 / Methods: The procedure for identifying and excluding 'likely-leaked questions' is not described in detail (e.g., exact string match, semantic similarity, or temporal cutoff relative to training data). Because the headline correlation remains high after exclusion (r = 0.78), this detail is load-bearing for distinguishing shared training data overlap from independent biases.
- [Study 2] Study 2: The within-question design tracks community forecast shifts across the November 2022 boundary, but the manuscript does not report explicit controls for contemporaneous events, information sources, or selection effects in the crowd platform data. This leaves open whether the observed r = 0.20 shift can be attributed specifically to LLM exposure rather than general information availability.
- [Study 3] Study 3: The central claim that human biases already matched the LLM pattern pre-ChatGPT (r = 0.87) and diverged afterward (r = -0.28) relies on category-level error patterns, yet the number of categories, exact error metric (signed vs. absolute), and statistical tests for the change are not fully specified. These details are needed to evaluate whether the weakening supports the 'not yet activated' interpretation.
minor comments (3)
- [Abstract] Abstract and Methods: The source and sampling frame for the 568 resolved binary questions (e.g., platform, time window, resolution criteria) should be stated explicitly to allow assessment of generalizability and potential selection bias.
- [Methods] Throughout: Clarify the precise definition of 'forecasting error' used for all correlations (e.g., absolute deviation from ground truth probability or signed error) and report whether results are robust to alternative definitions.
- [Figures] Figures: Ensure all panels include sample sizes, confidence intervals, and clear pre/post labels so that the r = 0.20 and r = 0.87 values can be directly interpreted without returning to the text.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review, which recognizes the potential contribution of our work while identifying areas for methodological clarification. We address each major comment below and indicate the revisions planned for the manuscript.
read point-by-point responses
-
Referee: [Study 1] Study 1 / Methods: The procedure for identifying and excluding 'likely-leaked questions' is not described in detail (e.g., exact string match, semantic similarity, or temporal cutoff relative to training data). Because the headline correlation remains high after exclusion (r = 0.78), this detail is load-bearing for distinguishing shared training data overlap from independent biases.
Authors: We agree that the exclusion procedure requires fuller description to allow readers to evaluate its impact. In the revised manuscript we will expand the Study 1 Methods section to specify the exact criteria employed, including temporal cutoffs tied to each model's known training-data release dates and the semantic-similarity threshold used for content-based matching. We will also report the number of questions excluded under the primary rule and present a sensitivity table showing the pairwise correlations under alternative exclusion thresholds. These additions will directly address the concern that the reported r = 0.78 may be driven by data overlap rather than independent biases. revision: yes
-
Referee: [Study 2] Study 2: The within-question design tracks community forecast shifts across the November 2022 boundary, but the manuscript does not report explicit controls for contemporaneous events, information sources, or selection effects in the crowd platform data. This leaves open whether the observed r = 0.20 shift can be attributed specifically to LLM exposure rather than general information availability.
Authors: The within-question pre/post design already holds question identity fixed, thereby controlling for many selection and topic-specific effects. The modest shift (r = 0.20) is statistically fully explained by movement toward ground truth, which is the pattern expected under rational updating rather than LLM-specific transmission. Nevertheless, we acknowledge that contemporaneous events or other information sources could contribute. In the revised manuscript we will add an explicit limitations paragraph in the Study 2 Discussion that enumerates these potential confounds, notes the public nature of the crowd platform data, and explains why the observed alignment with ground truth (rather than with LLM predictions per se) makes direct LLM influence the less parsimonious account. We cannot, however, introduce new variables for every possible external event without additional data sources. revision: partial
-
Referee: [Study 3] Study 3: The central claim that human biases already matched the LLM pattern pre-ChatGPT (r = 0.87) and diverged afterward (r = -0.28) relies on category-level error patterns, yet the number of categories, exact error metric (signed vs. absolute), and statistical tests for the change are not fully specified. These details are needed to evaluate whether the weakening supports the 'not yet activated' interpretation.
Authors: We agree that these analytic choices must be stated explicitly. The revised manuscript will specify the number of topic-derived categories used, confirm that the error metric is the signed forecast error (predicted probability minus realized outcome), and describe the statistical procedure employed to test the pre/post difference in correlation with the LLM fingerprint (a z-test for dependent correlations). These clarifications will allow readers to assess whether the observed weakening (from r = 0.87 to r = -0.28) supports the interpretation that the monoculture is built but not yet activated in human forecasts. revision: yes
Circularity Check
No circularity: empirical correlations computed directly from data
full rationale
The paper reports direct statistical computations (pairwise error correlations r=0.77 on 568 resolved questions, pre/post human shifts r=0.20 and category resemblance r=0.87/-0.28) from observed LLM outputs, ground truth, and crowd forecasts. These are not derived via equations or parameters that loop back to the inputs by construction. No self-citations, ansatzes, uniqueness theorems, or renamings of known results appear as load-bearing steps in the abstract or described studies. The within-question design and leakage exclusion are methodological, not tautological. The central claims rest on falsifiable empirical patterns rather than self-referential definitions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Forecasting errors can be meaningfully quantified as deviations from resolved binary outcomes.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
mean pairwise error correlation r = 0.77 ... regression Δi = β0 + β1·Ri + β2·Di + εi ... pre-ChatGPT fingerprint correlation r = 0.87
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Study 1 inter-model error correlation ... epistemic monoculture
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
On the Opportunities and Risks of Foundation Models
Bommasani, R., Hudson, D. A., Adeli, E., et al. (2022). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[2]
Cinelli, M., De Francisci Morales, G., Galeazzi, A., Quattrociocchi, W., & Starnini, M. (2021). The echo chamber effect on social media. Proceedings of the National Academy of Sciences, 118(9), e2023301118
2021
-
[3]
Galton, F. (1907). Vox populi. Nature, 75(1949), 450--451
1907
- [4]
-
[5]
Haldane, A. G. & May, R. M. (2011). Systemic risk in banking ecosystems. Nature, 469(7330), 351--355
2011
- [6]
-
[7]
& Raghavan, M
Kleinberg, J. & Raghavan, M. (2021). Algorithmic monoculture and social welfare. Proceedings of the National Academy of Sciences, 118(22), e2018340118
2021
-
[8]
& Vedelsby, J
Krogh, A. & Vedelsby, J. (1995). Neural network ensembles, cross validation, and active learning. In Advances in Neural Information Processing Systems (Vol. 7). MIT Press
1995
-
[9]
L\' o pez-Lira, A. & Tang, Y. (2023). Can ChatGPT forecast stock price movements? Return predictability and large language models. arXiv preprint arXiv:2304.07619
-
[10]
Lorenz, J., Rauhut, H., Schweitzer, F., & Helbing, D. (2011). How social influence can undermine the wisdom of crowd effect. Proceedings of the National Academy of Sciences, 108(22), 9020--9025
2011
- [11]
-
[12]
E., Soll, J
Mannes, A. E., Soll, J. B., & Larrick, R. P. (2014). The wisdom of select crowds. Journal of Personality and Social Psychology, 107(2), 276--299
2014
-
[13]
Metaculus track record
Metaculus (2023). Metaculus track record. https://www.metaculus.com/questions/track-record/
2023
-
[14]
Page, S. E. (2007). The Difference: How the Power of Diversity Creates Better Groups, Firms, Schools, and Societies. Princeton University Press
2007
- [15]
-
[16]
P., Nelson, L
Simmons, J. P., Nelson, L. D., Galak, J., & Frederick, S. (2011). Intuitive biases in choice versus estimation: Implications for the wisdom of crowds. Journal of Consumer Research, 38(1), 1--15
2011
-
[17]
Surowiecki, J. (2004). The Wisdom of Crowds. Doubleday
2004
-
[18]
Toyokawa, W., Whalen, A., & Laland, K. N. (2019). Social learning strategies regulate the wisdom and madness of interactive crowds. Nature Human Behaviour, 3(2), 183--193
2019
-
[19]
Zhu, Y., Chen, H., Fan, J., et al. (2000). Genetic diversity and disease control in rice. Nature, 406(6797), 718--722
2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.