pith. machine review for the scientific record. sign in

arxiv: 2605.00844 · v1 · submitted 2026-04-07 · 💻 cs.CY · cs.AI

Recognition: 2 theorem links

· Lean Theorem

The Oracle's Fingerprint: Correlated AI Forecasting Errors and the Limits of Bias Transmission

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:07 UTC · model grok-4.3

classification 💻 cs.CY cs.AI
keywords large language modelsforecasting errorserror correlationepistemic monoculturehuman-AI interactionbias transmissionprediction accuracycrowd forecasts
0
0 comments X

The pith

Three independently developed LLMs display highly correlated errors when forecasting binary events, yet human crowd predictions show no added bias from them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether large language models consulted for forecasts erode the error independence that supports collective intelligence. It establishes that GPT-4o, Claude, and Gemini produce forecasting mistakes with mean pairwise correlations of r = 0.77 across 568 resolved questions. Human community forecasts shifted toward LLM directions after the ChatGPT launch, but these shifts align fully with rational updating to revealed ground truth. Human error patterns already matched the LLM pattern closely before widespread LLM access and became less similar afterward. The work concludes that AI systems share failure modes that match and could amplify biases humans already possess.

Core claim

GPT-4o, Claude, and Gemini exhibit mean pairwise forecasting error correlations of r = 0.77 on 568 binary prediction questions despite independent development. Human community forecasts moved in LLM-predicted directions after November 2022, yet this movement is fully explained by rational updating to ground truth. Human forecasting biases already matched the LLM pattern strongly before ChatGPT (r = 0.87) and showed weaker resemblance afterward (r = -0.28), revealing that AI systems amplify biases humans already hold.

What carries the argument

Pairwise error correlation on resolved binary forecasting questions, measured across LLMs and tracked in human community predictions via a within-question pre/post-ChatGPT comparison.

If this is right

  • Consulting multiple LLMs may not increase forecast diversity as much as expected because their errors align.
  • Human forecasters appear to extract useful information from LLMs without adopting the models' specific error patterns.
  • The pre-existing match between human and LLM biases suggests AIs largely reproduce patterns already present in training data.
  • Epistemic monoculture among AIs could still limit long-run collective intelligence even without direct bias transmission to humans.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Persistent similarity in training data sources may sustain or increase these error correlations as models are updated.
  • Targeted interventions to diversify model outputs on forecasting tasks could be tested by measuring whether they reduce cross-model error correlations.
  • Prediction platforms that restrict LLM use might show different error evolution if the current rational-updating pattern holds.
  • The results invite checking whether the same correlation pattern appears in other domains beyond binary geopolitical or economic forecasts.

Load-bearing premise

The observed error correlations among LLMs reflect genuine shared biases in their internal reasoning rather than overlapping training data or leakage of the tested questions.

What would settle it

A finding that error correlations drop sharply when restricting analysis to questions with no plausible public discussion or data before the models' training cutoffs.

Figures

Figures reproduced from arXiv: 2605.00844 by Theodor Spiro.

Figure 2
Figure 2. Figure 2: Calibration curves for three LLMs and the Metaculus community prediction. Predicted [PITH_FULL_IMAGE:figures/full_fig_p021_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Bias fingerprint comparison across five question categories. Left: LLM bias (mean [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗
read the original abstract

When large language models (LLMs) are consulted as forecasting tools, the independence of individual errors -- the foundation of collective intelligence -- may collapse. We test three conditions necessary for this "epistemic monoculture" to emerge. In Study 1, we show that GPT-4o, Claude, and Gemini exhibit highly correlated forecasting errors on 568 resolved binary prediction questions (mean pairwise error correlation r = 0.77, p < 0.001; r = 0.78 excluding likely-leaked questions), despite being developed independently by different organizations. In Study 2, we test whether this correlated bias has propagated into human crowd forecasts, using a within-question design that tracks community prediction shifts across the ChatGPT launch boundary (November 2022). We find that community forecasts move in the direction predicted by LLMs (r = 0.20, p = 0.007), but this shift is fully explained by rational updating toward ground truth. In Study 3, we examine whether the category-level pattern of human forecasting errors increasingly resembles the LLM bias fingerprint. We find the opposite: pre-ChatGPT human biases already strongly resembled the LLM pattern (r = 0.87), while post-ChatGPT the resemblance weakened (r = -0.28). Together, these findings reveal an epistemic monoculture that is built but not yet activated: three nominally independent AI systems share the same failure modes, amplifying precisely the biases humans already hold.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript claims that three major LLMs (GPT-4o, Claude, and Gemini) exhibit highly correlated forecasting errors on 568 resolved binary prediction questions (mean pairwise r = 0.77, p < 0.001; r = 0.78 after excluding likely-leaked questions), indicating an epistemic monoculture despite independent development. Using a within-question pre/post-ChatGPT launch (November 2022) design on human crowd forecasts, it reports that community predictions shift toward LLM directions (r = 0.20, p = 0.007) but this is fully explained by rational updating to ground truth; additionally, human error patterns already resembled the LLM bias fingerprint pre-ChatGPT (r = 0.87) and weakened afterward (r = -0.28), suggesting the monoculture is built but not yet activated.

Significance. If the results hold after clarification of methods, the paper offers a timely empirical contribution to AI forecasting and collective intelligence by documenting shared LLM failure modes on resolved questions with ground truth, while showing no clear transmission of those biases to human forecasters. The pre/post design, leakage exclusion, and direct comparison of error patterns provide a falsifiable test without circular derivations or fitted parameters reducing to assumptions. This strengthens understanding of risks from AI monocultures and the resilience of human biases.

major comments (3)
  1. [Study 1] Study 1 / Methods: The procedure for identifying and excluding 'likely-leaked questions' is not described in detail (e.g., exact string match, semantic similarity, or temporal cutoff relative to training data). Because the headline correlation remains high after exclusion (r = 0.78), this detail is load-bearing for distinguishing shared training data overlap from independent biases.
  2. [Study 2] Study 2: The within-question design tracks community forecast shifts across the November 2022 boundary, but the manuscript does not report explicit controls for contemporaneous events, information sources, or selection effects in the crowd platform data. This leaves open whether the observed r = 0.20 shift can be attributed specifically to LLM exposure rather than general information availability.
  3. [Study 3] Study 3: The central claim that human biases already matched the LLM pattern pre-ChatGPT (r = 0.87) and diverged afterward (r = -0.28) relies on category-level error patterns, yet the number of categories, exact error metric (signed vs. absolute), and statistical tests for the change are not fully specified. These details are needed to evaluate whether the weakening supports the 'not yet activated' interpretation.
minor comments (3)
  1. [Abstract] Abstract and Methods: The source and sampling frame for the 568 resolved binary questions (e.g., platform, time window, resolution criteria) should be stated explicitly to allow assessment of generalizability and potential selection bias.
  2. [Methods] Throughout: Clarify the precise definition of 'forecasting error' used for all correlations (e.g., absolute deviation from ground truth probability or signed error) and report whether results are robust to alternative definitions.
  3. [Figures] Figures: Ensure all panels include sample sizes, confidence intervals, and clear pre/post labels so that the r = 0.20 and r = 0.87 values can be directly interpreted without returning to the text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review, which recognizes the potential contribution of our work while identifying areas for methodological clarification. We address each major comment below and indicate the revisions planned for the manuscript.

read point-by-point responses
  1. Referee: [Study 1] Study 1 / Methods: The procedure for identifying and excluding 'likely-leaked questions' is not described in detail (e.g., exact string match, semantic similarity, or temporal cutoff relative to training data). Because the headline correlation remains high after exclusion (r = 0.78), this detail is load-bearing for distinguishing shared training data overlap from independent biases.

    Authors: We agree that the exclusion procedure requires fuller description to allow readers to evaluate its impact. In the revised manuscript we will expand the Study 1 Methods section to specify the exact criteria employed, including temporal cutoffs tied to each model's known training-data release dates and the semantic-similarity threshold used for content-based matching. We will also report the number of questions excluded under the primary rule and present a sensitivity table showing the pairwise correlations under alternative exclusion thresholds. These additions will directly address the concern that the reported r = 0.78 may be driven by data overlap rather than independent biases. revision: yes

  2. Referee: [Study 2] Study 2: The within-question design tracks community forecast shifts across the November 2022 boundary, but the manuscript does not report explicit controls for contemporaneous events, information sources, or selection effects in the crowd platform data. This leaves open whether the observed r = 0.20 shift can be attributed specifically to LLM exposure rather than general information availability.

    Authors: The within-question pre/post design already holds question identity fixed, thereby controlling for many selection and topic-specific effects. The modest shift (r = 0.20) is statistically fully explained by movement toward ground truth, which is the pattern expected under rational updating rather than LLM-specific transmission. Nevertheless, we acknowledge that contemporaneous events or other information sources could contribute. In the revised manuscript we will add an explicit limitations paragraph in the Study 2 Discussion that enumerates these potential confounds, notes the public nature of the crowd platform data, and explains why the observed alignment with ground truth (rather than with LLM predictions per se) makes direct LLM influence the less parsimonious account. We cannot, however, introduce new variables for every possible external event without additional data sources. revision: partial

  3. Referee: [Study 3] Study 3: The central claim that human biases already matched the LLM pattern pre-ChatGPT (r = 0.87) and diverged afterward (r = -0.28) relies on category-level error patterns, yet the number of categories, exact error metric (signed vs. absolute), and statistical tests for the change are not fully specified. These details are needed to evaluate whether the weakening supports the 'not yet activated' interpretation.

    Authors: We agree that these analytic choices must be stated explicitly. The revised manuscript will specify the number of topic-derived categories used, confirm that the error metric is the signed forecast error (predicted probability minus realized outcome), and describe the statistical procedure employed to test the pre/post difference in correlation with the LLM fingerprint (a z-test for dependent correlations). These clarifications will allow readers to assess whether the observed weakening (from r = 0.87 to r = -0.28) supports the interpretation that the monoculture is built but not yet activated in human forecasts. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical correlations computed directly from data

full rationale

The paper reports direct statistical computations (pairwise error correlations r=0.77 on 568 resolved questions, pre/post human shifts r=0.20 and category resemblance r=0.87/-0.28) from observed LLM outputs, ground truth, and crowd forecasts. These are not derived via equations or parameters that loop back to the inputs by construction. No self-citations, ansatzes, uniqueness theorems, or renamings of known results appear as load-bearing steps in the abstract or described studies. The within-question design and leakage exclusion are methodological, not tautological. The central claims rest on falsifiable empirical patterns rather than self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is empirical and relies on standard statistical assumptions for correlation analysis without introducing new free parameters, axioms beyond basic probability, or invented entities.

axioms (1)
  • domain assumption Forecasting errors can be meaningfully quantified as deviations from resolved binary outcomes.
    This underpins the calculation of error correlations across models and humans.

pith-pipeline@v0.9.0 · 5563 in / 1283 out tokens · 43060 ms · 2026-05-10T18:07:43.157716+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 6 canonical work pages · 1 internal anchor

  1. [1]

    On the Opportunities and Risks of Foundation Models

    Bommasani, R., Hudson, D. A., Adeli, E., et al. (2022). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258

  2. [2]

    Cinelli, M., De Francisci Morales, G., Galeazzi, A., Quattrociocchi, W., & Starnini, M. (2021). The echo chamber effect on social media. Proceedings of the National Academy of Sciences, 118(9), e2023301118

  3. [3]

    Galton, F. (1907). Vox populi. Nature, 75(1949), 450--451

  4. [4]

    Halawi, D., Durmus, E., Falk, L., & Steinhardt, J. (2024). Approaching human-level forecasting with language models. arXiv preprint arXiv:2402.18563

  5. [5]

    Haldane, A. G. & May, R. M. (2011). Systemic risk in banking ecosystems. Nature, 469(7330), 351--355

  6. [6]

    Horton, J. J. (2023). Large language models as simulated economic agents: What can we learn from homo silicus? arXiv preprint arXiv:2301.07543

  7. [7]

    & Raghavan, M

    Kleinberg, J. & Raghavan, M. (2021). Algorithmic monoculture and social welfare. Proceedings of the National Academy of Sciences, 118(22), e2018340118

  8. [8]

    & Vedelsby, J

    Krogh, A. & Vedelsby, J. (1995). Neural network ensembles, cross validation, and active learning. In Advances in Neural Information Processing Systems (Vol. 7). MIT Press

  9. [9]

    Lopez-Lira and Y

    L\' o pez-Lira, A. & Tang, Y. (2023). Can ChatGPT forecast stock price movements? Return predictability and large language models. arXiv preprint arXiv:2304.07619

  10. [10]

    Lorenz, J., Rauhut, H., Schweitzer, F., & Helbing, D. (2011). How social influence can undermine the wisdom of crowd effect. Proceedings of the National Academy of Sciences, 108(22), 9020--9025

  11. [11]

    Luo, H., Cai, T., Zhang, Y., et al. (2024). Large language models for scientific research: A survey. arXiv preprint arXiv:2404.13672

  12. [12]

    E., Soll, J

    Mannes, A. E., Soll, J. B., & Larrick, R. P. (2014). The wisdom of select crowds. Journal of Personality and Social Psychology, 107(2), 276--299

  13. [13]

    Metaculus track record

    Metaculus (2023). Metaculus track record. https://www.metaculus.com/questions/track-record/

  14. [14]

    Page, S. E. (2007). The Difference: How the Power of Diversity Creates Better Groups, Firms, Schools, and Societies. Princeton University Press

  15. [15]

    Schoenegger, P., Park, P., Karger, E., & Tetlock, P. E. (2024). AI-augmented predictions: LLM assistants improve human forecasting accuracy. arXiv preprint arXiv:2402.07862

  16. [16]

    P., Nelson, L

    Simmons, J. P., Nelson, L. D., Galak, J., & Frederick, S. (2011). Intuitive biases in choice versus estimation: Implications for the wisdom of crowds. Journal of Consumer Research, 38(1), 1--15

  17. [17]

    Surowiecki, J. (2004). The Wisdom of Crowds. Doubleday

  18. [18]

    Toyokawa, W., Whalen, A., & Laland, K. N. (2019). Social learning strategies regulate the wisdom and madness of interactive crowds. Nature Human Behaviour, 3(2), 183--193

  19. [19]

    Zhu, Y., Chen, H., Fan, J., et al. (2000). Genetic diversity and disease control in rice. Nature, 406(6797), 718--722