pith. sign in

arxiv: 2605.17510 · v1 · pith:BQNWHE4Snew · submitted 2026-05-17 · 🌊 nlin.AO · cs.MA

Scale-Dependent Collective Adaptation in Self-Amending LLM Societies: A Cross-Family Study of Emergent Governance

Pith reviewed 2026-05-19 22:26 UTC · model grok-4.3

classification 🌊 nlin.AO cs.MA
keywords LLM societiescollective adaptationself-amending gamesNomicscale dependenceemergent governancegroup decision makingmodel size
0
0 comments X p. Extension
pith:BQNWHE4S Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{BQNWHE4S}

Prints a linked pith:BQNWHE4S badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Collective rule adaptation in LLM societies peaks at intermediate model sizes

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates group decision-making in artificial societies of large language models where the rules themselves can be collectively amended. By using the Nomic game across different scales in two model families, it demonstrates that adaptation does not steadily improve with larger models. A narrow mid-scale range allows for ongoing rule changes, varied proposals, and even consensus, while smaller models barely alter rules, bigger ones favor limiting votes, and mixed groups end up blocked by vetoes. This pattern remains consistent even when changing temperatures or switching to majority voting, indicating that model scale has a complex, non-monotonic effect on how well these groups govern themselves.

Core claim

The central discovery is that collective adaptation in self-amending LLM societies does not improve monotonically with model size. Both LLM families display a narrow mid-scale regime that enables sustained rule adoption, diverse amendments, and balanced consensus. Smaller models tend to stay rule-inert, larger models often converge on restrictive voting patterns, and heterogeneous mixed-size groups collapse into veto-driven gridlock. These cross-scale contrasts hold under temperature perturbations and changes from unanimity to majority voting. Hidden-state divergence does not account for the behavioral outcomes, and linear probes indicate that decodability of vote-predictive signals is 1. 2.

What carries the argument

The Nomic game as a self-amending testbed where agents propose and vote on rule modifications during ongoing play to study emergent governance.

If this is right

  • Mid-scale regimes support sustained rule adoption and diverse amendments.
  • Larger models converge on restrictive voting patterns limiting further changes.
  • Mixed-size groups lead to veto-driven gridlock preventing adaptation.
  • The non-monotonic scale effect persists across different voting rules and temperature settings.
  • Decodability of latent vote signals is required but insufficient alone for good collective behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This non-monotonicity could inform the design of multi-agent AI systems by favoring uniform mid-sized models over scaling up or mixing sizes.
  • Similar patterns might appear in other collective tasks like negotiation or resource management when rules evolve.
  • Testing with additional model families or real-world inspired scenarios could validate if the mid-scale optimum is general.
  • The results suggest that raw increases in model capability may hinder rather than help group-level adaptability in some contexts.

Load-bearing premise

The implementation of the Nomic game and the specific prompting methods used here represent general collective adaptation behaviors in LLM societies without special artifacts from the chosen rules or protocols.

What would settle it

Running the same experiments with a different self-amending game setup or alternative agent interaction protocols and checking if the mid-scale advantage in rule adoption and consensus still appears.

read the original abstract

We study group decision-making in artificial societies where the rules of play are themselves subject to collective amendment. Using the self-amending game Nomic, we compare multiple scales across two LLM families and find that collective adaptation does not improve monotonically with model size. Instead, both families exhibit a narrow mid-scale regime that supports sustained rule adoption, diverse amendments, and balanced consensus. Smaller models tend to remain rule-inert, whereas larger models often converge on restrictive voting patterns, and heterogeneous mixed-size groups collapse into veto-driven gridlock. These cross-scale contrasts persist under temperature perturbations and under a shift from unanimity to majority voting, although latent-state structure varies by family and scale. Hidden-state divergence alone does not explain collective performance: high representational divergence can coincide with poor behavioural outcomes. Linear probes reveal regime-selective coupling between latent vote-predictive signals and collective behaviour, but decodability is necessary rather than sufficient for adaptive play. Overall, the recurring regularity is non-monotonicity, not the particular scale at which the optimum appears. Self-amending games therefore provide a controlled testbed for studying collective adaptation in artificial societies beyond raw model scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript presents an empirical study of collective adaptation in LLM societies using the self-amending game Nomic across two LLM families and multiple scales. It reports a non-monotonic relationship with model size: a narrow mid-scale regime supports sustained rule adoption, diverse amendments, and balanced consensus, while smaller models remain rule-inert, larger models converge on restrictive voting patterns, and heterogeneous mixed-size groups collapse into veto-driven gridlock. These cross-scale contrasts persist under temperature perturbations and a shift from unanimity to majority voting. Additional analyses examine latent-state divergence, linear probes for vote-predictive signals, and regime-selective coupling between representations and behavior. The recurring regularity highlighted is non-monotonicity rather than any specific optimal scale.

Significance. If the non-monotonicity is shown to be robust to quantitative controls and generalizes beyond the Nomic template, the work would offer a valuable controlled testbed for scale-dependent emergent governance in artificial societies. It challenges monotonic assumptions about model scale improving collective outcomes and provides cross-family evidence with some robustness checks. The absence of statistical details and limited perturbation scope currently constrain its immediate impact, but the setup could inform broader studies of self-amending AI systems if strengthened.

major comments (2)
  1. [Methods and Robustness Checks] The central claim of a narrow mid-scale optimum for rule adoption, diverse amendments, and balanced consensus requires that this regularity is a property of scale-dependent collective adaptation rather than an artifact of Nomic's proposal/voting mechanics, fixed initial rule set, or multi-agent prompting and state-update protocol. The reported persistence under temperature changes and unanimity-to-majority shift remains inside the same game template and interaction structure; no tests with alternative self-governance scenarios (e.g., open-ended constitutional drafting or different amendment thresholds) are described. This is load-bearing for generalizing the non-monotonicity beyond the specific implementation.
  2. [Results] The abstract and results report consistent patterns across scales and conditions but provide no quantitative details, error bars, statistical tests, or controls. This leaves open whether data selection, run count, or prompting choices affect the central non-monotonic claim and the characterizations of smaller/larger/mixed regimes.
minor comments (3)
  1. [Experimental Setup] Clarify the exact model sizes and parameter counts defining the 'mid-scale' regime for each family, and ensure consistent terminology for scales throughout.
  2. [Figures and Tables] Add error bars, run counts, and significance markers to all figures and tables reporting behavioral outcomes and probe accuracies.
  3. [Analysis of Latent States] The discussion of latent-state divergence and linear probes would benefit from explicit comparison to baseline decodability thresholds or null models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below, providing clarifications on scope and committing to specific improvements in the revised manuscript to enhance statistical transparency and delimit generalization claims.

read point-by-point responses
  1. Referee: [Methods and Robustness Checks] The central claim of a narrow mid-scale optimum for rule adoption, diverse amendments, and balanced consensus requires that this regularity is a property of scale-dependent collective adaptation rather than an artifact of Nomic's proposal/voting mechanics, fixed initial rule set, or multi-agent prompting and state-update protocol. The reported persistence under temperature changes and unanimity-to-majority shift remains inside the same game template and interaction structure; no tests with alternative self-governance scenarios (e.g., open-ended constitutional drafting or different amendment thresholds) are described. This is load-bearing for generalizing the non-monotonicity beyond the specific implementation.

    Authors: We agree that broader generalization of the non-monotonicity would require validation across alternative self-governance templates. Our study deliberately employs Nomic as a canonical, fully specified self-amending game that permits precise tracking of rule evolution, proposal success, and consensus dynamics under controlled conditions. The reported checks (temperature sweeps and unanimity-to-majority transition) already vary key interaction parameters while holding the core amendment protocol fixed, showing that the scale-dependent regimes are not artifacts of those particular settings. To address the referee's concern directly, we will revise the discussion section to explicitly state the scope of our claims as applying to the Nomic testbed and to outline concrete directions for extending the protocol to other amendment frameworks in future work. This revision will make the load-bearing nature of the specific implementation transparent without overstating generality. revision: partial

  2. Referee: [Results] The abstract and results report consistent patterns across scales and conditions but provide no quantitative details, error bars, statistical tests, or controls. This leaves open whether data selection, run count, or prompting choices affect the central non-monotonic claim and the characterizations of smaller/larger/mixed regimes.

    Authors: We acknowledge that the current presentation would benefit from explicit quantitative support. Although the manuscript reports patterns observed consistently across independent simulation runs, we agree that the absence of error bars, run counts, and formal statistical tests leaves room for questions about robustness to implementation details. In the revised manuscript we will expand the results section to report the exact number of runs per scale and condition, include error bars or standard deviations on metrics such as amendment diversity, rule-adoption rate, and consensus balance, and add statistical comparisons (e.g., non-parametric tests across scales) to quantify the non-monotonic trends and regime distinctions. These additions will directly mitigate concerns about data selection or prompting sensitivity. revision: yes

Circularity Check

0 steps flagged

Empirical comparison study with no circular derivations or self-referential reductions

full rationale

The paper reports experimental outcomes from running LLM agents in the Nomic self-amending game across model scales and families. No equations, parameter fits, or derivations are present that would reduce the reported non-monotonic mid-scale regime or other behavioral patterns to inputs by construction. Claims rest on observed simulation results (rule adoption rates, amendment diversity, consensus patterns) that remain independently falsifiable through replication with different prompts or game variants. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing support for the central regularity. The work is therefore self-contained as an empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; all claims rest on the unstated assumption that the Nomic setup and chosen scales are representative of broader collective adaptation.

pith-pipeline@v0.9.0 · 5748 in / 1043 out tokens · 29017 ms · 2026-05-19T22:26:00.282231+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · 9 internal anchors

  1. [1]

    Qwen3.5: Towards native multimodal agents (2026)

    Qwen Team. Qwen3.5: Towards native multimodal agents (2026). URL https: //qwen.ai/blog?id=qwen3.5

  2. [2]

    Gemma 3 Technical Report

    Gemma Team, Google DeepMind, Gemma 3 technical report. arXiv preprint arXiv:2503.19786 (2025)

  3. [3]

    Axelrod,The Evolution of Cooperation(Basic Books, 1984)

    R. Axelrod,The Evolution of Cooperation(Basic Books, 1984)

  4. [4]

    Ostrom,Governing the commons: The evolution of institutions for collective action(Cambridge university press, 1990)

    E. Ostrom,Governing the commons: The evolution of institutions for collective action(Cambridge university press, 1990)

  5. [5]

    North,Institutions, Institutional Change and Economic Performance(Cam- bridge University Press, 1990)

    D.C. North,Institutions, Institutional Change and Economic Performance(Cam- bridge University Press, 1990)

  6. [6]

    Henrich,The Secret of Our Success: How Culture Is Driving Human Evolution, Domesticating Our Species, and Making Us Smarter(Princeton University Press, 2016)

    J. Henrich,The Secret of Our Success: How Culture Is Driving Human Evolution, Domesticating Our Species, and Making Us Smarter(Princeton University Press, 2016)

  7. [7]

    Boyd, P.J

    R. Boyd, P.J. Richerson, inBetter Than Conscious?: Decision Making, the Human Mind, and Implications For Institutions, ed. by C. Engel, W. Singer (The MIT Press, 2008). https://doi.org/10.7551/mitpress/9780262195805.003.0014. URL https://doi.org/10.7551/mitpress/9780262195805.003.0014

  8. [8]

    Galesic, D

    M. Galesic, D. Barkoczi, A.M. Berdahl, D. Biro, G. Carbone, I. Giannoccaro, R.L. Goldstone, C. Gonzalez, A. Kandler, A.B. Kao, et al., Beyond collective intelligence: Collective adaptation. Journal of the Royal Society Interface20(200), 20220736 (2023). https://doi.org/10.1098/rsif.2022.0736

  9. [9]

    Masumoto, T

    G. Masumoto, T. Ikegami, A new formalization of a meta-game using the lambda calculus. BioSystems80(3), 219–231 (2005)

  10. [10]

    J.S. Park, J. O’Brien, C.J. Cai, M.R. Morris, P. Liang, M.S. Bernstein,Generative agents: Interactive simulacra of human behavior, inProceedings of the 36th annual acm symposium on user interface software and technology(2023), pp. 1–22

  11. [11]

    Grossmann, M

    I. Grossmann, M. Feinberg, D.C. Parker, N.A. Christakis, P.E. Tetlock, W.A. Cunningham, AI and the transformation of social science research. Science 380(6650), 1108–1109 (2023)

  12. [12]

    Bail, Can generative AI improve social science? Proceedings of the National Academy of Sciences121(21), e2314021121 (2024)

    C.A. Bail, Can generative AI improve social science? Proceedings of the National Academy of Sciences121(21), e2314021121 (2024)

  13. [13]

    Argyle, E.C

    L.P. Argyle, E.C. Busby, N. Fulda, J.R. Gubler, C. Rytting, D. Wingate, Out of one, many: Using language models to simulate human samples. Political Analysis 31(3), 337–351 (2023) 28

  14. [14]

    Y. Zeng, C. Brown, M. Rounsevell, Too human to model: the uncanny valley of large language models in simulating human systems. npj Complexity3, 13 (2026). https://doi.org/10.1038/s44260-026-00075-1

  15. [15]

    Cobben, X.A

    P. Cobben, X.A. Huang, T.A. Pham, I. Dahlgren, T.J. Zhang, Z. Jin, Gt- harmbench: Benchmarking ai safety risks through the lens of game theory. arXiv preprint arXiv:2602.12316 (2026)

  16. [16]

    J. Duan, R. Zhang, J. Diffenderfer, B. Kailkhura, L. Sun, E. Stengel-Eskin, M. Bansal, T. Chen, K. Xu, Gtbench: Uncovering the strategic reasoning capa- bilities of llms via game-theoretic evaluations. Advances in Neural Information Processing Systems37, 28219–28253 (2024)

  17. [17]

    Costarelli, M

    A. Costarelli, M. Allen, R. Hauksson, G. Sodunke, S. Hariharan, C. Cheng, W. Li, J. Clymer, A. Yadav, Gamebench: Evaluating strategic reasoning abilities of llm agents. arXiv preprint arXiv:2406.06613 (2024)

  18. [18]

    (FAIR)†, A

    M.F.A.R.D.T. (FAIR)†, A. Bakhtin, N. Brown, E. Dinan, G. Farina, C. Flaherty, D. Fried, A. Goff, J. Gray, H. Hu, et al., Human-level play in the game of diplo- macy by combining language models with strategic reasoning. Science378(6624), 1067–1074 (2022)

  19. [19]

    Y. Xu, S. Wang, P. Li, F. Luo, X. Wang, W. Liu, Y. Liu, Exploring large language models for communication games: An empirical study on werewolf. arXiv preprint arXiv:2309.04658 (2023)

  20. [20]

    Light, M

    J. Light, M. Cai, S. Shen, Z. Hu, Avalonbench: Evaluating llms playing the game of avalon. arXiv preprint arXiv:2310.05036 (2023)

  21. [21]

    J. Guo, B. Yang, P. Yoo, B.Y. Lin, Y. Iwasawa, Y. Matsuo, Suspicion-agent: Playing imperfect information games with theory of mind aware gpt-4. arXiv preprint arXiv:2309.17277 (2023)

  22. [22]

    S. Hu, T. Huang, G. Liu, R.R. Kompella, F. Ilhan, S.F. Tekin, Y. Xu, Z. Yahn, L. Liu, A survey on large language model-based game agents. arXiv preprint arXiv:2404.02039 (2024)

  23. [23]

    Q. Zhao, J. Wang, Y. Zhang, Y. Jin, K. Zhu, H. Chen, X. Xie,CompeteAI: under- standing the competition dynamics of large language model-based agents, inPro- ceedings of the 41st International Conference on Machine Learning(JMLR.org, 2024), ICML’24

  24. [24]

    McKee, A

    K.R. McKee, A. Tacchetti, M.A. Bakker, J. Balaguer, L. Campbell-Gillingham, R. Everett, M. Botvinick, Scaffolding cooperation in human groups with deep reinforcement learning. Nature Human Behaviour7(10), 1787–1796 (2023) 29

  25. [25]

    Y. Du, S. Li, A. Torralba, J.B. Tenenbaum, I. Mordatch,Improving factual- ity and reasoning in language models through multiagent debate, inForty-first international conference on machine learning(2024)

  26. [26]

    G. Li, H. Hammoud, H. Itani, D. Khizbullin, B. Ghanem, Camel: Communicative agents for” mind” exploration of large language model society. Advances in neural information processing systems36, 51991–52008 (2023)

  27. [27]

    Akata, L

    E. Akata, L. Schulz, J. Coda-Forno, S.J. Oh, M. Bethge, E. Schulz, Playing repeated games with large language models. Nature Human Behaviour9(7), 1380–1390 (2025)

  28. [28]

    Gandhi, D

    K. Gandhi, D. Sadigh, N.D. Goodman, Strategic reasoning with language models. arXiv preprint arXiv:2305.19165 (2023)

  29. [29]

    Pan, J.S

    A. Pan, J.S. Chan, A. Zou, N. Li, S. Basart, T. Woodside, H. Zhang, S. Emmons, D. Hendrycks,Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark, inInternational conference on machine learning(PMLR, 2023), pp. 26837–26867

  30. [30]

    Y. Lan, Z. Hu, L. Wang, Y. Wang, D. Ye, P. Zhao, E.P. Lim, H. Xiong, H. Wang, Llm-based agent society investigation: Collaboration and confrontation in avalon gameplay, inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing(2024), pp. 128–145

  31. [31]

    H. Sun, Y. Wu, P. Wang, W. Chen, Y. Cheng, X. Deng, X. Chu, Game the- ory meets large language models: A systematic survey with taxonomy and new frontiers. arXiv preprint arXiv:2502.09053 (2025)

  32. [32]

    Suber,The Paradox of Self-Amendment: A Study of Logic, Law, Omnipotence, and Change(Peter Lang Publishing, 1990)

    P. Suber,The Paradox of Self-Amendment: A Study of Logic, Law, Omnipotence, and Change(Peter Lang Publishing, 1990)

  33. [33]

    Hatakeyama, T

    M. Hatakeyama, T. Hashimoto, Minimum nomic: a tool for studying rule dynamics. Artificial Life and Robotics13(2), 500–503 (2009)

  34. [34]

    Hota, J.P

    A. Hota, J.P. Jokinen, Nomiclaw: Emergent trust and strategic argumentation in llms during collaborative law-making. arXiv preprint arXiv:2508.05344 (2025)

  35. [35]

    Huang, D

    S. Huang, D. Siddarth, L. Lovitt, T.I. Liao, E. Durmus, A. Tamkin, D. Gan- guli,Collective Constitutional AI: Aligning a Language Model with Public Input, inProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency(2024), pp. 1395–1417. https://doi.org/10.1145/3630106.3658979

  36. [36]

    Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

    C. Fernando, D. Banarse, H. Michalewski, S. Osindero, T. Rockt¨ aschel, Prompt- breeder: Self-referential self-improvement via prompt evolution. arXiv preprint arXiv:2309.16797 (2023) 30

  37. [37]

    Horibe,Evolvability in rule-making: A self-amendment game among llm agents, inProceedings of the Genetic and Evolutionary Computation Conference Companion(2025), pp

    K. Horibe,Evolvability in rule-making: A self-amendment game among llm agents, inProceedings of the Genetic and Evolutionary Computation Conference Companion(2025), pp. 2127–2137

  38. [38]

    Open-Endedness is Essential for Artificial Superhuman Intelligence, June 2024.https://arxiv.org/abs/2406.04268

    E. Hughes, M. Dennis, J. Parker-Holder, F. Behbahani, A. Mavalankar, Y. Shi, T. Schaul, T. Rocktaschel, Open-endedness is essential for artificial superhuman intelligence. arXiv preprint arXiv:2406.04268 (2024)

  39. [39]

    Clune, Ai-gas: Ai-generating algorithms, an alternate paradigm for producing general artificial intelligence

    J. Clune, Ai-gas: Ai-generating algorithms, an alternate paradigm for producing general artificial intelligence. arXiv preprint arXiv:1905.10985 (2019)

  40. [40]

    G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, A. Anandku- mar, Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291 (2023)

  41. [41]

    Lehman, J

    J. Lehman, J. Gordon, S. Jain, K. Ndousse, C. Yeh, K.O. Stanley, inHandbook of evolutionary machine learning(Springer, 2023), pp. 331–366

  42. [42]

    Stanley, J

    K.O. Stanley, J. Lehman,Why Greatness Cannot Be Planned: The Myth of the Objective(Springer, 2015)

  43. [43]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al., Training language models to follow instruc- tions with human feedback. Advances in neural information processing systems 35, 27730–27744 (2022)

  44. [44]

    Rafailov, A

    R. Rafailov, A. Sharma, E. Mitchell, C.D. Manning, S. Ermon, C. Finn, Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems36, 53728–53741 (2023)

  45. [45]

    Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, et al., Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073 (2022)

  46. [46]

    Christiano, J

    P.F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, D. Amodei, Deep rein- forcement learning from human preferences. Advances in neural information processing systems30(2017)

  47. [47]

    Perez, S

    E. Perez, S. Ringer, K. Lukosiute, K. Nguyen, E. Chen, S. Heiner, C. Pettit, C. Olsson, S. Kundu, S. Kadavath, et al.,Discovering language model behaviors with model-written evaluations, inFindings of the association for computational linguistics: ACL 2023(2023), pp. 13387–13434

  48. [48]

    N. Mu, S. Chen, Z. Wang, S. Chen, D. Karamardian, L. Aljeraisy, B. Alo- mair, D. Hendrycks, D. Wagner, Can llms follow simple rules? arXiv preprint arXiv:2311.04235 (2023) 31

  49. [49]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T.B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, D. Amodei, Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)

  50. [50]

    Training Compute-Optimal Large Language Models

    J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. Casas, L.A. Hendricks, J. Welbl, A. Clark, et al., Training compute-optimal large language models. arXiv preprint arXiv:2203.1555610(2022)

  51. [51]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J.D. Kaplan, P. Dhariwal, A. Nee- lakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

  52. [52]

    J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, et al., Emergent abilities of large language models. Transactions on Machine Learning Research (2022)

  53. [53]

    Schaeffer, B

    R. Schaeffer, B. Miranda, S. Koyejo, Are emergent abilities of large language models a mirage? Advances in neural information processing systems36, 55565– 55581 (2023)

  54. [54]

    McKenzie, A

    I.R. McKenzie, A. Lyzhov, M. Pieler, A. Parrish, A. Mueller, A. Prabhu, E. McLean, A. Kirtland, A. Ross, A. Liu, et al., Inverse scaling: When bigger isn’t better. Transactions on Machine Learning Research (2023)

  55. [55]

    T.Y. Wu, M. Lo,U-shaped and inverted-u scaling behind emergent abilities of large language models, inInternational Conference on Learning Representations, vol. 2025 (2025), pp. 99426–99458

  56. [56]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q.V. Le, D. Zhou, et al., Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022)

  57. [57]

    Kojima, S.S

    T. Kojima, S.S. Gu, M. Reid, Y. Matsuo, Y. Iwasawa, Large language models are zero-shot reasoners. Advances in neural information processing systems35, 22199–22213 (2022)

  58. [58]

    S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, K. Narasimhan, Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems36, 11809–11822 (2023)

  59. [59]

    X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, D. Zhou, Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022)

  60. [60]

    Madaan, N

    A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, et al., Self-refine: Iterative refinement with self- feedback. Advances in neural information processing systems36, 46534–46594 32 (2023)

  61. [61]

    Shinn, F

    N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, S. Yao, Reflexion: Lan- guage agents with verbal reinforcement learning. Advances in neural information processing systems36, 8634–8652 (2023)

  62. [62]

    Kosinski, Evaluating large language models in theory of mind tasks

    M. Kosinski, Evaluating large language models in theory of mind tasks. Proceed- ings of the National Academy of Sciences121(45), e2405460121 (2024)

  63. [63]

    Large language models fail on trivial alterations to theory-of-mind tasks.arXiv preprint arXiv:2302.08399,

    T. Ullman, Large language models fail on trivial alterations to theory-of-mind tasks. arXiv preprint arXiv:2302.08399 (2023)

  64. [64]

    M. Sap, R. Le Bras, D. Fried, Y. Choi,Neural theory-of-mind? on the limits of social intelligence in large lms, inProceedings of the 2022 conference on empirical methods in natural language processing(2022), pp. 3762–3780

  65. [65]

    Flavell, Metacognition and cognitive monitoring: A new area of cognitive– developmental inquiry

    J.H. Flavell, Metacognition and cognitive monitoring: A new area of cognitive– developmental inquiry. American psychologist34(10), 906 (1979)

  66. [66]

    Didolkar, A

    A. Didolkar, A. Goyal, N.R. Ke, S. Guo, M. Valko, T. Lillicrap, D. Rezende, Y. Bengio, M. Mozer, S. Arora, Metacognitive capabilities of llms: An exploration in mathematical problem solving. Advances in Neural Information Processing Systems37, 19783–19812 (2024)

  67. [67]

    Searle, Minds, brains, and programs

    J.R. Searle, Minds, brains, and programs. Behavioral and brain sciences3(3), 417–424 (1980)

  68. [68]

    Understanding intermediate layers using linear classifier probes

    G. Alain, Y. Bengio, Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644 (2016)

  69. [69]

    Belinkov, Probing classifiers: Promises, shortcomings, and advances

    Y. Belinkov, Probing classifiers: Promises, shortcomings, and advances. Compu- tational Linguistics48(1), 207–219 (2022)

  70. [70]

    interpreting GPT: the logit lens

    nostalgebraist. interpreting GPT: the logit lens. https://www.lesswrong.com/ posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens (2020)

  71. [71]

    Eliciting Latent Predictions from Transformers with the Tuned Lens

    N. Belrose, I. Ostrovsky, L. McKinney, Z. Furman, L. Smith, D. Halawi, S. Bider- man, J. Steinhardt, Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112 (2023)

  72. [72]

    K. Meng, D. Bau, A. Andonian, Y. Belinkov, Locating and editing factual associa- tions in gpt. Advances in neural information processing systems35, 17359–17372 (2022)

  73. [73]

    J. Vig, S. Gehrmann, Y. Belinkov, S. Qian, D. Nevo, Y. Singer, S. Shieber, Investi- gating gender bias in language models using causal mediation analysis. Advances in neural information processing systems33, 12388–12401 (2020) 33

  74. [74]

    Lindsey, Emergent introspective awareness in large language models

    J. Lindsey, Emergent introspective awareness in large language models. Trans- former Circuits Thread (2025). URL https://transformer-circuits.pub/2025/ introspection/index.html

  75. [75]

    K. Li, A.K. Hopkins, D. Bau, F. Vi´ egas, H. Pfister, M. Wattenberg, Emergent world representations: Exploring a sequence model trained on a synthetic task. arXiv preprint arXiv:2210.13382 (2022)

  76. [76]

    Gurnee, M

    W. Gurnee, M. Tegmark,Language models represent space and time, inInterna- tional Conference on Learning Representations, vol. 2024 (2024), pp. 2483–2503

  77. [77]

    Piatti, Z

    G. Piatti, Z. Jin, M. Kleiman-Weiner, B. Sch¨ olkopf, M. Sachan, R. Mihalcea, Cooperate or collapse: Emergence of sustainable cooperation in a society of llm agents. Advances in Neural Information Processing Systems37, 111715–111759 (2024)

  78. [78]

    Conitzer, R

    V. Conitzer, R. Freedman, J. Heitzig, W.H. Holliday, B.M. Jacobs, N. Lam- bert, M. Moss´ e, E. Pacuit, S. Russell, H. Schoelkopf, et al., Social choice should guide ai alignment in dealing with diverse human feedback. arXiv preprint arXiv:2404.10271 (2024)

  79. [79]

    Lovato, N

    J. Lovato, N. Landry, L. Hebert-Dufresne, et al., Governance as a complex, networked, democratic, satisfiability problem. npj Complexity2, 14 (2025). https://doi.org/10.1038/s44260-025-00041-3

  80. [80]

    March, Exploration and exploitation in organizational learning

    J.G. March, Exploration and exploitation in organizational learning. Organization Science2(1), 71–87 (1991) 34