arxiv: 2605.00226 · v1 · submitted 2026-04-30 · 💻 cs.CL · cs.AI· cs.GT

Recognition: unknown

Why Do LLMs Struggle in Strategic Play? Broken Links Between Observations, Beliefs, and Actions

Jan Sobotka , Mustafa O. Karabag , Ufuk Topcu

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:08 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.GT

keywords large language modelsincomplete information gamesstrategic decision-makingbelief gapsinternal statesobservation-belief gapbelief-action gap

0 comments

The pith

LLMs maintain more accurate internal beliefs about hidden game states than they report verbally, yet convert those beliefs into actions less effectively than when beliefs are externalized.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why large language models fail at strategic decision-making in games with incomplete information such as negotiation or policy tasks. It identifies an observation-belief gap in which models track latent game states internally with higher accuracy than their own verbal reports, though these internal beliefs weaken under multi-step reasoning, display primacy and recency biases, and lose coherence with Bayesian updating. It also identifies a belief-action gap in which the model's implicit use of internal beliefs to choose moves is less reliable than when those beliefs are written explicitly into the input prompt. The experiments with open-weight models show that neither approach reliably raises payoffs, revealing systematic vulnerabilities in how LLMs handle hidden information. These mechanisms matter because LLMs are already being applied to real strategic domains where such gaps can produce poor or unpredictable decisions.

Core claim

In incomplete-information games, LLMs encode internal beliefs about latent game states that are substantially more accurate than their verbal reports, yet these beliefs degrade with multi-hop reasoning, exhibit primacy and recency biases, and drift away from Bayesian coherence over extended interactions. The implicit conversion of internal beliefs into actions is weaker than that of beliefs externalized in the prompt, yet neither belief-conditioning approach consistently achieves higher game payoffs.

What carries the argument

Probing of internal belief states versus verbal reports and actions in LLMs playing incomplete-information games to expose observation-belief and belief-action gaps.

Load-bearing premise

The chosen probing methods reliably extract and compare the model's true internal belief states to its verbal reports and chosen actions.

What would settle it

New experiments in which verbal reports match the accuracy of probed internal beliefs and in which implicit actions perform as well as or better than prompted beliefs would falsify the claimed gaps.

Figures

Figures reproduced from arXiv: 2605.00226 by Jan Sobotka, Mustafa O. Karabag, Ufuk Topcu.

**Figure 2.** Figure 2: Schematic overview of the games used in this study. In Generalized Kuhn Poker and The Chameleon, separate [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Prompt used for asking LLMs for their next action [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Inference of latent information in the selected strategy games. Shown are the means and their standard errors [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Inference of opponent’s type in repeated normal [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: PCA of the internal representations of Llama 3.1 70B in rounds 2 and 15 of repeated normal-form games. Shown [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Probing past opponent’s actions from internal [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Progression of the Bayesian Coherence Coefficient (BCC; top) and the slope of its associated regression line [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Round-by-round progression of the Bayesian Coherence Coefficient (BCC) of internal beliefs about opponent’s [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Steering internal representations toward differ [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: Differences in the gameplay of LLMs under implicit vs. explicit (prompt-based) conditioning by internal beliefs. [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 12.** Figure 12: Effect of the first-item (action) bias on selecting [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt used for asking LLMs for their next action in Generalized Kuhn Poker. [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt used for asking GPT-4o for new categories and secret words for The Chameleon game. [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗

**Figure 15.** Figure 15: Base conversation history used in the gameplay of LLM players in the Chameleon game. “Game:” and “Player:” [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗

**Figure 16.** Figure 16: Continuation of the base conversation history used in the gameplay of non-chameleon LLM players in the [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗

**Figure 17.** Figure 17: Continuation of the base conversation history used in the gameplay of the chameleon LLM player in the [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗

**Figure 18.** Figure 18: Prompt used in repeated normal-form games for specifying the possible opponent’s types by strategy. [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗

**Figure 19.** Figure 19: Prompt used in repeated normal-form games for specifying the possible opponent’s types by strategy and round [PITH_FULL_IMAGE:figures/full_fig_p021_19.png] view at source ↗

**Figure 20.** Figure 20: Prompt used in repeated normal-form games for specifying the possible opponent’s types by payoff matrix. [PITH_FULL_IMAGE:figures/full_fig_p022_20.png] view at source ↗

**Figure 21.** Figure 21: Prompt used for verbal probes of opponent’s strategy in repeated normal-form games. [PITH_FULL_IMAGE:figures/full_fig_p022_21.png] view at source ↗

**Figure 22.** Figure 22: Prompt used for verbal probes of opponent’s private card in Generalized Kuhn Poker. [PITH_FULL_IMAGE:figures/full_fig_p023_22.png] view at source ↗

**Figure 23.** Figure 23: Continuation of the base conversation history used for verbal probes of the secret word in the Chameleon game. [PITH_FULL_IMAGE:figures/full_fig_p023_23.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly tasked with strategic decision-making under incomplete information, such as in negotiation and policymaking. While LLMs can excel at many such tasks, they also fail in ways that are poorly understood. We shed light on these failures by uncovering two fundamental gaps in the internal mechanisms underlying the decision-making of LLMs in incomplete-information games, supported by experiments with open-weight models Llama 3.1, Qwen3, and gpt-oss. First, an observation-belief gap: LLMs encode internal beliefs about latent game states that are substantially more accurate than their own verbal reports, yet these beliefs are brittle. In particular, the belief accuracy degrades with multi-hop reasoning, exhibits primacy and recency biases, and drifts away from Bayesian coherence over extended interactions. Second, a belief-action gap: The implicit conversion of internal beliefs into actions is weaker than that of the beliefs externalized in the prompt, yet neither belief-conditioning consistently achieves higher game payoffs. These results show how analyzing LLMs' internal processes can expose systematic vulnerabilities that warrant caution before deploying LLMs in strategic domains without robust guardrails.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper identifies observation-belief and belief-action gaps in LLMs during incomplete-info games via internal probing, but lacks causal checks on whether those probes recover actual latent states.

read the letter

The paper's main takeaway is that LLMs in incomplete-information strategic games keep internal beliefs that are more accurate than their spoken reports, but those beliefs are fragile and don't feed into actions very effectively. They back this with tests on Llama 3.1, Qwen3, and gpt-oss. The observation-belief gap shows up as better internal accuracy that still degrades under multi-hop reasoning, picks up primacy and recency biases, and moves away from proper Bayesian updating during long games. The belief-action gap appears when they compare how well implicit beliefs turn into moves versus when beliefs are written out in the prompt. Neither version consistently beats the other on actual game payoffs. This framing is new. Most prior work on LLM strategy looks at final performance or simple prompting fixes. Here the authors try to open up the black box with internal analysis to locate the breaks between seeing the game, forming a belief, and acting on it. The experiments are a step in the right direction for understanding these models in negotiation or policy contexts. They use open-weight models, which helps with reproducibility. The weak part is the reliance on probing methods without clear causal evidence. If the probes are just picking up surface patterns from training rather than the actual internal game-state representations, then the reported gaps could be measurement issues. The stress-test concern about needing interventions to edit states and check action changes is fair. The paper would be stronger with some activation editing or causal tracing to show the beliefs really drive the outputs. The game set is also narrow, so claims about broader strategic domains need more support. This paper is for people working on LLM agents in multi-agent or decision-making settings who want to understand failure modes beyond surface behavior. A reader focused on mechanistic interpretability in games would find it relevant. It should go to peer review. The core observations are worth testing with tighter controls, even if the current version leaves some methodological questions open.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLMs exhibit two fundamental gaps in strategic decision-making under incomplete information: an observation-belief gap, where internal beliefs extracted via probes are substantially more accurate than verbal reports but brittle (degrading under multi-hop reasoning, showing primacy/recency biases, and drifting from Bayesian coherence), and a belief-action gap, where implicit conversion of beliefs to actions is weaker than when beliefs are externalized in prompts, though neither consistently improves game payoffs. These are demonstrated through experiments on Llama 3.1, Qwen3, and gpt-oss in incomplete-information games such as negotiation.

Significance. If the results hold after addressing methodological concerns, the work is significant for mechanistic interpretability and AI safety. It offers concrete evidence of internal LLM limitations in game-theoretic settings, extending probing techniques to strategic domains and highlighting risks for deployment in negotiation or policymaking without safeguards. The empirical focus on open-weight models supports reproducibility.

major comments (2)

[§4.2] §4.2 (Belief Probing Methods): The observation-belief gap rests on linear or activation-based probes recovering latent game-state representations. No causal validation is provided (e.g., activation editing or patching experiments showing that altering probed states changes downstream actions or payoffs). Without this, higher probe accuracy versus verbal reports and reported brittleness may reflect correlated surface features rather than causal internal mechanisms, weakening the claim of a fundamental gap.
[§5.3] §5.3 (Belief-Action Experiments): The belief-action gap compares implicit conversion to externalized beliefs in prompts, but the externalization baseline is not shown to be informationally equivalent. The finding that neither yields higher payoffs consistently suggests the gaps may not be load-bearing for performance failures; additional controls (e.g., payoff breakdowns by game length or information depth) are needed to establish that the gaps explain strategic struggles.

minor comments (2)

[Abstract] Abstract and §3: 'gpt-oss' should be fully specified with citation or repository link for reproducibility; clarify whether it refers to a specific open-source GPT variant.
[Results] Figure 4 (or equivalent results table): Error bars or statistical significance tests for belief accuracy differences across models and game lengths would strengthen the brittleness claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where we agree and the revisions we will implement to strengthen the causal interpretation and experimental controls.

read point-by-point responses

Referee: [§4.2] §4.2 (Belief Probing Methods): The observation-belief gap rests on linear or activation-based probes recovering latent game-state representations. No causal validation is provided (e.g., activation editing or patching experiments showing that altering probed states changes downstream actions or payoffs). Without this, higher probe accuracy versus verbal reports and reported brittleness may reflect correlated surface features rather than causal internal mechanisms, weakening the claim of a fundamental gap.

Authors: We agree that causal interventions such as activation patching would offer stronger evidence that the probed representations directly influence downstream actions. Our current results rely on the substantial accuracy advantage of probes over verbal reports together with the specific, reproducible brittleness patterns (multi-hop degradation, primacy/recency effects, and Bayesian drift). These patterns are difficult to attribute solely to surface correlations, yet we acknowledge the correlational nature of the evidence. In the revised manuscript we will add an explicit limitations subsection that states the absence of causal validation and outlines how activation-editing experiments could be conducted in follow-up work. We will also tighten the language around 'internal beliefs' to reflect the probe-based inference. revision: partial
Referee: [§5.3] §5.3 (Belief-Action Experiments): The belief-action gap compares implicit conversion to externalized beliefs in prompts, but the externalization baseline is not shown to be informationally equivalent. The finding that neither yields higher payoffs consistently suggests the gaps may not be load-bearing for performance failures; additional controls (e.g., payoff breakdowns by game length or information depth) are needed to establish that the gaps explain strategic struggles.

Authors: We will add a verification step showing that the externalized belief prompts contain the same state information recovered by the probes (measured by probe accuracy on the prompted text). The consistent failure of both implicit and explicit belief conditioning to raise payoffs is, in our view, direct support for a belief-action gap rather than evidence against it; it indicates that accurate beliefs alone are insufficient for strategic action selection. To strengthen the link to performance, we will include the requested controls: payoff breakdowns stratified by game length and information depth. These analyses will be added to §5.3 and the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivations or self-referential constructions

full rationale

The paper contains no equations, parameter fits, uniqueness theorems, or ansatzes. Its central claims rest on direct experimental comparisons of probe accuracies, verbal reports, and action payoffs across three model families in specific games. These measurements are independent of any self-citation chain or definitional loop; the observation-belief and belief-action gaps are reported outcomes of the probes and game logs rather than quantities defined in terms of themselves. No load-bearing step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the central claims rest on unstated assumptions about the validity of internal-state probing techniques.

pith-pipeline@v0.9.0 · 5513 in / 1054 out tokens · 58305 ms · 2026-05-09T20:08:28.458611+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 32 canonical work pages · 5 internal anchors

[1]

How well can LLMs negotiate? negotiationarena platform and analysis

Federico Bianchi, Patrick John Chia, Mert Yuksek- gonul, Jacopo Tagliabue, Dan Jurafsky, and James Zou. How well can LLMs negotiate? negotiationarena platform and analysis. InForty-first International Con- ference on Machine Learning, 2024. URL https: //openreview.net/forum?id=CmOmaxkt8p

2024
[2]

Cooperation, competition, and maliciousness: LLM-stakeholders interactive ne- gotiation

Sahar Abdelnabi, Amr Gomaa, Sarath Sivaprasad, Lea Schönherr, and Mario Fritz. Cooperation, competition, and maliciousness: LLM-stakeholders interactive ne- gotiation. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Bench- marks Track, 2024

2024
[3]

In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

Priyanshu Priya, Saurav Dudhate, Desai Vishesh Yasheshbhai, and Asif Ekbal. We argue to agree: Towards personality-driven argumentation-based ne- gotiation dialogue systems for tourism. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Asso- ciation for Computational Linguistics: EMNLP 2025, pages 2...

work page doi:10.18653/v1/2025.findings- 2025
[4]

Deuksin Kwon, Jiwon Hae, Emma Clift, Daniel Sham- soddini, Jonathan Gratch, and Gale Lucas. ASTRA: A negotiation agent with adaptive and strategic rea- 11 Why Do LLMs Struggle in Strategic Play? Broken Links Between Observations, Beliefs, and Actions soning via tool-integrated action for dynamic offer optimization. In Christos Christodoulopoulos, Tan- moy...

work page doi:10.18653/v1/2025.emnlp-main.821 2025
[5]

Evaluating behavioral alignment in conflict dialogue: A multi-dimensional comparison of LLM agents and humans

Deuksin Kwon, Kaleen Shrestha, Bin Han, Elena Hay- oung Lee, and Gale Lucas. Evaluating behavioral alignment in conflict dialogue: A multi-dimensional comparison of LLM agents and humans. In Chris- tos Christodoulopoulos, Tanmoy Chakraborty, Car- olyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Languag...

work page doi:10.18653/v1/2025.emnlp-main.828 2025
[6]

The end of reward engineering: How llms are redefining multi- agent coordination, 2026

Haoran Su, Yandong Sun, and Congjia Yu. The end of reward engineering: How llms are redefining multi- agent coordination, 2026. URL https://arxiv.or g/abs/2601.08237

work page arXiv 2026
[7]

What would an LLM do? evaluating policymaking capabilities of large language models

Pierre Le Coz, Jiaan Liu, Debarun Bhattacharjya, Georgina Curto, and Serge Stinckwich. What would an LLM do? evaluating policymaking capabilities of large language models. InSecond Workshop on Language Models for Underserved Communities (LM4UC), 2025. URL https://openreview.net/forum?id=ie9O GdjrVa

2025
[8]

From tools to partners: How large language models are transforming urban plan- ning.AI Open, 6:276–298, 2025

Fangyong Pan, Xinyi Huang, Yuxi Bi, Yunfan Gao, Yu Ye, and Haofen Wang. From tools to partners: How large language models are transforming urban plan- ning.AI Open, 6:276–298, 2025. ISSN 2666-6510. doi:https://doi.org/10.1016/j.aiopen.2025.11.001. URL https://www.sciencedirect.com/scienc e/article/pii/S2666651025000191

work page doi:10.1016/j.aiopen.2025.11.001 2025
[9]

Ai language models could both help and harm equity in marine policymak- ing.npj Ocean Sustainability, 4(1):32, Jun 2025

Matt Ziegler, Sarah Lothian, Brian O’Neill, Richard Anderson, and Yoshitaka Ota. Ai language models could both help and harm equity in marine policymak- ing.npj Ocean Sustainability, 4(1):32, Jun 2025. ISSN 2731-426X. doi:10.1038/s44183-025-00132-7. URL https://doi.org/10.1038/s44183-025-00132 -7

work page doi:10.1038/s44183-025-00132-7 2025
[10]

So, Yao-Yang Xu, Zhao-Feng Guo, Xinbing Wang, David W

Cai Chen, Shu-Le Li, Anthony D. So, Yao-Yang Xu, Zhao-Feng Guo, Xinbing Wang, David W. Gra- ham, and Yong-Guan Zhu. Using large language models to assist antimicrobial resistance policy de- velopment: Integrating the environment into health protection planning.Environmental Science & Tech- nology, 59(2):1243–1252, Jan 2025. ISSN 0013- 936X. doi:10.1021/ac...

work page doi:10.1021/acs.est.4c07842 2025
[11]

Royal Society Open Science11(6), 240255 (2024) https://doi

Olivia Macmillan-Scott and Mirco Musolesi. (ir)rationality and cognitive biases in large language models.Royal Society Open Science, 11(6):240255, 06 2024. ISSN 2054-5703. doi:10.1098/rsos.240255. URLhttps://doi.org/10.1098/rsos.240255

work page doi:10.1098/rsos.240255 2024
[12]

Denizalp Goktas, Amy Greenwald, Takayuki Osogami, Roma Patel, Kevin Leyton-Brown, Grant Schoenebeck, Daphne Cornelisse, Constantinos Daskalakis, Ian Gemp, John Horton, David C Parkes, David M. Pennock, Arjun Prakash, Sai Srivatsa Ravindranath, Max Olan Smith, Gokul Swamy, Eugene Vinitsky, Segev Wasserkrug, Michael Wellman, Jibang Wu, Haifeng Xu, Jiayao Zh...

2025
[13]

Game of thoughts: Iterative reasoning in game-theoretic do- mains with large language models.AAMAS, 2025

Benjamin Kempinski, Ian Gemp, Kate Larson, Marc Lanctot, Yoram Bachrach, and Tal Kachman. Game of thoughts: Iterative reasoning in game-theoretic do- mains with large language models.AAMAS, 2025

2025
[14]

Strategic behavior of large language models and the role of game structure versus contextual framing,

Nunzio Lorè and Babak Heydari. Strategic behav- ior of large language models and the role of game structure versus contextual framing.Scientific Re- ports, 14(1):18490, Aug 2024. ISSN 2045-2322. doi:10.1038/s41598-024-69032-z. URL https:// doi.org/10.1038/s41598-024-69032-z

work page doi:10.1038/s41598-024-69032-z 2024
[15]

Large language model reasoning failures.Transac- tions on Machine Learning Research, 2026

Peiyang Song, Pengrui Han, and Noah Goodman. Large language model reasoning failures.Transac- tions on Machine Learning Research, 2026. ISSN 2835-8856. URL https://openreview.net/for um?id=vnX1WHMNmz. Survey Certification

2026
[16]

J., Bethge, M., & Schulz, E

Elif Akata, Lion Schulz, Julian Coda-Forno, Seong Joon Oh, Matthias Bethge, and Eric Schulz. Playing repeated games with large language models. Nature Human Behaviour, 9(7):1380–1390, Jul 2025. ISSN 2397-3374. doi:10.1038/s41562-025-02172-y. URL https://doi.org/10.1038/s41562-025-0 2172-y

work page doi:10.1038/s41562-025-02172-y 2025
[17]

Demonstrating specification gaming in reasoning models,

Alexander Bondarenko, Denis V olk, Dmitrii V olkov, and Jeffrey Ladish. Demonstrating specification gaming in reasoning models.arXiv preprint arXiv:2502.13295, 2025

work page arXiv 2025
[18]

Textarena

Leon Guertler, Bobby Cheng, Simon Yu, Bo Liu, Leshem Choshen, and Cheston Tan. Textarena, 2025. URLhttps://arxiv.org/abs/2504.11442

work page arXiv 2025
[19]

Murdock, Bennet B

Jr. Murdock, Bennet B. The serial position effect of free recall.Journal of Experimental Psychology, 64 (5):482–488, 1962. doi:10.1037/h0045106

work page doi:10.1037/h0045106 1962
[20]

Minhee Yoo, Giwon Bahg, Brandon Turner, and Ian Krajbich. People display consistent recency and pri- macy effects in behavior and neural activity across perceptual and value-based judgments.Cogn Affect Behav Neurosci, 25(4):923–940, March 2025

2025
[21]

Mistakes in games

Sam Ganzfried. Mistakes in games. InProceedings of the First International Conference on Distributed Artificial Intelligence, pages 1–6, 2019. 12 Why Do LLMs Struggle in Strategic Play? Broken Links Between Observations, Beliefs, and Actions

2019
[22]

The chameleon board game

Rikki Tahta. The chameleon board game. https://bigpotato.com/products/the-chameleon, 2017

2017
[23]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chen- gen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jian- wei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jin- gren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Alt- man, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt- oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review arXiv 2025
[26]

URL https://openai.com/

OpenAI.GPT-5, 2025. URL https://openai.com/. Version: 2025-08-07

2025
[27]

Human strategic decision making in parametrized games.Mathematics, 10(7), 2022

Sam Ganzfried. Human strategic decision making in parametrized games.Mathematics, 10(7), 2022. ISSN 2227-7390. doi:10.3390/math10071147. URL https://www.mdpi.com/2227-7390/10/7/1147

work page doi:10.3390/math10071147 2022
[28]

Harold W. Kuhn. Simplified two-person poker. In Harold W. Kuhn and Albert W. Tucker, editors,Con- tributions to the Theory of Games, volume 1, pages 97–103. Princeton University Press, Princeton, NJ, USA, 1950

1950
[29]

Do LLMs strategically reveal, conceal, and infer in- formation? a theoretical and empirical analysis in the chameleon game.arXiv preprint arXiv:2501.19398, 2025

Mustafa O Karabag, Jan Sobotka, and Ufuk Topcu. Do LLMs strategically reveal, conceal, and infer in- formation? a theoretical and empirical analysis in the chameleon game.arXiv preprint arXiv:2501.19398, 2025

work page arXiv 2025
[30]

Computational Linguistics , year =

Yonatan Belinkov. Probing classifiers: Promises, short- comings, and advances.Computational Linguistics, 48 (1):207–219, March 2022. doi:10.1162/coli_a_00422. URL https://aclanthology.org/2022.cl-1.7 /

work page internal anchor Pith review doi:10.1162/coli_a_00422 2022
[31]

Information-Theoretic Probing with Minimum Description Length

Elena V oita and Ivan Titov. Information-theoretic probing with minimum description length. In Bon- nie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP), pages 183–196, Online, Novem- ber 2020. Association for Computational Linguistics. doi:10.18653/v1/202...

work page doi:10.18653/v1/2020.emnlp-main.14 2020
[32]

Jean-Stanislas Denain and Jacob Steinhardt

Yanai Elazar, Shauli Ravfogel, Alon Jacovi, and Yoav Goldberg. Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals.Transactions of the Association for Computational Linguistics, 9:160–175, 03 2021. ISSN 2307-387X. doi:10.1162/tacl_a_00359. URLhttps://doi.org/10.1162/tacl_a_00359

work page doi:10.1162/tacl_a_00359 2021
[33]

Finding neurons in a haystack: Case studies with sparse probing.Transactions on Machine Learning Research, 2023

Wes Gurnee, Neel Nanda, Matthew Pauly, Kather- ine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. Finding neurons in a haystack: Case studies with sparse probing.Transactions on Machine Learning Research, 2023

2023
[34]

Layer by layer: Uncovering hidden representations in language models.arXiv preprint arXiv:2502.02013,

Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models.arXiv preprint arXiv:2502.02013, 2025

work page arXiv 2025
[35]

Transformer feed-forward layers are key-value memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, 2021

2021
[36]

Interpreting gpt: The logit lens.Blog Post, 2020

Nostalgebraist. Interpreting gpt: The logit lens.Blog Post, 2020. URL https://www.lesswrong.com/ posts/AcKRB8wDpdaN6v6ru/interpreting-gpt -the-logit-lens

2020
[37]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY , USA, 2017. Curran Associates Inc. ISBN 9781510860964

2017
[38]

Improving language understand- ing by generative pre-training, 2018

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understand- ing by generative pre-training, 2018

2018
[39]

The linear representation hypothesis and the geometry of large language models

Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. InForty-first International Conference on Machine Learning, 2024. URL https: //openreview.net/forum?id=UGpGkLzwpP

2024
[40]

Toy models of superposition.Trans- former Circuits Thread, 2022

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Ka- plan, Dario Amodei, Martin Wattenberg, and Christo- pher Olah. Toy models of superposition.Trans- former Circuits Thread, 2022. https://transformer- circuits.pub/...

2022
[41]

Locating and Editing Factual Associations in GPT, January 2023

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT.Advances in Neural Information Processing Systems, 36, 2022. arXiv:2202.05262. 13 Why Do LLMs Struggle in Strategic Play? Broken Links Between Observations, Beliefs, and Actions

work page arXiv 2022
[42]

Chi, Quoc V Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Al- ice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Infor- mation Processing Systems, 2022. URL https: //openreview.net/for...

2022
[43]

Chain-of-thought reasoning without prompting

Xuezhi Wang and Denny Zhou. Chain-of-thought reasoning without prompting. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/f orum?id=4Zt7S0B0Jp

2024
[44]

Reasoning Language Models: A Blueprint, June 2025

Maciej Besta, Julia Barth, Eric Schreiber, Ales Ku- bicek, Afonso Catarino, Robert Gerstenberger, Pi- otr Nyczyk, Patrick Iff, Yueling Li, Sam Houlis- ton, Tomasz Sternal, Marcin Copik, Grzegorz Kwa´sniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Zixuan Chen, Hubert Niewiadomski, and Torsten Hoefler. Reasoning Language Models: A Blueprint, June 2025

2025
[45]

Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J. Reddi. Reasoning with latent thoughts: On the power of looped transformers. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview .net/forum?id=din0lGfZFd

2025
[46]

Training large language models to reason in a continu- ous latent space

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason E Weston, and Yuandong Tian. Training large language models to reason in a continu- ous latent space. InSecond Conference on Language Modeling, 2025. URL https://openreview.net /forum?id=Itxz7S4Ip3

2025
[47]

Reasoning beyond language: A compre- hensive survey on latent chain-of-thought reasoning.CoRR, abs/2505.16782, 2025

Xinghao Chen, Anhao Zhao, Heming Xia, Xuan Lu, Hanlin Wang, Yanjun Chen, Wei Zhang, Jian Wang, Wenjie Li, and Xiaoyu Shen. Reasoning beyond lan- guage: A comprehensive survey on latent chain-of- thought reasoning, 2025. URL https://arxiv.or g/abs/2505.16782

work page arXiv 2025
[48]

Ai–ai bias: Large language models favor communications gener- ated by large language models.Proceedings of the National Academy of Sciences, 122(31):e2415697122,

Walter Laurito, Benjamin Davis, Peli Grietzer, Tomáš Gavenˇciak, Ada Böhm, and Jan Kulveit. Ai–ai bias: Large language models favor communications gener- ated by large language models.Proceedings of the National Academy of Sciences, 122(31):e2415697122,
[49]

URL https: //www.pnas.org/doi/abs/10.1073/pnas.2415 697122

doi:10.1073/pnas.2415697122. URL https: //www.pnas.org/doi/abs/10.1073/pnas.2415 697122

work page doi:10.1073/pnas.2415697122
[50]

Alexander Knipper and Charles S

R Alexander Knipper, Charles S Knipper, Kaiqi Zhang, Valerie Sims, Clint Bowers, and Santu Karmaker. The bias is in the details: An assessment of cognitive bias in llms.arXiv preprint arXiv:2509.22856, 2025

work page arXiv 2025
[51]

On the emergence of position bias in transformers

Xinyi Wu, Yifei Wang, Stefanie Jegelka, and Ali Jadbabaie. On the emergence of position bias in transformers. InForty-second International Confer- ence on Machine Learning, 2025. URL https: //openreview.net/forum?id=YufVk7I6Ii

2025
[52]

K. J. Åström. Optimal control of markov processes with incomplete state information. Journal of Mathematical Analysis and Applica- tions, 10(1):174–205, 1965. ISSN 0022-247X. doi:https://doi.org/10.1016/0022-247X(65)90154-X. URL https://www.sciencedirect.com/scienc e/article/pii/0022247X6590154X

work page doi:10.1016/0022-247x(65)90154-x 1965
[53]

Edward J. Sondik. The optimal control of partially observable markov processes over the infinite horizon: Discounted costs.Operations Research, 26(2):282– 304, 1978. ISSN 0030364X, 15265463. URL http: //www.jstor.org/stable/169635

1978
[54]

The bayesian brain: the role of uncertainty in neural coding and com- putation.Trends Neurosci, 27(12):712–719, December 2004

David C Knill and Alexandre Pouget. The bayesian brain: the role of uncertainty in neural coding and com- putation.Trends Neurosci, 27(12):712–719, December 2004

2004
[55]

Action understanding as inverse planning.Cog- nition, 113(3):329–349, September 2009

Chris L Baker, Rebecca Saxe, and Joshua B Tenen- baum. Action understanding as inverse planning.Cog- nition, 113(3):329–349, September 2009

2009
[56]

MIT Press, 2024

T L Griffiths, N Chater, and J B Tenenbaum.Bayesian Models of Cognition: Reverse Engineering the Mind. MIT Press, 2024

2024
[57]

Atkinson

Sohaib Imran, Ihor Kendiukhov, Matthew Broerman, Aditya Thomas, Riccardo Campanella, Rob Lamb, and Peter M. Atkinson. Are LLM belief updates consistent with bayes’ theorem? InICML 2025 Workshop on Assessing World Models, 2025. URL https://open review.net/forum?id=Bki9T98mfr

2025
[58]

Inference-time in- tervention: Eliciting truthful answers from a lan- guage model

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time in- tervention: Eliciting truthful answers from a lan- guage model. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 41451–41530. Curran Associates, Inc., 2023. ...

2023
[59]

The geometry of truth: Emergent linear structure in large language model representations of true/false datasets

Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. InFirst Conference on Language Modeling, 2024. URL http s://openreview.net/forum?id=aajyHYjjsk

2024
[60]

Towards inference-time category-wise safety steering for large language mod- els

Amrita Bhattacharjee, Shaona Ghosh, Traian Rebe- dea, and Christopher Parisien. Towards inference-time category-wise safety steering for large language mod- els. InNeurips Safe Generative AI Workshop 2024,

2024
[61]

URL https://openreview.net/forum?i d=EkQRNLPFcn
[62]

Steering llms towards safer shores

Victor Batenburg, Gijs Wijnholds, and Olga Gady- atskaya. Steering llms towards safer shores. In Chuti- porn Anutariya, Marcello Bonsangue, Amalka Pini- diyaarachchi, and Hakim Usoof, editors,Data Science and Artificial Intelligence, pages 45–59, Singapore,
[63]

ISBN 978-981-95- 4409-7

Springer Nature Singapore. ISBN 978-981-95- 4409-7. 14 Why Do LLMs Struggle in Strategic Play? Broken Links Between Observations, Beliefs, and Actions
[64]

Emergence of linear truth encodings in language models

Shauli Ravfogel, Gilad Yehudai, Tal Linzen, Joan Bruna, and Alberto Bietti. Emergence of linear truth encodings in language models. InMechanistic Inter- pretability Workshop at NeurIPS 2025, 2025. URL ht tps://openreview.net/forum?id=KDK99BgmPe

2025
[65]

Reducing sycophancy and improving honesty via activation steering, 2023

Nina Rimsky. Reducing sycophancy and improving honesty via activation steering, 2023. URL https: //www.lesswrong.com/posts/zt6hRsDE84HeBK h7E. LessWrong post

2023
[66]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activa- tion engineering.arXiv preprint arXiv:2308.10248, 2023

work page internal anchor Pith review arXiv 2023
[67]

Eliciting latent knowl- edge from ”quirky” language models

Alex Troy Mallen, Madeline Brumley, Julia Kharchenko, and Nora Belrose. Eliciting latent knowl- edge from ”quirky” language models. InFirst Con- ference on Language Modeling, 2024. URL https: //openreview.net/forum?id=nGCMLATBit

2024
[68]

Is gpt-oss good? a comprehensive evaluation of openai’s latest open source models, 2025

Ziqian Bi, Keyu Chen, Chiung-Yi Tseng, Danyang Zhang, Tianyang Wang, Hongying Luo, Lu Chen, Jun- ming Huang, Jibin Guan, Junfeng Hao, Xinyuan Song, and Junhao Song. Is gpt-oss good? a comprehensive evaluation of openai’s latest open source models, 2025. URLhttps://arxiv.org/abs/2508.12461

work page arXiv 2025
[69]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InInternational Confer- ence on Learning Representations, 2015

2015
[70]

Prompting GPT-3 to be reliable

Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang Wang, Jianfeng Wang, Jordan Lee Boyd-Graber, and Lijuan Wang. Prompting GPT-3 to be reliable. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview .net/forum?id=98p5x51L5af

2023
[71]

Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. Language models as knowledge bases? In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing and the 9th International Joint Con- ference on Natu...

work page doi:10.18653/v1/d19-1250 2019
[72]

COPEN: Probing conceptual knowledge in pre- trained language models

Hao Peng, Xiaozhi Wang, Shengding Hu, Hailong Jin, Lei Hou, Juanzi Li, Zhiyuan Liu, and Qun Liu. COPEN: Probing conceptual knowledge in pre- trained language models. In Yoav Goldberg, Zor- nitsa Kozareva, and Yue Zhang, editors,Proceed- ings of the 2022 Conference on Empirical Meth- ods in Natural Language Processing, pages 5015– 5035, Abu Dhabi, United A...

work page doi:10.18653/v1/2022.emnlp-main.335 2022
[73]

Observer, not player: Simulating theory of mind in large language models through game observation

Jerry Wang and Ting Yu Liu. Observer, not player: Simulating theory of mind in large language models through game observation. InFirst Workshop on Foun- dations of Reasoning in Language Models, 2025. URL https://openreview.net/forum?id=GYi2Voim 9O

2025
[74]

A” or “B

Joseph Suh, Erfan Jahanparast, Suhong Moon, Min- woo Kang, and Serina Chang. Language model fine- tuning on scaled survey data for predicting distribu- tions of public opinions. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguis- tics (...

work page doi:10.18653/v1/2025.acl-long.1028 2025