Recognition: unknown
Why Do LLMs Struggle in Strategic Play? Broken Links Between Observations, Beliefs, and Actions
Pith reviewed 2026-05-09 20:08 UTC · model grok-4.3
The pith
LLMs maintain more accurate internal beliefs about hidden game states than they report verbally, yet convert those beliefs into actions less effectively than when beliefs are externalized.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In incomplete-information games, LLMs encode internal beliefs about latent game states that are substantially more accurate than their verbal reports, yet these beliefs degrade with multi-hop reasoning, exhibit primacy and recency biases, and drift away from Bayesian coherence over extended interactions. The implicit conversion of internal beliefs into actions is weaker than that of beliefs externalized in the prompt, yet neither belief-conditioning approach consistently achieves higher game payoffs.
What carries the argument
Probing of internal belief states versus verbal reports and actions in LLMs playing incomplete-information games to expose observation-belief and belief-action gaps.
Load-bearing premise
The chosen probing methods reliably extract and compare the model's true internal belief states to its verbal reports and chosen actions.
What would settle it
New experiments in which verbal reports match the accuracy of probed internal beliefs and in which implicit actions perform as well as or better than prompted beliefs would falsify the claimed gaps.
Figures
read the original abstract
Large language models (LLMs) are increasingly tasked with strategic decision-making under incomplete information, such as in negotiation and policymaking. While LLMs can excel at many such tasks, they also fail in ways that are poorly understood. We shed light on these failures by uncovering two fundamental gaps in the internal mechanisms underlying the decision-making of LLMs in incomplete-information games, supported by experiments with open-weight models Llama 3.1, Qwen3, and gpt-oss. First, an observation-belief gap: LLMs encode internal beliefs about latent game states that are substantially more accurate than their own verbal reports, yet these beliefs are brittle. In particular, the belief accuracy degrades with multi-hop reasoning, exhibits primacy and recency biases, and drifts away from Bayesian coherence over extended interactions. Second, a belief-action gap: The implicit conversion of internal beliefs into actions is weaker than that of the beliefs externalized in the prompt, yet neither belief-conditioning consistently achieves higher game payoffs. These results show how analyzing LLMs' internal processes can expose systematic vulnerabilities that warrant caution before deploying LLMs in strategic domains without robust guardrails.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLMs exhibit two fundamental gaps in strategic decision-making under incomplete information: an observation-belief gap, where internal beliefs extracted via probes are substantially more accurate than verbal reports but brittle (degrading under multi-hop reasoning, showing primacy/recency biases, and drifting from Bayesian coherence), and a belief-action gap, where implicit conversion of beliefs to actions is weaker than when beliefs are externalized in prompts, though neither consistently improves game payoffs. These are demonstrated through experiments on Llama 3.1, Qwen3, and gpt-oss in incomplete-information games such as negotiation.
Significance. If the results hold after addressing methodological concerns, the work is significant for mechanistic interpretability and AI safety. It offers concrete evidence of internal LLM limitations in game-theoretic settings, extending probing techniques to strategic domains and highlighting risks for deployment in negotiation or policymaking without safeguards. The empirical focus on open-weight models supports reproducibility.
major comments (2)
- [§4.2] §4.2 (Belief Probing Methods): The observation-belief gap rests on linear or activation-based probes recovering latent game-state representations. No causal validation is provided (e.g., activation editing or patching experiments showing that altering probed states changes downstream actions or payoffs). Without this, higher probe accuracy versus verbal reports and reported brittleness may reflect correlated surface features rather than causal internal mechanisms, weakening the claim of a fundamental gap.
- [§5.3] §5.3 (Belief-Action Experiments): The belief-action gap compares implicit conversion to externalized beliefs in prompts, but the externalization baseline is not shown to be informationally equivalent. The finding that neither yields higher payoffs consistently suggests the gaps may not be load-bearing for performance failures; additional controls (e.g., payoff breakdowns by game length or information depth) are needed to establish that the gaps explain strategic struggles.
minor comments (2)
- [Abstract] Abstract and §3: 'gpt-oss' should be fully specified with citation or repository link for reproducibility; clarify whether it refers to a specific open-source GPT variant.
- [Results] Figure 4 (or equivalent results table): Error bars or statistical significance tests for belief accuracy differences across models and game lengths would strengthen the brittleness claims.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where we agree and the revisions we will implement to strengthen the causal interpretation and experimental controls.
read point-by-point responses
-
Referee: [§4.2] §4.2 (Belief Probing Methods): The observation-belief gap rests on linear or activation-based probes recovering latent game-state representations. No causal validation is provided (e.g., activation editing or patching experiments showing that altering probed states changes downstream actions or payoffs). Without this, higher probe accuracy versus verbal reports and reported brittleness may reflect correlated surface features rather than causal internal mechanisms, weakening the claim of a fundamental gap.
Authors: We agree that causal interventions such as activation patching would offer stronger evidence that the probed representations directly influence downstream actions. Our current results rely on the substantial accuracy advantage of probes over verbal reports together with the specific, reproducible brittleness patterns (multi-hop degradation, primacy/recency effects, and Bayesian drift). These patterns are difficult to attribute solely to surface correlations, yet we acknowledge the correlational nature of the evidence. In the revised manuscript we will add an explicit limitations subsection that states the absence of causal validation and outlines how activation-editing experiments could be conducted in follow-up work. We will also tighten the language around 'internal beliefs' to reflect the probe-based inference. revision: partial
-
Referee: [§5.3] §5.3 (Belief-Action Experiments): The belief-action gap compares implicit conversion to externalized beliefs in prompts, but the externalization baseline is not shown to be informationally equivalent. The finding that neither yields higher payoffs consistently suggests the gaps may not be load-bearing for performance failures; additional controls (e.g., payoff breakdowns by game length or information depth) are needed to establish that the gaps explain strategic struggles.
Authors: We will add a verification step showing that the externalized belief prompts contain the same state information recovered by the probes (measured by probe accuracy on the prompted text). The consistent failure of both implicit and explicit belief conditioning to raise payoffs is, in our view, direct support for a belief-action gap rather than evidence against it; it indicates that accurate beliefs alone are insufficient for strategic action selection. To strengthen the link to performance, we will include the requested controls: payoff breakdowns stratified by game length and information depth. These analyses will be added to §5.3 and the appendix. revision: yes
Circularity Check
No circularity: purely empirical measurements with no derivations or self-referential constructions
full rationale
The paper contains no equations, parameter fits, uniqueness theorems, or ansatzes. Its central claims rest on direct experimental comparisons of probe accuracies, verbal reports, and action payoffs across three model families in specific games. These measurements are independent of any self-citation chain or definitional loop; the observation-belief and belief-action gaps are reported outcomes of the probes and game logs rather than quantities defined in terms of themselves. No load-bearing step reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
How well can LLMs negotiate? negotiationarena platform and analysis
Federico Bianchi, Patrick John Chia, Mert Yuksek- gonul, Jacopo Tagliabue, Dan Jurafsky, and James Zou. How well can LLMs negotiate? negotiationarena platform and analysis. InForty-first International Con- ference on Machine Learning, 2024. URL https: //openreview.net/forum?id=CmOmaxkt8p
2024
-
[2]
Cooperation, competition, and maliciousness: LLM-stakeholders interactive ne- gotiation
Sahar Abdelnabi, Amr Gomaa, Sarath Sivaprasad, Lea Schönherr, and Mario Fritz. Cooperation, competition, and maliciousness: LLM-stakeholders interactive ne- gotiation. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Bench- marks Track, 2024
2024
-
[3]
In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T
Priyanshu Priya, Saurav Dudhate, Desai Vishesh Yasheshbhai, and Asif Ekbal. We argue to agree: Towards personality-driven argumentation-based ne- gotiation dialogue systems for tourism. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Asso- ciation for Computational Linguistics: EMNLP 2025, pages 2...
-
[4]
Deuksin Kwon, Jiwon Hae, Emma Clift, Daniel Sham- soddini, Jonathan Gratch, and Gale Lucas. ASTRA: A negotiation agent with adaptive and strategic rea- 11 Why Do LLMs Struggle in Strategic Play? Broken Links Between Observations, Beliefs, and Actions soning via tool-integrated action for dynamic offer optimization. In Christos Christodoulopoulos, Tan- moy...
-
[5]
Deuksin Kwon, Kaleen Shrestha, Bin Han, Elena Hay- oung Lee, and Gale Lucas. Evaluating behavioral alignment in conflict dialogue: A multi-dimensional comparison of LLM agents and humans. In Chris- tos Christodoulopoulos, Tanmoy Chakraborty, Car- olyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Languag...
-
[6]
The end of reward engineering: How llms are redefining multi- agent coordination, 2026
Haoran Su, Yandong Sun, and Congjia Yu. The end of reward engineering: How llms are redefining multi- agent coordination, 2026. URL https://arxiv.or g/abs/2601.08237
-
[7]
What would an LLM do? evaluating policymaking capabilities of large language models
Pierre Le Coz, Jiaan Liu, Debarun Bhattacharjya, Georgina Curto, and Serge Stinckwich. What would an LLM do? evaluating policymaking capabilities of large language models. InSecond Workshop on Language Models for Underserved Communities (LM4UC), 2025. URL https://openreview.net/forum?id=ie9O GdjrVa
2025
-
[8]
Fangyong Pan, Xinyi Huang, Yuxi Bi, Yunfan Gao, Yu Ye, and Haofen Wang. From tools to partners: How large language models are transforming urban plan- ning.AI Open, 6:276–298, 2025. ISSN 2666-6510. doi:https://doi.org/10.1016/j.aiopen.2025.11.001. URL https://www.sciencedirect.com/scienc e/article/pii/S2666651025000191
-
[9]
Matt Ziegler, Sarah Lothian, Brian O’Neill, Richard Anderson, and Yoshitaka Ota. Ai language models could both help and harm equity in marine policymak- ing.npj Ocean Sustainability, 4(1):32, Jun 2025. ISSN 2731-426X. doi:10.1038/s44183-025-00132-7. URL https://doi.org/10.1038/s44183-025-00132 -7
-
[10]
So, Yao-Yang Xu, Zhao-Feng Guo, Xinbing Wang, David W
Cai Chen, Shu-Le Li, Anthony D. So, Yao-Yang Xu, Zhao-Feng Guo, Xinbing Wang, David W. Gra- ham, and Yong-Guan Zhu. Using large language models to assist antimicrobial resistance policy de- velopment: Integrating the environment into health protection planning.Environmental Science & Tech- nology, 59(2):1243–1252, Jan 2025. ISSN 0013- 936X. doi:10.1021/ac...
-
[11]
Royal Society Open Science11(6), 240255 (2024) https://doi
Olivia Macmillan-Scott and Mirco Musolesi. (ir)rationality and cognitive biases in large language models.Royal Society Open Science, 11(6):240255, 06 2024. ISSN 2054-5703. doi:10.1098/rsos.240255. URLhttps://doi.org/10.1098/rsos.240255
-
[12]
Denizalp Goktas, Amy Greenwald, Takayuki Osogami, Roma Patel, Kevin Leyton-Brown, Grant Schoenebeck, Daphne Cornelisse, Constantinos Daskalakis, Ian Gemp, John Horton, David C Parkes, David M. Pennock, Arjun Prakash, Sai Srivatsa Ravindranath, Max Olan Smith, Gokul Swamy, Eugene Vinitsky, Segev Wasserkrug, Michael Wellman, Jibang Wu, Haifeng Xu, Jiayao Zh...
2025
-
[13]
Game of thoughts: Iterative reasoning in game-theoretic do- mains with large language models.AAMAS, 2025
Benjamin Kempinski, Ian Gemp, Kate Larson, Marc Lanctot, Yoram Bachrach, and Tal Kachman. Game of thoughts: Iterative reasoning in game-theoretic do- mains with large language models.AAMAS, 2025
2025
-
[14]
Nunzio Lorè and Babak Heydari. Strategic behav- ior of large language models and the role of game structure versus contextual framing.Scientific Re- ports, 14(1):18490, Aug 2024. ISSN 2045-2322. doi:10.1038/s41598-024-69032-z. URL https:// doi.org/10.1038/s41598-024-69032-z
-
[15]
Large language model reasoning failures.Transac- tions on Machine Learning Research, 2026
Peiyang Song, Pengrui Han, and Noah Goodman. Large language model reasoning failures.Transac- tions on Machine Learning Research, 2026. ISSN 2835-8856. URL https://openreview.net/for um?id=vnX1WHMNmz. Survey Certification
2026
-
[16]
Elif Akata, Lion Schulz, Julian Coda-Forno, Seong Joon Oh, Matthias Bethge, and Eric Schulz. Playing repeated games with large language models. Nature Human Behaviour, 9(7):1380–1390, Jul 2025. ISSN 2397-3374. doi:10.1038/s41562-025-02172-y. URL https://doi.org/10.1038/s41562-025-0 2172-y
-
[17]
Demonstrating specification gaming in reasoning models,
Alexander Bondarenko, Denis V olk, Dmitrii V olkov, and Jeffrey Ladish. Demonstrating specification gaming in reasoning models.arXiv preprint arXiv:2502.13295, 2025
- [18]
-
[19]
Jr. Murdock, Bennet B. The serial position effect of free recall.Journal of Experimental Psychology, 64 (5):482–488, 1962. doi:10.1037/h0045106
-
[20]
Minhee Yoo, Giwon Bahg, Brandon Turner, and Ian Krajbich. People display consistent recency and pri- macy effects in behavior and neural activity across perceptual and value-based judgments.Cogn Affect Behav Neurosci, 25(4):923–940, March 2025
2025
-
[21]
Mistakes in games
Sam Ganzfried. Mistakes in games. InProceedings of the First International Conference on Distributed Artificial Intelligence, pages 1–6, 2019. 12 Why Do LLMs Struggle in Strategic Play? Broken Links Between Observations, Beliefs, and Actions
2019
-
[22]
The chameleon board game
Rikki Tahta. The chameleon board game. https://bigpotato.com/products/the-chameleon, 2017
2017
-
[23]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chen- gen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jian- wei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jin- gren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
gpt-oss-120b & gpt-oss-20b Model Card
Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Alt- man, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt- oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025
work page internal anchor Pith review arXiv 2025
-
[26]
URL https://openai.com/
OpenAI.GPT-5, 2025. URL https://openai.com/. Version: 2025-08-07
2025
-
[27]
Human strategic decision making in parametrized games.Mathematics, 10(7), 2022
Sam Ganzfried. Human strategic decision making in parametrized games.Mathematics, 10(7), 2022. ISSN 2227-7390. doi:10.3390/math10071147. URL https://www.mdpi.com/2227-7390/10/7/1147
-
[28]
Harold W. Kuhn. Simplified two-person poker. In Harold W. Kuhn and Albert W. Tucker, editors,Con- tributions to the Theory of Games, volume 1, pages 97–103. Princeton University Press, Princeton, NJ, USA, 1950
1950
-
[29]
Mustafa O Karabag, Jan Sobotka, and Ufuk Topcu. Do LLMs strategically reveal, conceal, and infer in- formation? a theoretical and empirical analysis in the chameleon game.arXiv preprint arXiv:2501.19398, 2025
-
[30]
Computational Linguistics , year =
Yonatan Belinkov. Probing classifiers: Promises, short- comings, and advances.Computational Linguistics, 48 (1):207–219, March 2022. doi:10.1162/coli_a_00422. URL https://aclanthology.org/2022.cl-1.7 /
work page internal anchor Pith review doi:10.1162/coli_a_00422 2022
-
[31]
Information-Theoretic Probing with Minimum Description Length
Elena V oita and Ivan Titov. Information-theoretic probing with minimum description length. In Bon- nie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP), pages 183–196, Online, Novem- ber 2020. Association for Computational Linguistics. doi:10.18653/v1/202...
-
[32]
Jean-Stanislas Denain and Jacob Steinhardt
Yanai Elazar, Shauli Ravfogel, Alon Jacovi, and Yoav Goldberg. Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals.Transactions of the Association for Computational Linguistics, 9:160–175, 03 2021. ISSN 2307-387X. doi:10.1162/tacl_a_00359. URLhttps://doi.org/10.1162/tacl_a_00359
-
[33]
Finding neurons in a haystack: Case studies with sparse probing.Transactions on Machine Learning Research, 2023
Wes Gurnee, Neel Nanda, Matthew Pauly, Kather- ine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. Finding neurons in a haystack: Case studies with sparse probing.Transactions on Machine Learning Research, 2023
2023
-
[34]
Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models.arXiv preprint arXiv:2502.02013, 2025
-
[35]
Transformer feed-forward layers are key-value memories
Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, 2021
2021
-
[36]
Interpreting gpt: The logit lens.Blog Post, 2020
Nostalgebraist. Interpreting gpt: The logit lens.Blog Post, 2020. URL https://www.lesswrong.com/ posts/AcKRB8wDpdaN6v6ru/interpreting-gpt -the-logit-lens
2020
-
[37]
Gomez, Łukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY , USA, 2017. Curran Associates Inc. ISBN 9781510860964
2017
-
[38]
Improving language understand- ing by generative pre-training, 2018
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understand- ing by generative pre-training, 2018
2018
-
[39]
The linear representation hypothesis and the geometry of large language models
Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. InForty-first International Conference on Machine Learning, 2024. URL https: //openreview.net/forum?id=UGpGkLzwpP
2024
-
[40]
Toy models of superposition.Trans- former Circuits Thread, 2022
Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Ka- plan, Dario Amodei, Martin Wattenberg, and Christo- pher Olah. Toy models of superposition.Trans- former Circuits Thread, 2022. https://transformer- circuits.pub/...
2022
-
[41]
Locating and Editing Factual Associations in GPT, January 2023
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT.Advances in Neural Information Processing Systems, 36, 2022. arXiv:2202.05262. 13 Why Do LLMs Struggle in Strategic Play? Broken Links Between Observations, Beliefs, and Actions
-
[42]
Chi, Quoc V Le, and Denny Zhou
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Al- ice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Infor- mation Processing Systems, 2022. URL https: //openreview.net/for...
2022
-
[43]
Chain-of-thought reasoning without prompting
Xuezhi Wang and Denny Zhou. Chain-of-thought reasoning without prompting. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/f orum?id=4Zt7S0B0Jp
2024
-
[44]
Reasoning Language Models: A Blueprint, June 2025
Maciej Besta, Julia Barth, Eric Schreiber, Ales Ku- bicek, Afonso Catarino, Robert Gerstenberger, Pi- otr Nyczyk, Patrick Iff, Yueling Li, Sam Houlis- ton, Tomasz Sternal, Marcin Copik, Grzegorz Kwa´sniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Zixuan Chen, Hubert Niewiadomski, and Torsten Hoefler. Reasoning Language Models: A Blueprint, June 2025
2025
-
[45]
Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J. Reddi. Reasoning with latent thoughts: On the power of looped transformers. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview .net/forum?id=din0lGfZFd
2025
-
[46]
Training large language models to reason in a continu- ous latent space
Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason E Weston, and Yuandong Tian. Training large language models to reason in a continu- ous latent space. InSecond Conference on Language Modeling, 2025. URL https://openreview.net /forum?id=Itxz7S4Ip3
2025
-
[47]
Xinghao Chen, Anhao Zhao, Heming Xia, Xuan Lu, Hanlin Wang, Yanjun Chen, Wei Zhang, Jian Wang, Wenjie Li, and Xiaoyu Shen. Reasoning beyond lan- guage: A comprehensive survey on latent chain-of- thought reasoning, 2025. URL https://arxiv.or g/abs/2505.16782
-
[48]
Ai–ai bias: Large language models favor communications gener- ated by large language models.Proceedings of the National Academy of Sciences, 122(31):e2415697122,
Walter Laurito, Benjamin Davis, Peli Grietzer, Tomáš Gavenˇciak, Ada Böhm, and Jan Kulveit. Ai–ai bias: Large language models favor communications gener- ated by large language models.Proceedings of the National Academy of Sciences, 122(31):e2415697122,
-
[49]
URL https: //www.pnas.org/doi/abs/10.1073/pnas.2415 697122
doi:10.1073/pnas.2415697122. URL https: //www.pnas.org/doi/abs/10.1073/pnas.2415 697122
-
[50]
Alexander Knipper and Charles S
R Alexander Knipper, Charles S Knipper, Kaiqi Zhang, Valerie Sims, Clint Bowers, and Santu Karmaker. The bias is in the details: An assessment of cognitive bias in llms.arXiv preprint arXiv:2509.22856, 2025
-
[51]
On the emergence of position bias in transformers
Xinyi Wu, Yifei Wang, Stefanie Jegelka, and Ali Jadbabaie. On the emergence of position bias in transformers. InForty-second International Confer- ence on Machine Learning, 2025. URL https: //openreview.net/forum?id=YufVk7I6Ii
2025
-
[52]
K. J. Åström. Optimal control of markov processes with incomplete state information. Journal of Mathematical Analysis and Applica- tions, 10(1):174–205, 1965. ISSN 0022-247X. doi:https://doi.org/10.1016/0022-247X(65)90154-X. URL https://www.sciencedirect.com/scienc e/article/pii/0022247X6590154X
-
[53]
Edward J. Sondik. The optimal control of partially observable markov processes over the infinite horizon: Discounted costs.Operations Research, 26(2):282– 304, 1978. ISSN 0030364X, 15265463. URL http: //www.jstor.org/stable/169635
1978
-
[54]
The bayesian brain: the role of uncertainty in neural coding and com- putation.Trends Neurosci, 27(12):712–719, December 2004
David C Knill and Alexandre Pouget. The bayesian brain: the role of uncertainty in neural coding and com- putation.Trends Neurosci, 27(12):712–719, December 2004
2004
-
[55]
Action understanding as inverse planning.Cog- nition, 113(3):329–349, September 2009
Chris L Baker, Rebecca Saxe, and Joshua B Tenen- baum. Action understanding as inverse planning.Cog- nition, 113(3):329–349, September 2009
2009
-
[56]
MIT Press, 2024
T L Griffiths, N Chater, and J B Tenenbaum.Bayesian Models of Cognition: Reverse Engineering the Mind. MIT Press, 2024
2024
-
[57]
Atkinson
Sohaib Imran, Ihor Kendiukhov, Matthew Broerman, Aditya Thomas, Riccardo Campanella, Rob Lamb, and Peter M. Atkinson. Are LLM belief updates consistent with bayes’ theorem? InICML 2025 Workshop on Assessing World Models, 2025. URL https://open review.net/forum?id=Bki9T98mfr
2025
-
[58]
Inference-time in- tervention: Eliciting truthful answers from a lan- guage model
Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time in- tervention: Eliciting truthful answers from a lan- guage model. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 41451–41530. Curran Associates, Inc., 2023. ...
2023
-
[59]
The geometry of truth: Emergent linear structure in large language model representations of true/false datasets
Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. InFirst Conference on Language Modeling, 2024. URL http s://openreview.net/forum?id=aajyHYjjsk
2024
-
[60]
Towards inference-time category-wise safety steering for large language mod- els
Amrita Bhattacharjee, Shaona Ghosh, Traian Rebe- dea, and Christopher Parisien. Towards inference-time category-wise safety steering for large language mod- els. InNeurips Safe Generative AI Workshop 2024,
2024
-
[61]
URL https://openreview.net/forum?i d=EkQRNLPFcn
-
[62]
Steering llms towards safer shores
Victor Batenburg, Gijs Wijnholds, and Olga Gady- atskaya. Steering llms towards safer shores. In Chuti- porn Anutariya, Marcello Bonsangue, Amalka Pini- diyaarachchi, and Hakim Usoof, editors,Data Science and Artificial Intelligence, pages 45–59, Singapore,
-
[63]
ISBN 978-981-95- 4409-7
Springer Nature Singapore. ISBN 978-981-95- 4409-7. 14 Why Do LLMs Struggle in Strategic Play? Broken Links Between Observations, Beliefs, and Actions
-
[64]
Emergence of linear truth encodings in language models
Shauli Ravfogel, Gilad Yehudai, Tal Linzen, Joan Bruna, and Alberto Bietti. Emergence of linear truth encodings in language models. InMechanistic Inter- pretability Workshop at NeurIPS 2025, 2025. URL ht tps://openreview.net/forum?id=KDK99BgmPe
2025
-
[65]
Reducing sycophancy and improving honesty via activation steering, 2023
Nina Rimsky. Reducing sycophancy and improving honesty via activation steering, 2023. URL https: //www.lesswrong.com/posts/zt6hRsDE84HeBK h7E. LessWrong post
2023
-
[66]
Steering Language Models With Activation Engineering
Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activa- tion engineering.arXiv preprint arXiv:2308.10248, 2023
work page internal anchor Pith review arXiv 2023
-
[67]
Eliciting latent knowl- edge from ”quirky” language models
Alex Troy Mallen, Madeline Brumley, Julia Kharchenko, and Nora Belrose. Eliciting latent knowl- edge from ”quirky” language models. InFirst Con- ference on Language Modeling, 2024. URL https: //openreview.net/forum?id=nGCMLATBit
2024
-
[68]
Is gpt-oss good? a comprehensive evaluation of openai’s latest open source models, 2025
Ziqian Bi, Keyu Chen, Chiung-Yi Tseng, Danyang Zhang, Tianyang Wang, Hongying Luo, Lu Chen, Jun- ming Huang, Jibin Guan, Junfeng Hao, Xinyuan Song, and Junhao Song. Is gpt-oss good? a comprehensive evaluation of openai’s latest open source models, 2025. URLhttps://arxiv.org/abs/2508.12461
-
[69]
Kingma and Jimmy Ba
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InInternational Confer- ence on Learning Representations, 2015
2015
-
[70]
Prompting GPT-3 to be reliable
Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang Wang, Jianfeng Wang, Jordan Lee Boyd-Graber, and Lijuan Wang. Prompting GPT-3 to be reliable. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview .net/forum?id=98p5x51L5af
2023
-
[71]
Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. Language models as knowledge bases? In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing and the 9th International Joint Con- ference on Natu...
-
[72]
COPEN: Probing conceptual knowledge in pre- trained language models
Hao Peng, Xiaozhi Wang, Shengding Hu, Hailong Jin, Lei Hou, Juanzi Li, Zhiyuan Liu, and Qun Liu. COPEN: Probing conceptual knowledge in pre- trained language models. In Yoav Goldberg, Zor- nitsa Kozareva, and Yue Zhang, editors,Proceed- ings of the 2022 Conference on Empirical Meth- ods in Natural Language Processing, pages 5015– 5035, Abu Dhabi, United A...
-
[73]
Observer, not player: Simulating theory of mind in large language models through game observation
Jerry Wang and Ting Yu Liu. Observer, not player: Simulating theory of mind in large language models through game observation. InFirst Workshop on Foun- dations of Reasoning in Language Models, 2025. URL https://openreview.net/forum?id=GYi2Voim 9O
2025
-
[74]
Joseph Suh, Erfan Jahanparast, Suhong Moon, Min- woo Kang, and Serina Chang. Language model fine- tuning on scaled survey data for predicting distribu- tions of public opinions. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguis- tics (...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.