pith. sign in

arxiv: 2605.30219 · v1 · pith:UHE23H3Knew · submitted 2026-05-28 · 💻 cs.AI · cs.CL· cs.LG

When Should Models Change Their Minds? Contextual Belief Management in Large Language Models

Pith reviewed 2026-06-29 07:14 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords contextual belief managementbelief trackinglarge language modelsreinforcement learningbelief state rewardsBeliefTrack benchmarkLLM failure modes
0
0 comments X

The pith

Reinforcement learning with belief-state rewards reduces contextual belief management failures in language models by 70.9% on average.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines contextual belief management as the task of maintaining a belief state aligned with formal evidence while filtering out task-irrelevant noise across long interactions. It introduces BeliefTrack, a closed-world benchmark with rule discovery and circuit diagnosis tasks that uses a finite belief space and symbolic verifiers to score models exactly at each turn. Vanilla models exhibit high rates of failed stays, failed updates, and failed isolations, and adding explicit belief-tracking prompts yields only modest improvement. Reinforcement learning that supplies rewards based on the current belief state cuts average failure rates by 70.9 percent, while representation-level steering on the same models cuts failures by 46.1 percent on two tasks.

Core claim

Contextual belief management requires language models to decide when to update, preserve, or ignore information based on formal evidence. In the closed-world BeliefTrack benchmark, reinforcement learning using rewards derived from the belief state reduces the average failure rate by 70.9 percent across models, outperforming explicit prompting which provides only limited improvement. Probing further reveals that models possess latent belief-state dynamics that can be influenced through representation-level steering to achieve a 46.1 percent reduction in failures on two tasks.

What carries the argument

BeliefTrack benchmark, which supplies a finite belief space and symbolic verifiers for exact turn-level measurement of belief update, stay, and isolation in rule discovery and circuit diagnosis.

If this is right

  • Language models can learn to handle accumulating information more reliably through reinforcement that targets belief-state accuracy.
  • Three distinct failure modes—failed stay, failed update, and failed isolation—can be diagnosed separately and reduced through the same training signal.
  • Internal representations of belief states exist in the models and can be read out or adjusted at the representation level without retraining the full policy.
  • Explicit prompting alone does not produce robust belief management even when the task is described clearly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If approximate verifiers can be learned or replaced by self-consistency checks, the same reward approach may transfer to open-world agent settings.
  • The existence of steerable latent belief dynamics suggests that current models already track approximate states but do not use them effectively without explicit training.
  • The framework connects naturally to memory mechanisms in multi-turn agents where consistency over long horizons is required.

Load-bearing premise

The closed-world setting with a finite belief space and symbolic verifiers in BeliefTrack accurately captures the belief-management demands that arise in open-ended, noisy real-world LLM interactions.

What would settle it

Running the same trained models on open-ended tasks that lack symbolic verifiers or a known finite belief space and checking whether the 70.9 percent failure reduction persists.

Figures

Figures reproduced from arXiv: 2605.30219 by Chiyu Wu, Haoming Xu, Jin Shang, Mengru Wang, Shumin Deng, Weihong Xu, Yu Gong, Yunzhi Yao, Zongrui Li.

Figure 1
Figure 1. Figure 1: Overview of Contextual Belief Management (CBM). CBM requires models to maintain a predicted belief state over a belief space, update it only when warranted by formal evidence, and filter task-irrelevant context or noise. The pilot Rule Discovery study reveals substantial belief-management errors in frontier models. procedures from context and translate them into effective behavior. However, absorbing conte… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between Contextual Belief Man￾agement and Theory of Mind. 2025; Al-Tawaha et al., 2026), degrade in multi￾turn instruction following (Laban et al., 2026; Duan et al., 2025), and fail under contextual pressure (Xu et al., 2024b; Deng et al., 2026). Recent work fur￾ther identifies contextual inertia, where models fail to revise earlier generations or intermediate infer￾ences despite later contradi… view at source ↗
Figure 3
Figure 3. Figure 3: BeliefTrack framework. Given a finite belief space, the model must output a predicted belief state, i.e., [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effects of temporal stress and task-irrelevant context on CBM. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Mechanistic probing and steering of CBM failures. (a) Probing compares Vanilla and RL by tracking the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: RL training dynamics across checkpoints. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt Templates A 17 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt Templates B 18 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Belief-State Drift and Backtracking Failure. As conversational depth increases or explicit corrections are introduced, models struggle to maintain the active set of candidate hypothesis. This manifests as either misremembering previous formal evidence, such as treating a past NO label as YES, or abandoning candidate hypothesis that remain supported by the accumulated formal evidence and recalculating from … view at source ↗
Figure 10
Figure 10. Figure 10: Contextual Hijacking. When task-irrelevant context/noise is injected, models frequently abandon formal verification. For instance, a suggestive comment can cause the model to prematurely terminate logical tracing and adopt the distractor’s premise without validating it against the formal evidence. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Latent-Output Gap. A frequent failure occurs when the Vanilla model ranks oracle-supported candidate hypothesis highly in its intermediate reasoning, but still outputs an incorrect or contradictory final predicted belief state. This indicates that the model may possess the necessary deductive capacity, but fails to route the latent belief state to the final generation. 21 [PITH_FULL_IMAGE:figures/full_fi… view at source ↗
read the original abstract

Long-horizon interactions require language models to manage accumulating information: when to update their state, when to preserve their state, and what to ignore. We study this challenge as \textbf{Contextual Belief Management (CBM)}: maintaining a predicted belief state aligned with formal evidence while isolating task-irrelevant noise. To make CBM measurable, we introduce BeliefTrack, a closed-world benchmark spanning Rule Discovery and Circuit Diagnosis, where a finite belief space and symbolic verifiers enable exact turn-level evaluation. BeliefTrack diagnoses three failures: Failed Stay, Failed Update, and Failed Isolation. Across multiple LLMs, vanilla models exhibit severe CBM failures, while explicit belief-tracking prompts provide limited gains. In contrast, reinforcement learning with belief-state rewards reduces failure rates by 70.9\% on average. Further probing reveals latent belief-state dynamics behind these failures, and representation-level steering reduces failure rates by 46.1\% across two tasks\footnote{Code is coming soon at https://github.com/zjunlp/CBM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Contextual Belief Management (CBM) as the challenge of maintaining belief states aligned with accumulating evidence while ignoring task-irrelevant noise in long-horizon LLM interactions. It presents BeliefTrack, a closed-world benchmark with Rule Discovery and Circuit Diagnosis tasks that uses a finite belief space and symbolic verifiers to enable exact turn-level evaluation of three failure modes (Failed Stay, Failed Update, Failed Isolation). Vanilla LLMs show high failure rates, explicit prompts yield limited gains, reinforcement learning with belief-state rewards reduces failures by 70.9% on average, and representation-level steering reduces failures by 46.1% across two tasks.

Significance. If the results hold, the work supplies a controlled benchmark that permits precise diagnosis of belief-management failures and demonstrates that RL and steering can produce large quantitative improvements within that setting. The closed-world design with symbolic verification is a methodological strength that enables reproducible, turn-level metrics unavailable in open text. These elements could support future work on reliable long-horizon reasoning, provided the methods transfer beyond the oracle-assisted benchmark.

major comments (2)
  1. [§2 and RL experiments] §2 (BeliefTrack) and the RL training description: the 70.9% failure-rate reduction is obtained by optimizing against belief-state rewards supplied by the benchmark's symbolic verifiers; this closed-world oracle is unavailable in open-ended interactions, so the central empirical claim does not yet establish that the same procedure will produce comparable CBM behavior when belief states must be inferred from noisy text without symbolic verification.
  2. [Results] Results section reporting the 70.9% and 46.1% figures: the abstract and summary provide no information on the LLMs tested (sizes, families), number of runs, statistical tests, or variance; without these details the magnitude of the reported deltas cannot be assessed for robustness.
minor comments (2)
  1. [Abstract] The footnote states that code is coming soon; releasing the BeliefTrack implementation and RL/steering code at submission time would strengthen reproducibility claims.
  2. [§2] The three failure modes are named but their precise operational definitions (e.g., exact conditions for Failed Isolation) should be stated with a short example from each task.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address each major point below, clarifying the scope of our claims while committing to revisions where appropriate.

read point-by-point responses
  1. Referee: [§2 and RL experiments] §2 (BeliefTrack) and the RL training description: the 70.9% failure-rate reduction is obtained by optimizing against belief-state rewards supplied by the benchmark's symbolic verifiers; this closed-world oracle is unavailable in open-ended interactions, so the central empirical claim does not yet establish that the same procedure will produce comparable CBM behavior when belief states must be inferred from noisy text without symbolic verification.

    Authors: BeliefTrack is explicitly constructed as a closed-world benchmark with finite belief spaces and symbolic verifiers to permit exact, turn-level diagnosis of the three failure modes. The RL results show that, when accurate belief-state rewards are available, optimization yields large reductions in failures. This serves as a controlled demonstration that reward signals aligned with belief states can address CBM issues. We agree that the current experiments do not establish transfer to open-ended settings where rewards must be inferred without oracles. We will add explicit language in the introduction and a new limitations subsection stating the scope of the claims, noting the oracle dependency, and outlining future directions for reward modeling from noisy text. revision: partial

  2. Referee: [Results] Results section reporting the 70.9% and 46.1% figures: the abstract and summary provide no information on the LLMs tested (sizes, families), number of runs, statistical tests, or variance; without these details the magnitude of the reported deltas cannot be assessed for robustness.

    Authors: We acknowledge that the abstract and high-level summary omit these experimental details. The full manuscript evaluates multiple LLMs and reports averaged results, but we will revise the abstract to name the model families and sizes, add a dedicated experimental setup subsection specifying the number of runs, variance measures, and statistical tests (e.g., significance testing of the reported improvements), and include a summary table of these metrics to allow assessment of robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements on new benchmark with no definitional or fitted reductions

full rationale

The paper introduces the BeliefTrack benchmark to quantify Contextual Belief Management failures and reports measured improvements (70.9% failure reduction via RL with belief-state rewards; 46.1% via representation steering). These are direct experimental outcomes on the closed-world tasks, not quantities obtained by fitting parameters to the evaluation data and then relabeling the fit as a prediction. No equations, self-definitional loops, or load-bearing self-citations appear in the provided text; the central claims rest on observable performance deltas rather than any derivation that reduces to its own inputs by construction. The benchmark design enables the measurements but does not force the reported gains mathematically.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the work is framed as empirical benchmarking rather than theoretical derivation.

pith-pipeline@v0.9.1-grok · 5733 in / 1156 out tokens · 26105 ms · 2026-06-29T07:14:33.662337+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Beyond Similarity: Trustworthy Memory Search for Personal AI Agents

    cs.AI 2026-06 unverdicted novelty 6.0

    MemGate is a 9M-parameter neural gate inserted between vector memory and LLM that converts similarity search into task-conditioned admission, reducing memory-induced threats across agent frameworks while preserving utility.

Reference graph

Works this paper leans on

53 extracted references · 37 canonical work pages · cited by 1 Pith paper · 15 internal anchors

  1. [1]

    Ahmad Al-Tawaha, Shangding Gu, Peizhi Niu, Ruoxi Jia, and Ming Jin. 2026. https://arxiv.org/abs/2605.17830 Remembering more, risking more: Longitudinal safety risks in memory-equipped llm agents . Preprint, arXiv:2605.17830

  2. [2]

    Xingwu Chen, Zhanqiu Zhang, Yiwen Guo, and Difan Zou. 2026 a . https://arxiv.org/abs/2603.04783 Breaking contextual inertia: Reinforcement learning with single-turn anchors for stable multi-turn interaction . Preprint, arXiv:2603.04783

  3. [3]

    Yining Chen, Jihao Zhao, Bo Tang, Haofen Wang, Yue Zhang, Fei Huang, Feiyu Xiong, and Zhiyu Li. 2026 b . https://arxiv.org/abs/2605.09530 Memprivacy: Privacy-preserving personalized memory management for edge-cloud agents . Preprint, arXiv:2605.09530

  4. [4]

    Zhuang Chen, Jincenzi Wu, Jinfeng Zhou, Bosi Wen, Guanqun Bi, Gongyao Jiang, Yaru Cao, Mengting Hu, Yunghwei Lai, Zexuan Xiong, and Minlie Huang. 2024. https://doi.org/10.18653/v1/2024.acl-long.847 T o MB ench: Benchmarking theory of mind in large language models . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics ...

  5. [5]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. https://arxiv.org/abs/2110.14168 Training verifiers to solve math word problems . Preprint, arXiv:2110.14168

  6. [6]

    Logan Cross, Violet Xiang, Agam Bhatia, Daniel LK Yamins, and Nick Haber. 2025. https://openreview.net/forum?id=otW0TJOUYF Hypothetical minds: Scaffolding theory of mind for multi-agent tasks with large language models . In The Thirteenth International Conference on Learning Representations

  7. [7]

    DeepSeek-AI, Aixin Liu, et al. 2025. https://arxiv.org/abs/2512.02556 Deepseek-v3.2: Pushing the frontier of open large language models . Preprint, arXiv:2512.02556

  8. [8]

    Jie Deng, Shining Liang, Jun Li, Hongzhi Li, and Yutao Xie. 2026. https://arxiv.org/abs/2602.01472 Conpress: Learning efficient reasoning from multi-question contextual pressure . Preprint, arXiv:2602.01472

  9. [9]

    Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang, Yujiong Shen, Junzhe Wang, Jiayi Chen, Yuchen Ni, Junjie Ye, Cheng Zhang, Huaibing Xie, Jianglu Hu, Shaolei Wang, Weichao Wang, Yanling Xiao, Yiting Liu, Zenan Xu, Zhen Guo, Pluto Zhou, Tao Gui, Zuxuan Wu, Xipeng Qiu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Di Wang, and Shunyu Yao. 2026. https://arxiv....

  10. [10]

    Guoliang Duan, Mingwei Liu, Yanlin Wang, Chong Wang, Xin Peng, and Zibin Zheng. 2025. https://arxiv.org/abs/2507.00699 A hierarchical and evolvable benchmark for fine-grained code instruction following with multi-turn feedback . Preprint, arXiv:2507.00699

  11. [11]

    Lovisa Hagström, Youna Kim, Haeun Yu, Sang goo Lee, Richard Johansson, Hyunsoo Cho, and Isabelle Augenstein. 2026. https://arxiv.org/abs/2505.16518 Cub: Benchmarking context utilisation techniques for language models . Preprint, arXiv:2505.16518

  12. [12]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. https://openreview.net/forum?id=d7KBjmI3GmQ Measuring massive multitask language understanding . In International Conference on Learning Representations

  13. [13]

    Huang, Leshem Choshen, Ramon Astudillo, Tamara Broderick, and Jacob Andreas

    Jenny Y. Huang, Leshem Choshen, Ramon Astudillo, Tamara Broderick, and Jacob Andreas. 2026. https://arxiv.org/abs/2602.24287 Do llms benefit from their own words? Preprint, arXiv:2602.24287

  14. [14]

    Zhuoran Jin, Pengfei Cao, Hongbang Yuan, Yubo Chen, Jiexin Xu, Huaijun Li, Xiaojian Jiang, Kang Liu, and Jun Zhao. 2024. https://doi.org/10.18653/v1/2024.findings-acl.70 Cutting off the head ends the conflict: A mechanism for interpreting and mitigating knowledge conflicts in language models . In Findings of the Association for Computational Linguistics: ...

  15. [15]

    Hyunwoo Kim, Melanie Sclar, Xuhui Zhou, Ronan Bras, Gunhee Kim, Yejin Choi, and Maarten Sap. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.890 FANT o M : A benchmark for stress-testing machine theory of mind in interactions . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14397--14413, Singapore. Assoc...

  16. [16]

    Evgenii Kortukov, Alexander Rubinstein, Elisa Nguyen, and Seong Joon Oh. 2024. https://openreview.net/forum?id=xm8zYRfrqE Studying large language model behaviors under context-memory conflicts with real documents . In First Conference on Language Modeling

  17. [17]

    Michal Kosinski. 2024. Evaluating large language models in theory of mind tasks. Proceedings of the National Academy of Sciences, 121(45):e2405460121

  18. [18]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

  19. [19]

    Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. 2026. https://openreview.net/forum?id=VKGTGGcwl6 LLM s get lost in multi-turn conversation . In The Fourteenth International Conference on Learning Representations

  20. [20]

    Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. 2026. https://arxiv.org/abs/2603.28052 Meta-harness: End-to-end optimization of model harnesses . Preprint, arXiv:2603.28052

  21. [21]

    Grace Liu, Yuxiao Qu, Jeff Schneider, Aarti Singh, and Aviral Kumar. 2025. https://arxiv.org/abs/2510.08517 Cart: Teaching llm agents to know when they know enough . Preprint, arXiv:2510.08517

  22. [22]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. https://doi.org/10.1162/tacl_a_00638 Lost in the middle: How language models use long contexts . Transactions of the Association for Computational Linguistics, 12:157--173

  23. [23]

    Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.565 Entity-based knowledge conflicts in question answering . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7052--7063, Online and Punta Cana, Dominican Republic....

  24. [24]

    Praveen Kumar Myakala, Manan Agrawal, and Rahul Manche. 2026. https://arxiv.org/abs/2603.23848 Beliefshift: Benchmarking temporal belief consistency and opinion drift in llm agents . Preprint, arXiv:2603.23848

  25. [25]

    Nikhil Prakash, Natalie Shapira, Arnab Sen Sharma, Christoph Riedl, Yonatan Belinkov, Tamar Rott Shaham, David Bau, and Atticus Geiger. 2026. https://openreview.net/forum?id=6gO6KTRMpG Language models use lookbacks to track beliefs . In The Fourteenth International Conference on Learning Representations

  26. [26]

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  27. [27]

    Ashutosh Raj. 2026. https://doi.org/10.5281/ZENODO.19356182 Llm psychosis: A theoretical and diagnostic framework for reality-boundary failures in large language models

  28. [28]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. https://arxiv.org/abs/2402.03300 Deepseekmath: Pushing the limits of mathematical reasoning in open language models . Preprint, arXiv:2402.03300

  29. [29]

    Natalie Shapira, Mosh Levy, Seyed Hossein Alavi, Xuhui Zhou, Yejin Choi, Yoav Goldberg, Maarten Sap, and Vered Shwartz. 2024. Clever hans or neural theory of mind? stress testing social reasoning in large language models. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers),...

  30. [30]

    Haojun Shi, Suyu Ye, Xinyu Fang, Chuanyang Jin, Leyla Isik, Yen-Ling Kuo, and Tianmin Shu. 2025. https://doi.org/10.1609/aaai.v39i2.32142 Muma-tom: multi-modal multi-agent theory of mind . In Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fi...

  31. [31]

    Aaditya Singh, Adam Fry, et al. 2026. https://arxiv.org/abs/2601.03267 Openai gpt-5 system card . Preprint, arXiv:2601.03267

  32. [32]

    James W. A. Strachan, Dalila Albergo, Giulia Borghini, Oriana Pansardi, Eugenio Scaliti, Saurabh Gupta, Krati Saxena, Alessandro Rufo, Stefano Panzeri, Guido Manzi, Michael S. A. Graziano, and Cristina Becchio. 2024 a . https://doi.org/10.1038/s41562-024-01882-z Testing theory of mind in large language models and humans . Nature Human Behaviour, 8(7):1285--1295

  33. [33]

    James WA Strachan, Dalila Albergo, Giulia Borghini, Oriana Pansardi, Eugenio Scaliti, Saurabh Gupta, Krati Saxena, Alessandro Rufo, Stefano Panzeri, Guido Manzi, et al. 2024 b . Testing theory of mind in large language models and humans. Nature human behaviour, 8(7):1285--1295

  34. [34]

    Qwen Team. 2026. https://qwen.ai/blog?id=qwen3.5 Qwen3.5: Accelerating productivity with native multimodal agents

  35. [35]

    Tomer Ullman. 2023. Large language models fail on trivial alterations to theory-of-mind tasks. In arXiv preprint arXiv:2302.08399

  36. [36]

    Yike Wang, Shangbin Feng, Heng Wang, Weijia Shi, Vidhisha Balachandran, Tianxing He, and Yulia Tsvetkov. 2024. https://openreview.net/forum?id=ptvV5HGTNN Resolving knowledge conflicts in large language models . In First Conference on Language Modeling

  37. [37]

    Bryan Wilie, Samuel Cahyawijaya, Etsuko Ishii, Junxian He, and Pascale Fung. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.586 Belief revision: The adaptability of large language models reasoning . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 10480--10496, Miami, Florida, USA. Association for Computa...

  38. [38]

    Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. 2024. https://doi.org/10.48550/arXiv.2410.10813 Longmemeval: Benchmarking chat assistants on long-term interactive memory . CoRR, abs/2410.10813

  39. [39]

    Jian Xie, Kai Zhang, Jiangjie Chen, Renze Lou, and Yu Su. 2024. https://openreview.net/forum?id=auKAUJZMO6 Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts . In The Twelfth International Conference on Learning Representations

  40. [40]

    Hainiu Xu, Runcong Zhao, Lixing Zhu, Jinhua Du, and Yulan He. 2024 a . Opentom: A comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8593--8623

  41. [41]

    Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency

    Haoming Xu, Ningyuan Zhao, Yunzhi Yao, Weihong Xu, Hongru Wang, Xinle Deng, Shumin Deng, Jeff Z. Pan, Huajun Chen, and Ningyu Zhang. 2026. https://arxiv.org/abs/2601.05905 Illusions of confidence? diagnosing llm truthfulness via neighborhood consistency . Preprint, arXiv:2601.05905

  42. [42]

    Rongwu Xu, Brian Lin, Shujian Yang, Tianqi Zhang, Weiyan Shi, Tianwei Zhang, Zhixuan Fang, Wei Xu, and Han Qiu. 2024 b . https://doi.org/10.18653/v1/2024.acl-long.858 The earth is flat because...: Investigating LLM s' belief towards misinformation via persuasive conversation . In Proceedings of the 62nd Annual Meeting of the Association for Computational ...

  43. [44]

    Rongwu Xu, Zehan Qi, Zhijiang Guo, Cunxiang Wang, Hongru Wang, Yue Zhang, and Wei Xu. 2024 d . https://doi.org/10.18653/v1/2024.emnlp-main.486 Knowledge conflicts for LLM s: A survey . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8541--8565, Miami, Florida, USA. Association for Computational Linguistics

  44. [45]

    Noam Steinmetz Yalon, Ariel Goldstein, Liad Mudrik, and Mor Geva. 2026. https://arxiv.org/abs/2602.02467 Indications of belief-guided agency and meta-cognitive monitoring in large language models . Preprint, arXiv:2602.02467

  45. [46]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: agent-computer interfaces enable automated software engineering. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS '24, Red Hook, NY, USA. Curran Associates Inc

  46. [47]

    Yunzhi Yao, Jiaxin Qin, Ningyu Zhang, Haoming Xu, Yuqi Zhu, Zeping Yu, Mengru Wang, Yuqi Tang, Jia-Chen Gu, Shumin Deng, Nanyun Peng, and Huajun Chen. 2025. https://doi.org/10.36227/techrxiv.176240454.46531513/v1 Rethinking knowledge editing in reasoning era . TechRxiv, 2025(1106)

  47. [48]

    Gal Yona, Mor Geva, and Yossi Matias. 2026. https://arxiv.org/abs/2605.01428 Hallucinations undermine trust; metacognition is a way forward . Preprint, arXiv:2605.01428

  48. [49]

    Pingyue Zhang, Zihan Huang, Yue Wang, Jieyu Zhang, Letian Xue, Zihan Wang, Qineng Wang, Keshigeyan Chandrasegaran, Ruohan Zhang, Yejin Choi, Ranjay Krishna, Jiajun Wu, Li Fei-Fei, and Manling Li. 2026. Theory of space: Can foundation models construct spatial beliefs through active exploration? arXiv preprint arXiv:2602.07055

  49. [50]

    Yiran Zhang, Mo Wang, Xiaoyang Li, Kaixuan Ren, Chencheng Zhu, and Usman Naseem. 2025. https://doi.org/10.18653/v1/2025.findings-emnlp.1084 T urn B ench- MS : A benchmark for evaluating multi-turn, multi-step reasoning in large language models . In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 19892--19924, Suzhou, China. As...

  50. [51]

    Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, and Yingda Chen. 2024. https://arxiv.org/abs/2408.05517 Swift:a scalable lightweight infrastructure for fine-tuning . Preprint, arXiv:2408.05517

  51. [52]

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, et al. 2023. https://webarena.dev Webarena: A realistic web environment for building autonomous agents . arXiv preprint arXiv:2307.13854

  52. [53]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  53. [54]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...