Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

Ji-An Li; Kayla Lee; Marcelo G. Mattar; Saner Cakir; Sinan Akcali; Sixing Chen

arxiv: 2605.06840 · v5 · pith:N7ZGSYXBnew · submitted 2026-05-07 · 💻 cs.AI

Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

Sixing Chen , Ji-An Li , Saner Cakir , Sinan Akcali , Kayla Lee , Marcelo G. Mattar This is my paper

Pith reviewed 2026-05-25 05:55 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM planningchain-of-thoughtsearch treesmyopic planningfour-in-a-rowreasoning tracescausal interventionboard games

0 comments

The pith

LLMs expand deep nodes in their reasoning traces but select moves using only a myopic model that ignores those nodes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method for extracting discrete search trees from the free-form chain-of-thought traces that LLMs produce while playing four-in-a-row. Fitting computational models to these trees reveals that LLMs generate traces containing deep nodes yet their actual move decisions align best with a model that attends only to immediate, shallow outcomes. Performance correlates with search breadth rather than depth, and selective pruning of CoT paragraphs shows that removing shallow nodes disrupts decisions more than removing deep ones. This pattern stands in contrast to human play, where deeper search drives better performance.

Core claim

Although LLMs expand deep nodes in their traces, their move choices are best explained by a myopic model that ignores those nodes entirely. A causal intervention study where we selectively prune CoT paragraphs further suggests that move selection is driven predominantly by shallow rather than deep nodes. These patterns contrast with human planning, where performance is driven primarily by deep search.

What carries the argument

Extraction of search trees from free-form CoT text in four-in-a-row, followed by model fitting to quantify depth, breadth, and causal influence on move selection.

If this is right

LLM performance in this domain is predicted by search breadth rather than depth.
Move selection remains driven by shallow nodes even when deep nodes appear in the generated trace.
Human and LLM planning dissociate because human expertise improves with deeper lookahead while LLM decisions do not.
Selective pruning of reasoning paragraphs can causally test which parts of the trace influence decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Methods that force deep nodes to influence output probabilities could close the gap with human-like planning.
The same extraction approach could be tested in other board games or multi-step reasoning tasks to check whether myopic selection is general.
If the parsing step systematically under-weights deep nodes, the dissociation finding would need re-examination with alternative tree-recovery techniques.

Load-bearing premise

The method of parsing free-form CoT text into discrete search trees faithfully recovers the model's internal deliberation without introducing parsing artifacts or selection biases that favor shallow nodes.

What would settle it

An experiment in which pruning deep nodes from the traces changes move selection while pruning shallow nodes does not would falsify the myopic claim; the observed pattern would support it.

Figures

Figures reproduced from arXiv: 2605.06840 by Ji-An Li, Kayla Lee, Marcelo G. Mattar, Saner Cakir, Sinan Akcali, Sixing Chen.

**Figure 1.** Figure 1: Game setup and search tree extraction. (A) An example board position in the four-ina-row game. Two players (black and white) alternate placing pieces on a 4 × 9 board, and the first player who achieves four-in-a-row wins the game. (B) Task prompt. The system prompt describes the rule of four-in-a-row, the board representation (FEN notation), and move submission format. The user message provides the curren… view at source ↗

**Figure 2.** Figure 2: Planning effort and game performance across models. (A) Winning rate as a function of search tree size. (B) Search breadth (number of root candidate moves considered) as a function of depth (max ply, i.e., the maximum number of alternating moves simulated ahead) across models. (C) Winning rate as a function of breadth-depth ratio. Dashed lines connect models in a model family. Asterisks denote significance… view at source ↗

**Figure 3.** Figure 3: Predicting moves from extracted search trees with cognitive modeling. (A) Features used in the heuristic value function. Features include connected two-in-a-row (blue), unconnected two-in-a-row (orange), three-in-a-row (purple), a four-in-a-row feature (not shown in the figure), and a central tendency feature. Features with identical colors are constrained to have identical weights. (B) Schematics of compu… view at source ↗

**Figure 4.** Figure 4: Causal intervention on reasoning traces. (A) An LLM judge (Claude Opus 4.7) labels each paragraph of the reasoning trace as preamble, branch, final decision, or meta. Branch paragraphs are associated with a specific candidate move. The judge additionally annotates all moves mentioned within each paragraph, together with their search depths. We then prune the trace according to these labels and feed the pru… view at source ↗

read the original abstract

Large language models (LLMs), especially reasoning models, generate extended chain-of-thought (CoT) reasoning that often contains explicit deliberation over future outcomes. Yet whether this deliberation constitutes genuine planning, how it is structured, and what aspects of it drive performance remain poorly understood. In this work, we introduce a new method to characterize LLM planning by extracting and quantifying search trees from reasoning traces in the four-in-a-row board game. By fitting computational models on the extracted search trees, we characterize how plans are structured and how they influence move decisions. We find that LLMs' search is shallower than humans', and that performance is predicted by search breadth rather than depth. Most strikingly, although LLMs expand deep nodes in their traces, their move choices are best explained by a myopic model that ignores those nodes entirely. A causal intervention study where we selectively prune CoT paragraphs further suggests that move selection is driven predominantly by shallow rather than deep nodes. These patterns contrast with human planning, where performance is driven primarily by deep search. Together, our findings reveal a key difference between LLM and human planning: while human expertise is driven by deeper search, LLMs do not act on deep lookahead. This dissociation offers targeted guidance for aligning LLM and human planning. More broadly, our framework provides a generalizable approach for interpreting the structure of LLM planning across strategic domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a method for extracting discrete search trees from LLM chain-of-thought traces in four-in-a-row, fits computational search models to the resulting trees, and reports that LLM move selection is best explained by a myopic model that ignores deep nodes (despite their expansion in traces). A causal paragraph-pruning intervention is used to argue that shallow nodes drive decisions, in contrast to human planning where depth matters more; performance is linked to breadth rather than depth.

Significance. If the extraction procedure is shown to be faithful, the dissociation between expanded deep search and myopic decision-making would be a substantive contribution to understanding LLM planning, offering a concrete contrast with human expertise and a framework extensible to other domains.

major comments (2)

[§3] §3 (Extraction Method): No quantitative validation of the CoT-to-tree parser is reported (e.g., inter-annotator agreement, accuracy on held-out forced-search traces, or sensitivity to shallow-node heuristics). Because all model fits, breadth/depth comparisons, and the myopic-model conclusion rest on the extracted trees, absence of these checks leaves open the possibility of systematic parsing bias favoring shallow nodes.
[§5] §5 (Model Comparison and Causal Intervention): The abstract states that move choices are 'best explained by a myopic model' and that pruning supports shallow-node dominance, yet supplies no effect sizes, cross-validation details, or controls for multiple comparisons across the family of search models. These omissions make it impossible to assess whether the myopic advantage is robust or an artifact of the unvalidated trees.

minor comments (2)

[Figure 2] Figure 2 and Table 1: Axis labels and tree-depth definitions are not fully aligned with the extraction heuristics described in §3, making it difficult to reproduce the reported breadth vs. depth results.
The four-in-a-row domain is introduced without explicit comparison to prior game-tree extraction work in chess or Go; a brief related-work paragraph would clarify novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comments point by point below.

read point-by-point responses

Referee: [§3] §3 (Extraction Method): No quantitative validation of the CoT-to-tree parser is reported (e.g., inter-annotator agreement, accuracy on held-out forced-search traces, or sensitivity to shallow-node heuristics). Because all model fits, breadth/depth comparisons, and the myopic-model conclusion rest on the extracted trees, absence of these checks leaves open the possibility of systematic parsing bias favoring shallow nodes.

Authors: The referee correctly identifies that we did not report quantitative validation metrics for the parser. To address this, the revised manuscript will include inter-annotator agreement scores on a random sample of 100 traces, parser accuracy evaluated on held-out traces where search was forced to specific depths, and sensitivity analyses showing how varying the heuristics for identifying shallow nodes affects the extracted tree structures and subsequent model fits. These additions will help confirm that the myopic conclusion is not an artifact of parsing bias. revision: yes
Referee: [§5] §5 (Model Comparison and Causal Intervention): The abstract states that move choices are 'best explained by a myopic model' and that pruning supports shallow-node dominance, yet supplies no effect sizes, cross-validation details, or controls for multiple comparisons across the family of search models. These omissions make it impossible to assess whether the myopic advantage is robust or an artifact of the unvalidated trees.

Authors: We agree that additional statistical details are needed. In the revision, we will report effect sizes such as the difference in log-likelihood between the myopic model and alternatives, describe the cross-validation setup (including number of folds and how models were selected), and apply corrections for multiple comparisons across the model family. These will be presented alongside the existing results to demonstrate the robustness of the myopic model's advantage. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical extraction, model fitting, and held-out testing are independent of inputs

full rationale

The paper extracts search trees from new LLM CoT traces in four-in-a-row, fits computational models to those trees, compares fit quality across myopic vs. deep-search models, and validates via held-out move choices plus paragraph-pruning interventions. No equation, parameter fit, or self-citation reduces the central claim (myopic model best explains choices despite deep nodes in traces) to the extraction method or data by construction. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested validity of mapping free-form text to search trees and on the assumption that the fitted myopic model is the appropriate comparator; both are introduced by the paper without external benchmarks.

free parameters (1)

parameters of the computational search models
Fitted to extracted trees to predict which move the LLM selects; their values determine the myopic vs. deep comparison.

axioms (1)

domain assumption Chain-of-thought traces contain explicit, parsable deliberation that can be mapped onto discrete game-tree nodes without systematic distortion.
Invoked at the start of the extraction method; if false, all downstream tree quantification and model comparisons collapse.

pith-pipeline@v0.9.0 · 5789 in / 1360 out tokens · 68882 ms · 2026-05-25T05:55:41.036637+00:00 · methodology

Review history (5 revisions) →

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 7 internal anchors

[1]

Measuring Progress on Scalable Oversight for Large Language Models

Samuel R Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamil˙e Lukoši¯ut˙e, Amanda Askell, Andy Jones, Anna Chen, et al. Measuring progress on scalable oversight for large language models.arXiv preprint arXiv:2211.03540, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu

Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound constrained optimization.SIAM Journal on Scientific Computing, 16(5):1190–1208, 1995

work page 1995
[3]

Rational use of cognitive resources in human planning.Nature human behaviour, 6(8):1112–1125, 2022

Frederick Callaway, Bas Van Opheusden, Sayan Gul, Priyam Das, Paul M Krueger, Thomas L Griffiths, and Falk Lieder. Rational use of cognitive resources in human planning.Nature human behaviour, 6(8):1112–1125, 2022

work page 2022
[4]

Rational decisions in multi-step environments with few rollouts.PsyArXiv, 2025

Sixing Chen, Kristopher T Jensen, and Marcelo G Mattar. Rational decisions in multi-step environments with few rollouts.PsyArXiv, 2025

work page 2025
[5]

Competitive programming with large reasoning models.arXiv preprint arXiv:2502.06807, 2025

Ahmed El-Kishky, Alexander Wei, Andre Saraiva, Borys Minaiev, Daniel Selsam, David Dohan, Francis Song, Hunter Lightman, Ignasi Clavera, Jakub Pachocki, et al. Competitive programming with large reasoning models.arXiv preprint arXiv:2502.06807, 2025

work page arXiv 2025
[6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? InProceedings of the 58th annual meeting of the association for computational linguistics, pages 4198–4205, 2020

work page 2020
[8]

A recurrent network model of planning explains hippocampal replay and human behavior.Nature neuroscience, 27(7): 1340–1348, 2024

Kristopher T Jensen, Guillaume Hennequin, and Marcelo G Mattar. A recurrent network model of planning explains hippocampal replay and human behavior.Nature neuroscience, 27(7): 1340–1348, 2024

work page 2024
[9]

What makes a good reasoning chain? uncovering structural patterns in long chain-of- thought reasoning

Gangwei Jiang, Yahui Liu, Zhaoyi Li, Wei Bi, Fuzheng Zhang, Linqi Song, Ying Wei, and Defu Lian. What makes a good reasoning chain? uncovering structural patterns in long chain-of- thought reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6501–6525, 2025

work page 2025
[10]

The dependence of effective planning horizon on model accuracy

Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The dependence of effective planning horizon on model accuracy. InProceedings of the 2015 international conference on autonomous agents and multiagent systems, pages 1181–1189, 2015

work page 2015
[11]

Position: Llms can’t plan, but can help planning in llm-modulo frameworks

Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Mudit Verma, Kaya Stechly, Siddhant Bhambri, Lucas Paul Saldyt, and Anil B Murthy. Position: Llms can’t plan, but can help planning in llm-modulo frameworks. InForty-first International Conference on Machine Learning, 2024

work page 2024
[12]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Looking deeper into the algorithms underlying human planning.Trends in Cognitive Sciences, 2025

Ionatan Kuperwajs, Evan M Russek, Marcelo G Mattar, Wei Ji Ma, and Thomas L Griffiths. Looking deeper into the algorithms underlying human planning.Trends in Cognitive Sciences, 2025

work page 2025
[14]

Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Human planning in stochastic environments.PsyArXiv

Jordan Lei, Jeroen Olieslagers, Nastaran Arfaei, D Xinlei Lin, and Wei Ji Ma. Human planning in stochastic environments.PsyArXiv. https://osf. io/bh56p_v1, 2025. 10

work page 2025
[16]

From System 1 to System 2: A Survey of Reasoning Large Language Models

Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, et al. From system 1 to system 2: A survey of reasoning large language models.arXiv preprint arXiv:2502.17419, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models

Jincheng Liu, Sijun He, Jingjing Wu, Xiangsen Wang, Yang Chen, Zhaoqi Kuang, Siqi Bao, and Yuan Yao. Chessarena: A chess testbed for evaluating strategic reasoning capabilities of large language models.arXiv preprint arXiv:2509.24239, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

work page 2024
[19]

Planning in the brain.Neuron, 110(6):914–934, 2022

Marcelo G Mattar and Máté Lengyel. Planning in the brain.Neuron, 110(6):914–934, 2022

work page 2022
[20]

Premise-augmented reasoning chains improve error identification in math reasoning with llms.arXiv preprint arXiv:2502.02362, 2025

Sagnik Mukherjee, Abhinav Chinta, Takyoung Kim, Tarun Anoop Sharma, and Dilek Hakkani- Tür. Premise-augmented reasoning chains improve error identification in math reasoning with llms.arXiv preprint arXiv:2502.02362, 2025

work page arXiv 2025
[21]

Openai o1 system card, 2024.https://openai.com

OpenAI. Openai o1 system card, 2024.https://openai.com

work page 2024
[22]

Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning

Debjit Paul, Robert West, Antoine Bosselut, and Boi Faltings. Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 15012–15032, 2024

work page 2024
[23]

Disentangling generalization and memorization in large language models using chess

Leonard S Pleiss, Maximilian Schiffer, and Robert K von Weizsäcker. Trapped in the past? disentangling fluid and crystallized intelligence of large language models using chess.arXiv preprint arXiv:2601.16823, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Si- mon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020

work page 2020
[25]

Mastering board games by external and internal planning with language models.arXiv preprint arXiv:2412.12119, 2024

John Schultz, Jakub Adamek, Matej Jusup, Marc Lanctot, Michael Kaisers, Sarah Perrin, Daniel Hennes, Jeremy Shar, Cannada Lewis, Anian Ruoss, et al. Mastering board games by external and internal planning with language models.arXiv preprint arXiv:2412.12119, 2024

work page arXiv 2024
[26]

Llms can plan only if we tell them.arXiv preprint arXiv:2501.13545, 2025

Bilgehan Sel, Ruoxi Jia, and Ming Jin. Llms can plan only if we tell them.arXiv preprint arXiv:2501.13545, 2025

work page arXiv 2025
[27]

Mas- tering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess- che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mas- tering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

work page 2016
[28]

A general reinforcement learning algorithm that masters chess, shogi, and go through self-play.Science, 362(6419):1140–1144, 2018

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play.Science, 362(6419):1140–1144, 2018

work page 2018
[29]

Generalized planning in pddl domains with pretrained large language models

Tom Silver, Soham Dan, Kavitha Srinivas, Joshua B Tenenbaum, Leslie Kaelbling, and Michael Katz. Generalized planning in pddl domains with pretrained large language models. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pages 20256–20264, 2024

work page 2024
[30]

Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36:74952–74965, 2023

Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36:74952–74965, 2023

work page 2023
[31]

A system- atic evaluation of the planning and scheduling abilities of the reasoning model o1.Transactions on Machine Learning Research, 2025

Karthik Valmeekam, Kaya Stechly, Atharva Gundawar, and Subbarao Kambhampati. A system- atic evaluation of the planning and scheduling abilities of the reasoning model o1.Transactions on Machine Learning Research, 2025

work page 2025
[32]

Expertise increases planning depth in human gameplay.Nature, 618(7967):1000–1005, 2023

Bas Van Opheusden, Ionatan Kuperwajs, Gianni Galbiati, Zahy Bnaya, Yunqi Li, and Wei Ji Ma. Expertise increases planning depth in human gameplay.Nature, 618(7967):1000–1005, 2023. 11

work page 2023
[33]

Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 2022

Jason Wei, Xuezhi Wang, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 2022

work page 2022
[34]

org/abs/1912.11206

Chenjun Xiao, Yifan Wu, Chen Ma, Dale Schuurmans, and Martin Müller. Learning to combat compounding-error in model-based reinforcement learning.arXiv preprint arXiv:1912.11206, 2019

work page arXiv 1912
[35]

Complete chess games enable llm become a chess master

Yinqi Zhang, Xintian Han, Haolong Li, Kedi Chen, and Shaohui Lin. Complete chess games enable llm become a chess master. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 1–7, 2025

work page 2025
[36]

falling piece

Huaixiu Steven Zheng, Swaroop Mishra, Hugh Zhang, Xinyun Chen, Minmin Chen, Azade Nova, Le Hou, Heng-Tze Cheng, Quoc V Le, Ed H Chi, et al. Natural plan: Benchmarking llms on natural language planning.arXiv preprint arXiv:2406.04520, 2024. 12 A Code and data availability Code is available at this anonymous repository. Raw game logs (1.1 GB compressed) are...

work page arXiv 2024
[37]

paragraph 2 text ... ... </trace> 19 Label every paragraph with its type, branch_root, and mentions. Return a JSON array with one object per paragraph. C.4.2 Trace editing We applied four editing strategies to isolate which parts of a reasoning branch causally drive move selection. Across all strategies, FINAL_DECISION paragraphs are always removed so the...

work page
[38]

We then fit both the full-tree model and the myopic model to these synthetic choices

Simulate from the full-tree model.Using the model’s fitted full-tree parameters, we sampled synthetic move choices from the full-tree softmax policy. We then fit both the full-tree model and the myopic model to these synthetic choices. If the fitting procedure is valid, the full-tree model should win (∆>0, where∆ = (NLL myopic −NLL full)/N)

work page
[39]

We then fit both models to these synthetic choices

Simulate from the myopic model.Using the model’s fitted myopic parameters, we sampled synthetic move choices from the myopic softmax policy. We then fit both models to these synthetic choices. The myopic model should win (∆<0). Models with both ∆>0 in condition 1 and ∆<0 in condition 2 are counted as successfully recovered. Model recovery succeeded in 12 ...

work page

[1] [1]

Measuring Progress on Scalable Oversight for Large Language Models

Samuel R Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamil˙e Lukoši¯ut˙e, Amanda Askell, Andy Jones, Anna Chen, et al. Measuring progress on scalable oversight for large language models.arXiv preprint arXiv:2211.03540, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu

Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound constrained optimization.SIAM Journal on Scientific Computing, 16(5):1190–1208, 1995

work page 1995

[3] [3]

Rational use of cognitive resources in human planning.Nature human behaviour, 6(8):1112–1125, 2022

Frederick Callaway, Bas Van Opheusden, Sayan Gul, Priyam Das, Paul M Krueger, Thomas L Griffiths, and Falk Lieder. Rational use of cognitive resources in human planning.Nature human behaviour, 6(8):1112–1125, 2022

work page 2022

[4] [4]

Rational decisions in multi-step environments with few rollouts.PsyArXiv, 2025

Sixing Chen, Kristopher T Jensen, and Marcelo G Mattar. Rational decisions in multi-step environments with few rollouts.PsyArXiv, 2025

work page 2025

[5] [5]

Competitive programming with large reasoning models.arXiv preprint arXiv:2502.06807, 2025

Ahmed El-Kishky, Alexander Wei, Andre Saraiva, Borys Minaiev, Daniel Selsam, David Dohan, Francis Song, Hunter Lightman, Ignasi Clavera, Jakub Pachocki, et al. Competitive programming with large reasoning models.arXiv preprint arXiv:2502.06807, 2025

work page arXiv 2025

[6] [6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? InProceedings of the 58th annual meeting of the association for computational linguistics, pages 4198–4205, 2020

work page 2020

[8] [8]

A recurrent network model of planning explains hippocampal replay and human behavior.Nature neuroscience, 27(7): 1340–1348, 2024

Kristopher T Jensen, Guillaume Hennequin, and Marcelo G Mattar. A recurrent network model of planning explains hippocampal replay and human behavior.Nature neuroscience, 27(7): 1340–1348, 2024

work page 2024

[9] [9]

What makes a good reasoning chain? uncovering structural patterns in long chain-of- thought reasoning

Gangwei Jiang, Yahui Liu, Zhaoyi Li, Wei Bi, Fuzheng Zhang, Linqi Song, Ying Wei, and Defu Lian. What makes a good reasoning chain? uncovering structural patterns in long chain-of- thought reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6501–6525, 2025

work page 2025

[10] [10]

The dependence of effective planning horizon on model accuracy

Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The dependence of effective planning horizon on model accuracy. InProceedings of the 2015 international conference on autonomous agents and multiagent systems, pages 1181–1189, 2015

work page 2015

[11] [11]

Position: Llms can’t plan, but can help planning in llm-modulo frameworks

Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Mudit Verma, Kaya Stechly, Siddhant Bhambri, Lucas Paul Saldyt, and Anil B Murthy. Position: Llms can’t plan, but can help planning in llm-modulo frameworks. InForty-first International Conference on Machine Learning, 2024

work page 2024

[12] [12]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Looking deeper into the algorithms underlying human planning.Trends in Cognitive Sciences, 2025

Ionatan Kuperwajs, Evan M Russek, Marcelo G Mattar, Wei Ji Ma, and Thomas L Griffiths. Looking deeper into the algorithms underlying human planning.Trends in Cognitive Sciences, 2025

work page 2025

[14] [14]

Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Human planning in stochastic environments.PsyArXiv

Jordan Lei, Jeroen Olieslagers, Nastaran Arfaei, D Xinlei Lin, and Wei Ji Ma. Human planning in stochastic environments.PsyArXiv. https://osf. io/bh56p_v1, 2025. 10

work page 2025

[16] [16]

From System 1 to System 2: A Survey of Reasoning Large Language Models

Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, et al. From system 1 to system 2: A survey of reasoning large language models.arXiv preprint arXiv:2502.17419, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models

Jincheng Liu, Sijun He, Jingjing Wu, Xiangsen Wang, Yang Chen, Zhaoqi Kuang, Siqi Bao, and Yuan Yao. Chessarena: A chess testbed for evaluating strategic reasoning capabilities of large language models.arXiv preprint arXiv:2509.24239, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

work page 2024

[19] [19]

Planning in the brain.Neuron, 110(6):914–934, 2022

Marcelo G Mattar and Máté Lengyel. Planning in the brain.Neuron, 110(6):914–934, 2022

work page 2022

[20] [20]

Premise-augmented reasoning chains improve error identification in math reasoning with llms.arXiv preprint arXiv:2502.02362, 2025

Sagnik Mukherjee, Abhinav Chinta, Takyoung Kim, Tarun Anoop Sharma, and Dilek Hakkani- Tür. Premise-augmented reasoning chains improve error identification in math reasoning with llms.arXiv preprint arXiv:2502.02362, 2025

work page arXiv 2025

[21] [21]

Openai o1 system card, 2024.https://openai.com

OpenAI. Openai o1 system card, 2024.https://openai.com

work page 2024

[22] [22]

Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning

Debjit Paul, Robert West, Antoine Bosselut, and Boi Faltings. Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 15012–15032, 2024

work page 2024

[23] [23]

Disentangling generalization and memorization in large language models using chess

Leonard S Pleiss, Maximilian Schiffer, and Robert K von Weizsäcker. Trapped in the past? disentangling fluid and crystallized intelligence of large language models using chess.arXiv preprint arXiv:2601.16823, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Si- mon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020

work page 2020

[25] [25]

Mastering board games by external and internal planning with language models.arXiv preprint arXiv:2412.12119, 2024

John Schultz, Jakub Adamek, Matej Jusup, Marc Lanctot, Michael Kaisers, Sarah Perrin, Daniel Hennes, Jeremy Shar, Cannada Lewis, Anian Ruoss, et al. Mastering board games by external and internal planning with language models.arXiv preprint arXiv:2412.12119, 2024

work page arXiv 2024

[26] [26]

Llms can plan only if we tell them.arXiv preprint arXiv:2501.13545, 2025

Bilgehan Sel, Ruoxi Jia, and Ming Jin. Llms can plan only if we tell them.arXiv preprint arXiv:2501.13545, 2025

work page arXiv 2025

[27] [27]

Mas- tering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess- che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mas- tering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

work page 2016

[28] [28]

A general reinforcement learning algorithm that masters chess, shogi, and go through self-play.Science, 362(6419):1140–1144, 2018

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play.Science, 362(6419):1140–1144, 2018

work page 2018

[29] [29]

Generalized planning in pddl domains with pretrained large language models

Tom Silver, Soham Dan, Kavitha Srinivas, Joshua B Tenenbaum, Leslie Kaelbling, and Michael Katz. Generalized planning in pddl domains with pretrained large language models. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pages 20256–20264, 2024

work page 2024

[30] [30]

Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36:74952–74965, 2023

Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36:74952–74965, 2023

work page 2023

[31] [31]

A system- atic evaluation of the planning and scheduling abilities of the reasoning model o1.Transactions on Machine Learning Research, 2025

Karthik Valmeekam, Kaya Stechly, Atharva Gundawar, and Subbarao Kambhampati. A system- atic evaluation of the planning and scheduling abilities of the reasoning model o1.Transactions on Machine Learning Research, 2025

work page 2025

[32] [32]

Expertise increases planning depth in human gameplay.Nature, 618(7967):1000–1005, 2023

Bas Van Opheusden, Ionatan Kuperwajs, Gianni Galbiati, Zahy Bnaya, Yunqi Li, and Wei Ji Ma. Expertise increases planning depth in human gameplay.Nature, 618(7967):1000–1005, 2023. 11

work page 2023

[33] [33]

Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 2022

Jason Wei, Xuezhi Wang, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 2022

work page 2022

[34] [34]

org/abs/1912.11206

Chenjun Xiao, Yifan Wu, Chen Ma, Dale Schuurmans, and Martin Müller. Learning to combat compounding-error in model-based reinforcement learning.arXiv preprint arXiv:1912.11206, 2019

work page arXiv 1912

[35] [35]

Complete chess games enable llm become a chess master

Yinqi Zhang, Xintian Han, Haolong Li, Kedi Chen, and Shaohui Lin. Complete chess games enable llm become a chess master. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 1–7, 2025

work page 2025

[36] [36]

falling piece

Huaixiu Steven Zheng, Swaroop Mishra, Hugh Zhang, Xinyun Chen, Minmin Chen, Azade Nova, Le Hou, Heng-Tze Cheng, Quoc V Le, Ed H Chi, et al. Natural plan: Benchmarking llms on natural language planning.arXiv preprint arXiv:2406.04520, 2024. 12 A Code and data availability Code is available at this anonymous repository. Raw game logs (1.1 GB compressed) are...

work page arXiv 2024

[37] [37]

paragraph 2 text ... ... </trace> 19 Label every paragraph with its type, branch_root, and mentions. Return a JSON array with one object per paragraph. C.4.2 Trace editing We applied four editing strategies to isolate which parts of a reasoning branch causally drive move selection. Across all strategies, FINAL_DECISION paragraphs are always removed so the...

work page

[38] [38]

We then fit both the full-tree model and the myopic model to these synthetic choices

Simulate from the full-tree model.Using the model’s fitted full-tree parameters, we sampled synthetic move choices from the full-tree softmax policy. We then fit both the full-tree model and the myopic model to these synthetic choices. If the fitting procedure is valid, the full-tree model should win (∆>0, where∆ = (NLL myopic −NLL full)/N)

work page

[39] [39]

We then fit both models to these synthetic choices

Simulate from the myopic model.Using the model’s fitted myopic parameters, we sampled synthetic move choices from the myopic softmax policy. We then fit both models to these synthetic choices. The myopic model should win (∆<0). Models with both ∆>0 in condition 1 and ∆<0 in condition 2 are counted as successfully recovered. Model recovery succeeded in 12 ...

work page