Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
Pith reviewed 2026-05-25 05:55 UTC · model grok-4.3
The pith
LLMs expand deep nodes in their reasoning traces but select moves using only a myopic model that ignores those nodes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Although LLMs expand deep nodes in their traces, their move choices are best explained by a myopic model that ignores those nodes entirely. A causal intervention study where we selectively prune CoT paragraphs further suggests that move selection is driven predominantly by shallow rather than deep nodes. These patterns contrast with human planning, where performance is driven primarily by deep search.
What carries the argument
Extraction of search trees from free-form CoT text in four-in-a-row, followed by model fitting to quantify depth, breadth, and causal influence on move selection.
If this is right
- LLM performance in this domain is predicted by search breadth rather than depth.
- Move selection remains driven by shallow nodes even when deep nodes appear in the generated trace.
- Human and LLM planning dissociate because human expertise improves with deeper lookahead while LLM decisions do not.
- Selective pruning of reasoning paragraphs can causally test which parts of the trace influence decisions.
Where Pith is reading between the lines
- Methods that force deep nodes to influence output probabilities could close the gap with human-like planning.
- The same extraction approach could be tested in other board games or multi-step reasoning tasks to check whether myopic selection is general.
- If the parsing step systematically under-weights deep nodes, the dissociation finding would need re-examination with alternative tree-recovery techniques.
Load-bearing premise
The method of parsing free-form CoT text into discrete search trees faithfully recovers the model's internal deliberation without introducing parsing artifacts or selection biases that favor shallow nodes.
What would settle it
An experiment in which pruning deep nodes from the traces changes move selection while pruning shallow nodes does not would falsify the myopic claim; the observed pattern would support it.
Figures
read the original abstract
Large language models (LLMs), especially reasoning models, generate extended chain-of-thought (CoT) reasoning that often contains explicit deliberation over future outcomes. Yet whether this deliberation constitutes genuine planning, how it is structured, and what aspects of it drive performance remain poorly understood. In this work, we introduce a new method to characterize LLM planning by extracting and quantifying search trees from reasoning traces in the four-in-a-row board game. By fitting computational models on the extracted search trees, we characterize how plans are structured and how they influence move decisions. We find that LLMs' search is shallower than humans', and that performance is predicted by search breadth rather than depth. Most strikingly, although LLMs expand deep nodes in their traces, their move choices are best explained by a myopic model that ignores those nodes entirely. A causal intervention study where we selectively prune CoT paragraphs further suggests that move selection is driven predominantly by shallow rather than deep nodes. These patterns contrast with human planning, where performance is driven primarily by deep search. Together, our findings reveal a key difference between LLM and human planning: while human expertise is driven by deeper search, LLMs do not act on deep lookahead. This dissociation offers targeted guidance for aligning LLM and human planning. More broadly, our framework provides a generalizable approach for interpreting the structure of LLM planning across strategic domains.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a method for extracting discrete search trees from LLM chain-of-thought traces in four-in-a-row, fits computational search models to the resulting trees, and reports that LLM move selection is best explained by a myopic model that ignores deep nodes (despite their expansion in traces). A causal paragraph-pruning intervention is used to argue that shallow nodes drive decisions, in contrast to human planning where depth matters more; performance is linked to breadth rather than depth.
Significance. If the extraction procedure is shown to be faithful, the dissociation between expanded deep search and myopic decision-making would be a substantive contribution to understanding LLM planning, offering a concrete contrast with human expertise and a framework extensible to other domains.
major comments (2)
- [§3] §3 (Extraction Method): No quantitative validation of the CoT-to-tree parser is reported (e.g., inter-annotator agreement, accuracy on held-out forced-search traces, or sensitivity to shallow-node heuristics). Because all model fits, breadth/depth comparisons, and the myopic-model conclusion rest on the extracted trees, absence of these checks leaves open the possibility of systematic parsing bias favoring shallow nodes.
- [§5] §5 (Model Comparison and Causal Intervention): The abstract states that move choices are 'best explained by a myopic model' and that pruning supports shallow-node dominance, yet supplies no effect sizes, cross-validation details, or controls for multiple comparisons across the family of search models. These omissions make it impossible to assess whether the myopic advantage is robust or an artifact of the unvalidated trees.
minor comments (2)
- [Figure 2] Figure 2 and Table 1: Axis labels and tree-depth definitions are not fully aligned with the extraction heuristics described in §3, making it difficult to reproduce the reported breadth vs. depth results.
- The four-in-a-row domain is introduced without explicit comparison to prior game-tree extraction work in chess or Go; a brief related-work paragraph would clarify novelty.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address the major comments point by point below.
read point-by-point responses
-
Referee: [§3] §3 (Extraction Method): No quantitative validation of the CoT-to-tree parser is reported (e.g., inter-annotator agreement, accuracy on held-out forced-search traces, or sensitivity to shallow-node heuristics). Because all model fits, breadth/depth comparisons, and the myopic-model conclusion rest on the extracted trees, absence of these checks leaves open the possibility of systematic parsing bias favoring shallow nodes.
Authors: The referee correctly identifies that we did not report quantitative validation metrics for the parser. To address this, the revised manuscript will include inter-annotator agreement scores on a random sample of 100 traces, parser accuracy evaluated on held-out traces where search was forced to specific depths, and sensitivity analyses showing how varying the heuristics for identifying shallow nodes affects the extracted tree structures and subsequent model fits. These additions will help confirm that the myopic conclusion is not an artifact of parsing bias. revision: yes
-
Referee: [§5] §5 (Model Comparison and Causal Intervention): The abstract states that move choices are 'best explained by a myopic model' and that pruning supports shallow-node dominance, yet supplies no effect sizes, cross-validation details, or controls for multiple comparisons across the family of search models. These omissions make it impossible to assess whether the myopic advantage is robust or an artifact of the unvalidated trees.
Authors: We agree that additional statistical details are needed. In the revision, we will report effect sizes such as the difference in log-likelihood between the myopic model and alternatives, describe the cross-validation setup (including number of folds and how models were selected), and apply corrections for multiple comparisons across the model family. These will be presented alongside the existing results to demonstrate the robustness of the myopic model's advantage. revision: yes
Circularity Check
No circularity: empirical extraction, model fitting, and held-out testing are independent of inputs
full rationale
The paper extracts search trees from new LLM CoT traces in four-in-a-row, fits computational models to those trees, compares fit quality across myopic vs. deep-search models, and validates via held-out move choices plus paragraph-pruning interventions. No equation, parameter fit, or self-citation reduces the central claim (myopic model best explains choices despite deep nodes in traces) to the extraction method or data by construction. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- parameters of the computational search models
axioms (1)
- domain assumption Chain-of-thought traces contain explicit, parsable deliberation that can be mapped onto discrete game-tree nodes without systematic distortion.
Reference graph
Works this paper leans on
-
[1]
Measuring Progress on Scalable Oversight for Large Language Models
Samuel R Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamil˙e Lukoši¯ut˙e, Amanda Askell, Andy Jones, Anna Chen, et al. Measuring progress on scalable oversight for large language models.arXiv preprint arXiv:2211.03540, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[2]
Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu
Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound constrained optimization.SIAM Journal on Scientific Computing, 16(5):1190–1208, 1995
work page 1995
-
[3]
Rational use of cognitive resources in human planning.Nature human behaviour, 6(8):1112–1125, 2022
Frederick Callaway, Bas Van Opheusden, Sayan Gul, Priyam Das, Paul M Krueger, Thomas L Griffiths, and Falk Lieder. Rational use of cognitive resources in human planning.Nature human behaviour, 6(8):1112–1125, 2022
work page 2022
-
[4]
Rational decisions in multi-step environments with few rollouts.PsyArXiv, 2025
Sixing Chen, Kristopher T Jensen, and Marcelo G Mattar. Rational decisions in multi-step environments with few rollouts.PsyArXiv, 2025
work page 2025
-
[5]
Competitive programming with large reasoning models.arXiv preprint arXiv:2502.06807, 2025
Ahmed El-Kishky, Alexander Wei, Andre Saraiva, Borys Minaiev, Daniel Selsam, David Dohan, Francis Song, Hunter Lightman, Ignasi Clavera, Jakub Pachocki, et al. Competitive programming with large reasoning models.arXiv preprint arXiv:2502.06807, 2025
-
[6]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? InProceedings of the 58th annual meeting of the association for computational linguistics, pages 4198–4205, 2020
work page 2020
-
[8]
Kristopher T Jensen, Guillaume Hennequin, and Marcelo G Mattar. A recurrent network model of planning explains hippocampal replay and human behavior.Nature neuroscience, 27(7): 1340–1348, 2024
work page 2024
-
[9]
Gangwei Jiang, Yahui Liu, Zhaoyi Li, Wei Bi, Fuzheng Zhang, Linqi Song, Ying Wei, and Defu Lian. What makes a good reasoning chain? uncovering structural patterns in long chain-of- thought reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6501–6525, 2025
work page 2025
-
[10]
The dependence of effective planning horizon on model accuracy
Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The dependence of effective planning horizon on model accuracy. InProceedings of the 2015 international conference on autonomous agents and multiagent systems, pages 1181–1189, 2015
work page 2015
-
[11]
Position: Llms can’t plan, but can help planning in llm-modulo frameworks
Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Mudit Verma, Kaya Stechly, Siddhant Bhambri, Lucas Paul Saldyt, and Anil B Murthy. Position: Llms can’t plan, but can help planning in llm-modulo frameworks. InForty-first International Conference on Machine Learning, 2024
work page 2024
-
[12]
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Looking deeper into the algorithms underlying human planning.Trends in Cognitive Sciences, 2025
Ionatan Kuperwajs, Evan M Russek, Marcelo G Mattar, Wei Ji Ma, and Thomas L Griffiths. Looking deeper into the algorithms underlying human planning.Trends in Cognitive Sciences, 2025
work page 2025
-
[14]
Measuring Faithfulness in Chain-of-Thought Reasoning
Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Human planning in stochastic environments.PsyArXiv
Jordan Lei, Jeroen Olieslagers, Nastaran Arfaei, D Xinlei Lin, and Wei Ji Ma. Human planning in stochastic environments.PsyArXiv. https://osf. io/bh56p_v1, 2025. 10
work page 2025
-
[16]
From System 1 to System 2: A Survey of Reasoning Large Language Models
Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, et al. From system 1 to system 2: A survey of reasoning large language models.arXiv preprint arXiv:2502.17419, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models
Jincheng Liu, Sijun He, Jingjing Wu, Xiangsen Wang, Yang Chen, Zhaoqi Kuang, Siqi Bao, and Yuan Yao. Chessarena: A chess testbed for evaluating strategic reasoning capabilities of large language models.arXiv preprint arXiv:2509.24239, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024
work page 2024
-
[19]
Planning in the brain.Neuron, 110(6):914–934, 2022
Marcelo G Mattar and Máté Lengyel. Planning in the brain.Neuron, 110(6):914–934, 2022
work page 2022
-
[20]
Sagnik Mukherjee, Abhinav Chinta, Takyoung Kim, Tarun Anoop Sharma, and Dilek Hakkani- Tür. Premise-augmented reasoning chains improve error identification in math reasoning with llms.arXiv preprint arXiv:2502.02362, 2025
-
[21]
Openai o1 system card, 2024.https://openai.com
OpenAI. Openai o1 system card, 2024.https://openai.com
work page 2024
-
[22]
Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning
Debjit Paul, Robert West, Antoine Bosselut, and Boi Faltings. Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 15012–15032, 2024
work page 2024
-
[23]
Disentangling generalization and memorization in large language models using chess
Leonard S Pleiss, Maximilian Schiffer, and Robert K von Weizsäcker. Trapped in the past? disentangling fluid and crystallized intelligence of large language models using chess.arXiv preprint arXiv:2601.16823, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[24]
Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Si- mon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020
work page 2020
-
[25]
John Schultz, Jakub Adamek, Matej Jusup, Marc Lanctot, Michael Kaisers, Sarah Perrin, Daniel Hennes, Jeremy Shar, Cannada Lewis, Anian Ruoss, et al. Mastering board games by external and internal planning with language models.arXiv preprint arXiv:2412.12119, 2024
-
[26]
Llms can plan only if we tell them.arXiv preprint arXiv:2501.13545, 2025
Bilgehan Sel, Ruoxi Jia, and Ming Jin. Llms can plan only if we tell them.arXiv preprint arXiv:2501.13545, 2025
-
[27]
Mas- tering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess- che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mas- tering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016
work page 2016
-
[28]
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play.Science, 362(6419):1140–1144, 2018
work page 2018
-
[29]
Generalized planning in pddl domains with pretrained large language models
Tom Silver, Soham Dan, Kavitha Srinivas, Joshua B Tenenbaum, Leslie Kaelbling, and Michael Katz. Generalized planning in pddl domains with pretrained large language models. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pages 20256–20264, 2024
work page 2024
-
[30]
Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36:74952–74965, 2023
work page 2023
-
[31]
Karthik Valmeekam, Kaya Stechly, Atharva Gundawar, and Subbarao Kambhampati. A system- atic evaluation of the planning and scheduling abilities of the reasoning model o1.Transactions on Machine Learning Research, 2025
work page 2025
-
[32]
Expertise increases planning depth in human gameplay.Nature, 618(7967):1000–1005, 2023
Bas Van Opheusden, Ionatan Kuperwajs, Gianni Galbiati, Zahy Bnaya, Yunqi Li, and Wei Ji Ma. Expertise increases planning depth in human gameplay.Nature, 618(7967):1000–1005, 2023. 11
work page 2023
-
[33]
Jason Wei, Xuezhi Wang, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 2022
work page 2022
-
[34]
Chenjun Xiao, Yifan Wu, Chen Ma, Dale Schuurmans, and Martin Müller. Learning to combat compounding-error in model-based reinforcement learning.arXiv preprint arXiv:1912.11206, 2019
-
[35]
Complete chess games enable llm become a chess master
Yinqi Zhang, Xintian Han, Haolong Li, Kedi Chen, and Shaohui Lin. Complete chess games enable llm become a chess master. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 1–7, 2025
work page 2025
-
[36]
Huaixiu Steven Zheng, Swaroop Mishra, Hugh Zhang, Xinyun Chen, Minmin Chen, Azade Nova, Le Hou, Heng-Tze Cheng, Quoc V Le, Ed H Chi, et al. Natural plan: Benchmarking llms on natural language planning.arXiv preprint arXiv:2406.04520, 2024. 12 A Code and data availability Code is available at this anonymous repository. Raw game logs (1.1 GB compressed) are...
-
[37]
paragraph 2 text ... ... </trace> 19 Label every paragraph with its type, branch_root, and mentions. Return a JSON array with one object per paragraph. C.4.2 Trace editing We applied four editing strategies to isolate which parts of a reasoning branch causally drive move selection. Across all strategies, FINAL_DECISION paragraphs are always removed so the...
-
[38]
We then fit both the full-tree model and the myopic model to these synthetic choices
Simulate from the full-tree model.Using the model’s fitted full-tree parameters, we sampled synthetic move choices from the full-tree softmax policy. We then fit both the full-tree model and the myopic model to these synthetic choices. If the fitting procedure is valid, the full-tree model should win (∆>0, where∆ = (NLL myopic −NLL full)/N)
-
[39]
We then fit both models to these synthetic choices
Simulate from the myopic model.Using the model’s fitted myopic parameters, we sampled synthetic move choices from the myopic softmax policy. We then fit both models to these synthetic choices. The myopic model should win (∆<0). Models with both ∆>0 in condition 1 and ∆<0 in condition 2 are counted as successfully recovered. Model recovery succeeded in 12 ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.