pith. machine review for the scientific record. sign in

arxiv: 2305.10601 · v2 · submitted 2023-05-17 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 3 theorem links

· Lean Theorem

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Dian Yu, Izhak Shafran, Jeffrey Zhao, Karthik Narasimhan, Shunyu Yao, Thomas L. Griffiths, Yuan Cao

Pith reviewed 2026-05-11 15:31 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords large language modelsreasoningplanningsearchchain of thoughtproblem solvingtree searchprompting
0
0 comments X

The pith

Tree of Thoughts lets language models explore multiple reasoning paths with self-evaluation and backtracking instead of proceeding left to right.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Tree of Thoughts to overcome the limits of standard chain-of-thought prompting, where models generate tokens sequentially without revisiting choices. By treating short coherent segments of text as thoughts and arranging them in a tree, the method lets the model generate several candidate paths, score their promise, and choose whether to continue, branch, or backtrack. Experiments introduce three new tasks that require planning or search, and show large gains such as raising GPT-4 success on the Game of 24 from 4 percent to 74 percent. If the approach holds, language models become capable of more global, strategic reasoning on problems where early decisions matter and dead ends are common.

Core claim

Tree of Thoughts generalizes chain-of-thought prompting by decomposing a problem into a tree of intermediate thoughts, each a coherent unit of text, and by equipping the language model with the ability to self-evaluate thoughts, explore multiple branches, perform lookahead, and backtrack when a path appears unpromising, thereby enabling deliberate decision making rather than token-level left-to-right generation.

What carries the argument

Tree of Thoughts, a search framework that organizes language-model generations into a tree of evaluable thoughts and applies algorithms to traverse, prune, and select paths.

If this is right

  • Language models reach markedly higher success rates on tasks that require non-trivial planning or search.
  • Self-evaluation of thoughts lets models decide when to explore further or abandon a line of reasoning.
  • The same framework improves performance on creative writing and mini crosswords in addition to arithmetic puzzles.
  • The method works without further training, using only prompting and standard search procedures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If thought evaluation proves reliable, the tree structure could be paired with external verifiers or tools to reduce reliance on the model's internal judgments.
  • The approach suggests a path toward applying language models to longer-horizon agent tasks where backtracking and lookahead are essential.
  • Models might eventually be trained or fine-tuned to produce higher-quality thoughts rather than optimizing only for next-token likelihood.

Load-bearing premise

The language model must generate coherent thoughts and judge their quality accurately enough to steer search without frequent systematic errors or excessive wasted computation.

What would settle it

Reproduce the Game of 24 experiments with the same GPT-4 model and prompts and observe that Tree of Thoughts yields no substantial increase in solved puzzles over chain-of-thought.

read the original abstract

Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. To surmount these challenges, we introduce a new framework for language model inference, Tree of Thoughts (ToT), which generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoughts) that serve as intermediate steps toward problem solving. ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices. Our experiments show that ToT significantly enhances language models' problem-solving abilities on three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords. For instance, in Game of 24, while GPT-4 with chain-of-thought prompting only solved 4% of tasks, our method achieved a success rate of 74%. Code repo with all prompts: https://github.com/princeton-nlp/tree-of-thought-llm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper introduces Tree of Thoughts (ToT), a prompting framework that generalizes Chain-of-Thought by structuring LLM inference as a tree search over coherent intermediate 'thoughts.' The model generates multiple thoughts per node, self-evaluates their promise, and uses search algorithms (BFS or DFS) with lookahead and backtracking to solve tasks requiring planning. Experiments demonstrate large gains on Game of 24 (74% success with GPT-4 vs. 4% CoT), Creative Writing, and Mini Crosswords, supported by ablations on search strategies and human evaluation.

Significance. If the results hold, the work is significant for advancing LLM reasoning beyond linear token generation toward deliberate, search-based problem solving. The large, consistent improvements across three distinct planning-heavy tasks, with comparisons to strong baselines, ablations, and reproducible code, provide a practical and extensible method that addresses a clear limitation in current LLM inference.

minor comments (2)
  1. [Abstract] The abstract and introduction describe the tasks as 'novel'; clarifying whether this refers to the tasks themselves or their use as LLM benchmarks would improve precision.
  2. [Experiments] Additional discussion of the increased inference cost (number of LM calls) due to tree search, relative to CoT, would help readers assess practical trade-offs.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation to accept the paper. We are glad that the referee finds the Tree of Thoughts framework significant for advancing LLM reasoning and appreciates the experimental results, ablations, and reproducibility.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces an algorithmic prompting framework (Tree of Thoughts) that generalizes Chain-of-Thought by enabling tree search with self-evaluation and backtracking. It contains no mathematical derivation, first-principles predictions, or fitted parameters whose outputs reduce to the inputs by construction. All reported results (e.g., Game of 24 success rates) are direct empirical measurements on held-out task instances, with ablations and human evaluations providing independent validation. The central claim rests on the method's description plus external performance deltas rather than any self-referential loop.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper is an empirical methods contribution; its central claim rests on the practical effectiveness of LLM-generated thoughts and evaluations rather than on new axioms or fitted parameters.

free parameters (1)
  • search hyperparameters (breadth, depth, number of thoughts per node)
    These control the tree search and are chosen per task but are not fitted to the final performance numbers in a way that circularly defines the claim.
axioms (1)
  • domain assumption Language models can generate coherent intermediate thoughts and produce useful self-evaluations of their promise.
    This assumption underpins the entire framework and is tested empirically through the reported task results.

pith-pipeline@v0.9.0 · 5554 in / 1287 out tokens · 68767 ms · 2026-05-11T15:31:38.045363+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 46 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.

  2. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.

  3. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.

  4. SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents

    cs.CR 2026-05 unverdicted novelty 7.0

    SkCC compiles LLM skills via SkIR to achieve portability across agent frameworks, reduce adaptation effort from O(m×n) to O(m+n), and enforce security with reported gains in task success rates and token efficiency.

  5. From Static Analysis to Audience Dissemination: A Training-Free Multimodal Controversy Detection Multi-Agent Framework

    cs.LG 2026-05 unverdicted novelty 7.0

    AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming pri...

  6. Shorthand for Thought: Compressing LLM Reasoning via Entropy-Guided Supertokens

    cs.CL 2026-04 unverdicted novelty 7.0

    Entropy-guided supertokens from BPE on reasoning traces compress LLM outputs by 8.1% on average across models and math benchmarks with no accuracy loss while exposing strategy differences between correct and incorrect traces.

  7. TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation

    cs.CV 2026-04 accept novelty 7.0

    TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.

  8. A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

    cs.CR 2026-04 unverdicted novelty 7.0

    A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

  9. RAG-Reflect: Agentic Retrieval-Augmented Generation with Reflections for Comment-Driven Code Maintenance on Stack Overflow

    cs.SE 2026-04 unverdicted novelty 7.0

    RAG-Reflect achieves F1=0.78 on valid comment-edit prediction using retrieval-augmented reasoning and self-reflection, outperforming baselines and approaching fine-tuned models without retraining.

  10. Synthesizing Multi-Agent Harnesses for Vulnerability Discovery

    cs.CR 2026-04 unverdicted novelty 7.0

    AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new z...

  11. Navigating the Conceptual Multiverse

    cs.HC 2026-04 unverdicted novelty 7.0

    The conceptual multiverse system with a verification framework for decision structures helps users in philosophy, AI alignment, and poetry build clearer working maps of open-ended problems by making implicit LLM choic...

  12. SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees

    cs.LG 2026-04 unverdicted novelty 7.0

    SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.

  13. Feedback-Driven Execution for LLM-Based Binary Analysis

    cs.CR 2026-04 unverdicted novelty 7.0

    FORGE uses a reasoning-action-observation loop and Dynamic Forest of Agents to perform scalable LLM-based binary analysis, finding 1,274 vulnerabilities across 591 of 3,457 real-world firmware binaries at 72.3% precis...

  14. BEAM: Bi-level Memory-adaptive Algorithmic Evolution for LLM-Powered Heuristic Design

    cs.AI 2026-04 unverdicted novelty 7.0

    BEAM reformulates LLM-based heuristic design as bi-level optimization using GA for structures, MCTS for placeholders, and adaptive memory to outperform prior single-layer methods on CVRP and MIS tasks.

  15. AdversarialCoT: Single-Document Retrieval Poisoning for LLM Reasoning

    cs.IR 2026-04 unverdicted novelty 7.0

    A single query-specific poisoned document, built by extracting and iteratively refining an adversarial chain-of-thought, can substantially degrade reasoning accuracy in retrieval-augmented LLM systems.

  16. IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling

    cs.AI 2026-04 unverdicted novelty 7.0

    IoT-Brain uses a neuro-symbolic Spatial Trajectory Graph to ground LLMs for verifiable semantic-spatial sensor scheduling, achieving 37.6% higher task success with lower resource use on a campus-scale benchmark.

  17. Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

    cs.CV 2023-10 accept novelty 7.0

    Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.

  18. Measuring Faithfulness in Chain-of-Thought Reasoning

    cs.AI 2023-07 conditional novelty 7.0

    Chain-of-Thought reasoning in LLMs is often unfaithful, with models relying on it variably by task and less so as models scale larger.

  19. VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    cs.RO 2023-07 unverdicted novelty 7.0

    VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.

  20. LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

    cs.AI 2023-04 accept novelty 7.0

    LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.

  21. LLM-X: A Scalable Negotiation-Oriented Exchange for Communication Among Personal LLM Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    LLM-X is a scalable architecture for direct negotiation and communication among personal LLM agents, featuring federated gateways, typed protocols, and policy enforcement, shown stable in experiments with up to 12 agents.

  22. From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World

    cs.AI 2026-05 unverdicted novelty 6.0

    A practical evaluation protocol for AI pentesting agents that uses validated vulnerability discovery, LLM semantic matching, and bipartite scoring to assess performance in realistic, complex targets.

  23. OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces

    cs.AI 2026-05 unverdicted novelty 6.0

    OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.

  24. Pop Quiz Attack: Black-box Membership Inference Attacks Against Large Language Models

    cs.CR 2026-05 unverdicted novelty 6.0

    PopQuiz Attack infers LLM training data membership by turning examples into quiz questions and measuring answer accuracy, reaching 0.873 average ROC-AUC across six models and outperforming prior methods by 20.6%.

  25. Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning

    cs.CV 2026-05 unverdicted novelty 6.0

    HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.

  26. SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents

    cs.CR 2026-05 unverdicted novelty 6.0

    SkCC compiles LLM agent skills through a strongly-typed IR and static security checks, cutting adaptation complexity from O(m×n) to O(m+n) and raising pass rates by 12-13 points on tested platforms.

  27. Thinking with Reasoning Skills: Fewer Tokens, More Accuracy

    cs.AI 2026-04 unverdicted novelty 6.0

    Distilling and retrieving reusable reasoning skills lets LLMs solve coding and math problems with fewer tokens and higher accuracy.

  28. Process Supervision via Verbal Critique Improves Reasoning in Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    Verbal Process Supervision uses structured critiques from stronger models in an iterative loop to improve LLM reasoning, reaching 94.9% on GPQA Diamond and large gains on AIME 2025.

  29. Structured Safety Auditing for Balancing Code Correctness and Content Safety in LLM-Generated Code

    cs.SE 2026-04 unverdicted novelty 6.0

    Dual Reasoning with explicit safety audits improves the new SUDS metric by 1.32x to 3.42x over baselines on code generation benchmarks containing injected harmful keywords.

  30. DCD: Domain-Oriented Design for Controlled Retrieval-Augmented Generation

    cs.IR 2026-04 unverdicted novelty 6.0

    DCD introduces a domain-oriented hierarchical decomposition and staged routing workflow that restricts retrieval and generation scopes progressively to improve robustness and factual accuracy in RAG on complex, multi-...

  31. OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    cs.CL 2024-10 unverdicted novelty 6.0

    OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.

  32. Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    cs.LG 2024-07 unverdicted novelty 6.0

    Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.

  33. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    cs.SE 2024-03 unverdicted novelty 6.0

    LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

  34. SGLang: Efficient Execution of Structured Language Model Programs

    cs.AI 2023-12 conditional novelty 6.0

    SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.

  35. Complexity Horizons of Compressed Models in Analog Circuit Analysis

    cs.AI 2026-05 unverdicted novelty 5.0

    Prerequisite graphs map compressed LLM performance boundaries in analog circuit analysis to allow selecting the smallest viable model for a given task complexity.

  36. State Representation and Termination for Recursive Reasoning Systems

    cs.AI 2026-05 unverdicted novelty 5.0

    Recursive reasoning systems can represent their state via an epistemic state graph and terminate when the linearized order-gap is non-degenerate near the fixed point, providing a local condition for when the stopping ...

  37. LLM Reasoning Is Latent, Not the Chain of Thought

    cs.AI 2026-04 unverdicted novelty 5.0

    LLM reasoning is primarily mediated by latent-state trajectories rather than by explicit surface chain-of-thought outputs.

  38. A pragmatic approach to regulating AI agents

    cs.CY 2026-04 unverdicted novelty 5.0

    AI agents require distinct regulation as AI systems under the EU AI Act with orchestration-layer oversight and a risk-based traffic light authorization system in contract law to preserve human accountability.

  39. Agent Mentor: Framing Agent Knowledge through Semantic Trajectory Analysis

    cs.AI 2026-04 unverdicted novelty 5.0

    Agent Mentor analyzes semantic trajectories in agent logs to identify undesired behaviors and derives corrective prompt instructions, yielding measurable accuracy gains on benchmark tasks across three agent setups.

  40. From Hallucination to Structure Snowballing: The Alignment Tax of Constrained Decoding in LLM Reflection

    cs.CL 2026-04 unverdicted novelty 5.0

    Enforcing structured reflection via Outlines-based constrained decoding on an 8B LLM triggers structure snowballing instead of better self-correction, producing near-perfect syntax but persistent semantic errors and r...

  41. Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    cs.CL 2023-12 unverdicted novelty 5.0

    Llama Guard is an instruction-tuned Llama2-7b model that performs multi-class safety classification on prompts and responses, matching or exceeding existing moderation tools on benchmarks while supporting taxonomy cus...

  42. A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications

    cs.IR 2026-05 unverdicted novelty 4.0

    The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.

  43. Dual-Track CoT: Budget-Aware Stepwise Guidance for Small LMs

    cs.CL 2026-04 unverdicted novelty 4.0

    Dual-Track CoT lets small language models perform reliable multi-step reasoning with the same or fewer tokens via budget tracking and rejection of redundant steps.

  44. Understanding the planning of LLM agents: A survey

    cs.AI 2024-02 accept novelty 4.0

    A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.

  45. The Rise and Potential of Large Language Model Based Agents: A Survey

    cs.AI 2023-09 accept novelty 4.0

    The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.

  46. A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

    cs.AI 2024-02 unverdicted novelty 3.0

    A systematic survey categorizes prompt engineering methods for LLMs and VLMs by application area, summarizing methodologies, applications, models, datasets, strengths, and limitations for each technique along with a t...

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 44 Pith papers · 8 internal anchors

  1. [1]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  2. [2]

    Browne, E

    C. Browne, E. J. Powley, D. Whitehouse, S. M. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. P. Liebana, S. Samothrakis, and S. Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4:1–43, 2012

  3. [3]

    Campbell, A

    M. Campbell, A. J. Hoane Jr, and F.-h. Hsu. Deep blue. Artificial intelligence, 134(1-2):57–83, 2002

  4. [4]

    X. Chen, M. Lin, N. Sch¨arli, and D. Zhou. Teaching large language models to self-debug, 2023

  5. [5]

    PaLM: Scaling Language Modeling with Pathways

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

  6. [6]

    Faithful reasoning using large language models.arXiv preprint arXiv:2208.14271, 2022

    A. Creswell and M. Shanahan. Faithful reasoning using large language models. arXiv preprint arXiv:2208.14271, 2022

  7. [7]

    N. D. Daw, Y . Niv, and P. Dayan. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature neuroscience, 8(12):1704–1711, 2005

  8. [8]

    L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y . Yang, J. Callan, and G. Neubig. Pal: Program- aided language models, 2023

  9. [9]

    S. Hao, Y . Gu, H. Ma, J. J. Hong, Z. Wang, D. Z. Wang, and Z. Hu. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023

  10. [10]

    P. E. Hart, N. J. Nilsson, and B. Raphael. A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics, 4(2):100–107,

  11. [11]

    doi: 10.1109/TSSC.1968.300136

  12. [12]

    P. E. Hart, N. J. Nilsson, and B. Raphael. A formal basis for the heuristic determination of minimum cost paths. IEEE transactions on Systems Science and Cybernetics, 4(2):100–107, 1968

  13. [13]

    Huang, P

    W. Huang, P. Abbeel, D. Pathak, and I. Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents, 2022

  14. [14]

    Inner Monologue: Embodied Reasoning through Planning with Language Models

    W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y . Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022. 10

  15. [15]

    J. Jung, L. Qin, S. Welleck, F. Brahman, C. Bhagavatula, R. L. Bras, and Y . Choi. Maieu- tic prompting: Logically consistent reasoning with recursive explanations. arXiv preprint arXiv:2205.11822, 2022

  16. [16]

    Kahneman

    D. Kahneman. Thinking, fast and slow. Macmillan, 2011

  17. [17]

    Kahneman, S

    D. Kahneman, S. Frederick, et al. Representativeness revisited: Attribute substitution in intuitive judgment. Heuristics and biases: The psychology of intuitive judgment, 49(49-81):74, 2002

  18. [18]

    G. Kim, P. Baldi, and S. McAleer. Language models can solve computer tasks, 2023

  19. [19]

    B. Liu, Y . Jiang, X. Zhang, Q. Liu, S. Zhang, J. Biswas, and P. Stone. Llm+p: Empowering large language models with optimal planning proficiency, 2023

  20. [20]

    X. Lu, S. Welleck, P. West, L. Jiang, J. Kasai, D. Khashabi, R. L. Bras, L. Qin, Y . Yu, R. Zellers, N. A. Smith, and Y . Choi. Neurologic a*esque decoding: Constrained text generation with lookahead heuristics. In North American Chapter of the Association for Computational Linguistics, 2021

  21. [21]

    Madaan, N

    A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yang, S. Welleck, B. P. Majumder, S. Gupta, A. Yazdanbakhsh, and P. Clark. Self-refine: Iterative refinement with self-feedback, 2023

  22. [22]

    Newell, J

    A. Newell, J. C. Shaw, and H. A. Simon. Report on a general problem solving program. In IFIP congress, volume 256, page 64. Pittsburgh, PA, 1959

  23. [23]

    Newell, H

    A. Newell, H. A. Simon, et al. Human problem solving. Prentice-Hall, 1972

  24. [24]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023

  25. [25]

    D. Paul, M. Ismayilzada, M. Peyrard, B. Borges, A. Bosselut, R. West, and B. Faltings. Refiner: Reasoning feedback on intermediate representations, 2023

  26. [26]

    Radford, K

    A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al. Improving language understanding by generative pre-training. OpenAI blog, 2018

  27. [27]

    Radford, J

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019

  28. [28]

    Schlag, S

    I. Schlag, S. Sukhbaatar, A. Celikyilmaz, W. tau Yih, J. Weston, J. Schmidhuber, and X. Li. Large language model programs, 2023

  29. [29]

    Shinn, B

    N. Shinn, B. Labash, and A. Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection, 2023

  30. [30]

    Silver, J

    D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. Mastering the game of go without human knowledge. nature, 550 (7676):354–359, 2017

  31. [31]

    S. A. Sloman. The empirical case for two systems of reasoning. Psychological bulletin, 119(1): 3, 1996

  32. [32]

    K. E. Stanovich. Who is rational? Studies of individual differences in reasoning. Psychology Press, 1999

  33. [33]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  34. [34]

    Verma, J

    S. Verma, J. Fu, S. Yang, and S. Levine. Chai: A chatbot ai for task-oriented dialogue with offline reinforcement learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 4471–4491, 2022. 11

  35. [35]

    Wallace, N

    E. Wallace, N. Tomlin, A. Xu, K. Yang, E. Pathak, M. Ginsberg, and D. Klein. Automated crossword solving. arXiv preprint arXiv:2205.09665, 2022

  36. [36]

    L. Wang, W. Xu, Y . Lan, Z. Hu, Y . Lan, R. K.-W. Lee, and E.-P. Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models, 2023

  37. [37]

    X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, and D. Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

  38. [38]

    Z. Wang, S. Cai, A. Liu, X. Ma, and Y . Liang. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents, 2023

  39. [39]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022

  40. [40]

    Y . Xie, K. Kawaguchi, Y . Zhao, X. Zhao, M.-Y . Kan, J. He, and Q. Xie. Decomposition enhances reasoning via self-evaluation guided decoding, 2023

  41. [41]

    S. Yang, O. Nachum, Y . Du, J. Wei, P. Abbeel, and D. Schuurmans. Foundation models for decision making: Problems, methods, and opportunities, 2023

  42. [42]

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. ReAct: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022

  43. [43]

    Zhang, Z

    S. Zhang, Z. Chen, Y . Shen, M. Ding, J. B. Tenenbaum, and C. Gan. Planning with large language models for code generation. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Lr8cOOtYbfL

  44. [44]

    D. Zhou, N. Sch¨arli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuurmans, C. Cui, O. Bousquet, Q. Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022

  45. [45]

    the answer is n

    X. Zhu, J. Wang, L. Zhang, Y . Zhang, R. Gan, J. Zhang, and Y . Yang. Solving math word problem via cooperative reasoning induced language models. arXiv preprint arXiv:2210.16257, 2022. 12 A Code, Prompts, Trajectories All code is available at https://github.com/princeton-nlp/tree-of-thought-llm . All prompts are available at https://github.com/princeton-...