Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.
hub
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
47 Pith papers cite this work. Polarity classification is still indexing.
abstract
Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. To surmount these challenges, we introduce a new framework for language model inference, Tree of Thoughts (ToT), which generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoughts) that serve as intermediate steps toward problem solving. ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices. Our experiments show that ToT significantly enhances language models' problem-solving abilities on three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords. For instance, in Game of 24, while GPT-4 with chain-of-thought prompting only solved 4% of tasks, our method achieved a success rate of 74%. Code repo with all prompts: https://github.com/princeton-nlp/tree-of-thought-llm.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. To surmount these challenges, we introduce a new framework for language model inference, Tree of Thoughts (ToT), which generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoug
co-cited works
roles
background 2polarities
background 2representative citing papers
RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
SkCC compiles LLM skills via SkIR to achieve portability across agent frameworks, reduce adaptation effort from O(m×n) to O(m+n), and enforce security with reported gains in task success rates and token efficiency.
AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming prior static methods on a public dataset.
Entropy-guided supertokens from BPE on reasoning traces compress LLM outputs by 8.1% on average across models and math benchmarks with no accuracy loss while exposing strategy differences between correct and incorrect traces.
TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
RAG-Reflect achieves F1=0.78 on valid comment-edit prediction using retrieval-augmented reasoning and self-reflection, outperforming baselines and approaching fine-tuned models without retraining.
AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new zero-days in Chrome including two critical sandbox escapes.
The conceptual multiverse system with a verification framework for decision structures helps users in philosophy, AI alignment, and poetry build clearer working maps of open-ended problems by making implicit LLM choices explicit and changeable.
SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.
FORGE uses a reasoning-action-observation loop and Dynamic Forest of Agents to perform scalable LLM-based binary analysis, finding 1,274 vulnerabilities across 591 of 3,457 real-world firmware binaries at 72.3% precision and broader coverage than prior methods.
BEAM reformulates LLM-based heuristic design as bi-level optimization using GA for structures, MCTS for placeholders, and adaptive memory to outperform prior single-layer methods on CVRP and MIS tasks.
A single query-specific poisoned document, built by extracting and iteratively refining an adversarial chain-of-thought, can substantially degrade reasoning accuracy in retrieval-augmented LLM systems.
IoT-Brain uses a neuro-symbolic Spatial Trajectory Graph to ground LLMs for verifiable semantic-spatial sensor scheduling, achieving 37.6% higher task success with lower resource use on a campus-scale benchmark.
ProofGrid is a new benchmark for LLM reasoning that uses machine-checkable proofs in minimal formal notation, revealing progress on basic tasks but major gaps in complex combinatorial and synthesis reasoning.
Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.
Chain-of-Thought reasoning in LLMs is often unfaithful, with models relying on it variably by task and less so as models scale larger.
VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.
RTLC prompting lifts Claude 3.7 Sonnet pairwise accuracy on 350 hard JudgeBench items from 64.6% to 78.6% via a Research-Teach-Critique scaffold that beats self-consistency.
LLM-X is a scalable architecture for direct negotiation and communication among personal LLM agents, featuring federated gateways, typed protocols, and policy enforcement, shown stable in experiments with up to 12 agents.
A practical evaluation protocol for AI pentesting agents that uses validated vulnerability discovery, LLM semantic matching, and bipartite scoring to assess performance in realistic, complex targets.
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
citing papers explorer
-
TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation
TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.
-
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.
-
Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning
HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.