MultiUAV-Plat supplies a new RESTful simulation platform and 1500-task benchmark where Agent4Drone reaches 57.9% task pass rate versus 30.6% for ReAct baseline across 75 multi-UAV missions.
super hub Canonical reference
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Canonical reference. 72% of citing Pith papers cite this work as background.
abstract
Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. To surmount these challenges, we introduce a new framework for language model inference, Tree of Thoughts (ToT), which generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoughts) that serve as intermediate steps toward problem solving. ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices. Our experiments show that ToT significantly enhances language models' problem-solving abilities on three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords. For instance, in Game of 24, while GPT-4 with chain-of-thought prompting only solved 4% of tasks, our method achieved a success rate of 74%. Code repo with all prompts: https://github.com/princeton-nlp/tree-of-thought-llm.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. To surmount these challenges, we introduce a new framework for language model inference, Tree of Thoughts (ToT), which generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoug
authors
co-cited works
representative citing papers
GILP trains a parameterized backbone for valid actions and state predictions, then uses a consistency gate with LLM drafts to reduce hallucinated-state rate from 0.176 to 0.035 on GPT-4o-mini while raising success from 0.668 to 0.838.
A new diagnostic benchmark decomposes LLM spatial navigation into three cognitive scales and shows that cross-scale aggregation, not single-level deficits, causes failure beyond small mazes.
An automatic numeric-remapping attack generator reveals 12-26 point accuracy drops on GSM8K for three LLMs while MAWPS and MultiArith stay near 98%.
ATLAS introduces an LLM-orchestrated agentic framework for dynamic test-time scaling via extensible 'explore' actions, achieving higher accuracy with fewer API calls than fixed-workflow baselines on four benchmarks.
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.
Counterfactual likelihood tests detect indirect influence through public channels in private reasoning models, validated on a 7B role-channel model showing asymmetric A-to-B influence and complete pathway identification via graph-separation controls.
Test-Time Hinting trains a hint generator to prepend contextual guidance to VLM prompts, improving accuracy on natural-image VQA benchmarks with generalization to unseen tasks and models.
Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.
SkCC introduces a typed intermediate representation and compiler pipeline to make LLM agent skills portable across frameworks and enforce security constraints before deployment.
AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming prior static methods on a public dataset.
Entropy-guided supertokens from BPE on reasoning traces compress LLM outputs by 8.1% on average across models and math benchmarks with no accuracy loss while exposing strategy differences between correct and incorrect traces.
TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
RAG-Reflect achieves F1=0.78 on valid comment-edit prediction using retrieval-augmented reasoning and self-reflection, outperforming baselines and approaching fine-tuned models without retraining.
AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new zero-days in Chrome including two critical sandbox escapes.
The conceptual multiverse system with a verification framework for decision structures helps users in philosophy, AI alignment, and poetry build clearer working maps of open-ended problems by making implicit LLM choices explicit and changeable.
SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.
FORGE uses a reasoning-action-observation loop and Dynamic Forest of Agents to perform scalable LLM-based binary analysis, finding 1,274 vulnerabilities across 591 of 3,457 real-world firmware binaries at 72.3% precision and broader coverage than prior methods.
BEAM reformulates LLM-based heuristic design as bi-level optimization using GA for structures, MCTS for placeholders, and adaptive memory to outperform prior single-layer methods on CVRP and MIS tasks.
A single query-specific poisoned document, built by extracting and iteratively refining an adversarial chain-of-thought, can substantially degrade reasoning accuracy in retrieval-augmented LLM systems.
IoT-Brain uses a neuro-symbolic Spatial Trajectory Graph to ground LLMs for verifiable semantic-spatial sensor scheduling, achieving 37.6% higher task success with lower resource use on a campus-scale benchmark.
ProofGrid is a new benchmark for LLM reasoning that uses machine-checkable proofs in minimal formal notation, revealing progress on basic tasks but major gaps in complex combinatorial and synthesis reasoning.
The Robust Reasoning Benchmark shows frontier LLMs are mostly resilient to textual perturbations on AIME problems while open-weight models suffer up to 54% accuracy drops and exhibit accuracy decay on later problems due to attention dilution during chain-of-thought.
citing papers explorer
-
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Llama Guard is an instruction-tuned Llama2-7b model that performs multi-class safety classification on prompts and responses, matching or exceeding existing moderation tools on benchmarks while supporting taxonomy customization.
-
Traceable Fault Diagnosis for Battery Energy Storage Systems via Retrieval-Augmented Multi-Agent O&M Assistant
A retrieval-augmented multi-agent system for traceable fault diagnosis in battery energy storage systems, with BESS-specific routing, schema-constrained DB access, hybrid retrieval, and preliminary internal evaluation.
-
Why Sampling Is Not Choosing: Intentionality, Agency, and Moral Responsibility in Large Language Models
LLMs operate via learned probabilistic input-output mappings with only derived intentionality, so their outputs are not owned as commitments and sampling does not constitute choice or moral agency.
-
What makes a harness a harness: necessary and sufficient conditions for an agent harness
Proposes and tests a constitutive definition of 'agent harness' via conceptual analysis of literature and six real systems.
-
Dual-Track CoT: Budget-Aware Stepwise Guidance for Small LMs
Dual-Track CoT lets small language models perform reliable multi-step reasoning with the same or fewer tokens via budget tracking and rejection of redundant steps.
-
LLM4Log: A Systematic Review of Large Language Model-based Log Analysis
Systematic review of 145 papers on LLM-based log analysis, providing a unified taxonomy, common design patterns, evaluation practices, and challenges for deployment under drift and limited labels.
-
Toward a Safe Internet of Agents
The paper proposes a bottom-up framework for safe agentic AI systems that treats each component as a dual-use interface where added capabilities also expand attack surfaces across single agents, multi-agent systems, and interoperable ecosystems.
-
Understanding the planning of LLM agents: A survey
A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.
-
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.
-
The Rise and Potential of Large Language Model Based Agents: A Survey
The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
-
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
A survey that defines agent skills as reusable procedural artifacts and reviews methods, resources, and applications across their representation, acquisition, retrieval, and evolution stages.
-
LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems
A survey taxonomy of LLMs identifies three scaling crises and six efficiency paradigms while tracing the shift from generation to tool-using agents.
-
The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey
A survey of emerging AI agent architectures that organizes single and multi-agent designs around reasoning, planning, tool use, communication, and reflection phases.
-
A Survey on Knowledge Distillation of Large Language Models
A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.
-
A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications
A systematic survey categorizes prompt engineering methods for LLMs and VLMs by application area, summarizing methodologies, applications, models, datasets, strengths, and limitations for each technique along with a taxonomy and summary table.
-
Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security
This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.
-
LLM Multi-Agent Systems: Challenges and Open Problems
The paper identifies inadequately addressed challenges in optimizing task allocation, fostering robust reasoning through debates, managing layered context, enhancing memory, and applying multi-agent systems to blockchain.
-
A Comprehensive Overview of Large Language Models
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.
- DCD: Domain-Oriented Design for Controlled Retrieval-Augmented Generation
- Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space