super hub Canonical reference

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Dian Yu, Izhak Shafran, Jeffrey Zhao, Shunyu Yao, Thomas L. Griffiths, Yuan Cao · 2023 · cs.CL · arXiv 2305.10601

Canonical reference. 72% of citing Pith papers cite this work as background.

116 Pith papers citing it

Background 72% of classified citations

open full Pith review browse 116 citing papers more from Dian Yu arXiv PDF

abstract

Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. To surmount these challenges, we introduce a new framework for language model inference, Tree of Thoughts (ToT), which generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoughts) that serve as intermediate steps toward problem solving. ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices. Our experiments show that ToT significantly enhances language models' problem-solving abilities on three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords. For instance, in Game of 24, while GPT-4 with chain-of-thought prompting only solved 4% of tasks, our method achieved a success rate of 74%. Code repo with all prompts: https://github.com/princeton-nlp/tree-of-thought-llm.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 18 baseline 3 method 3 dataset 1

citation-polarity summary

background 18 baseline 3 use method 3 use dataset 1

claims ledger

abstract Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. To surmount these challenges, we introduce a new framework for language model inference, Tree of Thoughts (ToT), which generalizes over the popular Chain of Thought approach to prompting language models, and enables exploration over coherent units of text (thoug

authors

Dian Yu Izhak Shafran Jeffrey Zhao Shunyu Yao Thomas L. Griffiths Yuan Cao

co-cited works

representative citing papers

MultiUAV-Plat: An LLM-Oriented Platform, Benchmark and Framework for Multi-UAV Collaborative Task Planning

cs.AI · 2026-06-30 · unverdicted · novelty 7.0

MultiUAV-Plat supplies a new RESTful simulation platform and 1500-task benchmark where Agent4Drone reaches 57.9% task pass rate versus 30.6% for ReAct baseline across 75 multi-UAV missions.

Lost in Aggregation: A Multi-Scale Diagnostic Benchmark for LLM Spatial Navigation

physics.soc-ph · 2026-06-20 · unverdicted · novelty 7.0

A new diagnostic benchmark decomposes LLM spatial navigation into three cognitive scales and shows that cross-scale aggregation, not single-level deficits, causes failure beyond small mazes.

Testing LLM Arithmetic Reasoning Generalization with Automatic Numeric-Remapping Attacks

cs.CR · 2026-06-02 · unverdicted · novelty 7.0

An automatic numeric-remapping attack generator reveals 12-26 point accuracy drops on GSM8K for three LLMs while MAWPS and MultiArith stay near 98%.

ATLAS: Agentic Test-time Learning-to-Allocate Scaling

cs.LG · 2026-06-01 · unverdicted · novelty 7.0

ATLAS introduces an LLM-orchestrated agentic framework for dynamic test-time scaling via extensible 'explore' actions, achieving higher accuracy with fewer API calls than fixed-workflow baselines on four benchmarks.

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.

Counterfactual Likelihood Tests for Indirect Influence in Private Reasoning Channels

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

Counterfactual likelihood tests detect indirect influence through public channels in private reasoning models, validated on a 7B role-channel model showing asymmetric A-to-B influence and complete pathway identification via graph-separation controls.

Test-Time Hinting for Black-Box Vision-Language Models

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

Test-Time Hinting trains a hint generator to prepend contextual guidance to VLM prompts, improving accuracy on natural-image VQA benchmarks with generalization to unseen tasks and models.

Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning

cs.AI · 2026-05-10 · unverdicted · novelty 7.0

Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.

SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents

cs.CR · 2026-05-05 · unverdicted · novelty 7.0 · 2 refs

SkCC introduces a typed intermediate representation and compiler pipeline to make LLM agent skills portable across frameworks and enforce security constraints before deployment.

From Static Analysis to Audience Dissemination: A Training-Free Multimodal Controversy Detection Multi-Agent Framework

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming prior static methods on a public dataset.

Shorthand for Thought: Compressing LLM Reasoning via Entropy-Guided Supertokens

cs.CL · 2026-04-29 · unverdicted · novelty 7.0

Entropy-guided supertokens from BPE on reasoning traces compress LLM outputs by 8.1% on average across models and math benchmarks with no accuracy loss while exposing strategy differences between correct and incorrect traces.

TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation

cs.CV · 2026-04-29 · accept · novelty 7.0

TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

RAG-Reflect: Agentic Retrieval-Augmented Generation with Reflections for Comment-Driven Code Maintenance on Stack Overflow

cs.SE · 2026-04-24 · unverdicted · novelty 7.0

RAG-Reflect achieves F1=0.78 on valid comment-edit prediction using retrieval-augmented reasoning and self-reflection, outperforming baselines and approaching fine-tuned models without retraining.

Synthesizing Multi-Agent Harnesses for Vulnerability Discovery

cs.CR · 2026-04-22 · unverdicted · novelty 7.0

AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new zero-days in Chrome including two critical sandbox escapes.

Navigating the Conceptual Multiverse

cs.HC · 2026-04-20 · unverdicted · novelty 7.0

The conceptual multiverse system with a verification framework for decision structures helps users in philosophy, AI alignment, and poetry build clearer working maps of open-ended problems by making implicit LLM choices explicit and changeable.

SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees

cs.LG · 2026-04-17 · unverdicted · novelty 7.0

SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.

Feedback-Driven Execution for LLM-Based Binary Analysis

cs.CR · 2026-04-16 · unverdicted · novelty 7.0

FORGE uses a reasoning-action-observation loop and Dynamic Forest of Agents to perform scalable LLM-based binary analysis, finding 1,274 vulnerabilities across 591 of 3,457 real-world firmware binaries at 72.3% precision and broader coverage than prior methods.

BEAM: Bi-level Memory-adaptive Algorithmic Evolution for LLM-Powered Heuristic Design

cs.AI · 2026-04-14 · unverdicted · novelty 7.0

BEAM reformulates LLM-based heuristic design as bi-level optimization using GA for structures, MCTS for placeholders, and adaptive memory to outperform prior single-layer methods on CVRP and MIS tasks.

AdversarialCoT: Single-Document Retrieval Poisoning for LLM Reasoning

cs.IR · 2026-04-14 · unverdicted · novelty 7.0

A single query-specific poisoned document, built by extracting and iteratively refining an adversarial chain-of-thought, can substantially degrade reasoning accuracy in retrieval-augmented LLM systems.

IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling

cs.AI · 2026-04-09 · unverdicted · novelty 7.0

IoT-Brain uses a neuro-symbolic Spatial Trajectory Graph to ground LLMs for verifiable semantic-spatial sensor scheduling, achieving 37.6% higher task success with lower resource use on a campus-scale benchmark.

Stress-Testing the Reasoning Competence of LLMs With Proofs Under Minimal Formalism

cs.LO · 2026-04-07 · unverdicted · novelty 7.0

ProofGrid is a new benchmark for LLM reasoning that uses machine-checkable proofs in minimal formal notation, revealing progress on basic tasks but major gaps in complex combinatorial and synthesis reasoning.

Robust Reasoning Benchmark

cs.LG · 2026-03-26 · unverdicted · novelty 7.0 · 2 refs

The Robust Reasoning Benchmark shows frontier LLMs are mostly resilient to textual perturbations on AIME problems while open-weight models suffer up to 54% accuracy drops and exhibit accuracy decay on later problems due to attention dilution during chain-of-thought.

LEAD: Breaking the No-Recovery Bottleneck in Long-Horizon Reasoning

cs.AI · 2026-03-06 · unverdicted · novelty 7.0

LEAD lets LLMs solve checkers jumping puzzles up to size 13 by using lookahead to recover from irreversible errors on hard steps that break extreme decomposition.

citing papers explorer

Showing 50 of 116 citing papers.

STREAM: A Data-Centric Framework for Mining High-Value Task-Oriented Dialogues from Streaming Media cs.CL · 2026-05-24 · unverdicted · none · ref 32 · internal anchor
Stream mines streaming media to create and release StreamDial, a dataset of 87,498 structured task-oriented dialogue sessions across automotive, restaurant, and hotel domains using persona construction, Conversational Blueprints, and RAG.
Beyond Fixed Budgets: Characterizing the Inelasticity and Limitations of Tree-of-Thought Reasoning Strategies cs.AI · 2026-05-19 · unverdicted · none · ref 34 · internal anchor
DPTS shows cold-start bottlenecks at low budgets while SSDP exhibits frontier depletion, indicating fixed ToT strategies are inelastic across compute levels.
TeleCom-Bench: How Far Are Large Language Models from Industrial Telecommunication Applications? cs.AI · 2026-05-18 · conditional · none · ref 62 · internal anchor
TeleCom-Bench reveals LLMs reach 90% on telecom intent and entity tasks but drop to 30% on solution generation and root cause analysis in live network scenarios.
Differentiable Mixture-of-Agents Incentivizes Swarm Intelligence of Large Language Models cs.LG · 2026-05-15 · unverdicted · none · ref 17 · 2 links · internal anchor
DMoA is a differentiable multi-agent framework for LLMs that uses recurrent context-aware routing and predictive entropy for test-time adaptation, claiming SOTA results on 9 benchmarks with efficiency and robustness.
TFGN: Task-Free, Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale cs.LG · 2026-05-14 · unverdicted · none · ref 55 · internal anchor
TFGN is an architectural overlay for transformers enabling task-free, replay-free continual pre-training across heterogeneous domains at LLM scale with near-zero backward transfer and high gradient orthogonality.
RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning cs.CL · 2026-05-13 · unverdicted · none · ref 10 · internal anchor
RTLC prompting lifts Claude 3.7 Sonnet pairwise accuracy on 350 hard JudgeBench items from 64.6% to 78.6% via a Research-Teach-Critique scaffold that beats self-consistency.
LLM-X: A Scalable Negotiation-Oriented Exchange for Communication Among Personal LLM Agents cs.AI · 2026-05-12 · unverdicted · none · ref 37 · internal anchor
LLM-X is a scalable architecture for direct negotiation and communication among personal LLM agents, featuring federated gateways, typed protocols, and policy enforcement, shown stable in experiments with up to 12 agents.
From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World cs.AI · 2026-05-11 · unverdicted · none · ref 47 · internal anchor
A practical evaluation protocol for AI pentesting agents that uses validated vulnerability discovery, LLM semantic matching, and bipartite scoring to assess performance in realistic, complex targets.
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces cs.AI · 2026-05-09 · unverdicted · none · ref 54 · internal anchor
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key cs.AI · 2026-05-07 · unverdicted · none · ref 103 · 3 links · internal anchor
RL training compute for logical reasoning follows a power law with horizon depth whose exponent rises with logical expressiveness, yielding better downstream transfer when models train on richer logics.
Pop Quiz Attack: Black-box Membership Inference Attacks Against Large Language Models cs.CR · 2026-05-07 · unverdicted · none · ref 48 · internal anchor
PopQuiz Attack infers LLM training data membership by turning examples into quiz questions and measuring answer accuracy, reaching 0.873 average ROC-AUC across six models and outperforming prior methods by 20.6%.
Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning cs.CV · 2026-05-05 · unverdicted · none · ref 59 · internal anchor
HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.
Investigating Linguistic Steering: An Analysis of Adjectival Effects Across Large Language Model Architectures cs.CL · 2026-04-28 · unverdicted · none · ref 16 · internal anchor
Shapley value analysis identifies powerful adjectives that steer MMLU performance in model-family-specific patterns, with non-additive interactions emerging in larger models.
Thinking with Reasoning Skills: Fewer Tokens, More Accuracy cs.AI · 2026-04-23 · unverdicted · none · ref 10 · internal anchor
Distilling and retrieving reusable reasoning skills lets LLMs solve coding and math problems with fewer tokens and higher accuracy.
Process Supervision via Verbal Critique Improves Reasoning in Large Language Models cs.CL · 2026-04-23 · unverdicted · none · ref 13 · internal anchor
Verbal Process Supervision uses structured critiques from stronger models in an iterative loop to improve LLM reasoning, reaching 94.9% on GPQA Diamond and large gains on AIME 2025.
Structured Safety Auditing for Balancing Code Correctness and Content Safety in LLM-Generated Code cs.SE · 2026-04-13 · unverdicted · none · ref 37 · internal anchor
Dual Reasoning with explicit safety audits improves the new SUDS metric by 1.32x to 3.42x over baselines on code generation benchmarks containing injected harmful keywords.
AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction cs.AI · 2026-02-05 · unverdicted · none · ref 27 · internal anchor
AgentXRay formulates workflow reconstruction as combinatorial optimization and uses Monte Carlo Tree Search with Red-Black Pruning to approximate black-box agent behaviors via output-based proxy metrics.
AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management cs.AI · 2025-12-11 · conditional · none · ref 56 · internal anchor
AgentProg reframes interaction history as a program with variables and control flow, plus a belief state for partial observability, achieving SOTA success rates on long-horizon GUI benchmarks while baselines degrade.
MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding cs.CL · 2025-10-09 · unverdicted · none · ref 24 · internal anchor
MOSAIC is a training-free multi-agent LLM framework with rationale, coding, reflection, and debugging agents plus a consolidated context window that outperforms prior methods on scientific coding benchmarks.
ARM: Discovering Agentic Reasoning Modules for Generalizable Multi-Agent Systems cs.AI · 2025-10-07 · unverdicted · none · ref 27 · internal anchor
ARM evolves specialized reasoning modules from basic CoT via tree search to serve as reusable components in multi-agent systems that generalize across models and domains without per-task re-optimization.
Mitigating Visual Context Degradation in Large Multimodal Models: A Training-Free Decoupled Agentic Framework cs.CV · 2025-09-27 · unverdicted · none · ref 46 · internal anchor
DRP decouples reasoning from perception in LMMs by using an LLM reasoner to query an LMM observer for visual details as needed, reducing visual grounding loss.
DoubleAgents: Human-Agent Alignment in a Socially Embedded Workflow cs.HC · 2025-09-16 · unverdicted · none · ref 51 · internal anchor
DoubleAgents shows that a distributed-cognition design with coordination agent, dashboard, and policy module increases user comfort and reliance on AI agents for coordination tasks over time.
Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction cs.AI · 2025-09-15 · unverdicted · none · ref 17 · internal anchor
A pruning technique called Reasoning-Aware Compression (RAC) jointly reconstructs input and chain-of-thought activations to preserve reasoning performance better than standard methods when compressing models like DeepSeek-R1.
GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis cs.AI · 2025-07-28 · unverdicted · none · ref 147 · internal anchor
GenoMAS deploys six specialized LLM agents with guided planning to preprocess transcriptomic data and identify genes, reaching 89.13% composite similarity and 60.48% F1 on the GenoTEX benchmark while outperforming prior methods.
Compressed Chain of Thought: Efficient Reasoning Through Dense Representations cs.CL · 2024-12-17 · unverdicted · none · ref 20 · internal anchor
CCoT generates variable-length continuous contemplation tokens that compress explicit reasoning chains, enabling additional dense reasoning and accuracy gains in off-the-shelf language models while allowing adaptive control of token count.
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents cs.CL · 2024-10-30 · unverdicted · none · ref 79 · internal anchor
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling cs.LG · 2024-07-31 · unverdicted · none · ref 66 · internal anchor
Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
Defending Against Indirect Prompt Injection Attacks With Spotlighting cs.CR · 2024-03-20 · unverdicted · none · ref 17 · internal anchor
Spotlighting prompt transformations cut indirect prompt injection success rates from >50% to <2% on GPT models while preserving task performance.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code cs.SE · 2024-03-12 · unverdicted · none · ref 156 · internal anchor
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models cs.CL · 2024-02-05 · unverdicted · none · ref 55 · internal anchor
DeepSeekMath 7B reaches 51.7% on MATH via continued pretraining on curated web math data and Group Relative Policy Optimization.
SGLang: Efficient Execution of Structured Language Model Programs cs.AI · 2023-12-12 · conditional · none · ref 56 · internal anchor
SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.
Reasoning with Language Model is Planning with World Model cs.CL · 2023-05-24 · unverdicted · none · ref 98 · internal anchor
RAP turns LLMs into dual world-model and planning agents via MCTS to generate better reasoning paths, outperforming CoT baselines and achieving 33% relative gains over GPT-4 CoT using LLaMA-33B on plan generation.
Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces cs.CL · 2026-06-05 · unverdicted · none · ref 99 · internal anchor
Reasoning in large output spaces proceeds via shortlisting then fine-grained reasoning; this characterization enables a mechanistic distillation strategy that outperforms standard distillation.
EvoGens: A Population-Based Heuristic Search Framework for Scientific Idea Generation cs.CL · 2026-05-29 · unverdicted · none · ref 39 · internal anchor
EvoGens uses rank-based mutation, semantic-aware crossover, and lightweight evaluation to evolve populations of LLM-generated scientific ideas, boosting novelty and diversity metrics.
Explore Before You Solve: The Speed--Depth Trade-off in Epistemic Agents for ARC-AGI-3 cs.AI · 2026-05-25 · unverdicted · none · ref 19 · internal anchor
Public ARC-AGI-3 games are solvable by trivial heuristics, so they fail to test intelligent exploration; the private set is the real test, and an explore-first AERA agent solves 4/25 while baselines solve none.
A Measure-Theoretic Analysis of Reasoning: Structural Generalization and Approximation Limits cs.LG · 2026-05-19 · unverdicted · none · ref 32 · internal anchor
Applies optimal transport to bound OOD generalization error in Transformers via Lipschitz continuity and TC^0 circuit depth lower bounds for Dyck-k backtracking, supported by evaluations on 54 configurations.
Sequential Consensus for Multi-Agent LLM Debates: A Wald-SPRT compute governor with calibration-based failure detection cs.LG · 2026-05-18 · unverdicted · none · ref 11 · internal anchor
Adapts SPRT as a compute governor for multi-agent LLM debates using Beta-modeled consensus scores from an LLM judge, yielding 3.7x call reduction on GSM8K at -2pp accuracy versus fixed rounds.
Runtime-Structured Task Decomposition for Agentic Coding Systems cs.SE · 2026-05-14 · unverdicted · none · ref 18 · internal anchor
Runtime-structured task decomposition reduces retry costs in agentic coding systems by up to 51.7% versus monolithic prompts by rerunning only failed subtasks on two software engineering workloads.
Complexity Horizons of Compressed Models in Analog Circuit Analysis cs.AI · 2026-05-04 · unverdicted · none · ref 16 · internal anchor
Prerequisite graphs map compressed LLM performance boundaries in analog circuit analysis to allow selecting the smallest viable model for a given task complexity.
State Representation and Termination for Recursive Reasoning Systems cs.AI · 2026-05-02 · unverdicted · none · ref 13 · internal anchor
Recursive reasoning systems can represent their state via an epistemic state graph and terminate when the linearized order-gap is non-degenerate near the fixed point, providing a local condition for when the stopping rule is informative.
LLM Reasoning Is Latent, Not the Chain of Thought cs.AI · 2026-04-17 · unverdicted · none · ref 3 · internal anchor
LLM reasoning is primarily mediated by latent-state trajectories rather than by explicit surface chain-of-thought outputs.
A pragmatic approach to regulating AI agents cs.CY · 2026-04-16 · unverdicted · none · ref 27 · internal anchor
AI agents require distinct regulation as AI systems under the EU AI Act with orchestration-layer oversight and a risk-based traffic light authorization system in contract law to preserve human accountability.
Agent Mentor: Framing Agent Knowledge through Semantic Trajectory Analysis cs.AI · 2026-04-12 · unverdicted · none · ref 32 · internal anchor
Agent Mentor analyzes semantic trajectories in agent logs to identify undesired behaviors and derives corrective prompt instructions, yielding measurable accuracy gains on benchmark tasks across three agent setups.
From Hallucination to Structure Snowballing: The Alignment Tax of Constrained Decoding in LLM Reflection cs.CL · 2026-04-07 · unverdicted · none · ref 12 · internal anchor
Enforcing structured reflection via Outlines-based constrained decoding on an 8B LLM triggers structure snowballing instead of better self-correction, producing near-perfect syntax but persistent semantic errors and revealing an alignment tax.
BRIDGE: Building Representations In Domain Guided Program Synthesis cs.LG · 2025-11-26 · unverdicted · none · ref 35 · internal anchor
BRIDGE improves Lean executable correctness up to 1.5x and sample efficiency roughly 2x over direct prompting by using domain-guided intermediate representations across code, specs, and proofs.
SHE: Stepwise Hybrid Examination Reinforcement Learning Framework for E-commerce Search Relevance cs.AI · 2025-10-09 · unverdicted · none · ref 27 · internal anchor
SHE is a new RL framework using stepwise hybrid examination rewards to improve reasoning quality and accuracy in large-scale e-commerce query-product relevance prediction.
On the Diagram of Thought cs.CL · 2024-09-16 · unverdicted · none · ref 7 · internal anchor
Diagram of Thought (DoT) is a controller-light framework in which an LLM builds typed reasoning diagrams validated online and interpreted as diagrams in a slice topos whose synthesis is a finite limit.
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations cs.CL · 2023-12-07 · unverdicted · none · ref 10 · internal anchor
Llama Guard is an instruction-tuned Llama2-7b model that performs multi-class safety classification on prompts and responses, matching or exceeding existing moderation tools on benchmarks while supporting taxonomy customization.
What makes a harness a harness: necessary and sufficient conditions for an agent harness cs.SE · 2026-06-08 · unverdicted · none · ref 61 · internal anchor
Proposes and tests a constitutive definition of 'agent harness' via conceptual analysis of literature and six real systems.
Dual-Track CoT: Budget-Aware Stepwise Guidance for Small LMs cs.CL · 2026-04-27 · unverdicted · none · ref 5 · internal anchor
Dual-Track CoT lets small language models perform reliable multi-step reasoning with the same or fewer tokens via budget tracking and rejection of redundant steps.

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer