Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
hub Canonical reference
ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models
Canonical reference. 88% of citing Pith papers cite this work as background.
abstract
Augmented Language Models (ALMs) blend the reasoning capabilities of Large Language Models (LLMs) with tools that allow for knowledge retrieval and action execution. Existing ALM systems trigger LLM thought processes while pulling observations from these tools in an interleaved fashion. Specifically, an LLM reasons to call an external tool, gets halted to fetch the tool's response, and then decides the next action based on all preceding response tokens. Such a paradigm, though straightforward and easy to implement, often leads to huge computation complexity from redundant prompts and repeated execution. This study addresses such challenges for the first time, proposing a modular paradigm ReWOO (Reasoning WithOut Observation) that detaches the reasoning process from external observations, thus significantly reducing token consumption. Comprehensive evaluations across six public NLP benchmarks and a curated dataset reveal consistent performance enhancements with our proposed methodology. Notably, ReWOO achieves 5x token efficiency and 4% accuracy improvement on HotpotQA, a multi-step reasoning benchmark. Furthermore, ReWOO demonstrates robustness under tool-failure scenarios. Beyond prompt efficiency, decoupling parametric modules from non-parametric tool calls enables instruction fine-tuning to offload LLMs into smaller language models, thus substantially reducing model parameters. Our illustrative work offloads reasoning ability from 175B GPT3.5 into 7B LLaMA, demonstrating the significant potential for truly efficient and scalable ALM systems.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero library-time LLM cost.
TRIAGE evaluates LLMs on prospective metacognitive control by requiring a single plan for task selection, sequencing, and token allocation under a calibrated budget, revealing substantial gaps in current models across math, science, code, and knowledge tasks.
DORA is the first end-to-end agentic benchmark for LLM-based disaster response, covering perception, spatial analysis, evacuation planning, temporal reasoning, and report generation over heterogeneous geospatial data, with evaluations of 13 frontier models revealing tool-use and composition failures
Autonomous programming agents frequently fail to follow instructed plans, falling back on incomplete internalized workflows, while standard plans and periodic reminders improve performance but poor plans can degrade it more than no plan.
KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.
PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.
Introduces loop engineering as a distinct practice layer for coding agents, supplies a taxonomy and verification ladder, and analyzes a hand-coded corpus of fifty real loops.
GDP-RAG targets only information deltas in multi-hop RAG through preliminary grounding, gap-conditioned prompts, and skeletal trajectories, reaching 60.63% accuracy at 0.51 cost-of-pass on HotpotQA, 2WikiMultiHopQA, and MuSiQue.
The Novelty-Aware Research Agent layers query analysis, ReAct retrieval, ranking, schema-guided extraction, three-pass comparison, and answer generation on RAG to produce structured comparison artifacts that standard RAG cannot.
RadioMaster is a multi-agent system that autonomously generates radio signals from user input using domain knowledge retrieval, collaborative I/Q sample creation, and closed-loop verification, outperforming baselines on the new RadioBench benchmark.
RePoT recovers from PoT failures via deterministic verified replay and checkpoint repair, yielding +3 to +11pp gains on planning benchmarks and showing checkpoint state as the key recovery signal over error-only feedback.
LLM-X is a scalable architecture for direct negotiation and communication among personal LLM agents, featuring federated gateways, typed protocols, and policy enforcement, shown stable in experiments with up to 12 agents.
HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.
Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tradeoffs in open-world affordance grounding.
QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.
Complete cyclic subtask graphs offer a lens to measure when multi-agent revisitation aids recovery and exploration versus when it increases costs or is dominated by other bottlenecks in LLM agent workflows.
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future directions.
AMREC couples molecule-aware mismatch tracking with expanded candidate exploration and trajectory-level selection to achieve stronger recovery of molecular identity from invalid ChEBI-20 LLM drafts than prior repair strategies.
MapAgent augments vectorized lane mapping with a bounded verification-driven agent loop that diagnoses specification violations and applies minimal edits, reaching over 95% automation in Baidu Maps production across 360+ cities.
AsyncFC decouples LLM decoding from function execution via symbolic futures, enabling overlap and parallelism to reduce end-to-end latency on function-calling benchmarks while preserving accuracy.
citing papers explorer
-
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
-
Only Ask What You Don't Know: Grounded Delta Planning for Efficient Multi-step RAG
GDP-RAG targets only information deltas in multi-hop RAG through preliminary grounding, gap-conditioned prompts, and skeletal trajectories, reaching 60.63% accuracy at 0.51 cost-of-pass on HotpotQA, 2WikiMultiHopQA, and MuSiQue.
-
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
-
Concurrency without Model Changes: Future-based Asynchronous Function Calling for LLMs
AsyncFC decouples LLM decoding from function execution via symbolic futures, enabling overlap and parallelism to reduce end-to-end latency on function-calling benchmarks while preserving accuracy.
-
Do Agents Need to Plan Step-by-Step? Rethinking Planning Horizon in Data-Centric Tool Calling
Full-horizon planning with on-demand replanning achieves accuracy parity with single-step planning in tool-calling agents for knowledge base and multi-hop question answering while consuming 2-3 times fewer tokens.