Title resolution pending

Ruipeng Wang, Yuxin Chen, Yukai Wang, Chang Wu, Junfeng Fang, Xiaodong Cai, Qi Gu, Hui Su, An Zhang, Xiang Wang, et al · 2026 · arXiv 2602.11348

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability

cs.CL · 2026-06-24 · unverdicted · novelty 7.0

ToolBench-X is a new benchmark for tool-using LLM agents featuring five structured recoverable hazards across domains and workflows, with experiments showing a substantial reliability gap driven by poor hazard diagnosis rather than tool volume or compute.

VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

VitaBench 2.0 introduces a benchmark for long-term personalized and proactive agent behavior, with results indicating substantial gaps in current frontier LLMs.

Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments

cs.AI · 2026-05-26 · unverdicted · novelty 5.0

NoisyAgent trains LLM agents with controlled user and tool noise to improve robustness in stochastic environments while also boosting clean-benchmark performance.

The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective

cs.AI · 2026-06-05 · unverdicted · novelty 4.0

The paper proposes a unified MDP-based research agenda for addressing sim-to-real gaps in foundation model agents and advocates adopting classical solutions such as domain randomization.

citing papers explorer

Showing 4 of 4 citing papers.

Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability cs.CL · 2026-06-24 · unverdicted · none · ref 1
ToolBench-X is a new benchmark for tool-using LLM agents featuring five structured recoverable hazards across domains and workflows, with experiments showing a substantial reliability gap driven by poor hazard diagnosis rather than tool volume or compute.
VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions cs.AI · 2026-05-26 · unverdicted · none · ref 56
VitaBench 2.0 introduces a benchmark for long-term personalized and proactive agent behavior, with results indicating substantial gaps in current frontier LLMs.
Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments cs.AI · 2026-05-26 · unverdicted · none · ref 35
NoisyAgent trains LLM agents with controlled user and tool noise to improve robustness in stochastic environments while also boosting clean-benchmark performance.
The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective cs.AI · 2026-06-05 · unverdicted · none · ref 56
The paper proposes a unified MDP-based research agenda for addressing sim-to-real gaps in foundation model agents and advocates adopting classical solutions such as domain randomization.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer