ABRA shows radiology agents excel at tool execution (89%+) but struggle with outcomes (0-25%), with oracle perception raising outcomes to 69-100%, identifying perception as the primary bottleneck.
The Twelfth International Conference on Learning Representations , year=
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
dataset 1polarities
use dataset 1representative citing papers
Introduces DocOS benchmark to test GUI agents on proactively locating, comprehending, and executing instructions from online documentation in interactive web settings.
HINT-SD improves long-horizon LLM agent training by using hindsight to target self-distillation on failure-relevant action spans, delivering up to 18.8% higher performance and 2.26x lower time per step than dense per-turn feedback.
Clarification-seeking in LLM agents amplifies prompt injection attack success from ~2% to over 30% across ten frontier models in a new 728-scenario benchmark.
A blind replay script matches frontier model performance on static CUA benchmarks due to non-principled environments and evaluation methods, prompting PRISM design principles and the DigiWorld benchmark with improved statistical aggregation.
Mango raises web agent success rates to 63.6% on WebVoyager and 52.5% on WebWalkerQA by bandit-based starting-point selection and memory, beating baselines by 7.3% and 26.8%.
AWM induces reusable workflows from agent experiences and provides them selectively to improve success rates by 24.6% on Mind2Web and 51.1% on WebArena while reducing steps taken.
citing papers explorer
-
ABRA: Agent Benchmark for Radiology Applications
ABRA shows radiology agents excel at tool execution (89%+) but struggle with outcomes (0-25%), with oracle perception raising outcomes to 69-100%, identifying perception as the primary bottleneck.
-
DocOS: Towards Proactive Document-Guided Actions in GUI Agents
Introduces DocOS benchmark to test GUI agents on proactively locating, comprehending, and executing instructions from online documentation in interactive web settings.
-
HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents
HINT-SD improves long-horizon LLM agent training by using hindsight to target self-distillation on failure-relevant action spans, delivering up to 18.8% higher performance and 2.26x lower time per step than dense per-turn feedback.
-
ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents
Clarification-seeking in LLM agents amplifies prompt injection attack success from ~2% to over 30% across ten frontier models in a new 728-scenario benchmark.
-
Computer Use at the Edge of the Statistical Precipice
A blind replay script matches frontier model performance on static CUA benchmarks due to non-principled environments and evaluation methods, prompting PRISM design principles and the DigiWorld benchmark with improved statistical aggregation.
-
Mango: Multi-Agent Web Navigation via Global-View Optimization
Mango raises web agent success rates to 63.6% on WebVoyager and 52.5% on WebWalkerQA by bandit-based starting-point selection and memory, beating baselines by 7.3% and 26.8%.
-
Agent Workflow Memory
AWM induces reusable workflows from agent experiences and provides them selectively to improve success rates by 24.6% on Mind2Web and 51.1% on WebArena while reducing steps taken.