hub

Autonomous evaluation and refinement of digital agents

· 2024 · arXiv 2404.06474

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

read on arXiv browse 11 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis

cs.AI · 2026-04-06 · unverdicted · novelty 7.0

GUIDE decomposes GUI agent evaluation into trajectory segmentation, subtask diagnosis, and overall summary to deliver higher accuracy and structured error reports than holistic baselines.

Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification

cs.AI · 2026-01-22 · conditional · novelty 7.0

DeepVerifier enables self-evolving deep research agents via rubric-guided verification at test time, delivering 8-11% accuracy gains on GAIA and XBench-DeepSearch subsets.

ChainWorld: Composing Long-Horizon Desktop Workloads from Atomic OSWorld Tasks

cs.AI · 2026-06-19 · unverdicted · novelty 6.0

ChainWorld builds 347 chains from atomic OSWorld tasks and benchmarks four agents under single-turn and multi-turn protocols, reporting a maximum 31% completion rate with distinct failure profiles.

OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation

cs.CL · 2026-06-16 · unverdicted · novelty 6.0

OPD-Evolver uses on-policy self-distillation in fast interaction and slow attribution loops to build agents with holistic memory competence, outperforming prior systems by up to 11.5% and allowing a 9B model to compete with much larger ones.

Benchmark Everything Everywhere All at Once

cs.AI · 2026-06-04 · unverdicted · novelty 6.0

Benchmark Agent is an autonomous agentic system that constructs benchmarks for LLMs and MLLMs via query analysis, subtask design, annotation and quality control, yielding 15 benchmarks with minimal human input.

SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations

cs.CL · 2026-05-21 · unverdicted · novelty 6.0

SynAE is a multi-metric framework that evaluates how well synthetic benchmarks replicate real data characteristics for multi-turn tool-calling agent testing.

Evaluating Multi-turn Human-AI Interaction

cs.HC · 2026-05-18 · unverdicted · novelty 6.0

Introduces the TCR framework to evaluate educational LLM assistants on transparency, consistency, and refinement in multi-turn interactions, complementing aggregate metrics.

EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration

cs.AI · 2025-12-22 · unverdicted · novelty 6.0

EchoTrail-GUI builds an automated memory of successful GUI task trajectories via self-exploration and injects relevant past examples to raise success rates on Android benchmarks.

Agent Workflow Memory

cs.CL · 2024-09-11 · unverdicted · novelty 6.0

AWM induces reusable workflows from agent experiences and provides them selectively to improve success rates by 24.6% on Mind2Web and 51.1% on WebArena while reducing steps taken.

Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks

cs.CL · 2025-03-12 · unverdicted · novelty 5.0

Plan-and-Act trains a dedicated Planner on synthetic plan-annotated trajectories to generate high-level plans that an Executor follows, reaching 57.58% success on WebArena-Lite and 81.36% on WebVoyager.

A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions

cs.AI · 2025-01-27 · unverdicted · novelty 5.0

A survey of 87 agents for computer use and 33 datasets that introduces a three-dimensional taxonomy across domain, interaction, and agent perspectives and identifies six research gaps.

citing papers explorer

Showing 11 of 11 citing papers.

GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis cs.AI · 2026-04-06 · unverdicted · none · ref 12
GUIDE decomposes GUI agent evaluation into trajectory segmentation, subtask diagnosis, and overall summary to deliver higher accuracy and structured error reports than holistic baselines.
Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification cs.AI · 2026-01-22 · conditional · none · ref 14
DeepVerifier enables self-evolving deep research agents via rubric-guided verification at test time, delivering 8-11% accuracy gains on GAIA and XBench-DeepSearch subsets.
ChainWorld: Composing Long-Horizon Desktop Workloads from Atomic OSWorld Tasks cs.AI · 2026-06-19 · unverdicted · none · ref 14
ChainWorld builds 347 chains from atomic OSWorld tasks and benchmarks four agents under single-turn and multi-turn protocols, reporting a maximum 31% completion rate with distinct failure profiles.
OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation cs.CL · 2026-06-16 · unverdicted · none · ref 116
OPD-Evolver uses on-policy self-distillation in fast interaction and slow attribution loops to build agents with holistic memory competence, outperforming prior systems by up to 11.5% and allowing a 9B model to compete with much larger ones.
Benchmark Everything Everywhere All at Once cs.AI · 2026-06-04 · unverdicted · none · ref 27
Benchmark Agent is an autonomous agentic system that constructs benchmarks for LLMs and MLLMs via query analysis, subtask design, annotation and quality control, yielding 15 benchmarks with minimal human input.
SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations cs.CL · 2026-05-21 · unverdicted · none · ref 25
SynAE is a multi-metric framework that evaluates how well synthetic benchmarks replicate real data characteristics for multi-turn tool-calling agent testing.
Evaluating Multi-turn Human-AI Interaction cs.HC · 2026-05-18 · unverdicted · none · ref 17
Introduces the TCR framework to evaluate educational LLM assistants on transparency, consistency, and refinement in multi-turn interactions, complementing aggregate metrics.
EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration cs.AI · 2025-12-22 · unverdicted · none · ref 14
EchoTrail-GUI builds an automated memory of successful GUI task trajectories via self-exploration and injects relevant past examples to raise success rates on Android benchmarks.
Agent Workflow Memory cs.CL · 2024-09-11 · unverdicted · none · ref 51
AWM induces reusable workflows from agent experiences and provides them selectively to improve success rates by 24.6% on Mind2Web and 51.1% on WebArena while reducing steps taken.
Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks cs.CL · 2025-03-12 · unverdicted · none · ref 31
Plan-and-Act trains a dedicated Planner on synthetic plan-annotated trajectories to generate high-level plans that an Executor follows, reaching 57.58% success on WebArena-Lite and 81.36% on WebVoyager.
A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions cs.AI · 2025-01-27 · unverdicted · none · ref 119
A survey of 87 agents for computer use and 33 datasets that introduces a three-dimensional taxonomy across domain, interaction, and agent perspectives and identifies six research gaps.

Autonomous evaluation and refinement of digital agents

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer