CUJBench is the first benchmark for cross-modal LLM-agent failure diagnosis, reporting 19.7% accuracy and identifying evidence attribution as the core bottleneck across six models.
arXiv preprint arXiv:2506.14477
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new DynamicGUIBench spanning ten applications.
OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
InquireMobile applies two-stage reinforcement fine-tuning and pre-action reasoning to VLM mobile agents, raising inquiry success rate by 46.8% on the introduced InquireBench benchmark.
citing papers explorer
-
Benchmarking and Improving GUI Agents in High-Dynamic Environments
DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new DynamicGUIBench spanning ten applications.
-
OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents
OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
- GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models