RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.
Mobile-agent-e: Self-evolving mobile assistant for complex tasks
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4representative citing papers
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
Mobile world models in text, image, and code modalities reach state-of-the-art on their benchmarks and improve downstream GUI agent performance, with code best for in-distribution accuracy and text more robust for out-of-distribution use.
LDMDroid applies LLMs in a state-aware process to trigger data manipulation functions and uses visual cues to detect errors, finding 17 bugs across 24 Android apps with 14 developer confirmations.
citing papers explorer
-
RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management
RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.
-
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
-
How Mobile World Model Guides GUI Agents?
Mobile world models in text, image, and code modalities reach state-of-the-art on their benchmarks and improve downstream GUI agent performance, with code best for in-distribution accuracy and text more robust for out-of-distribution use.
-
LDMDroid: Leveraging LLMs for Detecting Data Manipulation Errors in Android Apps
LDMDroid applies LLMs in a state-aware process to trigger data manipulation functions and uses visual cues to detect errors, finding 17 bugs across 24 Android apps with 14 developer confirmations.