BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.
hub Canonical reference
Ui-venus technical report: Building high-performance ui agents with rft
Canonical reference. 80% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
GUI grounding in VLMs is bottlenecked by prefill-stage candidate selection that decoding cannot fix, so Re-Prefill uses attention to extract and re-inject target tokens for up to 4.3% gains on ScreenSpot-Pro.
OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
WebTestPilot symbolizes GUI elements to infer contextual oracles for end-to-end web testing from natural language specs, reporting 99% task completion and 96% precision/recall on a new bug-injected benchmark.
Introduces Active Task Driving Memory (ATMem) and STR-GRPO to move GUI agents from passive record storage to actively maintained task states, tested on a new mobile benchmark with progress and scope-aware metrics.
InnerZoom bridges cross-layer evidence in one forward pass to achieve SOTA GUI grounding accuracy on six benchmarks while cutting latency up to 31.8% versus two-pass baselines.
GUICrafter uses curriculum learning on unannotated GUI screenshots for visual grounding followed by RL calibration on limited labels to match or exceed prior GUI agents with far less annotation.
AutoFocus converts token perplexity into an anisotropic Gaussian uncertainty field to drive region proposals and shape-aware zooming for improved GUI grounding in VLMs.
SOLAR-RL assigns dense step-level rewards from static trajectory data by detecting first failure points and applying target-aligned shaping to improve long-horizon GUI task completion without full online interactions.
LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.
UI-Zoomer uses uncertainty quantification to trigger and size adaptive zoom-ins only on uncertain GUI grounding predictions, yielding up to 13.4% gains on benchmarks with no training.
SceneCritic is a symbolic, ontology-grounded evaluator for floor-plan layouts that identifies specific semantic, orientation, and geometric violations and aligns better with human judgments than VLM-based scorers.
Semantic-level UI Element Injection distracts GUI agents by overlaying safety-aligned UI elements, achieving up to 4.4x higher attack success rates that transfer across models and create persistent attractors.
AgentProg reframes interaction history as a program with variables and control flow, plus a belief state for partial observability, achieving SOTA success rates on long-horizon GUI benchmarks while baselines degrade.
RISK introduces a dataset, benchmark, and R1-style RL fine-tuning for GUI agents that achieve 6.8-8.8% offline gains and 70.5% online task success in e-commerce risk management using 7.2% of baseline parameters.
VeriOS-Agent is an OS agent that proactively queries humans in untrustworthy scenarios via a query-driven framework and three-stage training, achieving 19.72% higher step-wise success rate over baselines while preserving normal performance.
StainFlow proposes global entity stain tracking and local stain evidence linking modules to improve process rewards for GUI agents, reporting 3.2% relative gain in online RL success and 1.8% in judgment accuracy on AndroidWorld and OGRBench.
MIRAGE compresses explicit chain-of-thought into latent vectors and adds a generative world model to predict future interface states, matching explicit reasoning performance with 3-5x fewer tokens on Android benchmarks.
The paper proposes that reusable agent skills should incorporate visual elements alongside text, introduces three forms of visual skills and an automatic conversion system, and reports better performance on GUI and visual-centric tasks.
GUI-C² pairs a difficulty-scoring data pipeline with an area-gated coarse-to-fine RL mechanism to improve GUI grounding accuracy and training stability.
SE-GA combines Test-Time Memory Extension for dynamic context retrieval with Memory-Augmented Self-Evolution training to reach 89.0% on ScreenSpot and 75.8% on AndroidControl-High.
Empirical study finds background semantics, random pruning, and recency-based allocation improve token efficiency for GUI visual agents.
Xiaomi-GUI-0 reports 72.0% success on RealMobile and 78.9% on AndroidWorld via real-device closed-loop training with multi-source data and three-stage RL pipeline.
citing papers explorer
-
GUICrafter: Weakly-Supervised GUI Agent Leveraging Massive Unannotated Screenshots
GUICrafter uses curriculum learning on unannotated GUI screenshots for visual grounding followed by RL calibration on limited labels to match or exceed prior GUI agents with far less annotation.
-
AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management
AgentProg reframes interaction history as a program with variables and control flow, plus a belief state for partial observability, achieving SOTA success rates on long-horizon GUI benchmarks while baselines degrade.
-
RISK: A Framework for GUI Agents in E-commerce Risk Management
RISK introduces a dataset, benchmark, and R1-style RL fine-tuning for GUI agents that achieve 6.8-8.8% offline gains and 70.5% online task success in e-commerce risk management using 7.2% of baseline parameters.
-
StainFlow: Entity-Stain Tracking and Evidence Linking for Process Rewards in GUI Agents
StainFlow proposes global entity stain tracking and local stain evidence linking modules to improve process rewards for GUI agents, reporting 3.2% relative gain in online RL success and 1.8% in judgment accuracy on AndroidWorld and OGRBench.
-
MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models
MIRAGE compresses explicit chain-of-thought into latent vectors and adds a generative world model to predict future interface states, matching explicit reasoning performance with 3-5x fewer tokens on Android benchmarks.
-
Xiaomi-GUI-0 Technical Report
Xiaomi-GUI-0 reports 72.0% success on RealMobile and 78.9% on AndroidWorld via real-device closed-loop training with multi-source data and three-stage RL pipeline.