hub

arXiv preprint arXiv:2410.05243 , year=

URL https://arxiv · 2024 · arXiv 2410.05243

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.

Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding

cs.AI · 2026-05-01 · accept · novelty 7.0 · 2 refs

GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.

Benchmarking and Improving GUI Agents in High-Dynamic Environments

cs.CV · 2026-04-28 · unverdicted · novelty 7.0 · 2 refs

DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new DynamicGUIBench spanning ten applications.

MMSkills: Towards Multimodal Skills for General Visual Agents

cs.AI · 2026-05-13 · unverdicted · novelty 6.0

MMSkills turns public interaction trajectories into compact multimodal skill packages that visual agents can consult at runtime to improve decision-making on benchmarks.

AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding

cs.CV · 2026-05-04 · unverdicted · novelty 6.0

AutoFocus converts token perplexity into an anisotropic Gaussian uncertainty field to drive region proposals and shape-aware zooming for improved GUI grounding in VLMs.

UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

UI-Zoomer uses uncertainty quantification to trigger and size adaptive zoom-ins only on uncertain GUI grounding predictions, yielding up to 13.4% gains on benchmarks with no training.

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

cs.CL · 2024-10-30 · unverdicted · novelty 6.0

OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.

Perceptual Flow Network for Visually Grounded Reasoning

cs.CV · 2026-05-04 · unverdicted · novelty 5.0

PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).

GUI Agents with Reinforcement Learning: Toward Digital Inhabitants

cs.AI · 2026-04-30 · unverdicted · novelty 5.0

The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a future roadmap.

Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding

cs.LG · 2026-04-23 · unverdicted · novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.

Towards Scalable Lightweight GUI Agents via Multi-role Orchestration

cs.AI · 2026-04-15 · unverdicted · novelty 5.0

LAMO uses role-oriented data synthesis and two-stage training (perplexity-weighted supervised fine-tuning plus reinforcement learning) to create scalable lightweight GUI agents that support both single-model and multi-agent orchestration.

See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback

cs.CV · 2026-04-14 · unverdicted · novelty 5.0

Multi-turn visual feedback refinement outperforms single-shot coordinate prediction for pixel-precise GUI grounding in complex coding environments.

ZAYA1-VL-8B Technical Report

cs.CV · 2026-05-08 · unverdicted · novelty 4.0

ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting benchmarks.

citing papers explorer

Showing 13 of 13 citing papers.

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark cs.CV · 2026-05-12 · unverdicted · none · ref 10
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding cs.AI · 2026-05-01 · accept · none · ref 7 · 2 links
GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.
Benchmarking and Improving GUI Agents in High-Dynamic Environments cs.CV · 2026-04-28 · unverdicted · none · ref 8 · 2 links
DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new DynamicGUIBench spanning ten applications.
MMSkills: Towards Multimodal Skills for General Visual Agents cs.AI · 2026-05-13 · unverdicted · none · ref 9
MMSkills turns public interaction trajectories into compact multimodal skill packages that visual agents can consult at runtime to improve decision-making on benchmarks.
AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding cs.CV · 2026-05-04 · unverdicted · none · ref 7
AutoFocus converts token perplexity into an anisotropic Gaussian uncertainty field to drive region proposals and shape-aware zooming for improved GUI grounding in VLMs.
UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding cs.CV · 2026-04-15 · unverdicted · none · ref 6
UI-Zoomer uses uncertainty quantification to trigger and size adaptive zoom-ins only on uncertain GUI grounding predictions, yielding up to 13.4% gains on benchmarks with no training.
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents cs.CL · 2024-10-30 · unverdicted · none · ref 143
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
Perceptual Flow Network for Visually Grounded Reasoning cs.CV · 2026-05-04 · unverdicted · none · ref 8
PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants cs.AI · 2026-04-30 · unverdicted · none · ref 29
The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a future roadmap.
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding cs.LG · 2026-04-23 · unverdicted · none · ref 65
A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
Towards Scalable Lightweight GUI Agents via Multi-role Orchestration cs.AI · 2026-04-15 · unverdicted · none · ref 1
LAMO uses role-oriented data synthesis and two-stage training (perplexity-weighted supervised fine-tuning plus reinforcement learning) to create scalable lightweight GUI agents that support both single-model and multi-agent orchestration.
See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback cs.CV · 2026-04-14 · unverdicted · none · ref 5
Multi-turn visual feedback refinement outperforms single-shot coordinate prediction for pixel-precise GUI grounding in complex coding environments.
ZAYA1-VL-8B Technical Report cs.CV · 2026-05-08 · unverdicted · none · ref 156
ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting benchmarks.

arXiv preprint arXiv:2410.05243 , year=

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer