arxiv: 2501.12326 · v1 · submitted 2025-01-21 · 💻 cs.AI · cs.CL· cs.CV· cs.HC

Recognition: 2 theorem links

· Lean Theorem

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Chaolin Jin, Chen Li, Chuang Li, Feng Lin, Guang Shi, Haifeng Liu, Haihua Yang, Haoli Chen, Haoming Wang, Jiahao Li, Jiale Yang, Jingyu Li, Junda Zhang, Junjie Fang, Kai Cai, Kuanye Li, Longxiang Liu, Minchao Wang, Qianli Ma, Shihao Liang, Shijue Huang, Shizuo Tian, Tao Peng, Wanjun Zhong, Woyu Lin, Xiaojun Xiao, Xiao Zhou, Xin Liu, Xu Jiang, Yaowei Zheng, Yining Ye, Yujia Qin, Yu Miao, Yunxin Li, Zhaojian Li

Authors on Pith no claims yet

Pith reviewed 2026-05-11 05:37 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CVcs.HC

keywords GUI agentsend-to-end modelsscreenshot perceptionunified action spacesystem-2 reasoningreflective trainingOSWorld benchmarkAndroidWorld benchmark

0 comments

The pith

UI-TARS is an end-to-end screenshot model that directly outputs keyboard and mouse actions and outperforms wrapped commercial agents on GUI benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents UI-TARS as a model that receives only GUI screenshots and produces human-like interactions without depending on external large models or hand-crafted workflows. It establishes this through four components that together enable perception of UI elements, precise action selection across platforms, deliberate multi-step reasoning, and ongoing self-improvement from new data. A sympathetic reader would care because the approach removes the need for expert prompting and API dependencies while still delivering higher task success rates. If correct, it shows that native training on interaction traces can close the gap with more engineered systems and support continuous adaptation on virtual environments.

Core claim

UI-TARS is a native GUI agent model that solely perceives screenshots as input and performs human-like interactions such as keyboard and mouse operations. Unlike frameworks that wrap commercial models with expert prompts and workflows, it is trained end-to-end and achieves superior results on benchmarks for perception, grounding, and task execution. The model incorporates enhanced perception from large-scale GUI screenshot data, unified action modeling that standardizes interactions across platforms, system-2 reasoning that applies patterns including task decomposition and reflection, and iterative training that automatically collects, filters, and refines interaction traces on virtual to

What carries the argument

The end-to-end UI-TARS architecture that combines enhanced perception, unified action modeling, system-2 reasoning patterns, and iterative reflective online trace collection to map screenshots directly to actions.

If this is right

UI-TARS records state-of-the-art scores across more than ten GUI agent benchmarks that test perception, grounding, and full task execution.
On the OSWorld benchmark it reaches 24.6 with 50 steps and 22.7 with 15 steps, exceeding Claude's 22.0 and 14.9 under the same conditions.
On the AndroidWorld benchmark it attains 46.6, exceeding GPT-4o's 34.5.
The iterative training loop lets the model learn from its own mistakes and handle new situations with minimal additional human input.
The paper supplies an analysis of the historical path of GUI agents to guide later work in the area.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Native models of this kind could lower the engineering cost of building reliable GUI automation by removing reliance on prompt engineering and external APIs.
The reflective trace collection method may transfer to other agent settings such as web navigation or mobile apps beyond virtual machines.
Embedding system-2 reasoning inside a single model rather than an external workflow suggests a route toward more compact agent designs for operating-system tasks.
If the self-improvement loop scales, future versions might close performance gaps on longer-horizon workflows that current commercial agents still struggle with.

Load-bearing premise

Automatically collected and reflectively filtered interaction traces on virtual machines supply high-quality unbiased training signals that support continuous improvement without introducing systematic errors or distribution shifts.

What would settle it

Evaluating UI-TARS on a held-out set of GUI tasks in OSWorld or AndroidWorld where its success rate falls below that of Claude or GPT-4o under identical step limits.

read the original abstract

This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines. Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention. We also analyze the evolution path of GUI agents to guide the further development of this domain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UI-TARS shows one end-to-end model can top prompt-heavy GUI agent frameworks on public benchmarks, but the reflective trace loop is described at too high a level to judge its reliability.

read the letter

The core claim is straightforward: train a single model on screenshots and actions, add explicit reasoning steps, and let it collect and refine its own traces on virtual machines. This beats the usual setups that wrap GPT-4o or Claude with custom prompts and workflows. On OSWorld it reaches 24.6 with 50 steps and 22.7 with 15 steps against Claude's 22.0 and 14.9; on AndroidWorld it hits 46.6 versus GPT-4o's 34.5. Those are the numbers that matter for anyone tracking GUI agent progress.

Referee Report

2 major / 2 minor

Summary. This paper introduces UI-TARS, a native end-to-end GUI agent model that takes only screenshots as input and performs human-like interactions (keyboard/mouse operations). It claims to outperform sophisticated agent frameworks built around commercial models like GPT-4o with expert prompts and workflows. The central results are SOTA performance across 10+ GUI agent benchmarks for perception, grounding, and task execution, with specific numbers on OSWorld (24.6 at 50 steps, 22.7 at 15 steps, outperforming Claude) and AndroidWorld (46.6, surpassing GPT-4o). Key components include enhanced perception from large-scale GUI screenshot datasets, unified action modeling across platforms, system-2 reasoning (task decomposition, reflection, milestone recognition), and iterative training via automatically collected and reflectively filtered traces on virtual machines to enable continuous self-improvement with minimal human intervention.

Significance. If the performance claims hold under rigorous validation, the work would represent a meaningful advance by demonstrating that native models can exceed complex LLM-based agent frameworks while incorporating self-reflective iterative training to address data scarcity. The concrete benchmark comparisons to named strong baselines (Claude, GPT-4o) and the evolutionary analysis of GUI agents provide useful reference points for the field.

major comments (2)

[Iterative Training with Reflective Online Traces] The section describing Iterative Training with Reflective Online Traces provides only a high-level overview of automatic collection, filtering, and reflective refinement on virtual machines but omits concrete filtering criteria, acceptance thresholds, error-type breakdowns, or before/after performance deltas on held-out tasks. This is load-bearing for the central SOTA claims on OSWorld and AndroidWorld, which are attributed to the quality of these training signals.
[Experimental Results (OSWorld and AndroidWorld)] The reported benchmark results (OSWorld: 24.6/22.7; AndroidWorld: 46.6) include no statistical significance tests, standard deviations across runs, exact evaluation protocols, data splits, or controls for prompt sensitivity in the comparison systems (Claude, GPT-4o). These omissions directly affect the reliability of the superiority claims.

minor comments (2)

[Abstract] The abstract states SOTA on '10+ GUI agent benchmarks' without listing them or pointing to a summary table; adding an explicit enumeration or reference would improve clarity.
[Unified Action Modeling] The description of unified action modeling would benefit from a concrete example or formal notation showing how actions are standardized across platforms and how this enables precise grounding.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their positive assessment of the significance of our work and for providing detailed comments that will help improve the manuscript. We respond to each major comment below.

read point-by-point responses

Referee: [Iterative Training with Reflective Online Traces] The section describing Iterative Training with Reflective Online Traces provides only a high-level overview of automatic collection, filtering, and reflective refinement on virtual machines but omits concrete filtering criteria, acceptance thresholds, error-type breakdowns, or before/after performance deltas on held-out tasks. This is load-bearing for the central SOTA claims on OSWorld and AndroidWorld, which are attributed to the quality of these training signals.

Authors: We agree that the current description of Iterative Training with Reflective Online Traces is high-level and that additional concrete details are needed to fully substantiate its role in achieving the reported SOTA results. In the revised manuscript we will expand this section to specify the filtering criteria (including reflection-based acceptance rules and success thresholds), exact acceptance thresholds applied, breakdowns of error types in the collected traces, and quantitative before/after performance deltas measured on held-out tasks. These additions will be placed in the main text or a new appendix to strengthen the link between the training process and benchmark gains. revision: yes
Referee: [Experimental Results (OSWorld and AndroidWorld)] The reported benchmark results (OSWorld: 24.6/22.7; AndroidWorld: 46.6) include no statistical significance tests, standard deviations across runs, exact evaluation protocols, data splits, or controls for prompt sensitivity in the comparison systems (Claude, GPT-4o). These omissions directly affect the reliability of the superiority claims.

Authors: We acknowledge the importance of statistical rigor and reproducibility in benchmark reporting. In the revised manuscript we will augment the experimental results section with statistical significance tests (such as bootstrap or paired comparisons) against the named baselines, report standard deviations from repeated evaluation runs where available, provide the precise evaluation protocols and data splits employed, and include analysis addressing prompt sensitivity in the Claude and GPT-4o baselines. These changes will be incorporated to increase confidence in the superiority claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's claims rest on empirical benchmark scores (OSWorld 24.6/22.7, AndroidWorld 46.6) obtained via direct evaluation on public external test sets. These metrics are not derived from internal training quantities, fitted parameters, or self-generated traces by construction; the iterative reflective trace collection is a data-generation procedure whose outputs are assessed on held-out benchmarks rather than being tautologically renamed as predictions. No equations, self-definitional loops, uniqueness theorems, or ansatz smuggling appear in the described derivation chain. The central results therefore remain independent of the training-loop inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of large-scale screenshot pretraining, a unified action representation, and self-generated reflective traces; these are supported by standard supervised and imitation-learning assumptions rather than new axioms.

free parameters (2)

Scale and composition of GUI screenshot dataset
Large-scale dataset size and filtering criteria are chosen to enable context-aware perception but are not derived from first principles.
Action trace collection and reflection filtering thresholds
Parameters governing which traces are retained and how reflection is applied are tuned during iterative training.

axioms (2)

domain assumption Screenshots alone contain sufficient information for precise UI element grounding and task-relevant captioning
Invoked in the enhanced perception component and required for the end-to-end screenshot-only design.
domain assumption Human-like interaction traces collected on virtual machines are representative of real-world GUI tasks
Required for the iterative training loop to produce generalizable improvements.

pith-pipeline@v0.9.0 · 5756 in / 1581 out tokens · 33662 ms · 2026-05-11T05:37:11.173296+00:00 · methodology

discussion (0)

Forward citations

Cited by 55 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
cs.LG 2026-05 unverdicted novelty 7.0

BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
cs.CV 2026-05 unverdicted novelty 7.0

Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
Learning Agentic Policy from Action Guidance
cs.CL 2026-05 unverdicted novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?
cs.AI 2026-05 unverdicted novelty 7.0

VLATIM benchmark reveals large VLMs excel at high-level planning in physics puzzles but struggle with precise visual grounding and mouse control, so they lack human-like problem-solving capabilities.
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
cs.CL 2026-05 unverdicted novelty 7.0

ReVision reduces visual token usage by 46% on average in agent trajectories via a learned patch selector and improves success rates by 3% on three benchmarks, showing that history saturation stems from inefficient rep...
What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs
cs.CV 2026-05 conditional novelty 7.0

GUI grounding in VLMs is bottlenecked by prefill-stage candidate selection that decoding cannot fix, so Re-Prefill uses attention to extract and re-inject target tokens for up to 4.3% gains on ScreenSpot-Pro.
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
cs.AI 2026-05 unverdicted novelty 7.0

An exploration-aware RL framework lets LLM agents adaptively explore only under high uncertainty via variational rewards and action grouping, yielding consistent gains on text and GUI agent benchmarks.
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
cs.AI 2026-05 unverdicted novelty 7.0

Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...
FlowEval: Reference-based Evaluation of Generated User Interfaces
cs.MA 2026-05 unverdicted novelty 7.0

FlowEval evaluates generated UIs by measuring how closely their navigation flows match real websites via reference-based similarity metrics and shows strong correlation with human expert judgments.
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
cs.AI 2026-05 unverdicted novelty 7.0

GUI-SD is the first on-policy self-distillation framework for GUI grounding that adds privileged bounding-box context and entropy-guided weighting to outperform GRPO methods on six benchmarks in accuracy and efficiency.
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
cs.AI 2026-05 accept novelty 7.0

GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.
Training Computer Use Agents to Assess the Usability of Graphical User Interfaces
cs.CL 2026-04 unverdicted novelty 7.0

uxCUA is a trained computer use agent that assesses GUI usability more accurately than larger models by learning to prioritize and execute important user interactions on labeled interface datasets.
Benchmarking and Improving GUI Agents in High-Dynamic Environments
cs.CV 2026-04 conditional novelty 7.0

DynamicUI improves GUI agent performance in high-dynamic environments by using video-based dynamic perception, action-conditioned refinement, and reflection, outperforming prior agents on the new DynamicGUIBench while...
Benchmarking and Improving GUI Agents in High-Dynamic Environments
cs.CV 2026-04 unverdicted novelty 7.0

DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new Dy...
OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents
cs.CL 2026-04 unverdicted novelty 7.0

OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models
cs.LG 2026-04 conditional novelty 7.0

GUI-Perturbed shows that GUI grounding models suffer systematic accuracy collapse under relational instructions and visual changes such as 70% zoom, with even augmented fine-tuning worsening results.
RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management
cs.AI 2026-04 unverdicted novelty 7.0

RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.
From Exploration to Specification: LLM-Based Property Generation for Mobile App Testing
cs.SE 2026-04 unverdicted novelty 7.0

PropGen automates property generation for Android app testing via LLM synthesis from guided exploration and feedback refinement, yielding 912 valid properties and 25 previously unknown bugs across 12 apps.
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
cs.CV 2026-04 unverdicted novelty 7.0

Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, w...
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
cs.CV 2025-04 unverdicted novelty 7.0

GUI-R1 uses reinforcement fine-tuning with GRPO on a small curated dataset to create a generalist vision-language action model that outperforms prior GUI agent methods across mobile, desktop, and web benchmarks using ...
MMSkills: Towards Multimodal Skills for General Visual Agents
cs.AI 2026-05 unverdicted novelty 6.0

MMSkills turns public interaction trajectories into compact multimodal skill packages that visual agents can consult at runtime to improve decision-making on benchmarks.
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
cs.AI 2026-05 unverdicted novelty 6.0

ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
cs.CL 2026-05 unverdicted novelty 6.0

ReVision reduces visual tokens in computer-use agent histories by 46% on average and raises success rates by 3% by learning to drop redundant patches across screenshots, allowing longer histories to keep improving per...
How Mobile World Model Guides GUI Agents?
cs.AI 2026-05 unverdicted novelty 6.0

Mobile world models in text, image, and code modalities reach state-of-the-art on their benchmarks and improve downstream GUI agent performance, with code best for in-distribution accuracy and text more robust for out...
Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents
cs.CL 2026-05 unverdicted novelty 6.0

Phone-use agents avoid harm more often through inability to act than through deliberate safe choices, so benchmarks must separate unsafe judgment from capability failure.
LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

LiteGUI trains 2B/3B-scale GUI agents via SFT-free guided on-policy distillation and multi-solution dual-level GRPO to reach SOTA lightweight performance and compete with larger models.
BAMI: Training-Free Bias Mitigation in GUI Grounding
cs.CV 2026-05 unverdicted novelty 6.0

BAMI mitigates precision and ambiguity biases in GUI grounding via coarse-to-fine focus and candidate selection, raising accuracy on ScreenSpot-Pro without training.
Augmenting Interface Usability Heuristics for Reliable Computer-Use Agents
cs.HC 2026-05 unverdicted novelty 6.0

Augmented Nielsen heuristics improve computer-use agent task completion on varied interfaces while preserving human usability, as shown in UI-Verse experiments and human studies.
AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding
cs.CV 2026-05 unverdicted novelty 6.0

AutoFocus converts token perplexity into an anisotropic Gaussian uncertainty field to drive region proposals and shape-aware zooming for improved GUI grounding in VLMs.
SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents
cs.CR 2026-04 unverdicted novelty 6.0

SnapGuard detects prompt injection attacks on screenshot-based web agents via visual stability indicators and contrast-polarity textual signals, reaching F1 0.75 while running 8x faster than GPT-4o with no added memory cost.
SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 6.0

SOLAR-RL assigns dense step-level rewards from static trajectory data by detecting first failure points and applying target-aligned shaping to improve long-horizon GUI task completion without full online interactions.
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
cs.CL 2026-04 conditional novelty 6.0

VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
Zoom Consistency: A Free Confidence Signal in Multi-Step Visual Grounding Pipelines
cs.CV 2026-04 unverdicted novelty 6.0

Zoom consistency provides a geometric, cross-model confidence signal in zoom-in grounding pipelines that correlates with prediction correctness and enables modest gains in specialist-generalist routing.
UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding
cs.CV 2026-04 unverdicted novelty 6.0

UI-Zoomer uses uncertainty quantification to trigger and size adaptive zoom-ins only on uncertain GUI grounding predictions, yielding up to 13.4% gains on benchmarks with no training.
Mobile GUI Agent Privacy Personalization with Trajectory Induced Preference Optimization
cs.AI 2026-04 unverdicted novelty 6.0

TIPO applies preference-intensity weighting and padding gating to stabilize preference optimization for privacy personalization in mobile GUI agents, yielding higher alignment and distinction metrics than prior methods.
Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents
cs.CV 2026-04 unverdicted novelty 6.0

Closed-loop VLM agents using multi-view reasoning, object-centered visualization, and single-axis rotation prediction achieve superior text-guided 6D pose rearrangement for target objects in scenes.
HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation
cs.AI 2026-04 unverdicted novelty 6.0

HiRO-Nav adaptively triggers reasoning only on high-entropy actions via a hybrid training pipeline and shows better success-token trade-offs than always-reason or never-reason baselines on the CHORES-S benchmark.
Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection
cs.CR 2026-04 unverdicted novelty 6.0

Semantic-level UI Element Injection distracts GUI agents by overlaying safety-aligned UI elements, achieving up to 4.4x higher attack success rates that transfer across models and create persistent attractors.
IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents
cs.AI 2026-04 unverdicted novelty 6.0

IntentScore learns intent-conditioned action scores from offline GUI trajectories and raises task success by 6.9 points on an unseen agent and environment.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
cs.AI 2026-05 unverdicted novelty 5.0

An exploration-aware policy optimization method lets LLM agents explore selectively via a variational-inference reward and action grouping, yielding consistent gains on text and GUI agent benchmarks.
LoViF 2026 The First Challenge on Holistic Quality Assessment for 4D World Model (PhyScore)
cs.CV 2026-05 conditional novelty 5.0

The PhyScore challenge creates the first benchmark requiring metrics to jointly score video quality, physical realism, condition alignment, and temporal consistency while localizing physical anomalies in 1554 videos f...
Perceptual Flow Network for Visually Grounded Reasoning
cs.CV 2026-05 unverdicted novelty 5.0

PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
cs.AI 2026-04 unverdicted novelty 5.0

The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
cs.LG 2026-04 unverdicted novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
HalluClear: Diagnosing, Evaluating and Mitigating Hallucinations in GUI Agents
cs.AI 2026-04 unverdicted novelty 5.0

HalluClear supplies a taxonomy, calibrated evaluation, and lightweight post-training mitigation that reduces hallucinations in GUI agents using only 9K samples.
From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments
cs.AI 2026-03 unverdicted novelty 5.0

An empirical literature analysis reveals a bifurcation in RL environments into Semantic Prior (LLM-dominated) and Domain-Specific Generalization ecosystems with distinct cognitive fingerprints.
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
cs.AI 2025-09 conditional novelty 5.0

UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
A Pattern Language for Resilient Visual Agents
cs.AI 2026-04 unverdicted novelty 4.0

Proposes four architectural patterns—Hybrid Affordance Integration, Adaptive Visual Anchoring, Visual Hierarchy Synthesis, and Semantic Scene Graph—to balance non-determinism and latency of foundation models with ente...
Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward
cs.MA 2026-02 unverdicted novelty 4.0

The paper surveys agent skills for LLMs across architecture, acquisition, deployment, and security, proposing a four-tier Skill Trust and Lifecycle Governance Framework to address vulnerabilities in community skills.
Seed1.5-VL Technical Report
cs.CV 2025-05 unverdicted novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction
cs.CV 2026-05 unverdicted novelty 3.0

X-OmniClaw presents a unified architecture for Android mobile agents using Omni Perception, Memory, and Action modules to enable efficient multimodal task handling and personalized interactions.
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
cs.AI 2026-04 unverdicted novelty 2.0

The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
cs.AI 2026-04 unverdicted novelty 2.0

This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.