{"total":25,"items":[{"citing_arxiv_id":"2606.01414","ref_index":45,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Agent Skills Should Go Beyond Text: The Case for Visual Skills","primary_cat":"cs.CV","submitted_at":"2026-05-31T19:22:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper proposes that reusable agent skills should incorporate visual elements alongside text, introduces three forms of visual skills and an automatic conversion system, and reports better performance on GUI and visual-centric tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13527","ref_index":34,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MMSkills: Towards Multimodal Skills for General Visual Agents","primary_cat":"cs.AI","submitted_at":"2026-05-13T13:40:31+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07358","ref_index":80,"ref_count":4,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications","primary_cat":"cs.IR","submitted_at":"2026-05-08T07:10:26+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":2,"top_context_role":"background","top_context_polarity":"background","context_text":"Task-Derived CREATOR [14], ToolMakers [13], Cradle [74], CodeAct [51], SkillWeaver [57], SayCan [28], ReAct [4], DEPS [29], RAP [32], MetaGPT [6], Self-Discover [36], LDB [50], SWE-agent [52], Alita [63] Corpus-DerivedAppAgent [75], AutoGuide [76], HuggingGPT [5], ToolLLM [22], WebArena [77], TPTU [59], ToolCoder [53], DS-Agent [49], Corpus2Skill [78], AgentDistill [79] Retrieval & Selection (§V) Skill Retrieval Dense EmbeddingV oyager [12], SAGE [80], AutoSkill [81], MemSkill [82], ExpeL [23], ReasoningBank [83], DS-Agent [49] Sparse & KeywordSAGE [80], SkillWeaver [84], AutoSkill [81], Memento-Skills [85], SkillNet [64] Generative RetrievalToolGen [86], ToolLLM [22] Structure-Aware Hierarchical SkillRL [87], AgentSkillOS [58], TOOL-PLANNER [88], SkillNet [64], GraphSkill [62], MemGPT [34], G-Memory [70], Corpus2Skill [78]"},{"citing_arxiv_id":"2605.05765","ref_index":5,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction","primary_cat":"cs.CV","submitted_at":"2026-05-07T06:58:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Describes X-OmniClaw, a multimodal mobile agent architecture using Omni Perception, Memory, and Action modules with behavior cloning for Android task execution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26622","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory","primary_cat":"cs.CL","submitted_at":"2026-04-29T12:49:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24441","ref_index":63,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark","primary_cat":"cs.CV","submitted_at":"2026-04-27T13:06:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AutoGUI-v2 is a new benchmark exposing that VLMs handle basic GUI grounding but struggle with complex interaction logic and state prediction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18860","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Temporal UI State Inconsistency in Desktop GUI Agents: Formalizing and Defending Against TOCTOU Attacks on Computer-Use Agents","primary_cat":"cs.CR","submitted_at":"2026-04-20T21:36:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Desktop GUI agents face TOCTOU attacks from UI state changes during the ~6.5s observation-to-action gap, with a three-layer pre-execution verification defense achieving 100% interception on two attack types but failing on DOM injection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04838","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Less Detail, Better Answers: Degradation-Driven Prompting for VQA","primary_cat":"cs.CV","submitted_at":"2026-04-06T16:41:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Degradation-Driven Prompting improves VQA by intentionally reducing image detail and using masks, lines, and examples to guide models toward essential structures.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.20867","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SoK: Agentic Skills -- Beyond Tool Use in LLM Agents","primary_cat":"cs.CR","submitted_at":"2026-02-24T13:11:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.18842","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents","primary_cat":"cs.CR","submitted_at":"2026-01-26T11:33:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"\"GPT-4V(ision) is a Generalist Web Agent, if Grounded\". In:Proceed- ings of the 41st International Conference on Machine Learning. Ed. by Ruslan Salakhutdinov et al. V ol. 235. Proceedings of Machine Learning Research. PMLR, 2024, pp. 61349-61385. [39] Chi Zhang et al. \"AppAgent: Multimodal Agents as Smartphone Users\". In:arXiv preprint arXiv:2312.13771(2023). [40] Yanda Li et al. \"AppAgent v2: Advanced Agent for Flexible Mobile Interactions\". In:arXiv preprint arXiv:2408.11824(2024). [41] Peter Shaw et al. \"From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces\". In:Advances in Neural Information Processing Systems. Ed. by A. Oh et al. V ol. 36. Curran Associates, Inc., 2023, pp."},{"citing_arxiv_id":"2512.10371","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management","primary_cat":"cs.AI","submitted_at":"2025-12-11T07:37:38+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AgentProg reframes interaction history as a program with variables and control flow, plus a belief state for partial observability, achieving SOTA success rates on long-horizon GUI benchmarks while baselines degrade.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.06721","ref_index":65,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ProAgent: Harnessing On-Demand Sensory Contexts for Proactive LLM Agent Systems in the Wild","primary_cat":"cs.AI","submitted_at":"2025-12-07T08:21:07+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ProAgent uses on-demand tiered perception and context-aware LLM reasoning to deliver proactive assistance on AR glasses, achieving up to 27.7% higher prediction accuracy and 20.5% lower false detections than baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.03364","ref_index":68,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DroidRetriever: A Transparent and Steerable Automation System for Collaborative Mobile Information Seeking","primary_cat":"cs.HC","submitted_at":"2025-05-06T09:37:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DroidRetriever is a transparent steerable mobile automation system that decomposes information-seeking tasks with multi-LLM agents, navigates apps, synthesizes reports with screenshots, and provides a dashboard for real-time user intervention and privacy pauses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.14239","ref_index":63,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners","primary_cat":"cs.AI","submitted_at":"2025-04-19T09:25:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InfiGUI-R1 uses Reasoning Injection via spatial distillation followed by Deliberation Enhancement via RL to evolve GUI agents from reactive actors to deliberative reasoners, reporting strong performance on grounding and trajectory tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.21620","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2025-03-27T15:39:30+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UI-R1 shows rule-based RL with GRPO on 136 GUI tasks improves a 3B MLLM's action prediction accuracy by 6-22% over its base model and matches larger SFT-trained models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.14075","ref_index":63,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Growing a Multi-head Twig via Distillation and Reinforcement Learning to Accelerate Large Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2025-03-18T09:52:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TwigVLM adds a twig module to VLMs for twig-guided token pruning and self-speculative decoding, retaining 96% performance after pruning 88.9% visual tokens and delivering 154% speedup on long responses for LLaVA-1.5-7B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.09572","ref_index":58,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks","primary_cat":"cs.CL","submitted_at":"2025-03-12T17:40:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Plan-and-Act trains a dedicated Planner on synthetic plan-annotated trajectories to generate high-level plans that an Executor follows, reaching 57.58% success on WebArena-Lite and 81.36% on WebVoyager.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.16150","ref_index":184,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions","primary_cat":"cs.AI","submitted_at":"2025-01-27T15:44:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A survey of 87 agents for computer use and 33 datasets that introduces a three-dimensional taxonomy across domain, interaction, and agent perspectives and identifies six research gaps.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"To address this bottleneck, we recommend research into the direction of introducing aself-supervised fine-tuning stage between general pre-training and resource-intensive environment learning. This intermediate stage would align general-purpose foundation models more closely to computer use contexts - analogous to the role of RLHF in aligning LLMs with human preferences [184] or GRPO in improving reasoning [128]. Such an alignment stage would equip models with domain-specific inductive biases, enabling faster and more robust adaptation during subsequent environment learning phases [111]. Our analysis also identifies planning as a major limitation in current ACU architectures. LLMs exhibit limited long-horizon planning capabilities [148], and the dynamics of the environment are often unknown, which hinders"},{"citing_arxiv_id":"2412.04454","ref_index":117,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction","primary_cat":"cs.CL","submitted_at":"2024-12-05T18:58:26+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Aguvis presents a pure vision-based framework for autonomous GUI agents using structured reasoning via inner monologue, a new multimodal dataset, and two-stage training to reach SOTA on offline and online benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2411.18279","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Large Language Model-Brained GUI Agents: A Survey","primary_cat":"cs.AI","submitted_at":"2024-11-27T12:13:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"This new paradigm enables users to control general software systems with conversational commands [16]. By reducing the cognitive load of multi-step GUI operations, LLM-powered agents make complex systems accessible to non-technical users and streamline workflows across diverse domains. Notable examples include SeeAct [17] for web navi- gation, AppAgent [18] for mobile interactions, and UFO [19] for Windows OS applications. These agents resemble a \"virtual assistant\" [20] akin to J.A.R.V.I.S. from Iron Man-an intuitive, adaptive system capable of understanding user goals and autonomously performing actions across applications. The futuristic concept of an AI-powered operating system that executes cross-application tasks with fluidity and precision is"},{"citing_arxiv_id":"2410.23218","ref_index":90,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OS-ATLAS: A Foundation Action Model for Generalist GUI Agents","primary_cat":"cs.CL","submitted_at":"2024-10-30T17:10:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.16158","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception","primary_cat":"cs.CL","submitted_at":"2024-01-29T13:46:37+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mobile-Agent is a vision-centric autonomous agent that uses MLLMs to perceive UI elements, plan complex multi-step tasks, and operate mobile apps without relying on XML or system metadata, showing strong results on the introduced Mobile-Eval benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.10935","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents","primary_cat":"cs.HC","submitted_at":"2024-01-17T08:10:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2308.11432","ref_index":95,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Survey on Large Language Model based Autonomous Agents","primary_cat":"cs.AI","submitted_at":"2023-08-22T13:30:37+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"the relevant memories are extracted to complete the current task. In this process, the improved agent capability comes from the specially designed mem- ory accumulation and utilization mechanisms. V oy- ager [38] introduces a skill library, where executable codes for specific skills are refined through interac- tions with the environment, enabling efficient task execution over time. In AppAgent [95], the agent is designed to interact with apps in a manner akin to human users, learning through both autonomous ex- ploration and observation of human demonstrations. Throughout this process, it constructs a knowledge base, which serves as a reference for performing in- tricate tasks across various applications on a mobile phone. In MemPrompt [96], the users are requested"},{"citing_arxiv_id":"2306.13549","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Survey on Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2023-06-23T15:21:52+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"(4) Extension to more realms and usage scenarios. Some studies transfer the strong capabilities of MLLMs to other domains such as medical image understanding [35], [36], [37] and document parsing [38], [39], [40]. Moreover, multimodal agents are developed to assist in real-world interaction, e.g. embodied agents [41], [42] and GUI agents [43], [44], [45]. An MLLM timeline is illustrated in Fig. 1. In view of such rapid progress and the promising results arXiv:2306.13549v4 [cs.CV] 29 Nov 2024 IEEE TRANSACTIONS ON PATTERN ANAL YSIS AND MACHINE INTELLIGENCE 2 Flamingo VIMA PaLM-E LLaMA-Adapter BLIP-2HuggingGPT MM-REACT InstructBLIP MultiModal-GPTVisionLLM DetGPT LLaVAMiniGPT-4 mPLUG-Owl VideoChat"}],"limit":50,"offset":0}