MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management

Congxiao Liu; Gao Wu; Guangyi Liu; Liang Guo; Liang Liu; Mading Li; Mengyan Wang; Pengxiang Zhao; Qi Zhang; Yong Liu

arxiv: 2606.19926 · v1 · pith:IEAJIHSDnew · submitted 2026-06-18 · 💻 cs.HC

MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management

Guangyi Liu , Gao Wu , Congxiao Liu , Pengxiang Zhao , Liang Liu , Mading Li , Qi Zhang , Mengyan Wang

show 2 more authors

Liang Guo Yong Liu

This is my paper

Pith reviewed 2026-06-26 15:59 UTC · model grok-4.3

classification 💻 cs.HC

keywords mobile GUI agentslong-horizon taskscontext managementConActMLLM-based agentssupervised fine-tuningMemGUI-BenchMobileWorld benchmark

0 comments

The pith

MemGUI-Agent treats context management as first-class actions to enable reliable long-horizon mobile GUI performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that passive history accumulation in ReAct-style agents causes prompt explosion and loss of key facts in long mobile tasks spanning apps. By making context management proactive actions decided by the same model, it keeps three structured fields compact while retaining critical information. This is supported by creating a dataset of nearly 3,000 annotated trajectories and training an 8B model that leads open 8B results on their benchmark while working on a different one. A sympathetic reader would care because it offers a way to scale GUI agents beyond short tasks without external memory systems.

Core claim

MemGUI-Agent is built on Context-as-Action (ConAct), which casts context management as first-class actions emitted by the same policy that selects UI actions. Instead of passively appending history, ConAct maintains three structured context fields: folded action history, folded UI state, and recent step record, preserving critical UI facts while keeping context compact. Training an 8B model on the 2,956-trajectory MemGUI-3K dataset produces MemGUI-8B-SFT that achieves the best open-data 8B performance on MemGUI-Bench and generalizes to the out-of-distribution MobileWorld benchmark.

What carries the argument

Context-as-Action (ConAct), which casts context management as first-class actions emitted by the same policy that selects UI actions, maintaining folded action history, folded UI state, and recent step record.

If this is right

The same policy learns to decide when and how to fold context, preserving critical cross-app facts.
Supervised training on annotated trajectories makes proactive management learnable across model scales.
The resulting 8B agent sets the best open-data performance on MemGUI-Bench.
It generalizes to out-of-distribution benchmarks like MobileWorld.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method might reduce the need for separate memory architectures in agent systems.
Similar context-as-action ideas could apply to web or desktop agents facing similar horizon limits.
If models learn context actions well, it could improve reliability on tasks with many app transitions.
The three-field structure could be adapted for other structured memory needs.

Load-bearing premise

That the model will learn to emit useful context management actions rather than unhelpful or noisy ones that fail to preserve critical facts.

What would settle it

If the 8B model trained on MemGUI-3K does not achieve the highest open-data 8B score on MemGUI-Bench or fails to retain facts on long sequences, the claim would not hold.

read the original abstract

MLLM-based mobile GUI agents have made substantial progress on short-horizon tasks, yet remain unreliable on long-horizon tasks that require retaining intermediate facts across many steps and app transitions. We attribute this limitation to ReAct-style prompting, which passively accumulates per-step records, leading to prompt explosion and dilution of critical cross-app facts. To address this, we introduce MemGUI-Agent, an end-to-end long-horizon mobile GUI agent with proactive context management. MemGUI-Agent is built on Context-as-Action (ConAct), which casts context management as first-class actions emitted by the same policy that selects UI actions. Instead of passively appending history, ConAct maintains three structured context fields: folded action history, folded UI state, and recent step record, preserving critical UI facts while keeping context compact. To make proactive context management learnable across model scales, we construct MemGUI-3K, a 2,956-trajectory dataset with full ConAct annotations for supervised training and offline analysis. Training an 8B model on MemGUI-3K produces MemGUI-8B-SFT, an 8B MemGUI-Agent that achieves the best open-data 8B performance on MemGUI-Bench and generalizes to the out-of-distribution MobileWorld benchmark. Code, data, and trained models will be released at https://memgui-agent.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ConAct turns context management into policy actions and ships an annotated 3K-trajectory dataset, but the abstract gives no ablations or training details to show the model actually learns useful folds.

read the letter

The paper's central move is to treat context management as first-class actions the same policy emits, instead of letting ReAct-style history pile up. They structure it into three fields—folded action history, folded UI state, and recent step record—and release MemGUI-3K with full annotations on 2,956 trajectories. An 8B model fine-tuned on it is claimed to lead open 8B results on MemGUI-Bench and transfer to MobileWorld.

The dataset and the explicit framing are the concrete contributions. Long-horizon mobile tasks do suffer from context dilution across apps, and making the model decide when and how to compress is a direct response to that.

The soft spot is the missing evidence on whether the learned policy actually emits useful actions. The abstract reports benchmark wins but supplies no ablations on action quality, no statistics on how often folds occur or what they retain, and no description of annotation quality or inter-annotator agreement. If the 8B model mostly outputs empty or noisy context actions after SFT, the claimed edge over passive baselines collapses. The stress-test concern lands because nothing in the SFT objective described guarantees the right behavior.

This is for groups already building or evaluating mobile GUI agents. The dataset could be reused even if the method needs tightening. It deserves peer review because the problem is practical and the release is real, but referees will need to see the training dynamics and controls before the performance numbers can be taken at face value.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces MemGUI-Agent, an MLLM-based mobile GUI agent that frames context management as first-class actions (ConAct) emitted by the same policy as UI actions. ConAct maintains three structured fields (folded action history, folded UI state, recent step record) to avoid passive accumulation and prompt explosion in long-horizon tasks. The authors construct MemGUI-3K, a dataset of 2,956 fully annotated trajectories, perform SFT on an 8B model to obtain MemGUI-8B-SFT, and claim this yields the best open-data 8B performance on MemGUI-Bench while generalizing to the out-of-distribution MobileWorld benchmark.

Significance. If the results hold, the work provides a concrete mechanism for making context management proactive and learnable within the policy itself, which could improve reliability of long-horizon agents across GUI and related domains. The public release of code, data, and trained models is a clear strength that supports reproducibility and follow-on research.

major comments (2)

[Abstract] Abstract: The headline performance claims for MemGUI-8B-SFT (best open-data 8B result on MemGUI-Bench and generalization to MobileWorld) are stated without any experimental details, error bars, dataset statistics, ablation studies, or baseline comparisons, rendering the central empirical result unevaluable from the provided text.
[ConAct / SFT sections] Section on ConAct and SFT training (likely §3–4): The performance attribution to ConAct requires that SFT on MemGUI-3K induces the 8B policy to emit useful, fact-preserving context actions at appropriate times rather than defaulting to passive accumulation or noisy/empty folds. No analysis of emitted ConAct actions (frequency, fidelity to annotations, effect on context length, or comparison to ReAct baselines) is described, leaving this load-bearing assumption unverified.

minor comments (1)

[Abstract] Abstract: The phrase 'best open-data 8B performance' is undefined; the manuscript should clarify what 'open-data' baselines are considered and how they are selected.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments, which highlight important aspects for improving the clarity and verifiability of our empirical claims. We address each point below and commit to revisions that will strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The headline performance claims for MemGUI-8B-SFT (best open-data 8B result on MemGUI-Bench and generalization to MobileWorld) are stated without any experimental details, error bars, dataset statistics, ablation studies, or baseline comparisons, rendering the central empirical result unevaluable from the provided text.

Authors: The abstract is designed to provide a high-level overview of the contributions and key results within the typical length constraints. Full details on the experimental setup, including the MemGUI-Bench and MobileWorld benchmarks, dataset statistics for MemGUI-3K (2,956 trajectories), SFT training on the 8B model, and comparisons to baselines are presented in the Experiments section. We will revise the abstract to briefly mention the evaluation on MemGUI-Bench with open-data 8B models and generalization to MobileWorld, while noting that detailed results and ablations appear in the main text. Error bars and full ablations are not standard in abstracts but are included in the paper body. revision: partial
Referee: [ConAct / SFT sections] Section on ConAct and SFT training (likely §3–4): The performance attribution to ConAct requires that SFT on MemGUI-3K induces the 8B policy to emit useful, fact-preserving context actions at appropriate times rather than defaulting to passive accumulation or noisy/empty folds. No analysis of emitted ConAct actions (frequency, fidelity to annotations, effect on context length, or comparison to ReAct baselines) is described, leaving this load-bearing assumption unverified.

Authors: We agree that verifying the policy's use of ConAct is important for attributing performance gains. The current manuscript focuses on the design of ConAct and the construction of the annotated dataset but does not include a post-hoc analysis of the trained model's ConAct emissions. In the revision, we will add such an analysis, including quantitative measures of ConAct emission frequency, fidelity to the ground-truth annotations in MemGUI-3K, impact on context length compared to ReAct-style accumulation, and qualitative examples. This will be incorporated into the experimental results section to directly address the load-bearing assumption. revision: yes

Circularity Check

0 steps flagged

No circularity detected; empirical construction with external benchmarks

full rationale

The paper describes an empirical pipeline: define ConAct as context actions, annotate a new 2,956-trajectory dataset (MemGUI-3K) with those actions, perform SFT on an 8B model, and report performance on MemGUI-Bench plus out-of-distribution MobileWorld. No equations, parameter fits, or derivations appear. No self-citations are invoked as load-bearing uniqueness theorems. The central claim (SFT on annotated trajectories yields a policy that emits useful ConAct actions) is tested against separate benchmarks rather than reducing to the training data by construction. This is a standard supervised-learning result whose validity is externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations or implementation details, so no free parameters, axioms, or invented entities beyond the high-level concepts named in the text can be identified.

pith-pipeline@v0.9.1-grok · 5803 in / 1152 out tokens · 21870 ms · 2026-06-26T15:59:56.540442+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 10 linked inside Pith

[1]

Agent s2: A compositional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906, 2025

Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906, 2025

Pith/arXiv arXiv 2025
[2]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025
[3]

OpenMobile: Building open mobile agents with task and trajectory synthesis.arXiv preprint arXiv:2604.15093, 2026

Kanzhi Cheng, Zehao Li, Zheng Ma, Nuo Chen, Jialin Cao, Qiushi Sun, Zichen Ding, Fangzhi Xu, Hang Yan, Jiajun Chen, et al. OpenMobile: Building open mobile agents with task and trajectory synthesis.arXiv preprint arXiv:2604.15093, 2026

Pith/arXiv arXiv 2026
[4]

Mem0: Building production- ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production- ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

Pith/arXiv arXiv 2025
[5]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

Pith/arXiv arXiv 2025
[6]

Ui-venus-1.5 technical report.arXiv e-prints, pages arXiv–2602, 2026

Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu, Shuheng Shen, Yue Wen, Tianyu Xia, Zhenyu Xu, Zhengwen Zeng, Beitong Zhou, et al. Ui-venus-1.5 technical report.arXiv e-prints, pages arXiv–2602, 2026

2026
[7]

Ui-venus technical report: Building high-performance ui agents with rft

Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, et al. Ui-venus technical report: Building high-performance ui agents with rft. arXiv preprint arXiv:2508.10833, 2025

arXiv 2025
[8]

Cogagent: A visual language model for gui agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14281–14290, 2024

2024
[9]

Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive and mcp-augmented environments.arXiv preprint arXiv:2512.19432, 2025

Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzhang Zhou, Jianan Zhang, Liangyu Chen, et al. Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive and mcp-augmented environments.arXiv preprint arXiv:2512.19432, 2025

arXiv 2025
[10]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

2020
[11]

Llm-powered gui agents in phone automation: Surveying progress and prospects

Guangyi Liu, Pengxiang Zhao, Yaozhen Liang, Liang Liu, Yaxuan Guo, Han Xiao, Weifeng Lin, Yuxiang Chai, Yue Han, Shuai Ren, et al. Llm-powered gui agents in phone automation: Surveying progress and prospects. arXiv preprint arXiv:2504.19838, 2025

arXiv 2025
[12]

Memgui-bench: Benchmarking memory of mobile gui agents in dynamic environments.arXiv preprint arXiv:2602.06075, 2026

Guangyi Liu, Pengxiang Zhao, Yaozhen Liang, Qinyi Luo, Shunye Tang, Yuxiang Chai, Weifeng Lin, Han Xiao, WenHao Wang, Siheng Chen, et al. Memgui-bench: Benchmarking memory of mobile gui agents in dynamic environments.arXiv preprint arXiv:2602.06075, 2026

arXiv 2026
[13]

Scalecua: Scaling open-source computer use agents with cross-platform data.arXiv preprint arXiv:2509.15221, 2025

Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, et al. Scalecua: Scaling open-source computer use agents with cross-platform data.arXiv preprint arXiv:2509.15221, 2025

arXiv 2025
[14]

Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices

Quanfeng Lu, Wenqi Shao, Zitao Liu, Lingxiao Du, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, and Ping Luo. Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22404–22414, 2025

2025
[15]

Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

Pith/arXiv arXiv 2025
[16]

Androidworld: A dynamic benchmarking environment for autonomous agents

Chris Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents. InInternational Conference on Learning Representations, volume 2025, pages 406–441, 2025

2025
[17]

ClawGUI: A unified framework for training, evaluating, and deploying gui agents.arXiv preprint arXiv:2604.11784, 2026

Fei Tang, Zhiqiong Lu, Boxuan Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen. ClawGUI: A unified framework for training, evaluating, and deploying gui agents.arXiv preprint arXiv:2604.11784, 2026. 11

Pith/arXiv arXiv 2026
[18]

Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration

Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration. Advances in Neural Information Processing Systems, 37:2686–2710, 2024

2024
[19]

Mobile-agent-e: Self-evolving mobile assistant for complex tasks.arXiv preprint arXiv:2501.11733, 2025

Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, and Heng Ji. Mobile-agent-e: Self-evolving mobile assistant for complex tasks.arXiv preprint arXiv:2501.11733, 2025

arXiv 2025
[20]

Mobile-agent-v3

Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents.arXiv preprint arXiv:2602.16855, 2026

arXiv 2026
[21]

A-mem: Agentic memory for llm agents.Advances in Neural Information Processing Systems, 38:17577–17604, 2026

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.Advances in Neural Information Processing Systems, 38:17577–17604, 2026

2026
[22]

Mobile-agent-v3: Fundamental agents for gui automation.arXiv preprint arXiv:2508.15144, 2025

Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al. Mobile-agent-v3: Fundamental agents for gui automation.arXiv preprint arXiv:2508.15144, 2025

Pith/arXiv arXiv 2025
[23]

Agentfold: Long-horizon web agents with proactive context management.arXiv preprint arXiv:2510.24699, 2025

Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai Su, Liwen Zhang, Zile Qiao, Xinyu Wang, et al. Agentfold: Long-horizon web agents with proactive context management.arXiv preprint arXiv:2510.24699, 2025

arXiv 2025
[24]

Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.arXiv preprint arXiv:2507.02259, 2025

Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, et al. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.arXiv preprint arXiv:2507.02259, 2025

Pith/arXiv arXiv 2025
[25]

Appagent: Multimodal agents as smartphone users

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2025

2025
[26]

G-memory: Tracing hierarchical memory for multi-agent systems.Advances in Neural Information Processing Systems, 38:12988–13018, 2026

Guibin Zhang, Muxin Fu, Kun Wang, Frank Wan, Miao Yu, and Shuicheng Yan. G-memory: Tracing hierarchical memory for multi-agent systems.Advances in Neural Information Processing Systems, 38:12988–13018, 2026

2026
[27]

Expel: Llm agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024

2024
[28]

Swift: a scalable lightweight infrastructure for fine-tuning

Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, et al. Swift: a scalable lightweight infrastructure for fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 29733–29735, 2025

2025
[29]

GPT-4V(ision) is a generalist web agent, if grounded

Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. GPT-4V(ision) is a generalist web agent, if grounded. InInternational Conference on Machine Learning, 2024

2024
[30]

Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, et al. Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

arXiv 2025
[31]

reasonable

Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. Mem1: Learning to synergize memory and reasoning for efficient long-horizon agents. arXiv preprint arXiv:2506.15841, 2025. 12 Appendix organization.We place background discussion first, then benchmark and dataset details, follow...

Pith/arXiv arXiv 2025
[32]

Thinking: a <thinking>...</thinking> block explaining the next move (no multi-step reasoning)
[33]

name": <function-name>,

Tool call: a <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.,→ 26
[34]

failure"`immediately.,→ - If task is successfully completed, use`action=terminate`with`status=

Conclusion: a short <conclusion>...</conclusion> block describing what to do in the UI. Rules: - Output exactly in the order: <thinking>,<tool_call>,<conclusion>. - Be brief: one sentence for <thinking>, one for <conclusion>. - Do not output anything else outside those three parts. - **Task Feasibility**: If you determine the task is INFEASIBLE (e.g., req...
[35]

**Folded UI State**: Explicitly stored critical information extracted from UI
[36]

**Folded Action History**: Compressed records of past actions
[37]

type": "function

**Recent Step Record**: Full details of your most recent step (to be folded this turn) Under CONACT, these three fields form the structured context state, and the model may emit both UI actions and context actions (history folding or UI memory operations).,→ # Tools You may call ONE function per step. <tools> [ { "type": "function", "function": { "name": ...
[38]

**Thinking**:`<thinking>...</thinking>`- Your reasoning for next action AND folding decision
[39]

range": [start_step, current_step],

**Folding Directive**:`<folding>...</folding>`- JSON object specifying how to compress history: ```json {"range": [start_step, current_step], "summary": "Compressed description"} ``` - **Step-level Distillation** (start_step == current_step): Distill only the latest step into a compact record Example:`{"range": [5, 5], "summary": "[Step 5] Opened Settings...
[40]

**Tool Call**:`<tool_call>...</tool_call>`- Your action (UI or memory operation)
[41]

Include exact text, numbers, prices, names, counts visible

**UI Observation**:`<ui_observation>...</ui_observation>`- **DETAILED** screen description. Include exact text, numbers, prices, names, counts visible. Quote task-relevant info verbatim.,→ 28
[42]

**Action Intent**:`<action_intent>...</action_intent>`- What you INTEND to do next. ### Rules: - Output exactly in order: <thinking>, <folding>, <tool_call>, <ui_observation>, <action_intent> - First step (step 1): Skip <folding> as there's no history to fold - ALWAYS include <folding> from step 2 onwards - In <folding>, "range" must include the current s...
[43]

Output <thinking> with your reasoning
[44]

Skip <folding> for the first step

{"Skip <folding> for the first step" if self.current_step == 1 else "Output <folding> to compress your previous step(s)"}
[45]

Output <tool_call> with your action
[46]

Output <ui_observation> with **DETAILED** screen description (include ALL task-relevant info: exact text, numbers, prices, names, counts visible on screen),→
[47]

Output <action_intent> describing your planned action 29 Figure 14Representative process-hallucination failure. The agent deviates from the required workflow or falsely assumes that a necessary intermediate operation has been completed, causing progress loss even when the task remains feasible. 30 Figure 15Representative output-hallucination failure. The ...

[1] [1]

Agent s2: A compositional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906, 2025

Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906, 2025

Pith/arXiv arXiv 2025

[2] [2]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025

[3] [3]

OpenMobile: Building open mobile agents with task and trajectory synthesis.arXiv preprint arXiv:2604.15093, 2026

Kanzhi Cheng, Zehao Li, Zheng Ma, Nuo Chen, Jialin Cao, Qiushi Sun, Zichen Ding, Fangzhi Xu, Hang Yan, Jiajun Chen, et al. OpenMobile: Building open mobile agents with task and trajectory synthesis.arXiv preprint arXiv:2604.15093, 2026

Pith/arXiv arXiv 2026

[4] [4]

Mem0: Building production- ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production- ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

Pith/arXiv arXiv 2025

[5] [5]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

Pith/arXiv arXiv 2025

[6] [6]

Ui-venus-1.5 technical report.arXiv e-prints, pages arXiv–2602, 2026

Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu, Shuheng Shen, Yue Wen, Tianyu Xia, Zhenyu Xu, Zhengwen Zeng, Beitong Zhou, et al. Ui-venus-1.5 technical report.arXiv e-prints, pages arXiv–2602, 2026

2026

[7] [7]

Ui-venus technical report: Building high-performance ui agents with rft

Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, et al. Ui-venus technical report: Building high-performance ui agents with rft. arXiv preprint arXiv:2508.10833, 2025

arXiv 2025

[8] [8]

Cogagent: A visual language model for gui agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14281–14290, 2024

2024

[9] [9]

Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive and mcp-augmented environments.arXiv preprint arXiv:2512.19432, 2025

Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzhang Zhou, Jianan Zhang, Liangyu Chen, et al. Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive and mcp-augmented environments.arXiv preprint arXiv:2512.19432, 2025

arXiv 2025

[10] [10]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

2020

[11] [11]

Llm-powered gui agents in phone automation: Surveying progress and prospects

Guangyi Liu, Pengxiang Zhao, Yaozhen Liang, Liang Liu, Yaxuan Guo, Han Xiao, Weifeng Lin, Yuxiang Chai, Yue Han, Shuai Ren, et al. Llm-powered gui agents in phone automation: Surveying progress and prospects. arXiv preprint arXiv:2504.19838, 2025

arXiv 2025

[12] [12]

Memgui-bench: Benchmarking memory of mobile gui agents in dynamic environments.arXiv preprint arXiv:2602.06075, 2026

Guangyi Liu, Pengxiang Zhao, Yaozhen Liang, Qinyi Luo, Shunye Tang, Yuxiang Chai, Weifeng Lin, Han Xiao, WenHao Wang, Siheng Chen, et al. Memgui-bench: Benchmarking memory of mobile gui agents in dynamic environments.arXiv preprint arXiv:2602.06075, 2026

arXiv 2026

[13] [13]

Scalecua: Scaling open-source computer use agents with cross-platform data.arXiv preprint arXiv:2509.15221, 2025

Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, et al. Scalecua: Scaling open-source computer use agents with cross-platform data.arXiv preprint arXiv:2509.15221, 2025

arXiv 2025

[14] [14]

Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices

Quanfeng Lu, Wenqi Shao, Zitao Liu, Lingxiao Du, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, and Ping Luo. Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22404–22414, 2025

2025

[15] [15]

Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

Pith/arXiv arXiv 2025

[16] [16]

Androidworld: A dynamic benchmarking environment for autonomous agents

Chris Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents. InInternational Conference on Learning Representations, volume 2025, pages 406–441, 2025

2025

[17] [17]

ClawGUI: A unified framework for training, evaluating, and deploying gui agents.arXiv preprint arXiv:2604.11784, 2026

Fei Tang, Zhiqiong Lu, Boxuan Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen. ClawGUI: A unified framework for training, evaluating, and deploying gui agents.arXiv preprint arXiv:2604.11784, 2026. 11

Pith/arXiv arXiv 2026

[18] [18]

Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration

Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration. Advances in Neural Information Processing Systems, 37:2686–2710, 2024

2024

[19] [19]

Mobile-agent-e: Self-evolving mobile assistant for complex tasks.arXiv preprint arXiv:2501.11733, 2025

Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, and Heng Ji. Mobile-agent-e: Self-evolving mobile assistant for complex tasks.arXiv preprint arXiv:2501.11733, 2025

arXiv 2025

[20] [20]

Mobile-agent-v3

Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents.arXiv preprint arXiv:2602.16855, 2026

arXiv 2026

[21] [21]

A-mem: Agentic memory for llm agents.Advances in Neural Information Processing Systems, 38:17577–17604, 2026

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.Advances in Neural Information Processing Systems, 38:17577–17604, 2026

2026

[22] [22]

Mobile-agent-v3: Fundamental agents for gui automation.arXiv preprint arXiv:2508.15144, 2025

Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al. Mobile-agent-v3: Fundamental agents for gui automation.arXiv preprint arXiv:2508.15144, 2025

Pith/arXiv arXiv 2025

[23] [23]

Agentfold: Long-horizon web agents with proactive context management.arXiv preprint arXiv:2510.24699, 2025

Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai Su, Liwen Zhang, Zile Qiao, Xinyu Wang, et al. Agentfold: Long-horizon web agents with proactive context management.arXiv preprint arXiv:2510.24699, 2025

arXiv 2025

[24] [24]

Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.arXiv preprint arXiv:2507.02259, 2025

Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, et al. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.arXiv preprint arXiv:2507.02259, 2025

Pith/arXiv arXiv 2025

[25] [25]

Appagent: Multimodal agents as smartphone users

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2025

2025

[26] [26]

G-memory: Tracing hierarchical memory for multi-agent systems.Advances in Neural Information Processing Systems, 38:12988–13018, 2026

Guibin Zhang, Muxin Fu, Kun Wang, Frank Wan, Miao Yu, and Shuicheng Yan. G-memory: Tracing hierarchical memory for multi-agent systems.Advances in Neural Information Processing Systems, 38:12988–13018, 2026

2026

[27] [27]

Expel: Llm agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024

2024

[28] [28]

Swift: a scalable lightweight infrastructure for fine-tuning

Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, et al. Swift: a scalable lightweight infrastructure for fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 29733–29735, 2025

2025

[29] [29]

GPT-4V(ision) is a generalist web agent, if grounded

Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. GPT-4V(ision) is a generalist web agent, if grounded. InInternational Conference on Machine Learning, 2024

2024

[30] [30]

Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, et al. Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

arXiv 2025

[31] [31]

reasonable

Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. Mem1: Learning to synergize memory and reasoning for efficient long-horizon agents. arXiv preprint arXiv:2506.15841, 2025. 12 Appendix organization.We place background discussion first, then benchmark and dataset details, follow...

Pith/arXiv arXiv 2025

[32] [32]

Thinking: a <thinking>...</thinking> block explaining the next move (no multi-step reasoning)

[33] [33]

name": <function-name>,

Tool call: a <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.,→ 26

[34] [34]

failure"`immediately.,→ - If task is successfully completed, use`action=terminate`with`status=

Conclusion: a short <conclusion>...</conclusion> block describing what to do in the UI. Rules: - Output exactly in the order: <thinking>,<tool_call>,<conclusion>. - Be brief: one sentence for <thinking>, one for <conclusion>. - Do not output anything else outside those three parts. - **Task Feasibility**: If you determine the task is INFEASIBLE (e.g., req...

[35] [35]

**Folded UI State**: Explicitly stored critical information extracted from UI

[36] [36]

**Folded Action History**: Compressed records of past actions

[37] [37]

type": "function

**Recent Step Record**: Full details of your most recent step (to be folded this turn) Under CONACT, these three fields form the structured context state, and the model may emit both UI actions and context actions (history folding or UI memory operations).,→ # Tools You may call ONE function per step. <tools> [ { "type": "function", "function": { "name": ...

[38] [38]

**Thinking**:`<thinking>...</thinking>`- Your reasoning for next action AND folding decision

[39] [39]

range": [start_step, current_step],

**Folding Directive**:`<folding>...</folding>`- JSON object specifying how to compress history: ```json {"range": [start_step, current_step], "summary": "Compressed description"} ``` - **Step-level Distillation** (start_step == current_step): Distill only the latest step into a compact record Example:`{"range": [5, 5], "summary": "[Step 5] Opened Settings...

[40] [40]

**Tool Call**:`<tool_call>...</tool_call>`- Your action (UI or memory operation)

[41] [41]

Include exact text, numbers, prices, names, counts visible

**UI Observation**:`<ui_observation>...</ui_observation>`- **DETAILED** screen description. Include exact text, numbers, prices, names, counts visible. Quote task-relevant info verbatim.,→ 28

[42] [42]

**Action Intent**:`<action_intent>...</action_intent>`- What you INTEND to do next. ### Rules: - Output exactly in order: <thinking>, <folding>, <tool_call>, <ui_observation>, <action_intent> - First step (step 1): Skip <folding> as there's no history to fold - ALWAYS include <folding> from step 2 onwards - In <folding>, "range" must include the current s...

[43] [43]

Output <thinking> with your reasoning

[44] [44]

Skip <folding> for the first step

{"Skip <folding> for the first step" if self.current_step == 1 else "Output <folding> to compress your previous step(s)"}

[45] [45]

Output <tool_call> with your action

[46] [46]

Output <ui_observation> with **DETAILED** screen description (include ALL task-relevant info: exact text, numbers, prices, names, counts visible on screen),→

[47] [47]

Output <action_intent> describing your planned action 29 Figure 14Representative process-hallucination failure. The agent deviates from the required workflow or falsely assumes that a necessary intermediate operation has been completed, causing progress loss even when the task remains feasible. 30 Figure 15Representative output-hallucination failure. The ...