MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management
Pith reviewed 2026-06-26 15:59 UTC · model grok-4.3
The pith
MemGUI-Agent treats context management as first-class actions to enable reliable long-horizon mobile GUI performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MemGUI-Agent is built on Context-as-Action (ConAct), which casts context management as first-class actions emitted by the same policy that selects UI actions. Instead of passively appending history, ConAct maintains three structured context fields: folded action history, folded UI state, and recent step record, preserving critical UI facts while keeping context compact. Training an 8B model on the 2,956-trajectory MemGUI-3K dataset produces MemGUI-8B-SFT that achieves the best open-data 8B performance on MemGUI-Bench and generalizes to the out-of-distribution MobileWorld benchmark.
What carries the argument
Context-as-Action (ConAct), which casts context management as first-class actions emitted by the same policy that selects UI actions, maintaining folded action history, folded UI state, and recent step record.
If this is right
- The same policy learns to decide when and how to fold context, preserving critical cross-app facts.
- Supervised training on annotated trajectories makes proactive management learnable across model scales.
- The resulting 8B agent sets the best open-data performance on MemGUI-Bench.
- It generalizes to out-of-distribution benchmarks like MobileWorld.
Where Pith is reading between the lines
- This method might reduce the need for separate memory architectures in agent systems.
- Similar context-as-action ideas could apply to web or desktop agents facing similar horizon limits.
- If models learn context actions well, it could improve reliability on tasks with many app transitions.
- The three-field structure could be adapted for other structured memory needs.
Load-bearing premise
That the model will learn to emit useful context management actions rather than unhelpful or noisy ones that fail to preserve critical facts.
What would settle it
If the 8B model trained on MemGUI-3K does not achieve the highest open-data 8B score on MemGUI-Bench or fails to retain facts on long sequences, the claim would not hold.
read the original abstract
MLLM-based mobile GUI agents have made substantial progress on short-horizon tasks, yet remain unreliable on long-horizon tasks that require retaining intermediate facts across many steps and app transitions. We attribute this limitation to ReAct-style prompting, which passively accumulates per-step records, leading to prompt explosion and dilution of critical cross-app facts. To address this, we introduce MemGUI-Agent, an end-to-end long-horizon mobile GUI agent with proactive context management. MemGUI-Agent is built on Context-as-Action (ConAct), which casts context management as first-class actions emitted by the same policy that selects UI actions. Instead of passively appending history, ConAct maintains three structured context fields: folded action history, folded UI state, and recent step record, preserving critical UI facts while keeping context compact. To make proactive context management learnable across model scales, we construct MemGUI-3K, a 2,956-trajectory dataset with full ConAct annotations for supervised training and offline analysis. Training an 8B model on MemGUI-3K produces MemGUI-8B-SFT, an 8B MemGUI-Agent that achieves the best open-data 8B performance on MemGUI-Bench and generalizes to the out-of-distribution MobileWorld benchmark. Code, data, and trained models will be released at https://memgui-agent.github.io/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MemGUI-Agent, an MLLM-based mobile GUI agent that frames context management as first-class actions (ConAct) emitted by the same policy as UI actions. ConAct maintains three structured fields (folded action history, folded UI state, recent step record) to avoid passive accumulation and prompt explosion in long-horizon tasks. The authors construct MemGUI-3K, a dataset of 2,956 fully annotated trajectories, perform SFT on an 8B model to obtain MemGUI-8B-SFT, and claim this yields the best open-data 8B performance on MemGUI-Bench while generalizing to the out-of-distribution MobileWorld benchmark.
Significance. If the results hold, the work provides a concrete mechanism for making context management proactive and learnable within the policy itself, which could improve reliability of long-horizon agents across GUI and related domains. The public release of code, data, and trained models is a clear strength that supports reproducibility and follow-on research.
major comments (2)
- [Abstract] Abstract: The headline performance claims for MemGUI-8B-SFT (best open-data 8B result on MemGUI-Bench and generalization to MobileWorld) are stated without any experimental details, error bars, dataset statistics, ablation studies, or baseline comparisons, rendering the central empirical result unevaluable from the provided text.
- [ConAct / SFT sections] Section on ConAct and SFT training (likely §3–4): The performance attribution to ConAct requires that SFT on MemGUI-3K induces the 8B policy to emit useful, fact-preserving context actions at appropriate times rather than defaulting to passive accumulation or noisy/empty folds. No analysis of emitted ConAct actions (frequency, fidelity to annotations, effect on context length, or comparison to ReAct baselines) is described, leaving this load-bearing assumption unverified.
minor comments (1)
- [Abstract] Abstract: The phrase 'best open-data 8B performance' is undefined; the manuscript should clarify what 'open-data' baselines are considered and how they are selected.
Simulated Author's Rebuttal
We thank the referee for their thoughtful comments, which highlight important aspects for improving the clarity and verifiability of our empirical claims. We address each point below and commit to revisions that will strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline performance claims for MemGUI-8B-SFT (best open-data 8B result on MemGUI-Bench and generalization to MobileWorld) are stated without any experimental details, error bars, dataset statistics, ablation studies, or baseline comparisons, rendering the central empirical result unevaluable from the provided text.
Authors: The abstract is designed to provide a high-level overview of the contributions and key results within the typical length constraints. Full details on the experimental setup, including the MemGUI-Bench and MobileWorld benchmarks, dataset statistics for MemGUI-3K (2,956 trajectories), SFT training on the 8B model, and comparisons to baselines are presented in the Experiments section. We will revise the abstract to briefly mention the evaluation on MemGUI-Bench with open-data 8B models and generalization to MobileWorld, while noting that detailed results and ablations appear in the main text. Error bars and full ablations are not standard in abstracts but are included in the paper body. revision: partial
-
Referee: [ConAct / SFT sections] Section on ConAct and SFT training (likely §3–4): The performance attribution to ConAct requires that SFT on MemGUI-3K induces the 8B policy to emit useful, fact-preserving context actions at appropriate times rather than defaulting to passive accumulation or noisy/empty folds. No analysis of emitted ConAct actions (frequency, fidelity to annotations, effect on context length, or comparison to ReAct baselines) is described, leaving this load-bearing assumption unverified.
Authors: We agree that verifying the policy's use of ConAct is important for attributing performance gains. The current manuscript focuses on the design of ConAct and the construction of the annotated dataset but does not include a post-hoc analysis of the trained model's ConAct emissions. In the revision, we will add such an analysis, including quantitative measures of ConAct emission frequency, fidelity to the ground-truth annotations in MemGUI-3K, impact on context length compared to ReAct-style accumulation, and qualitative examples. This will be incorporated into the experimental results section to directly address the load-bearing assumption. revision: yes
Circularity Check
No circularity detected; empirical construction with external benchmarks
full rationale
The paper describes an empirical pipeline: define ConAct as context actions, annotate a new 2,956-trajectory dataset (MemGUI-3K) with those actions, perform SFT on an 8B model, and report performance on MemGUI-Bench plus out-of-distribution MobileWorld. No equations, parameter fits, or derivations appear. No self-citations are invoked as load-bearing uniqueness theorems. The central claim (SFT on annotated trajectories yields a policy that emits useful ConAct actions) is tested against separate benchmarks rather than reducing to the training data by construction. This is a standard supervised-learning result whose validity is externally falsifiable.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906, 2025
Pith/arXiv arXiv 2025
-
[2]
Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
Pith/arXiv arXiv 2025
-
[3]
Kanzhi Cheng, Zehao Li, Zheng Ma, Nuo Chen, Jialin Cao, Qiushi Sun, Zichen Ding, Fangzhi Xu, Hang Yan, Jiajun Chen, et al. OpenMobile: Building open mobile agents with task and trajectory synthesis.arXiv preprint arXiv:2604.15093, 2026
Pith/arXiv arXiv 2026
-
[4]
Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production- ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025
Pith/arXiv arXiv 2025
-
[5]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
Pith/arXiv arXiv 2025
-
[6]
Ui-venus-1.5 technical report.arXiv e-prints, pages arXiv–2602, 2026
Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu, Shuheng Shen, Yue Wen, Tianyu Xia, Zhenyu Xu, Zhengwen Zeng, Beitong Zhou, et al. Ui-venus-1.5 technical report.arXiv e-prints, pages arXiv–2602, 2026
2026
-
[7]
Ui-venus technical report: Building high-performance ui agents with rft
Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, et al. Ui-venus technical report: Building high-performance ui agents with rft. arXiv preprint arXiv:2508.10833, 2025
arXiv 2025
-
[8]
Cogagent: A visual language model for gui agents
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14281–14290, 2024
2024
-
[9]
Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzhang Zhou, Jianan Zhang, Liangyu Chen, et al. Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive and mcp-augmented environments.arXiv preprint arXiv:2512.19432, 2025
arXiv 2025
-
[10]
Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020
2020
-
[11]
Llm-powered gui agents in phone automation: Surveying progress and prospects
Guangyi Liu, Pengxiang Zhao, Yaozhen Liang, Liang Liu, Yaxuan Guo, Han Xiao, Weifeng Lin, Yuxiang Chai, Yue Han, Shuai Ren, et al. Llm-powered gui agents in phone automation: Surveying progress and prospects. arXiv preprint arXiv:2504.19838, 2025
arXiv 2025
-
[12]
Guangyi Liu, Pengxiang Zhao, Yaozhen Liang, Qinyi Luo, Shunye Tang, Yuxiang Chai, Weifeng Lin, Han Xiao, WenHao Wang, Siheng Chen, et al. Memgui-bench: Benchmarking memory of mobile gui agents in dynamic environments.arXiv preprint arXiv:2602.06075, 2026
arXiv 2026
-
[13]
Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, et al. Scalecua: Scaling open-source computer use agents with cross-platform data.arXiv preprint arXiv:2509.15221, 2025
arXiv 2025
-
[14]
Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices
Quanfeng Lu, Wenqi Shao, Zitao Liu, Lingxiao Du, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, and Ping Luo. Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22404–22414, 2025
2025
-
[15]
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025
Pith/arXiv arXiv 2025
-
[16]
Androidworld: A dynamic benchmarking environment for autonomous agents
Chris Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents. InInternational Conference on Learning Representations, volume 2025, pages 406–441, 2025
2025
-
[17]
Fei Tang, Zhiqiong Lu, Boxuan Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen. ClawGUI: A unified framework for training, evaluating, and deploying gui agents.arXiv preprint arXiv:2604.11784, 2026. 11
Pith/arXiv arXiv 2026
-
[18]
Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration
Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration. Advances in Neural Information Processing Systems, 37:2686–2710, 2024
2024
-
[19]
Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, and Heng Ji. Mobile-agent-e: Self-evolving mobile assistant for complex tasks.arXiv preprint arXiv:2501.11733, 2025
arXiv 2025
-
[20]
Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents.arXiv preprint arXiv:2602.16855, 2026
arXiv 2026
-
[21]
A-mem: Agentic memory for llm agents.Advances in Neural Information Processing Systems, 38:17577–17604, 2026
Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.Advances in Neural Information Processing Systems, 38:17577–17604, 2026
2026
-
[22]
Mobile-agent-v3: Fundamental agents for gui automation.arXiv preprint arXiv:2508.15144, 2025
Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al. Mobile-agent-v3: Fundamental agents for gui automation.arXiv preprint arXiv:2508.15144, 2025
Pith/arXiv arXiv 2025
-
[23]
Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai Su, Liwen Zhang, Zile Qiao, Xinyu Wang, et al. Agentfold: Long-horizon web agents with proactive context management.arXiv preprint arXiv:2510.24699, 2025
arXiv 2025
-
[24]
Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, et al. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.arXiv preprint arXiv:2507.02259, 2025
Pith/arXiv arXiv 2025
-
[25]
Appagent: Multimodal agents as smartphone users
Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2025
2025
-
[26]
G-memory: Tracing hierarchical memory for multi-agent systems.Advances in Neural Information Processing Systems, 38:12988–13018, 2026
Guibin Zhang, Muxin Fu, Kun Wang, Frank Wan, Miao Yu, and Shuicheng Yan. G-memory: Tracing hierarchical memory for multi-agent systems.Advances in Neural Information Processing Systems, 38:12988–13018, 2026
2026
-
[27]
Expel: Llm agents are experiential learners
Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024
2024
-
[28]
Swift: a scalable lightweight infrastructure for fine-tuning
Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, et al. Swift: a scalable lightweight infrastructure for fine-tuning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 29733–29735, 2025
2025
-
[29]
GPT-4V(ision) is a generalist web agent, if grounded
Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. GPT-4V(ision) is a generalist web agent, if grounded. InInternational Conference on Machine Learning, 2024
2024
-
[30]
Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, et al. Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025
arXiv 2025
-
[31]
Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. Mem1: Learning to synergize memory and reasoning for efficient long-horizon agents. arXiv preprint arXiv:2506.15841, 2025. 12 Appendix organization.We place background discussion first, then benchmark and dataset details, follow...
Pith/arXiv arXiv 2025
-
[32]
Thinking: a <thinking>...</thinking> block explaining the next move (no multi-step reasoning)
-
[33]
name": <function-name>,
Tool call: a <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.,→ 26
-
[34]
failure"`immediately.,→ - If task is successfully completed, use`action=terminate`with`status=
Conclusion: a short <conclusion>...</conclusion> block describing what to do in the UI. Rules: - Output exactly in the order: <thinking>,<tool_call>,<conclusion>. - Be brief: one sentence for <thinking>, one for <conclusion>. - Do not output anything else outside those three parts. - **Task Feasibility**: If you determine the task is INFEASIBLE (e.g., req...
-
[35]
**Folded UI State**: Explicitly stored critical information extracted from UI
-
[36]
**Folded Action History**: Compressed records of past actions
-
[37]
type": "function
**Recent Step Record**: Full details of your most recent step (to be folded this turn) Under CONACT, these three fields form the structured context state, and the model may emit both UI actions and context actions (history folding or UI memory operations).,→ # Tools You may call ONE function per step. <tools> [ { "type": "function", "function": { "name": ...
-
[38]
**Thinking**:`<thinking>...</thinking>`- Your reasoning for next action AND folding decision
-
[39]
range": [start_step, current_step],
**Folding Directive**:`<folding>...</folding>`- JSON object specifying how to compress history: ```json {"range": [start_step, current_step], "summary": "Compressed description"} ``` - **Step-level Distillation** (start_step == current_step): Distill only the latest step into a compact record Example:`{"range": [5, 5], "summary": "[Step 5] Opened Settings...
-
[40]
**Tool Call**:`<tool_call>...</tool_call>`- Your action (UI or memory operation)
-
[41]
Include exact text, numbers, prices, names, counts visible
**UI Observation**:`<ui_observation>...</ui_observation>`- **DETAILED** screen description. Include exact text, numbers, prices, names, counts visible. Quote task-relevant info verbatim.,→ 28
-
[42]
**Action Intent**:`<action_intent>...</action_intent>`- What you INTEND to do next. ### Rules: - Output exactly in order: <thinking>, <folding>, <tool_call>, <ui_observation>, <action_intent> - First step (step 1): Skip <folding> as there's no history to fold - ALWAYS include <folding> from step 2 onwards - In <folding>, "range" must include the current s...
-
[43]
Output <thinking> with your reasoning
-
[44]
Skip <folding> for the first step
{"Skip <folding> for the first step" if self.current_step == 1 else "Output <folding> to compress your previous step(s)"}
-
[45]
Output <tool_call> with your action
-
[46]
Output <ui_observation> with **DETAILED** screen description (include ALL task-relevant info: exact text, numbers, prices, names, counts visible on screen),→
-
[47]
Output <action_intent> describing your planned action 29 Figure 14Representative process-hallucination failure. The agent deviates from the required workflow or falsely assumes that a necessary intermediate operation has been completed, causing progress loss even when the task remains feasible. 30 Figure 15Representative output-hallucination failure. The ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.