arxiv: 2604.14956 · v1 · submitted 2026-04-16 · 💻 cs.MA

Recognition: unknown

FedGUI: Benchmarking Federated GUI Agents across Heterogeneous Platforms, Devices, and Operating Systems

Guangyi Liu, Hanzhang Zhou, Haoting Shi, Mengying Yuan, Panrong Tong, Pengxiang Zhao, Siheng Chen, Wenhao Wang, Yiquan Lin, Yue Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 09:50 UTC · model grok-4.3

classification 💻 cs.MA

keywords federated learningGUI agentsbenchmarkheterogeneitycross-platformcross-devicecross-OScross-source

0 comments

The pith

FedGUI is the first benchmark to show that cross-platform collaboration improves federated GUI agent performance while identifying platform and OS as the dominant heterogeneity factors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FedGUI to overcome the absence of benchmarks for training GUI agents under federated learning in varied real-world settings. It supplies six curated datasets that isolate four heterogeneity types spanning mobile, web, and desktop platforms, different devices, operating systems, and data sources. Experiments establish that joint training across platforms produces stronger results than isolated platform training. The work further isolates platform and operating system differences as the heterogeneity dimensions with the largest effects on agent behavior. This foundation supports development of privacy-preserving GUI agents that can scale to diverse user environments.

Core claim

FedGUI provides the first comprehensive benchmark for federated GUI agents, consisting of six curated datasets that enable systematic study of cross-platform, cross-device, cross-OS, and cross-source heterogeneity. Experiments demonstrate that cross-platform collaboration improves performance relative to single-platform federated learning and that platform and OS variations are the most influential heterogeneity factors.

What carries the argument

The FedGUI benchmark and its six curated datasets that isolate and compare the four heterogeneity types during federated training of GUI agents.

If this is right

Cross-platform collaboration extends federated learning for GUI agents beyond mobile-only settings to full multi-platform environments.
Platform and OS differences require priority attention when designing federated systems for GUI agents.
The benchmark supplies a concrete basis for building scalable, privacy-preserving GUI agents ready for real-world deployment.
Distinct heterogeneity dimensions call for targeted mitigation strategies in future federated GUI agent work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

GUI agent developers could adopt the benchmark to validate federated methods on heterogeneous device fleets before broad release.
The dominance of platform and OS factors suggests that adaptive algorithms sensitive to these differences may yield further gains.
Extending the datasets to include live user interactions could test whether the identified heterogeneity patterns hold in uncontrolled conditions.
The privacy advantages of federated approaches in GUI settings may encourage integration into consumer applications with strict data rules.

Load-bearing premise

The six curated datasets accurately represent real-world cross-platform, cross-device, cross-OS, and cross-source heterogeneity without selection biases or unaccounted confounding variables.

What would settle it

Repeating the experiments on live, uncontrolled devices across platforms and observing no performance gain from cross-platform collaboration or finding other heterogeneity types more dominant would falsify the reported insights.

Figures

Figures reproduced from arXiv: 2604.14956 by Guangyi Liu, Hanzhang Zhou, Haoting Shi, Mengying Yuan, Panrong Tong, Pengxiang Zhao, Siheng Chen, Wenhao Wang, Yiquan Lin, Yue Wang.

**Figure 1.** Figure 1: Overview of FedGUI. FedGUI provides a federated framework that coordinates a central server and heterogeneous clients across mobile, web, and desktop platforms to train a generalized cross-platform GUI agent. Hetero. is short for heterogeneity. comprehensive set of evaluation metrics that jointly assess task performance and system efficiency. (3) Heterogeneity. FedGUI models four representative real-worl… view at source ↗

**Figure 2.** Figure 2: Distributions of the 5 different Android device across five clients in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Distributions of episode counts across clients in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Results on FedGUI-OS, with evaluations conducted on Windows, Ubuntu, and macOS. Our Federated GUI agents demonstrate superior cross-OS generalization compared to strong VLMs like GPT-5. 20 40 60 80 Type Ground SR Type Ground SR Type Ground SR AC AitW GO Source Skew Source Non-Uniform IID GPT-5 Qwen3-8B Central (a) FedGUI-Mobile. 20 40 60 80 Type Ground SR Type Ground SR Type Ground SR OA M2W GA-W Sourc… view at source ↗

**Figure 6.** Figure 6: Model performance comparison across mobile, web, and desktop domains. Semi-transparent and solid [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 8.** Figure 8: Model action type accuracy (top row) and grounding accuracy (bottom row) comparison across mobile, web, and Desktop domains. Base and FedAvg performances are shown with arrows indicating gains. (a) Mobile Data Example (b) Web Data Example (c) Desktop Data Example [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Example data episodes for training GUI agents across mobile, web, and desktop platforms in FedGUI. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Overview of the datasets used in our experiments across mobile (AC, AitW, GO), web (M2W, GA-W, OA [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: The full system prompt for training and evaluation (Part I). [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: The full system prompt for training and evaluation (Part II). [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

read the original abstract

Training GUI agents with traditional centralized methods faces significant cost and scalability challenges. Federated learning (FL) offers a promising solution, yet its potential is hindered by the lack of benchmarks that capture real-world, cross-platform heterogeneity. To bridge this gap, we introduce FedGUI, the first comprehensive benchmark for developing and evaluating federated GUI agents across mobile, web, and desktop platforms. FedGUI provides a suite of six curated datasets to systematically study four crucial types of heterogeneity: cross-platform, cross-device, cross-OS, and cross-source. Extensive experiments reveal several key insights: First, we show that cross-platform collaboration improves performance, extending prior mobile-only federated learning to diverse GUI environments; Second, we demonstrate the presence of distinct heterogeneity dimensions and identify platform and OS as the most influential factors. FedGUI provides a vital foundation for the community to build more scalable and privacy-preserving GUI agents for real-world deployment. Our code and data are publicly available at https://github.com/wwh0411/FedGUI..

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces FedGUI, the first comprehensive benchmark for federated GUI agents spanning mobile, web, and desktop platforms. It provides six curated datasets to study four heterogeneity types (cross-platform, cross-device, cross-OS, cross-source). Experiments demonstrate that cross-platform collaboration improves performance over mobile-only settings and identify platform and OS as the dominant heterogeneity factors. Code and data are publicly released.

Significance. If the datasets prove representative and the experimental protocols reproducible, FedGUI supplies a much-needed standardized testbed for heterogeneity-aware federated learning in GUI agents. The public code release and ablation-style comparisons across heterogeneity axes are concrete strengths that lower the barrier for follow-on work on scalable, privacy-preserving agents.

major comments (2)

[§4] §4 (Dataset Construction): The claim that the six datasets isolate the four heterogeneity dimensions without confounding variables rests on task and source selection; however, the manuscript provides only high-level descriptions of curation criteria rather than quantitative diversity metrics or explicit checks against real-world usage distributions. This weakens the downstream assertion that platform and OS are the most influential factors.
[§5.3] §5.3 (Heterogeneity Ablations): The reported performance gains from cross-platform collaboration are presented without error bars, confidence intervals, or statistical significance tests across runs. Given the stochastic nature of both FL training and GUI agent evaluation, it is unclear whether the observed improvements are robust or could be explained by variance in a subset of the six datasets.

minor comments (3)

[Table 1] Table 1 and Figure 3: axis labels and legend entries use inconsistent abbreviations for the four heterogeneity types; a single glossary table would improve readability.
[§6] §6 (Related Work): The discussion of prior mobile-only FL benchmarks is brief; adding one or two sentences contrasting FedGUI's cross-platform scope with the most closely related mobile GUI datasets would better situate the contribution.
[Abstract] The abstract states that 'platform and OS [are] the most influential factors' but does not report the quantitative ranking or ablation delta that supports this ordering; a single sentence with the relevant numbers would strengthen the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your thorough review and positive recommendation for minor revision. We appreciate the constructive feedback on dataset construction and experimental reporting. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§4] §4 (Dataset Construction): The claim that the six datasets isolate the four heterogeneity dimensions without confounding variables rests on task and source selection; however, the manuscript provides only high-level descriptions of curation criteria rather than quantitative diversity metrics or explicit checks against real-world usage distributions. This weakens the downstream assertion that platform and OS are the most influential factors.

Authors: We agree that quantitative metrics would strengthen the claims. In the revised manuscript, we will add quantitative diversity metrics for each dataset, such as statistics on task types, action distributions, UI element variety, and source characteristics. Regarding explicit checks against real-world usage distributions, comprehensive public data on GUI agent usage patterns is limited; we will add a discussion subsection explaining the curation process, its grounding in established benchmarks, and potential limitations. These changes will provide firmer support for the influence of platform and OS factors. revision: yes
Referee: [§5.3] §5.3 (Heterogeneity Ablations): The reported performance gains from cross-platform collaboration are presented without error bars, confidence intervals, or statistical significance tests across runs. Given the stochastic nature of both FL training and GUI agent evaluation, it is unclear whether the observed improvements are robust or could be explained by variance in a subset of the six datasets.

Authors: We acknowledge the need for statistical rigor given the stochastic elements in FL and GUI evaluation. We will revise §5.3 to report results over multiple independent runs (minimum of five seeds), including error bars and 95% confidence intervals. We will also add paired statistical significance tests (e.g., t-tests) comparing cross-platform collaboration against baselines to confirm that gains are robust and not attributable to variance in specific datasets. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark with independent experimental observations

full rationale

The paper introduces FedGUI as a benchmark suite of six curated datasets and reports empirical results from federated learning experiments across heterogeneity axes. No mathematical derivation, parameter fitting, or predictive model is present whose outputs reduce to the inputs by construction. Claims rest on dataset curation, protocol execution, and ablation-style comparisons, with public code release enabling external verification. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text or abstract.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central contribution is the creation of a new benchmark resource rather than a derivation from first principles; therefore the ledger contains only standard domain assumptions and one invented entity (the benchmark itself).

axioms (1)

domain assumption Federated learning methods can be directly applied to GUI agent training once heterogeneity is accounted for
The paper assumes standard FL techniques transfer to GUI agents without additional domain-specific constraints beyond the studied heterogeneities.

invented entities (1)

FedGUI benchmark and its six curated datasets no independent evidence
purpose: To enable systematic study of cross-platform, cross-device, cross-OS, and cross-source heterogeneity in federated GUI agents
Newly introduced resource whose value depends on community adoption rather than independent external validation.

pith-pipeline@v0.9.0 · 5508 in / 1492 out tokens · 36007 ms · 2026-05-10T09:50:13.645736+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 5 canonical work pages · 1 internal anchor

[1]

Measuring the

Measuring the effects of non-identical data distribution for federated visual classification.arXiv preprint arXiv:1909.06335. Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, and 1 others. 2021. Lora: Low-rank adaptation of large language models. InICLR. Zhifeng Jiang, Wei Wang, and Ruichuan Chen. 2024. Dordis: ...

work page arXiv 1909
[2]

MobileWorld: Benchmarking autonomous mobile agents in agent-user interactive, and mcp-augmented environments.arXiv preprint arXiv:2512.19432,

Springer. Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. 2020. Scaffold: Stochastic controlled averaging for federated learning. InIn- ternational Conference on Machine Learning, pages 5132–5143. PMLR. Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzh...

work page arXiv 2020
[3]

arXiv preprint arXiv:2412.19723 , year=

OS-Genesis: Automating GUI Agent Tra- jectory Construction via Reverse Task Synthesis. Preprint, arXiv:2412.19723. Gemma Team. 2025. Gemma 3. Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhi- hao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou,...

work page arXiv 2025
[4]

MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents

Mas-bench: A unified benchmark for shortcut- augmented hybrid mobile gui agents.Preprint, arXiv:2509.06477. Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, and Yingda Chen. 2024. Swift:a scalable lightweight infrastruc- ture for fine-tuning.Preprint, arXiv:2408.05517. ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Mai-ui technical report: Real-world centric foundation gui agents.Preprint, arXiv:2512.22047. A Future Directions A.1 Cross-Platform Heterogeneity-Aware Optimization While our experiments demonstrate that cross- platform collaboration can improve overall perfor- mance, the substantial heterogeneity across mobile, web, and desktop environments remains a si...

work page arXiv 2020
[6]

B Supplementary Experiments & Results B.1 Client Scalability under Heterogeneity

could be adapted to account for the sequen- tial and high-dimensional nature of GUI interaction data, ensuring formal privacy guarantees without severely degrading model utility. B Supplementary Experiments & Results B.1 Client Scalability under Heterogeneity. Setups.We conduct ablation studies onFedGUI- Fullto analyze the effect of scaling client number ...
[7]

and applying multiple heterogeneous partitions (Full IID, Platform Skew, and Source Skew), we simulate federated training scenarios with different degrees of fragmentation and client participation. Results.From the ablation results in Table 12, we draw the following conclusions: (1) As the client number scales from 9 to 36, performance consis- tently decl...

2000
[8]

They allow for flexibility and adaptability, enabling the model to support new and unseen actions defined by users

Custom Actions for Mobile Platforms Custom actions are unique to each user’s platform and environment. They allow for flexibility and adaptability, enabling the model to support new and unseen actions defined by users. •Custom Action 7: LONG_PRESS - purpose: Long press at the specified position. - format: LONG_PRESS <point>[[x-axis, y-axis]]</point> - exa...
[9]

- format: DOUBLE_CLICK <point>[[x-axis, y-axis]]</point> •Custom Action 13: RIGHT_CLICK - purpose: Right click at the specified position

Custom Actions for Web and Desktop Platforms •Custom Action 12: DOUBLE_CLICK - purpose: Double click at the specified position. - format: DOUBLE_CLICK <point>[[x-axis, y-axis]]</point> •Custom Action 13: RIGHT_CLICK - purpose: Right click at the specified position. - format: RIGHT_CLICK <point>[[x-axis, y-axis]]</point> •Custom Action 14: MOVETO - purpose...