MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents

Byeongung Jo; Insik Shin; Jaeyoung Wi; Joo Hyung Lee; Sangeun Oh; Seungwoo Baek; Sunjae Lee; Tae Hoon Min; Youngmin Im

arxiv: 2512.12634 · v4 · pith:QHECGYVLnew · submitted 2025-12-14 · 💻 cs.AI

MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents

Youngmin Im , Byeongung Jo , Jaeyoung Wi , Seungwoo Baek , Tae Hoon Min , Joo Hyung Lee , Sangeun Oh , Insik Shin

show 1 more author

Sunjae Lee

This is my paper

Pith reviewed 2026-05-16 22:56 UTC · model grok-4.3

classification 💻 cs.AI

keywords mobile GUI agentsbenchmarking frameworkoffline evaluationmulti-path annotationsmodular analysishuman agreementAI agent evaluation

0 comments

The pith

MobiBench provides a modular offline benchmark for mobile GUI agents that matches human evaluators at 94.72 percent agreement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current evaluation methods for mobile GUI agents either use single-path offline datasets that penalize valid alternative actions or rely on live online tests that lack scalability and reproducibility. MobiBench introduces multi-path annotations and a modular structure to overcome both issues in a fully offline setting. The framework decomposes agents into components for detailed analysis while preserving high agreement with human judgments. Experiments confirm it reaches 94.72 percent agreement, comparable to engineered online benchmarks, and surfaces insights on techniques, model scales, and design guidelines.

Core claim

MobiBench is the first modular and multi-path aware offline benchmarking framework for mobile GUI agents. It achieves 94.72 percent agreement with human evaluators on par with carefully engineered online benchmarks while retaining the scalability and reproducibility of static offline benchmarks, and it supports module-level analysis of agent performance.

What carries the argument

Multi-branch annotations paired with modular decomposition of agent pipelines that separate perception, reasoning, and action modules for independent scoring.

If this is right

Different agent techniques can be compared fairly without penalizing valid alternative paths.
Performance bottlenecks can be isolated to specific modules such as perception or planning.
Optimal module configurations can be identified across different model sizes.
Actionable guidelines emerge for building more capable and cost-efficient mobile GUI agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same multi-path and modular structure could transfer to benchmarking GUI agents on web or desktop platforms.
Richer multi-path data might serve as improved training signals for agent models.
Widespread adoption would lower the cost and time of reliable agent evaluation, speeding iteration cycles.

Load-bearing premise

The multi-path annotations capture all valid alternative actions that human evaluators would accept without systematic omissions.

What would settle it

A direct comparison study that collects fresh human ratings on a held-out set of agent trajectories and measures whether MobiBench scores still reach at least 90 percent agreement.

Figures

Figures reproduced from arXiv: 2512.12634 by Byeongung Jo, Insik Shin, Jaeyoung Wi, Joo Hyung Lee, Sangeun Oh, Seungwoo Baek, Sunjae Lee, Tae Hoon Min, Youngmin Im.

**Figure 2.** Figure 2: Modular architecture of Mobile GUI Agents [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Cost efficiency of different module combina [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Efficiency of latency incurring techniques [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Correlation between screen complexity and path diversity [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Example techniques for screen parsing. B.1 Input B.1.1 A11y Tree. Android’s Accessibility framework enables extraction of the on-screen UI hierarchy by producing XML dumps that encode view structure, attributes, and interaction affordances. We use this mechanism to collect UI snapshots offline and construct a dataset containing serialized UI trees for each interaction state. During evaluation, the agent op… view at source ↗

read the original abstract

Mobile GUI Agents, AI agents capable of interacting with mobile applications on behalf of users, have the potential to transform human computer interaction. However, current evaluation practices for GUI agents face two fundamental limitations. First, they either rely on single path offline benchmarks or online live benchmarks. Offline benchmarks using static, single path annotated datasets unfairly penalize valid alternative actions, while online benchmarks suffer from poor scalability and reproducibility due to the dynamic and unpredictable nature of live evaluation. Second, existing benchmarks treat agents as monolithic black boxes, overlooking the contributions of individual components, which often leads to unfair comparisons or obscures key performance bottlenecks. To address these limitations, we present MobiBench, the first modular and multi path aware offline benchmarking framework for mobile GUI agents that enables high fidelity, scalable, and reproducible evaluation entirely in offline settings. Our experiments demonstrate that MobiBench achieves 94.72 percent agreement with human evaluators, on par with carefully engineered online benchmarks, while preserving the scalability and reproducibility of static offline benchmarks. Furthermore, our comprehensive module level analysis uncovers several key insights, including a systematic evaluation of diverse techniques used in mobile GUI agents, optimal module configurations across model scales, the inherent limitations of current LFMs, and actionable guidelines for designing more capable and cost efficient mobile agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MobiBench introduces a multi-path offline benchmark for mobile GUI agents that reaches 94.72% human agreement while adding modular breakdowns.

read the letter

MobiBench tackles the core problems in evaluating mobile GUI agents by replacing single fixed paths with multiple annotated branches per task and splitting the agent into testable modules. This setup keeps evaluation offline, which improves scalability and reproducibility over live runs, while trying to avoid unfairly penalizing valid alternative actions that single-path benchmarks reject outright. The headline result is the 94.72% agreement with human judges, which the paper positions as comparable to well-tuned online benchmarks. The module-level experiments also surface concrete observations about component contributions, model-scale effects, and current limitations in large foundation models. These pieces are grounded in direct human comparisons rather than fitted parameters, so the circularity risk stays low. The modular design is a clear practical addition that lets researchers isolate where agents fail instead of treating everything as a black box. The main soft spot is coverage of the multi-branch annotations. The agreement number only holds if the collected paths include essentially every sequence a human would accept as correct. If the annotation process missed some navigation orders, error recoveries, or equivalent widget choices, then agents that take those routes get scored as wrong even though humans would approve, which would make the fidelity claim look stronger than the data supports. More detail on how paths were gathered and validated would help here. This work is aimed at groups building or benchmarking mobile agents who need reproducible offline tests. Researchers in that area will find the framework and the component insights directly usable. The empirical grounding and the clear motivation make it worth sending to peer review for a full check on the annotation process and any edge cases in the evaluation.

Referee Report

2 major / 3 minor

Summary. The paper introduces MobiBench, a modular multi-branch offline benchmark for mobile GUI agents. It claims to resolve the unfair penalization of valid alternative actions in single-path offline benchmarks and the poor scalability/reproducibility of online live benchmarks by providing multi-path annotations and component-wise evaluation, reporting 94.72% agreement with human evaluators while enabling module-level analysis of techniques, model scales, and design guidelines.

Significance. If the multi-path annotations prove comprehensive and the agreement metric robust, MobiBench would represent a meaningful advance by delivering scalable, reproducible offline evaluation that matches the fidelity of online benchmarks, while also supplying actionable module-level insights that could guide more efficient GUI agent design.

major comments (2)

[§3 and §4.2] §3 (Benchmark Construction) and §4.2 (Human Agreement Evaluation): the 94.72% agreement claim is load-bearing for the central contribution, yet the manuscript provides insufficient detail on the procedure used to enumerate and validate the completeness of alternative paths (e.g., no quantitative coverage metric, no inter-annotator agreement on path exhaustiveness, and no explicit check for omitted error-recovery or navigation-order variants). This leaves open the possibility that agreement rates partly reflect annotation coverage rather than true behavioral equivalence.
[§5.3] §5.3 (Module-Level Analysis): the reported breakdowns by module and model scale are presented without an ablation that isolates the effect of multi-path versus single-path scoring on per-module performance; without this, it is unclear whether the modular insights are driven by the multi-branch feature or would hold under conventional single-path evaluation.

minor comments (3)

[§2] The related-work section (§2) omits several 2024 GUI-agent papers that also explore offline evaluation; adding them would strengthen positioning.
[Figure 2] Figure 2 (benchmark pipeline) would benefit from explicit call-outs for the multi-branch merging step and the exact matching criteria used in path comparison.
[§3.1] A few minor notation inconsistencies appear in the module-interface definitions (e.g., inconsistent use of M_i versus Module_i); a quick pass for uniformity would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important aspects of transparency and interpretability that we address below. We have prepared revisions to strengthen the manuscript on both points.

read point-by-point responses

Referee: [§3 and §4.2] §3 (Benchmark Construction) and §4.2 (Human Agreement Evaluation): the 94.72% agreement claim is load-bearing for the central contribution, yet the manuscript provides insufficient detail on the procedure used to enumerate and validate the completeness of alternative paths (e.g., no quantitative coverage metric, no inter-annotator agreement on path exhaustiveness, and no explicit check for omitted error-recovery or navigation-order variants). This leaves open the possibility that agreement rates partly reflect annotation coverage rather than true behavioral equivalence.

Authors: We agree that greater detail on path enumeration and validation is warranted to substantiate the agreement metric. In the revised manuscript we will expand §3 to describe our multi-annotator protocol, report quantitative coverage statistics (average paths per task and saturation curves), provide inter-annotator agreement figures specifically for path exhaustiveness, and document the systematic inclusion of error-recovery and navigation-order variants. These additions will clarify that the observed agreement reflects comprehensive annotation rather than incomplete coverage. revision: yes
Referee: [§5.3] §5.3 (Module-Level Analysis): the reported breakdowns by module and model scale are presented without an ablation that isolates the effect of multi-path versus single-path scoring on per-module performance; without this, it is unclear whether the modular insights are driven by the multi-branch feature or would hold under conventional single-path evaluation.

Authors: We concur that an ablation isolating multi-path versus single-path scoring is necessary to interpret the module-level findings. In the revised §5.3 we will add a direct comparison that recomputes all module and scale breakdowns under single-path scoring and contrasts the results with the multi-path evaluation. This will show whether the reported insights depend on the multi-branch annotations. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical agreement measured against independent human judgments

full rationale

The paper presents MobiBench as an empirical benchmarking framework and reports a 94.72% agreement rate with human evaluators. This rate is obtained by direct comparison to external human annotations rather than any fitted parameters, self-citations, or internal derivations. No equations, predictions, or first-principles claims appear in the provided text that reduce to inputs by construction. The multi-path annotation process is described as an engineering choice whose coverage is validated externally via human agreement, keeping the central result independent of the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that human agreement validates the benchmark and that the constructed dataset covers representative tasks and paths. No free parameters are fitted to data in the reported results.

axioms (1)

domain assumption Human evaluators provide reliable ground truth for valid agent actions
The 94.72% agreement metric depends on this assumption being true.

invented entities (1)

MobiBench modular multi-path benchmark no independent evidence
purpose: To enable scalable offline evaluation with component analysis
Newly introduced framework for mobile GUI agent testing.

pith-pipeline@v0.9.0 · 5555 in / 1162 out tokens · 28070 ms · 2026-05-16T22:56:01.658808+00:00 · methodology

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
cs.LG 2026-05 unverdicted novelty 7.0

BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
cs.LG 2026-05 unverdicted novelty 7.0

BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.
RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management
cs.AI 2026-04 unverdicted novelty 7.0

RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.
AgentLens: Adaptive Visual Modalities for Human-Agent Interaction in Mobile GUI Agents
cs.HC 2026-04 unverdicted novelty 6.0

AgentLens adaptively deploys Full UI, Partial UI, and GenUI modalities with virtual display overlays for mobile GUI agents, yielding 85.7% user preference and best-in-study usability in a 21-participant evaluation.