MPR-GUI: Benchmarking and Enhancing Multilingual Perception and Reasoning in GUI Agents

Bing Qin; Dandan Tu; Haoyu Ren; Kun Chen; Qiming Li; Ruihan Chen; Weihong Zhong; Xiaocheng Feng; Xiaoliang Yang; Yunfei Lu

arxiv: 2512.00756 · v2 · submitted 2025-11-30 · 💻 cs.AI

MPR-GUI: Benchmarking and Enhancing Multilingual Perception and Reasoning in GUI Agents

Ruihan Chen , Qiming Li , Xiaocheng Feng , Weihong Zhong , Xiaoliang Yang , Yuxuan Gu , Zekun Zhou , Yunfei Lu

show 4 more authors

Haoyu Ren Kun Chen Dandan Tu Bing Qin

This is my paper

Pith reviewed 2026-05-17 03:22 UTC · model grok-4.3

classification 💻 cs.AI

keywords multilingual GUI agentsperception and reasoning taskscross-lingual evaluationhidden state alignmentvision-language modelsGUI benchmark

0 comments

The pith

Aligning non-English hidden states to English ones at specific layers reduces GUI agent cross-lingual gaps by 6.5 percent on average.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a benchmark with matched environments across six languages and eight detailed perception and reasoning tasks to expose where GUI agents fail. It shows that non-English performance trails English, especially on reasoning-heavy subtasks. The authors locate layers in the model that respond strongly to language and introduce a method that replaces non-English hidden states with their English equivalents at those layers during inference only. A reader would care because GUI agents must handle screens and instructions for speakers of many languages, and the work supplies both a diagnostic tool and a practical fix that avoids full retraining.

Core claim

Strictly aligned cross-lingual GUI environments expose consistent perception and reasoning gaps between English and non-English inputs, with larger shortfalls on reasoning-intensive tasks. Identifying language-sensitive layers and aligning non-English hidden states to their English counterparts at those layers during inference transfers the stronger English capabilities, producing an average 6.5 percent gain across non-English settings.

What carries the argument

GUI-XLI, a cross-lingual intervention that locates language-sensitive layers and substitutes non-English hidden states with matched English states at those layers during inference.

If this is right

GUI agents can close language gaps at inference time without additional training data or fine-tuning.
Fine-grained task breakdowns allow developers to target specific perception or reasoning weaknesses rather than treating overall accuracy as a single number.
Reasoning-intensive GUI subtasks stand to gain the most from the alignment procedure.
The same layer-identification step could be applied to other vision-language models that already perform well in English.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the language-sensitive layers prove stable across different model families, the intervention could become a lightweight standard step for any multilingual GUI deployment.
The benchmark's matched environments could serve as a template for testing whether similar gaps appear in non-GUI vision-language tasks such as image captioning or visual question answering.
One could test whether the method still works when the English reference states come from a stronger but separate model rather than the same model under English input.

Load-bearing premise

Intervening on hidden states at language-sensitive layers transfers superior English perception and reasoning capabilities to non-English inputs without degrading other model behaviors or introducing new failure modes.

What would settle it

Running the intervention on the same non-English GUI tasks and observing either no performance lift or new error types that were absent before the change would falsify the central claim.

Figures

Figures reproduced from arXiv: 2512.00756 by Bing Qin, Dandan Tu, Haoyu Ren, Kun Chen, Qiming Li, Ruihan Chen, Weihong Zhong, Xiaocheng Feng, Xiaoliang Yang, Yunfei Lu, Yuxuan Gu, Zekun Zhou.

**Figure 1.** Figure 1: Performance of GUI agents on our MPRGUI-Bench benchmark. The left figure illustrates that all GUI agents exhibit the strongest performance in English, while the right one exhibits their fine-grained P&R capabilities across multiple dimensions. et al., 2024; Chen et al., 2025), which conflict with the global need for multilingual support; (2) current benchmarks lack systematic evaluation of GUI agents’ fin… view at source ↗

**Figure 2.** Figure 2: An overview of the MPR-GUI-Bench construction pipeline in § [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The composition of MPR-GUI-Bench. As shown with gray numbers, We generated 2156 samples for each language, specifically 351 samples for the first 6 dimensions, and 25 for each of the last 2 dimensions. • Given the checked questions and answers, the distractors should also be based on the reference screenshots while being sufficiently misleading. Fleiss’ Kappa (Fleiss, 1971) is computed to measure the inte… view at source ↗

**Figure 4.** Figure 4: t-SNE Visualization of Multilingual Hidden [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Line chart of grid search on MPR-GUIBench for intervention strength α and layer l on Zh and JA. The upper two figures present the grid search results for l, and the lower two present those for α. the last token’ hidden states of English inputs and their semantically parallel non-English counterparts (i.e., ZH, RU, JA and TH) in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Examples of incorrect responses by LVLMs in AU dimension across 6 language settings. [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Examples of incorrect responses by LVLMs in AP dimension across 6 language settings. [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 9.** Figure 9: Examples of incorrect responses by LVLMs in REL dimension across 6 language settings. [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: Examples of incorrect responses by LVLMs in WF dimension across 6 language settings. [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Examples of incorrect responses by LVLMs in WI dimension across 6 language settings. [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: An Overview of our GUI-XLI method in §4.2. Step 1 GUI-XL-Memory: We sample semantically parallel VQA pairs to form entries in GUI-XL-Memory. Step 2 Cross-lingual Representation Intervention: When answering non-English questions, related entries are retrieved to calculate difference vectors and then injected to certain layer as intervention to add P&R capabilities to non-English settings [PITH_FULL_IMAGE:… view at source ↗

read the original abstract

Large Vision-Language Models (LVLMs) have shown strong potential as multilingual Graphical User Interface (GUI) agents, as evidenced by existing GUI benchmarks. However, these benchmarks exhibit two primary limitations: (1) although Perception and Reasoning (P&R) capabilities are fundamental for GUI agents, current benchmarks lack fine-grained diagnostics to identify which specific capabilities lead to task failures, hindering targeted improvements; (2) existing benchmarks fail to provide a strictly aligned cross-lingual evaluation environment, introducing confounding factors that prevent isolating the language impact on GUI agent performance. To address these issues, we propose the Multilingual P&R GUI Benchmark (MPR-GUI-Bench), featuring strictly aligned environments across six languages and eight fine-grained P&R tasks. Our benchmark reveals consistent P&R gaps between English and non-English settings, particularly on reasoning-intensive tasks. To leverage the superior English P&R capabilities for bridging cross-lingual gaps, we identify layers sensitive to language and propose GUI-XLI, a GUI Cross-Lingual Intervention method that aligns non-English hidden states with their English counterparts at these layers during inference. Experiments show that GUI-XLI effectively reduces the cross-lingual gaps, with an average gain of 6.5% in non-English settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's real contribution is a cleanly aligned multilingual GUI benchmark with fine-grained tasks; the hidden-state intervention gives a small reported gain but rests on thin validation.

read the letter

The main takeaway is that MPR-GUI-Bench fixes two real problems in prior GUI evaluations by supplying strictly aligned cross-lingual environments and eight targeted perception-and-reasoning tasks. That setup lets them measure language effects without the usual confounds, and they document consistent gaps that are larger on reasoning-heavy items. The benchmark design itself is the clearest advance here and should be usable by others working on multilingual agents. GUI-XLI is a straightforward inference-time move: locate language-sensitive layers and align non-English hidden states to their English counterparts. They report an average 6.5% lift in non-English settings, which is a practical starting point if it holds up. The soft spot is exactly the one the stress-test note flags. The paper does not spell out how the layers were identified or include controls that check whether the alignment leaves visual perception, spatial reasoning, and action selection intact. There are also no error bars, English-side degradation numbers, or checks for new failure modes after the intervention. Without those, the gain is hard to interpret as a clean transfer rather than a partial trade-off. This work is aimed at people building or benchmarking GUI agents for non-English users. The benchmark could get picked up; the method is cheap enough to try but needs tighter experiments before it becomes a standard fix. I would send it to peer review. The new resources and the testable idea are worth referee time even if the current results section needs more controls and statistical detail.

Referee Report

2 major / 2 minor

Summary. The paper introduces MPR-GUI-Bench, a benchmark with strictly aligned cross-lingual GUI environments across six languages and eight fine-grained perception and reasoning (P&R) tasks, to diagnose specific capability failures in multilingual GUI agents. It documents consistent English/non-English P&R gaps (especially on reasoning tasks) and proposes GUI-XLI, which identifies language-sensitive layers and aligns non-English hidden states to their English counterparts at those layers during inference, reporting an average 6.5% gain in non-English settings.

Significance. If the results hold under rigorous controls, the aligned benchmark design enables cleaner isolation of language effects on GUI P&R, and GUI-XLI offers a practical inference-time method to leverage stronger English capabilities. The work directly addresses a gap in diagnostic benchmarks for multilingual GUI agents.

major comments (2)

[GUI-XLI method and experimental protocol] The central claim of a 6.5% non-English gain via hidden-state alignment rests on the unverified assumptions that (a) the selected layers encode P&R in a largely language-independent subspace and (b) the alignment leaves visual perception, spatial reasoning, and action prediction intact. The manuscript provides no description of how language-sensitive layers were identified (e.g., via probing, activation differences, or causal intervention) and no control experiments measuring English-task degradation or new non-language failure modes after the intervention.
[Experiments and results] The reported 6.5% average gain lacks error bars, statistical significance tests, ablation details on layer selection, and the full experimental protocol, preventing assessment of whether post-hoc choices or data selection affect the result.

minor comments (2)

[Benchmark construction] Clarify the precise procedure used to create the strictly aligned cross-lingual environments to confirm absence of residual confounding factors.
[Benchmark evaluation] Add a table or figure summarizing per-task and per-language breakdowns to support the claim of larger gaps on reasoning-intensive tasks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving the clarity of the GUI-XLI method and the rigor of our experimental reporting. We address each point below and have made revisions to incorporate additional details and analyses.

read point-by-point responses

Referee: [GUI-XLI method and experimental protocol] The central claim of a 6.5% non-English gain via hidden-state alignment rests on the unverified assumptions that (a) the selected layers encode P&R in a largely language-independent subspace and (b) the alignment leaves visual perception, spatial reasoning, and action prediction intact. The manuscript provides no description of how language-sensitive layers were identified (e.g., via probing, activation differences, or causal intervention) and no control experiments measuring English-task degradation or new non-language failure modes after the intervention.

Authors: We agree that the original manuscript did not provide sufficient detail on layer identification or supporting controls. In the revised version, we have expanded the method section to explain that language-sensitive layers were selected by computing activation differences (L2 norm of hidden-state deltas) between parallel English and non-English GUI inputs on a held-out set of 200 tasks, choosing the layers with the largest average differences. We have also added control experiments showing that post-intervention English performance changes by less than 1% on average, perception and spatial subtasks show no degradation, and manual inspection of 100 failure cases reveals no new non-language error modes introduced. These additions provide empirical grounding for the assumptions. revision: yes
Referee: [Experiments and results] The reported 6.5% average gain lacks error bars, statistical significance tests, ablation details on layer selection, and the full experimental protocol, preventing assessment of whether post-hoc choices or data selection affect the result.

Authors: We acknowledge the validity of this concern regarding reproducibility. The revised manuscript now reports error bars as standard deviation over five independent runs, includes paired statistical significance tests (Wilcoxon signed-rank test, p < 0.05 for the average gain), provides an ablation study on layer count and selection criteria in the appendix, and expands the experimental protocol section with complete details on model checkpoints, inference settings, random seeds, and data splits to rule out post-hoc selection effects. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark and GUI-XLI intervention are empirically derived without self-referential reductions

full rationale

The paper first constructs MPR-GUI-Bench with aligned cross-lingual environments and fine-grained P&R tasks, then reports observed performance gaps between English and non-English inputs. From these observations it identifies language-sensitive layers and defines GUI-XLI as an inference-time alignment of non-English hidden states to English counterparts. The reported 6.5% average gain is presented as the direct experimental outcome of applying this intervention on the new benchmark. No equation, parameter fit, or central claim reduces by construction to its own inputs, and no load-bearing premise rests on a self-citation chain. The derivation chain therefore remains self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the existence of identifiable language-sensitive layers whose alignment preserves task performance, plus the assumption that the new benchmark environments are free of confounding factors beyond language.

axioms (1)

domain assumption LVLMs possess superior English P&R capabilities that can be transferred via hidden-state alignment
Invoked to justify the GUI-XLI intervention.

pith-pipeline@v0.9.0 · 5557 in / 1192 out tokens · 28589 ms · 2026-05-17T03:22:45.209893+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 1 internal anchor

[1]

Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923. Tyler A. Chang, Zhuowen Tu, and Benjamin K. Bergen

work page internal anchor Pith review Pith/arXiv arXiv
[2]

arXiv preprint arXiv:2406.10819 (2024)

The geometry of multilingual language model representations. InProceedings of the 2022 Con- ference on Empirical Methods in Natural Language Processing, pages 119–136, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin Bai, Zhigang He, Chenlong Wang, Huichi Zhou, ...

work page arXiv 2022
[3]

arXiv preprint arXiv:2511.07062

Improving region representation learning from urban imagery with noisy long-caption supervision. arXiv preprint arXiv:2511.07062. Yiran Zhao, Wenxuan Zhang, Guizhen Chen, Kenji Kawaguchi, and Lidong Bing. 2024. How do large language models handle multilingualism? InAd- vances in Neural Information Processing Systems (NeurIPS). A Additional Details of MPR-...

work page arXiv 2024
[4]

Compliant

Data Summary An inter-rater reliability analysis is conducted to determine the consistency of agreement among 6 annotators for 2,156 samples in each languages. For all six languages we conducted certain anal- ysis, here we take English as an example. Each VQA sample is classified into one of two nominal categories: “Compliant” or “Non-compliant”. The dist...

work page
[5]

Com- pliant

Calculation of Fleiss’ Kappa Fleiss’ Kappa (κ) is calculated to assess the degree of agreement beyond what would be expected by chance.[1, 2] The calculation followed three steps. Step 1: Overall Observed Agreement ( ¯P) The proportion of observed agreement for each item (Pi) is calculated using the formula: Pi = 1 n(n−1)   kX j=1 n2 ij −n   where n= ...

work page 1977
[6]

Standard platform conventions

work page
[7]

Explicit visual affordances (shadows, highlights, depth cues)

work page
[8]

State indicators (color coding, iconography, text labels)

work page
[9]

Spatial relationships to adjacent elements Core ability focus (evidence-based): - MUST synthesize≥3 distinct visual cues:

work page
[10]

Weather",

Primary text labels (e.g., "Weather", "Reminders")

work page
[11]

Icon semantics (standard meanings only)

work page
[12]

Data representations (charts, progress bars)

work page
[13]

home screen) - BANNED:

Contextual positioning (status bar vs. home screen) - BANNED:

work page
[14]

Speculation beyond visible elements

work page
[15]

Clicking ’+’ on clock widget enables quick alarm setting

Prior knowledge of specific apps Question design requirements: - Ambiguous but decodable visual patterns (e.g., semi-transparent overlay on a search icon requiring icon shape, faded color, and nearby label) - Compound state indicators (e.g., lock icon + greyed-out button requiring icon meaning and color state) - Conflicting affordances requiring prioritiz...

work page
[16]

Each option shouldlook like a real user task— it doesn’t need to match the exact phrasing or grammar of the correct goal, but should feel natural and fit within the app’s context (e.g., settings, messaging, shopping, file management)

work page
[17]

Focus onplausible misinterpretations: the user might think the person is doing something related but different — changing a setting instead of deleting, sharing instead of saving, searching for a contact instead of calling, etc

work page
[18]

Vary theaction,target, orintent: use different verbs (edit, find, enable, share, create, view, check, etc.) or objects (a message, a photo, an account, a notification, etc.) that appear or could appear in the interface

work page
[19]

It’s okay if the grammar is slightly informal or simplified — real users don’t always phrase tasks perfectly

work page
[20]

attempt to

Donotinclude explanations, reasoning, or meta-comments (e.g., no “attempt to”, “mistake”, “analyze”)

work page
[21]

apple" in English should correspond to

Make sure the options are clearly different from the correct goal, but stillcontextually groundedin the screenshots. Only output the three distractors in the following format: A. ... B. ... C. ... Table 13: Prompt for RI & SI Dimensions Data Collecting Guidelines Annotators are required to collect screenshots in the following languages: Chinese (ZH), Engl...

work page

[1] [1]

Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923. Tyler A. Chang, Zhuowen Tu, and Benjamin K. Bergen

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

arXiv preprint arXiv:2406.10819 (2024)

The geometry of multilingual language model representations. InProceedings of the 2022 Con- ference on Empirical Methods in Natural Language Processing, pages 119–136, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin Bai, Zhigang He, Chenlong Wang, Huichi Zhou, ...

work page arXiv 2022

[3] [3]

arXiv preprint arXiv:2511.07062

Improving region representation learning from urban imagery with noisy long-caption supervision. arXiv preprint arXiv:2511.07062. Yiran Zhao, Wenxuan Zhang, Guizhen Chen, Kenji Kawaguchi, and Lidong Bing. 2024. How do large language models handle multilingualism? InAd- vances in Neural Information Processing Systems (NeurIPS). A Additional Details of MPR-...

work page arXiv 2024

[4] [4]

Compliant

Data Summary An inter-rater reliability analysis is conducted to determine the consistency of agreement among 6 annotators for 2,156 samples in each languages. For all six languages we conducted certain anal- ysis, here we take English as an example. Each VQA sample is classified into one of two nominal categories: “Compliant” or “Non-compliant”. The dist...

work page

[5] [5]

Com- pliant

Calculation of Fleiss’ Kappa Fleiss’ Kappa (κ) is calculated to assess the degree of agreement beyond what would be expected by chance.[1, 2] The calculation followed three steps. Step 1: Overall Observed Agreement ( ¯P) The proportion of observed agreement for each item (Pi) is calculated using the formula: Pi = 1 n(n−1)   kX j=1 n2 ij −n   where n= ...

work page 1977

[6] [6]

Standard platform conventions

work page

[7] [7]

Explicit visual affordances (shadows, highlights, depth cues)

work page

[8] [8]

State indicators (color coding, iconography, text labels)

work page

[9] [9]

Spatial relationships to adjacent elements Core ability focus (evidence-based): - MUST synthesize≥3 distinct visual cues:

work page

[10] [10]

Weather",

Primary text labels (e.g., "Weather", "Reminders")

work page

[11] [11]

Icon semantics (standard meanings only)

work page

[12] [12]

Data representations (charts, progress bars)

work page

[13] [13]

home screen) - BANNED:

Contextual positioning (status bar vs. home screen) - BANNED:

work page

[14] [14]

Speculation beyond visible elements

work page

[15] [15]

Clicking ’+’ on clock widget enables quick alarm setting

Prior knowledge of specific apps Question design requirements: - Ambiguous but decodable visual patterns (e.g., semi-transparent overlay on a search icon requiring icon shape, faded color, and nearby label) - Compound state indicators (e.g., lock icon + greyed-out button requiring icon meaning and color state) - Conflicting affordances requiring prioritiz...

work page

[16] [16]

Each option shouldlook like a real user task— it doesn’t need to match the exact phrasing or grammar of the correct goal, but should feel natural and fit within the app’s context (e.g., settings, messaging, shopping, file management)

work page

[17] [17]

Focus onplausible misinterpretations: the user might think the person is doing something related but different — changing a setting instead of deleting, sharing instead of saving, searching for a contact instead of calling, etc

work page

[18] [18]

Vary theaction,target, orintent: use different verbs (edit, find, enable, share, create, view, check, etc.) or objects (a message, a photo, an account, a notification, etc.) that appear or could appear in the interface

work page

[19] [19]

It’s okay if the grammar is slightly informal or simplified — real users don’t always phrase tasks perfectly

work page

[20] [20]

attempt to

Donotinclude explanations, reasoning, or meta-comments (e.g., no “attempt to”, “mistake”, “analyze”)

work page

[21] [21]

apple" in English should correspond to

Make sure the options are clearly different from the correct goal, but stillcontextually groundedin the screenshots. Only output the three distractors in the following format: A. ... B. ... C. ... Table 13: Prompt for RI & SI Dimensions Data Collecting Guidelines Annotators are required to collect screenshots in the following languages: Chinese (ZH), Engl...

work page