Recognition: unknown
KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation
Pith reviewed 2026-05-10 17:24 UTC · model grok-4.3
The pith
Agents that excel at explicit mobile tasks still fall below 50 percent success when they must infer user preferences or decide when to intervene.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
KnowU-Bench evaluates the full proactive decision chain for mobile agents in a live GUI setting. It contains 42 general tasks, 86 personalized tasks, and 64 proactive tasks. Agents receive only behavioral logs rather than explicit profiles and must interact with an LLM user simulator to elicit missing preferences, negotiate consent, execute grounded actions, and refrain after rejection. Hybrid rule-based and LLM-as-a-judge scoring reveals that agents achieving high success on explicit tasks drop below 50 percent when vague instructions demand preference inference or intervention calibration. The primary bottlenecks lie in preference acquisition and intervention timing.
What carries the argument
KnowU-Bench, an Android emulation benchmark that hides user profiles behind behavioral logs and pairs agents with an LLM-driven simulator for multi-turn preference elicitation and consent negotiation.
If this is right
- Mobile agents require explicit mechanisms for preference elicitation rather than static context lookup.
- Proactive decision making demands separate calibration of when to intervene, seek consent, or stay silent.
- Evaluation must include live multi-turn interaction instead of offline intent prediction.
- Training objectives should target the gap between interface operation and user-trustworthy assistance.
Where Pith is reading between the lines
- The same preference-acquisition and intervention problems likely appear in web and desktop agents, suggesting a general limitation across agent platforms.
- Incorporating explicit user-state modeling or memory modules could address the identified bottlenecks.
- The benchmark opens the possibility of measuring long-term user satisfaction by extending tasks across multiple sessions.
Load-bearing premise
The LLM-driven user simulator grounded in structured profiles produces clarification dialogues and proactive consent decisions that match actual human behavior.
What would settle it
A controlled comparison in which real users replace the simulator on the same 86 personalized and 64 proactive tasks and the resulting dialogues, consent rates, and success metrics diverge substantially from the simulated ones.
read the original abstract
Personalized mobile agents that infer user preferences and calibrate proactive assistance hold great promise as everyday digital assistants, yet existing benchmarks fail to capture what this requires. Prior work evaluates preference recovery from static histories or intent prediction from fixed contexts. Neither tests whether an agent can elicit missing preferences through interaction, nor whether it can decide when to intervene, seek consent, or remain silent in a live GUI environment. We introduce KnowU-Bench, an online benchmark for personalized mobile agents built on a reproducible Android emulation environment, covering 42 general GUI tasks, 86 personalized tasks, and 64 proactive tasks. Unlike prior work that treats user preferences as static context, KnowU-Bench hides the user profile from the agent and exposes only behavioral logs, forcing genuine preference inference rather than context lookup. To support multi-turn preference elicitation, it instantiates an LLM-driven user simulator grounded in structured profiles, enabling realistic clarification dialogues and proactive consent handling. Beyond personalization, KnowU-Bench provides comprehensive evaluation of the complete proactive decision chain, including grounded GUI execution, consent negotiation, and post-rejection restraint, evaluated through a hybrid protocol combining rule-based verification with LLM-as-a-Judge scoring. Our experiments reveal a striking degradation: agents that excel at explicit task execution fall below 50% under vague instructions requiring user preference inference or intervention calibration, even for frontier models like Claude Sonnet 4.6. The core bottlenecks are not GUI navigation but preference acquisition and intervention calibration, exposing a fundamental gap between competent interface operation and trustworthy personal assistance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces KnowU-Bench, a benchmark for interactive, proactive, and personalized mobile agents built on a reproducible Android emulation environment. It comprises 42 general GUI tasks, 86 personalized tasks, and 64 proactive tasks in which user profiles are hidden from the agent, forcing inference from behavioral logs via an LLM-driven user simulator that generates clarification dialogues and consent interactions. A hybrid rule-based plus LLM-as-Judge protocol evaluates the full proactive chain (GUI execution, consent negotiation, post-rejection restraint). Experiments demonstrate that agents strong on explicit tasks drop below 50% success on vague instructions, even for models such as Claude Sonnet 4.6, with the primary bottlenecks identified as preference acquisition and intervention calibration rather than GUI navigation.
Significance. If the simulator faithfully reproduces real-user clarification and consent patterns, the benchmark would provide a valuable shift from static preference recovery to live, multi-turn elicitation and proactive decision-making in GUI environments. The use of an external Android emulator, structured profile grounding, and hybrid evaluation protocol are concrete strengths that support reproducible testing of trustworthy personal assistance. The reported performance gap, if robust, would usefully redirect research attention from interface operation to preference modeling and restraint.
major comments (3)
- [Evaluation Protocol and User Simulator Description] The central claim that preference acquisition and intervention calibration are the dominant bottlenecks (rather than GUI navigation) rests entirely on metrics produced by the LLM-driven user simulator. No human grounding is reported: there is no side-by-side comparison of simulator-generated clarification frequency, consent thresholds, or rejection patterns against real users on the 86 personalized or 64 proactive tasks. This validation gap directly affects the reliability of the sub-50% degradation figures.
- [Experiments] The performance results (below 50% for frontier models) are presented without error bars, confidence intervals, or statistical significance tests across the task categories. This omission makes it impossible to determine whether the observed gap between explicit and vague-instruction settings is robust or sensitive to simulator stochasticity.
- [Evaluation Protocol] The hybrid rule-based plus LLM-as-Judge scoring protocol is described as calibrated, yet no details are given on how the LLM judge was prompted, how inter-annotator agreement with rule-based checks was measured, or how bias in the judge was assessed on the proactive consent tasks.
minor comments (2)
- [Abstract and Experiments] The abstract states the task counts (42/86/64) but the main text should include a table breaking down success rates by category and model to allow direct comparison.
- [Benchmark Design] Notation for the proactive decision chain (grounded execution, consent negotiation, restraint) could be formalized with a short diagram or pseudocode for clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on KnowU-Bench. We address each of the major comments below and outline the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Evaluation Protocol and User Simulator Description] The central claim that preference acquisition and intervention calibration are the dominant bottlenecks (rather than GUI navigation) rests entirely on metrics produced by the LLM-driven user simulator. No human grounding is reported: there is no side-by-side comparison of simulator-generated clarification frequency, consent thresholds, or rejection patterns against real users on the 86 personalized or 64 proactive tasks. This validation gap directly affects the reliability of the sub-50% degradation figures.
Authors: We recognize the importance of human validation for the user simulator to support the reliability of our bottleneck analysis. The simulator is instantiated with structured profiles to generate clarification dialogues and consent interactions, but we have not performed a direct comparison with real-user behaviors. In the revised manuscript, we will include a new subsection in the limitations discussing this gap and its implications for interpreting the sub-50% performance figures. We will also provide additional qualitative analysis of the simulator's outputs to illustrate its grounding. A full human study remains an important direction for future work. revision: partial
-
Referee: [Experiments] The performance results (below 50% for frontier models) are presented without error bars, confidence intervals, or statistical significance tests across the task categories. This omission makes it impossible to determine whether the observed gap between explicit and vague-instruction settings is robust or sensitive to simulator stochasticity.
Authors: We agree that statistical rigor is necessary. We will re-analyze the experimental results and add error bars representing standard deviations from multiple simulation runs, as well as 95% confidence intervals for the success rates in the general, personalized, and proactive task categories. Furthermore, we will include statistical significance tests (such as Wilcoxon signed-rank tests) comparing performance under explicit versus vague instructions to demonstrate the robustness of the observed degradation. revision: yes
-
Referee: [Evaluation Protocol] The hybrid rule-based plus LLM-as-Judge scoring protocol is described as calibrated, yet no details are given on how the LLM judge was prompted, how inter-annotator agreement with rule-based checks was measured, or how bias in the judge was assessed on the proactive consent tasks.
Authors: We will provide comprehensive details on the LLM-as-Judge protocol in the revised manuscript. This includes appending the exact prompts used for evaluation, reporting inter-annotator agreement (e.g., agreement percentage and Cohen's kappa) between the rule-based verifier and the LLM judge on a sampled set of tasks, and adding an analysis of potential biases, particularly for consent-related decisions in proactive tasks. Examples of judge decisions will be included to demonstrate calibration. revision: yes
- Direct human validation data for the user simulator's clarification and consent patterns, as no such study was conducted in the original work.
Circularity Check
No significant circularity; empirical benchmark results independent of inputs
full rationale
The paper introduces KnowU-Bench as a new evaluation environment built on Android emulation with 42 general, 86 personalized, and 64 proactive tasks. It instantiates an LLM-driven user simulator from structured profiles to generate clarification and consent interactions, then measures agent performance via hybrid rule-based plus LLM-as-Judge scoring. The central claim of performance degradation below 50% on vague instructions is an empirical observation from running external agents on these tasks, not a mathematical derivation, fitted parameter, or self-referential equation that reduces to the benchmark construction by definition. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to force the results. The evaluation chain remains self-contained against the external emulation and task definitions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption An LLM-driven simulator grounded in structured profiles can generate realistic clarification dialogues and consent behaviors that stand in for real users.
Forward citations
Cited by 1 Pith paper
-
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...
Reference graph
Works this paper leans on
-
[1]
User Activity Logs
Review the provided “User Activity Logs” and the current “System Environment”
-
[2]
• Autonomous Execution: for other tasks, complete the execution directly in the background without interrupting the user
Based on this context, identify whether a task needs to be performed and determine the appropriate engagement strategy: • Interactive Execution: for certain tasks, first consult the user or provide a suggestion, and proceed only after receiving confirmation. • Autonomous Execution: for other tasks, complete the execution directly in the background without...
-
[3]
Dear” or “Sincerely,
Use your judgment to decide which strategy best fits the current situation. E.2 Prompt for User Simulator System Prompt: User Simulator # Context USER ROLE You are the user described below. Reply consistently with this profile. You areAiden Lin, a 34 year old Associate Professor and AI Lab Director at Peking University (PKU). Your goal is to simulate this...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.