pith. sign in

arxiv: 2605.16024 · v1 · pith:UARXUOVJnew · submitted 2026-05-15 · 💻 cs.AI

ScreenSearch: Uncertainty-Aware OS Exploration

Pith reviewed 2026-05-20 18:12 UTC · model grok-4.3

classification 💻 cs.AI
keywords GUI agentsOS explorationpartial observabilityambiguity reductionscreen deduplicationPUCT searchdesktop applicationsexploration policies
0
0 comments X

The pith

Ambiguity reduction alone does not suffice as an exploration objective for desktop GUI agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Desktop GUI agents encounter partial observability because visually similar screens can represent different workflow states, so actions that look locally good can produce very different outcomes. ScreenSearch tackles this by maintaining a shared deduplicated state graph built from structural screen features and by defining an ambiguity signal from how often matched actions lead to different next states. The system then uses this signal together with frontier rewards inside a PUCT graph-bandit to decide when to keep probing versus when to commit. Large-scale runs across eleven applications produce over a million screenshots and thirty thousand unique states, yet replay evaluations reveal a clear novelty-ambiguity trade-off: policies that cut ambiguity fastest often discover little new territory. The results indicate that state identity, proposal quality, and ambiguity-aware search must all be considered together.

Core claim

ScreenSearch combines structural screen retrieval and deduplication with an ambiguity-aware PUCT graph-bandit. The central ambiguity signal is matched-action outcome dispersion on the deduplicated state graph: when similar screens produce different next states under the same action signature, the state is treated as unresolved and scheduled for further probing. Across eleven desktop applications the system collects over one million screenshots and over thirty thousand deduplicated states. On a fixed replay-start slice, policies that reduce ambiguity quickly discover little frontier, showing that ambiguity reduction by itself is not a sufficient exploration objective.

What carries the argument

An ambiguity signal defined as matched-action outcome dispersion on a deduplicated state graph, used inside a PUCT graph-bandit to balance probing and committing.

Load-bearing premise

The ambiguity signal based on matched-action outcome dispersion accurately reflects true workflow uncertainty rather than noise from retrieval or execution.

What would settle it

Re-run the same exploration with a different screen-retrieval or deduplication method and check whether the novelty-ambiguity trade-off disappears and whether ambiguity-only policies then discover substantial new frontier states.

Figures

Figures reproduced from arXiv: 2605.16024 by Justin Wagle, Michael Solodko.

Figure 1
Figure 1. Figure 1: Complementary exploration signals: novelty expands coverage, while ambiguity reduction resolves aliased states before commitment. Let 𝐺𝑡 = (𝑉𝑡 , 𝐸𝑡) denote the global deduplicated state graph accumulated up to search time 𝑡, where 𝑉𝑡 is the set of discovered deduplicated states and 𝐸𝑡 is the set of observed labeled transitions. We represent a realized graph edge as a transition triple (𝑠, 𝜎, 𝑠′ ), where 𝜎 … view at source ↗
Figure 2
Figure 2. Figure 2: Screen representation components: shared discrete universes (left) and feature-set extraction [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: System overview of the data generation pipeline distributed across VMs. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ambiguity change on the same subset. Lower Δ𝑢𝑡 indicates stronger disambiguation, not greater frontier growth. We compare four reactive baselines that act directly from the current screen, and the uncertainty-guided PUCT graph-bandit with the default uniform prior. Every episode uses a fixed 50-action budget (steps 0–49 in the plots), starts from the same replayed state, and receives the same public inform… view at source ↗
Figure 6
Figure 6. Figure 6: Overview of screen similarity and retrieval process. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cumulative number of unique states discovered over wall-clock time. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-step action decision time over total wall-clock time. [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Maximum fraction of changed pixels within each deduplicated screen state. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
read the original abstract

Desktop GUI agents operate under partial observability: visually similar screens can correspond to different underlying workflow states, so locally plausible actions can lead to sharply different outcomes. We frame this as a problem of computer/OS state exploration, where effective behavior requires both expanding the reachable frontier and reducing ambiguity before committing. We present ScreenSearch, a system that combines structural screen retrieval and deduplication with an ambiguity-aware PUCT graph-bandit for large-scale desktop exploration. The retrieval layer converts UIA trees into location-aware structural features, indexes related screens through sparse token search and metadata filters, and maintains a shared deduplicated state graph across VM workers. On top of this graph, we define a scalable ambiguity signal based on matched-action outcome dispersion. If similar screens produce different next states under the same action signature, the state should be probed further rather than treated as resolved. We use this signal together with frontier rewards to drive large-scale exploration and replay-start policy evaluation over the shared graph. Across 11 desktop applications, ScreenSearch collects over 1M screenshots and over 30K deduplicated states, yielding large exploration corpora with substantial cross-application and within-application diversity. On a fixed replay-start slice, we observe a clear novelty--ambiguity trade-off: some policies reduce ambiguity quickly while discovering little frontier. Ambiguity reduction alone is therefore not a sufficient exploration objective. Appendix ablations show that stronger proposal priors can materially improve unique-state discovery during corpus building. These results suggest that state identity, proposal quality, and ambiguity-aware search all matter when deciding when to probe and when to commit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ScreenSearch, a system for large-scale desktop OS exploration under partial observability. It combines UIA-tree structural retrieval and deduplication with an ambiguity signal (matched-action outcome dispersion on a shared state graph) and an ambiguity-aware PUCT graph-bandit. The work reports collecting >1M screenshots and >30K deduplicated states across 11 applications, and empirically observes a novelty-ambiguity trade-off across policies on replay-start slices, concluding that ambiguity reduction alone is not a sufficient exploration objective and that state identity, proposal quality, and ambiguity-aware search must be jointly considered.

Significance. If the central empirical observations hold, the manuscript contributes a large-scale, deduplicated exploration corpus and concrete evidence that single-objective strategies are inadequate for GUI agents. The scale of data collection and the shared-graph architecture are clear strengths that could support follow-on work; the explicit trade-off demonstration is a useful falsifiable claim for the community.

major comments (2)
  1. [Abstract (ambiguity signal paragraph) and §4 (method)] The ambiguity signal (described in the abstract as 'matched-action outcome dispersion on the deduplicated state graph') is load-bearing for the novelty-ambiguity trade-off claim. No explicit equation, pseudocode, or threshold definition is provided, nor is there validation that dispersion correlates with human-judged workflow differences rather than VM timing/async noise or retrieval artifacts. Without this, the trade-off could be an artifact of the measurement.
  2. [Results / replay-start evaluation] Results on the fixed replay-start slice report a 'clear novelty-ambiguity trade-off' but supply no quantitative baselines (e.g., random walk, standard PUCT without ambiguity term), error bars, or statistical tests. This makes it impossible to judge the magnitude or reliability of the observed differences that support the multi-objective conclusion.
minor comments (2)
  1. [Abstract] The abstract states 'over 1M screenshots and over 30K deduplicated states' but does not summarize per-application diversity metrics or deduplication false-positive rates.
  2. [Method] Notation for the PUCT exploration constant and ambiguity threshold is introduced without a consolidated table of all free parameters.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the scale of the exploration corpus and the value of the observed trade-off. We address each major comment below and describe the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Abstract (ambiguity signal paragraph) and §4 (method)] The ambiguity signal (described in the abstract as 'matched-action outcome dispersion on the deduplicated state graph') is load-bearing for the novelty-ambiguity trade-off claim. No explicit equation, pseudocode, or threshold definition is provided, nor is there validation that dispersion correlates with human-judged workflow differences rather than VM timing/async noise or retrieval artifacts. Without this, the trade-off could be an artifact of the measurement.

    Authors: We agree that an explicit formulation is required for clarity and reproducibility. In the revised manuscript we will add a formal definition of the ambiguity signal in §4, including the equation for outcome dispersion (computed as the entropy of the distribution over deduplicated successor states reached by identical action signatures from structurally matched screens) together with pseudocode for its incremental update on the shared graph. We will also state any thresholds applied during policy selection. The current work does not contain a human validation study correlating dispersion scores with workflow differences; we will add an explicit limitations paragraph acknowledging that distinguishing genuine state ambiguity from retrieval or timing artifacts would benefit from such validation in follow-on work. We note that the UIA-tree structural features and metadata filters were chosen precisely to reduce sensitivity to VM timing noise, and the trade-off appears consistently across 11 applications, but we accept that stronger empirical grounding for the signal itself is desirable. revision: partial

  2. Referee: [Results / replay-start evaluation] Results on the fixed replay-start slice report a 'clear novelty-ambiguity trade-off' but supply no quantitative baselines (e.g., random walk, standard PUCT without ambiguity term), error bars, or statistical tests. This makes it impossible to judge the magnitude or reliability of the observed differences that support the multi-objective conclusion.

    Authors: We agree that the absence of explicit baselines and reliability measures weakens the interpretability of the reported trade-off. In the revision we will augment the replay-start evaluation with direct comparisons to random-walk and standard (ambiguity-agnostic) PUCT policies on the same fixed slices. We will report means and standard deviations across multiple independent runs to supply error bars and will include statistical significance tests (paired t-tests) either in the main results table or in the appendix. These additions will allow readers to assess both the magnitude and reliability of the novelty-ambiguity differences. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical corpus statistics and policy comparisons

full rationale

The paper defines an ambiguity signal directly from observed matched-action outcome dispersion on the deduplicated state graph and reports large-scale empirical results (1M+ screenshots, 30K states) plus observed novelty-ambiguity trade-offs across policies. No equations, derivations, or self-citations are invoked that reduce a claimed prediction or uniqueness result to a fitted parameter or prior ansatz defined in terms of the target outcome. The central claim that multiple objectives matter rests on direct data comparisons rather than any self-referential construction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The approach rests on standard graph-bandit machinery plus domain assumptions about UIA trees and VM replay; no new invented entities or heavily fitted parameters are introduced in the abstract.

free parameters (1)
  • PUCT exploration constant and ambiguity threshold
    Standard bandit hyperparameters that must be chosen or tuned to produce the reported novelty-ambiguity trade-off.
axioms (2)
  • domain assumption UIA trees yield location-aware structural features sufficient to distinguish workflow states across applications
    Invoked in the retrieval layer description as the basis for indexing and deduplication.
  • domain assumption Outcome dispersion under matched actions is a reliable proxy for state ambiguity
    Central to the ambiguity signal definition and the decision to probe further.

pith-pipeline@v0.9.0 · 5809 in / 1398 out tokens · 50460 ms · 2026-05-20T18:12:44.702709+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · 1 internal anchor

  1. [2]

    URLhttps://arxiv.org/abs/2204.01691. R. Bajcsy, Y. Aloimonos, and J. K. Tsotsos. Revisiting active perception.Autonomous Robots, 42 (2):177–196,

  2. [3]

    URLhttps://doi.org/10.1007/ s10514-017-9615-3

    doi: 10.1007/s10514-017-9615-3. URLhttps://doi.org/10.1007/ s10514-017-9615-3. Y. Burda, H. Edwards, A. J. Storkey, and O. Klimov. Exploration by random network distillation. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9,

  3. [4]

    URLhttps://dblp.org/rec/conf/ aaai/Chrisman92. X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y. Su. Mind2web: Towards a generalist agent for the web. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing ...

  4. [6]

    URLhttps://arxiv.org/abs/2402.03610. E. Z. Liu, K. Guu, P. Pasupat, T. Shi, and P. Liang. Reinforcement learning on web interfaces using workflow-guided exploration. In6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net,

  5. [7]

    URLhttps://openreview.net/forum?id=ryTp3f-0-. D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self-supervised prediction. In D. Precup and Y. W. Teh, editors,Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Lear...

  6. [8]

    Sekar, O

    R. Sekar, O. Rybkin, K. Daniilidis, P. Abbeel, D. Hafner, and D. Pathak. Planning to explore via self-supervised world models. InProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 ofProceedings of Machine Learning Research, pages 8583–8592. PMLR,

  7. [9]

    press/v119/sekar20a.html

    URLhttp://proceedings.mlr. press/v119/sekar20a.html. N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: language agents with verbal reinforcement learning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on NeuralInformationProcess...

  8. [10]

    URLhttps://dl.acm.org/doi/10.1145/3583068

    doi: 10.1145/3583068. URLhttps://dl.acm.org/doi/10.1145/3583068. T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, Y. Liu, Y. Xu, S. Zhou, S. Savarese, C. Xiong, V. Zhong, and T. Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. In A. Glober- sons, L. Mackey, D. Belgrave,...

  9. [11]

    URL http://papers.nips.cc/paper_files/paper/ 2024/hash/5d413e48f84dc61244b6be550f1cd8f5-Abstract-Datasets_ and_Benchmarks_Track.html. S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rw...

  10. [12]

    URL https://openreview.net/pdf?id=WE_vluYUL-X. A. Zhou, K. Yan, M. Shlapentokh-Rothman, H. Wang, and Y. Wang. Language agent tree search unifies reasoning, acting, and planning in language models. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27,

  11. [13]

    URLhttps://openreview.net/forum?id=njwv9BsGHF

    OpenReview.net, 2024a. URLhttps://openreview.net/forum?id=njwv9BsGHF. S.Zhou,F.F.Xu,H.Zhu,X.Zhou,R.Lo,A.Sridhar,X.Cheng,T.Ou,Y.Bisk,D.Fried,U.Alon,and G.Neubig. Webarena: Arealisticwebenvironmentforbuildingautonomousagents. InTheTwelfth InternationalConferenceonLearningRepresentations,ICLR2024,Vienna,Austria,May7-11,2024. OpenReview.net, 2024b. URLhttps:/...