pith. sign in

arxiv: 2606.29472 · v1 · pith:6E6DCBN3new · submitted 2026-06-28 · 💻 cs.AI

Agent-Computer Observation Interfaces Enable Dynamic Computer Use

Pith reviewed 2026-06-30 06:57 UTC · model grok-4.3

classification 💻 cs.AI
keywords computer-use agentsobservation interfacesdynamic tasksagent perceptionmultimodal agentsaudio transcriptionvisual narrationbenchmarks
0
0 comments X

The pith

Agent-Computer Observation Interfaces let models gain 17-48 points on dynamic tasks by adding gated audio and persistent narration to screenshots.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current computer-use agents receive only periodic screenshots and stay blind to video, animations, transient events, and audio between captures. The paper introduces the Agent-Computer Observation Interface (AOI) as a model-agnostic layer that inserts inter-step keyframe capture, volume-gated audio transcription, and model-generated visual narration stored as text. These components stay silent on static content and revert to the standard loop without loss. On the new DynaCU-Bench of 100 dynamic browser tasks, models from 7B to frontier scale improve 17 to 48 percentage points over screenshot baselines with no retraining, turning many near-impossible tasks into largely solved ones. The largest lift appears on spoken-content subsets, where AOI agents complete every task.

Core claim

The Agent-Computer Observation Interface (AOI) decouples continuous adaptive observation from discrete actions through three gated components—inter-step keyframe capture, volume-gated audio transcription, and CU-model-generated visual narration that persists as text—producing almost nothing on static silent content and enabling CU models from 7B to frontier scale to achieve +17 to +48 percentage point gains over screenshot baselines on DynaCU-Bench dynamic tasks with zero retraining.

What carries the argument

The Agent-Computer Observation Interface (AOI), a perception layer whose three gated components produce persistent text only when content is dynamic or voiced and otherwise reduce to the standard screenshot loop.

If this is right

  • Spoken-content tasks become fully solvable while screenshot-only agents solve none.
  • The performance value comes mainly from narrating captured frames into persistent text rather than from keyframe selection itself.
  • Individual AOI components must be chosen per model because the keyframe stream can regress performance on newer models through image-token dilution.
  • Static control tasks show no degradation, so the interface preserves baseline behavior when content is unchanging.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Observation design choices may matter as much as action design for enabling agents to operate in live environments.
  • The same gated components could allow existing models to handle live meetings or interactive software sessions without further training.

Load-bearing premise

The DynaCU-Bench tasks represent real-world dynamic computer-use scenarios and the measured gains are caused by the AOI components rather than benchmark construction details or model-specific interactions with the narration text.

What would settle it

Running the same models on a fresh set of dynamic browser tasks outside DynaCU-Bench and finding that the 17-48 point gains disappear or that equivalent narration text supplied without the gated AOI structure produces identical results.

Figures

Figures reproduced from arXiv: 2606.29472 by Bojie Li, Noah Shi.

Figure 1
Figure 1. Figure 1: Main results: the AOI lifts every CU model on DynaCU-Bench (100 dynamic tasks), with zero retraining. Green numbers are absolute gains in percentage points over each model’s screenshot baseline (+9 to +48 pp). Full numbers, confidence intervals, and the per-model analysis are in Section 5. 1 arXiv:2606.29472v1 [cs.AI] 28 Jun 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The computer-use agent loop and its blind spots. A CU agent repeats observe → reason → act: it captures a single screenshot st (observation space S), the model samples an action at from a small grammar A (click/type/scroll/wait/done), the browser executes it, and after a buffer δ the next screenshot is taken. The interval is 3–5 s; between snapshots the agent is blind to anything that moves (video, animati… view at source ↗
Figure 3
Figure 3. Figure 3: Standard CU agents vs. AOI-equipped agents. Left: Current agents observe through periodic screenshots (∼3–5 s apart) and are blind and deaf between observations, missing video, audio, and transient UI events (✗). Right: The AOI continuously monitors screen and audio streams through fast gates (<1 ms) that skip processing for static, silent content. When gates fire, keyframe extraction (shown here with opti… view at source ↗
Figure 4
Figure 4. Figure 4: Example observation record sent to the CU model at step N of a meeting task. The context section provides text from prior steps (audio transcriptions and visual narrations persist after images are pruned). The new section contains the current audio transcription and the post-action screenshot (image). In this case, no keyframes were captured (the slide did not change), so only the screenshot is included as… view at source ↗
Figure 5
Figure 5. Figure 5: What the agent actually sees: real frames and narrations from a recorded benchmark run (Carousel task: “How many reviews mention the word ‘excellent’?” Claude Sonnet 4.6 + AOI full, the run reported in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-family AOI gain (AOI full − Standard, in tasks per 10) across five models, families grouped by primary axis (audio, visual, interaction). Green = AOI helps, red = AOI hurts. Gains are positive almost everywhere and largest on the audio families (Podcast, Phone, Meeting), which are impossible from screenshots alone; the lone consistent regression is GAMES, where action latency—not perception—is the bott… view at source ↗
Figure 7
Figure 7. Figure 7: The AOI gain splits into a prompt-format and a perception component—both positive for ev￾ery model. Bars stack the raw-screenshot baseline, the gain from the AOI’s structured observation record alone (no keyframes/audio), and the further gain from perception (keyframes + audio + narration). The worst-case view that the gain is “just reformatting” is ruled out; the split is model-specific (Claude perception… view at source ↗
Figure 8
Figure 8. Figure 8: Progressive contribution of each AOI component (Claude Sonnet 4.6). The first step (+20 pp) bundles inter-step keyframes with the structured prompt scaffold that bare screenshots lack. The scaffold carries +19 of it (Section 6.1), the keyframe images +1 pp as raw input but +10 pp once narration is present (Figure 9b). ASR adds +6 pp (directional, p = 0.18). Visual narration adds +18 pp, the largest single … view at source ↗
Figure 9
Figure 9. Figure 9: What narration and keyframes contribute. (a) Visual narration lifts Claude from 64% (visual+ASR, no narration) to 82%: discarding the narration text after it is generated keeps 74%, splitting the +18 pp into ∼+10 pp inference-time articulation (†directional, p=0.12) and +8 pp from persisting it as memory (p=0.039). (b) The marginal value of the inter-step keyframe images, measured with audio and narration … view at source ↗
Figure 10
Figure 10. Figure 10: The AOI’s small +9 pp net on Gemini 3 Flash is a three-way cancellation, not saturation. Audio (+12) and the structured scaffold (+9) help as on every other model, but the keyframe-image stream regresses −12 pp—so the default bundle nets only 36 → 45. Dropping just the keyframes recovers 57/100 (+21 pp over Standard), showing the slack is real and the bundle merely mis-tuned. Underlying per-mode and per-f… view at source ↗
Figure 11
Figure 11. Figure 11: Trajectory comparison on two representative tasks (illustrative, cherry-picked to expose the mechanism). Top: Meeting task where the launch date is spoken aloud. The standard agent cannot hear it and exhausts its step budget on 15 wait() steps. The AOI agent transcribes the audio, hears “April 28th,” and acts in 2 steps. Bottom: Screencast task where the answer appears in a terminal recording. The standar… view at source ↗
Figure 12
Figure 12. Figure 12: Observation activity per family. Percentage of steps in which each channel fires, and gray is idle (both gates suppress). Carousel is highly visual. Phone and Interview are audio-only. Meeting uses both. ∼61% of steps are idle overall [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: CLIP threshold (θ) sensitivity on 40 visual tasks (categories C–F) with Claude Sonnet 4.6. Error bars are 95% Wilson confidence intervals at N=40 (±9.5–12.1 pp). Point estimates remain within an 80–90% band (shaded) across a 15× range of θ, but the CIs overlap heavily, so we cannot distinguish individual thresholds at this sample size. θ values are placed at uniform spacing on the axis (not to scale). The… view at source ↗
read the original abstract

SWE-agent established the action interface as an underexplored design axis for software-engineering agents; we make the analogous case for the observation interface in computer-use (CU) agents. Current CU agents, closed and open-source alike, tie observation to action--one screenshot every 3-5 s, no audio--leaving them blind and deaf between screenshots to video, animations, transient UI events, meetings, and spoken instructions. We introduce the Agent-Computer Observation Interface (AOI), a model-agnostic perception layer that decouples continuous, adaptive observation from discrete actions through three gated components: inter-step keyframe capture, volume-gated audio transcription, and CU-model-generated visual narration that persists as text. Each produces almost nothing on static, silent content, reducing to the standard loop without degrading it. On DynaCU-Bench (100 dynamic browser tasks plus a 50-task static control), CU models from 7B to frontier scale gain +17 to +48 pp over their screenshot baselines with zero retraining, turning tasks that are near-impossible from periodic screenshots into largely solved ones. The gap is starkest on audio: on a spoken-content subset AOI agents solve every task, whereas streaming voice models hear accurately but cannot act on what they hear without the scaffold. The decomposition is as informative as the headline gain: keyframe selection turns out not to matter--the value comes from narrating captured frames into persistent text--and the interface is not a fixed bundle, since on a newer model (Gemini 3 Flash) the keyframe stream actively regresses through image-token dilution, so its components must be selected per model rather than shipped as one configuration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper argues that current computer-use (CU) agents are limited by observation tied to discrete actions (periodic screenshots, no audio), and introduces the Agent-Computer Observation Interface (AOI) as a model-agnostic perception layer with three gated components: inter-step keyframe capture, volume-gated audio transcription, and model-generated visual narration that persists as text. On the new DynaCU-Bench (100 dynamic browser tasks + 50-task static control), AOI yields +17 to +48 pp gains over screenshot baselines across 7B-to-frontier models with zero retraining; the decomposition finds narration into persistent text as the main driver while keyframe selection adds little value (and can regress on some models via token dilution).

Significance. If the gains are shown to be caused by the AOI components rather than benchmark construction or narration-model interactions, the work would usefully extend the design-space analysis begun by SWE-agent from actions to observations, offering a practical, zero-retraining intervention that turns near-impossible dynamic tasks into largely solved ones. The model-specific component selection finding and the audio-subset result (AOI solves all tasks while streaming voice models cannot act) are actionable. The absence of statistical details, task-authoring descriptions, and explicit ablations currently limits how far these claims can be taken.

major comments (3)
  1. [Abstract / DynaCU-Bench] Abstract and DynaCU-Bench evaluation: the manuscript provides no description of how the 100 dynamic browser tasks were authored, how transient events or spoken instructions were injected, or what exact prompts/context the narration model receives. This is load-bearing for the central claim because the largest gaps occur on the audio subset; without these details it remains possible that narration content systematically supplies information unavailable to the pure screenshot baseline.
  2. [Abstract] Abstract: the decomposition claim that 'keyframe selection turns out not to matter' and that 'the value comes from narrating captured frames into persistent text' is presented without quantitative ablation tables or controls showing the incremental contribution of each gated component; this directly underpins the paper's guidance that components must be selected per model.
  3. [Evaluation results] Evaluation results: no information is supplied on the number of evaluation runs, variance, or statistical significance testing for the reported +17 to +48 pp gains across model scales. This is required to assess whether the headline improvements are reliable or could be artifacts of single-run or implementation-specific effects.
minor comments (1)
  1. [AOI description] The phrase 'each produces almost nothing on static, silent content, reducing to the standard loop without degrading it' would benefit from an explicit statement of the fallback behavior and any measured overhead.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional detail will strengthen the manuscript. We address each major comment below and will revise the paper to incorporate the requested information.

read point-by-point responses
  1. Referee: [Abstract / DynaCU-Bench] Abstract and DynaCU-Bench evaluation: the manuscript provides no description of how the 100 dynamic browser tasks were authored, how transient events or spoken instructions were injected, or what exact prompts/context the narration model receives. This is load-bearing for the central claim because the largest gaps occur on the audio subset; without these details it remains possible that narration content systematically supplies information unavailable to the pure screenshot baseline.

    Authors: We agree these details are necessary for reproducibility and to substantiate the audio-subset claims. The current manuscript does not include a full description of task authoring, transient event injection, spoken instruction mechanisms, or the exact prompts and context given to the narration model. In the revised manuscript we will add a dedicated subsection to the DynaCU-Bench description that specifies the task creation process, how dynamic and audio elements are injected, and the complete prompt templates used for visual narration. revision: yes

  2. Referee: [Abstract] Abstract: the decomposition claim that 'keyframe selection turns out not to matter' and that 'the value comes from narrating captured frames into persistent text' is presented without quantitative ablation tables or controls showing the incremental contribution of each gated component; this directly underpins the paper's guidance that components must be selected per model.

    Authors: The abstract summarizes our experimental decomposition, which found narration to persistent text as the dominant factor and keyframe selection to add little value (sometimes regressing via token dilution). However, the manuscript does not present explicit quantitative ablation tables showing incremental contributions of each component. We will add these ablation tables and controls to the evaluation section of the revised manuscript to support the per-model component selection guidance. revision: yes

  3. Referee: [Evaluation results] Evaluation results: no information is supplied on the number of evaluation runs, variance, or statistical significance testing for the reported +17 to +48 pp gains across model scales. This is required to assess whether the headline improvements are reliable or could be artifacts of single-run or implementation-specific effects.

    Authors: We acknowledge that the number of runs, variance, and statistical significance testing are required to evaluate result reliability. The manuscript currently omits these details. In the revision we will add an evaluation protocol subsection reporting the number of runs per task and model, observed variance, and any statistical significance tests performed. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical performance deltas on introduced benchmark

full rationale

The paper reports measured success-rate gains (+17 to +48 pp) of AOI-augmented agents versus screenshot baselines on DynaCU-Bench. No equations, derivations, fitted parameters, or uniqueness theorems appear; the central claims are direct experimental comparisons. The decomposition (narration as main driver) is likewise an observed outcome, not a reduction to self-defined inputs. Self-citations, if present, are not load-bearing for any derivation. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the new benchmark captures the relevant failure modes of current screenshot-based agents and that the three gated components are the operative cause of the observed gains.

axioms (1)
  • domain assumption Periodic screenshot observation is the relevant baseline for computer-use agents.
    The paper positions AOI as an improvement over the standard one-screenshot-every-3-5s loop used by current CU agents.
invented entities (1)
  • Agent-Computer Observation Interface (AOI) no independent evidence
    purpose: Decouple continuous adaptive observation from discrete actions via three gated components.
    New perception layer introduced by the paper; no independent evidence outside the reported benchmark gains.

pith-pipeline@v0.9.1-grok · 5831 in / 1342 out tokens · 26344 ms · 2026-06-30T06:57:08.869650+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 25 canonical work pages · 12 internal anchors

  1. [1]

    Surfer 2: The next generation of cross-platform computer use agents

    Mathieu Andreux et al. Surfer 2: The next generation of cross-platform computer use agents. arXiv preprint arXiv:2510.19949,

  2. [2]

    Claude computer use

    Anthropic. Claude computer use. https://docs.anthropic.com/en/docs/ agents-and-tools/computer-use, 2024a. Agent-Computer Observation Interfaces Enable Dynamic Computer Use 20 Anthropic. Model context protocol.https://modelcontextprotocol.io, 2024b. Ahmed Awadallah et al. Fara-7B: An efficient agentic model for computer use.arXiv preprint arXiv:2511.19663,

  3. [3]

    GUI-World: A video benchmark and dataset for multimodal GUI- oriented understanding.arXiv preprint arXiv:2406.10819,

    Dongping Chen et al. GUI-World: A video benchmark and dataset for multimodal GUI- oriented understanding.arXiv preprint arXiv:2406.10819,

  4. [4]

    CUA-Suite: 55 hours of real computer-use recordings for agent benchmarking

    Liang Chen et al. CUA-Suite: 55 hours of real computer-use recordings for agent benchmarking. arXiv preprint arXiv:2510.10142,

  5. [5]

    WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

    Ac- cessed 2026-05-15; model idgemini-3-flash-preview. Hongliang He et al. WebVoyager: Building an end-to-end web agent with large multimodal models.arXiv preprint arXiv:2401.13919,

  6. [6]

    VideoWebArena: Evaluating long context multimodal agents with video understanding web tasks.arXiv preprint arXiv:2410.19100,

    Lawrence Jang et al. VideoWebArena: Evaluating long context multimodal agents with video understanding web tasks.arXiv preprint arXiv:2410.19100,

  7. [7]

    ScreenLLM: Stateful screen schema for efficient action understanding and prediction.arXiv preprint arXiv:2503.20978,

    Yiqiao Jin et al. ScreenLLM: Stateful screen schema for efficient action understanding and prediction.arXiv preprint arXiv:2503.20978,

  8. [8]

    Microsoft

    Accessed 2026-05-14. Microsoft. NLWeb: A conversational interface for websites. https://github.com/microsoft/ NLWeb,

  9. [9]

    MemGUI-Bench: Benchmarking memory in GUI agents.arXiv preprint arXiv:2509.12233,

    Joonsuk Park et al. MemGUI-Bench: Benchmarking memory in GUI agents.arXiv preprint arXiv:2509.12233,

  10. [10]

    Accessed 2026-06-28

    URL https://www.19pine.ai/blog/ pine-ai-the-most-natural-human-computer-interface-is-your-voice . Accessed 2026-06-28. Agent-Computer Observation Interfaces Enable Dynamic Computer Use 21 Rui Qian et al. Dispider: Enabling video LLMs with active real-time interaction via disentan- gled perception, decision, and reaction.arXiv preprint arXiv:2501.03218,

  11. [11]

    Qwen3-VL Technical Report

    Qwen Team. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631,

  12. [12]

    Robust Speech Recognition via Large-Scale Weak Supervision

    arXiv:2212.04356. Alec Radford et al. Learning transferable visual models from natural language supervision. In ICML,

  13. [13]

    Learning Transferable Visual Models From Natural Language Supervision

    arXiv:2103.00020. Pascal J. Sager et al. A comprehensive survey of agents for computer use: Foundations, challenges, and future directions.arXiv preprint arXiv:2501.16150,

  14. [14]

    arXiv:2508.03923 [cs.CL] https://arxiv.org/abs/2508.03923

    Linxin Song et al. CoAct-1: Computer-using agents with coding as actions.arXiv preprint arXiv:2508.03923,

  15. [15]

    Cradle: Empowering foundation agents towards general computer control

    Weihao Tan et al. Cradle: Empowering foundation agents towards general computer control. arXiv preprint arXiv:2403.03186,

  16. [16]

    Adaptive keyframe sampling for long video understanding.arXiv preprint arXiv:2502.21271,

    Xi Tang et al. Adaptive keyframe sampling for long video understanding.arXiv preprint arXiv:2502.21271,

  17. [17]

    UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    Haoming Wang et al. UI-TARS-2 technical report: Advancing GUI agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544, 2025a. Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better LLM agents. InInternational Conference on Machine Learning (ICML), 2024a. Xinyuan Wang...

  18. [18]

    OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    Accessed 2026-05-14. Tianbao Xie et al. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments.arXiv preprint arXiv:2404.07972,

  19. [19]

    StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding

    Haolin Yang et al. StreamAgent: Towards anticipatory agents for streaming video understand- ing.arXiv preprint arXiv:2508.01875,

  20. [20]

    Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

    Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of- mark prompting unleashes extraordinary visual grounding in GPT-4V.arXiv preprint arXiv:2310.11441,

  21. [21]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    Agent-Computer Observation Interfaces Enable Dynamic Computer Use 22 John Yang et al. SWE-agent: Agent-computer interfaces enable automated software engineering. arXiv preprint arXiv:2405.15793,

  22. [22]

    Mobile-Agent-v3: Fundamental Agents for GUI Automation

    Jiabo Ye et al. Mobile-Agent-v3: Fundamental agents for GUI automation.arXiv preprint arXiv:2508.15144,

  23. [23]

    VideoGameBench: Can Vision-Language Models complete popular video games?

    Alex L. Zhang et al. VideoGameBench: Can vision-language models complete popular video games?arXiv preprint arXiv:2505.18134,

  24. [24]

    A simple LLM framework for long-range video question-answering.arXiv preprint arXiv:2312.17235,

    Ce Zhang et al. A simple LLM framework for long-range video question-answering.arXiv preprint arXiv:2312.17235,

  25. [25]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou et al. WebArena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854,

  26. [26]

    Apollo: An exploration of video understanding in large multimodal models

    Orr Zohar et al. Apollo: An exploration of video understanding in large multimodal models. arXiv preprint arXiv:2412.10360,

  27. [27]

    What is the new launch date?

    Standard AOI Full ModelEasy Med Hard Easy Med Hard Claude 4.6 17/30 15/40 6/3028/3034/4020/30 GPT-5.4 15/30 17/40 5/30 21/30 26/40 10/30 Gemini 2.5 10/30 9/40 2/30 24/30 33/40 12/30 EvoCUA-32B 6/30 8/40 4/30 20/30 26/40 9/30 Fara-7B 9/30 5/40 3/30 19/30 11/40 4/30 Table 7.Efficiency and cost on the 100-task DynaCU-Bench. Grok-4 is omitted due to inference...