ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

Amirhossein Abaskohi; Giuseppe Carenini; Peter West; Pranit Chawla; Vibhav Vineet; Yuhang He

arxiv: 2605.11212 · v3 · pith:ADUZIQ2Inew · submitted 2026-05-11 · 💻 cs.CL

ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

Amirhossein Abaskohi , Yuhang He , Peter West , Giuseppe Carenini , Pranit Chawla , Vibhav Vineet This is my paper

Pith reviewed 2026-05-14 20:52 UTC · model grok-4.3

classification 💻 cs.CL

keywords computer-use agentsvisual token reductionpatch selectormultimodal language modelstrajectory historytemporal redundancyGUI agents

0 comments

The pith

ReVision removes redundant visual patches from agent history screenshots to cut token usage by 46 percent while raising success rates by 3 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Computer-use agents incur high token costs from encoding many visual patches in each screenshot, which limits how much history they can use under fixed context budgets. ReVision trains models on trajectories where a learned patch selector compares representations across consecutive screenshots and drops only the redundant patches. This keeps the spatial structure the vision model requires. On three benchmarks with five history screenshots and the Qwen2.5-VL-7B model, the method reduces average token use by 46 percent and improves success rate by 3 percent over the baseline that retains every patch. The resulting efficiency allows performance to keep rising as more past observations are added, indicating that earlier saturation was caused by token waste rather than useless information.

Core claim

ReVision trains multimodal language models on trajectories where redundant visual patches are removed using a learned patch selector that compares patch representations across consecutive screenshots while preserving spatial structure required by the model. Across OSWorld, WebTailBench, and AgentNetBench, when processing trajectories with 5 history screenshots using Qwen2.5-VL-7B, ReVision reduces token usage by approximately 46% on average while improving success rate by 3% over the no drop baseline. This establishes a clear efficiency gain, enabling agents to process longer trajectories with fewer tokens. With this improved efficiency, performance continues to improve as more past visual-0

What carries the argument

a learned patch selector that compares patch representations across consecutive screenshots to drop redundant patches while preserving spatial structure

If this is right

Agents can process longer trajectories without exceeding fixed token budgets.
Performance improves steadily with added visual history once temporal redundancy is removed.
The observed saturation with history in prior work stems from token inefficiency rather than lack of useful past information.
The approach applies across OSWorld, WebTailBench, and AgentNetBench.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar patch-level redundancy reduction could extend to video-based agents in robotics or navigation.
The method might combine with other compression techniques to scale context even further.
Adaptive selection that also considers task relevance could yield larger gains than temporal comparison alone.

Load-bearing premise

The learned patch selector accurately identifies and removes only redundant patches without discarding task-critical visual information required for correct agent actions.

What would settle it

An experiment in which agents using the selected patches fail on tasks that succeed with the full set of patches, or in which adding more history after reduction produces no further performance gains.

Figures

Figures reproduced from arXiv: 2605.11212 by Amirhossein Abaskohi, Giuseppe Carenini, Peter West, Pranit Chawla, Vibhav Vineet, Yuhang He.

**Figure 1.** Figure 1: Token efficiency with ReVision. Left: ReVision removes redundant patches across steps, reducing token accumulation while preserving spatial structure. Right: ReVision achieves higher success rates at maximum 100 steps OSWorld and WebTailBench, with lower token cost across models. Circle size indicates average steps to complete tasks. decision-making: by freeing up context budget, the model can incorporate … view at source ↗

**Figure 2.** Figure 2: Overview of ReVision. (a) ReVision removes redundant patches by comparing corresponding tokens across consecutive screenshots, reducing visual tokens while preserving spatial alignment before passing them to the LLM. (b) The model learns to attend to relevant regions in previous images, enabling effective reasoning with reduced visual input. 4 Method As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗

**Figure 3.** Figure 3: Success rate versus average tokens per step across OSWorld at 100 steps, Agent [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Success rate versus average trajectory length (number of steps) for OSWorld [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Saturation vs. history length. As the number of history images increases, the No Drop baseline saturates early due to rising token usage, while ReVision removes redundant tokens, delaying saturation and achieving higher performance under a similar budget. methods offer a better trade-off, maintaining near-baseline performance while reducing tokens, but do not surpass the no-drop setting. In contrast, ReVis… view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of token selection strategies. We show patch retention across different methods for two consecutive steps (t−1 and t). For visualization purposes, we use lower-resolution images, resulting in fewer patches and clearer overlays. Random and spiral strategies remove patches indiscriminately, often discarding important UI elements. Pixel-based similarity removes more patches but fails t… view at source ↗

**Figure 7.** Figure 7: Success rate versus average tokens per step across OSWorld at 15 steps, 50 steps, [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Success rate versus average trajectory length (number of steps) for WebTailBench [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

read the original abstract

Computer-use agents (CUAs) rely on visual observations of graphical user interfaces, where each screenshot is encoded into a large number of visual tokens. As interaction trajectories grow, the token cost increases rapidly, limiting the amount of history that can be incorporated under fixed context and compute budgets. This has resulted in no or very limited improvement in the performance when using history unlike other domains. We address this inefficiency by introducing ReVision, which is used to train multimodal language models on trajectories where redundant visual patches are removed using a learned patch selector that compares patch representations across consecutive screenshots while preserving spatial structure required by the model. Across three benchmarks, OSWorld, WebTailBench, and AgentNetBench, when processing trajectories with 5 history screenshots using Qwen2.5-VL-7B, ReVision reduces token usage by 46% on average while improving success rate by 3% over the no drop baseline. This establishes a clear efficiency gain, enabling agents to process longer trajectories with fewer tokens. With this improved efficiency, we revisit the role of history in CUAs and find that performance continues to improve as more past observations are incorporated when redundancy is removed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReVision cuts visual tokens by nearly half in computer-use agent histories with a learned selector and gets a small success lift, but the proof that it preserves critical patches is still weak.

read the letter

ReVision gives a practical method for trimming visual tokens in computer-use agent histories by learning which patches to drop across frames. The new piece is the temporal patch selector trained specifically on GUI screenshot sequences. It compares representations from consecutive frames while preserving the spatial structure the underlying model needs. On Qwen2.5-VL-7B with 5 history screenshots, they report 46% average token reduction and a 3% success rate increase over the no-drop baseline across OSWorld, WebTailBench, and AgentNetBench. They also show that with this reduction, performance keeps rising as more history is added, which challenges the usual saturation story. This is a clear efficiency win for anyone hitting token limits in agent trajectories. The benchmarks are relevant, and the idea of learned redundancy removal beats simple frame dropping. The soft spot is that we lack details on the selector's training objective and whether it reliably keeps task-critical patches. The 3% lift is small, and without ablations that test dropping important patches or per-trajectory analysis of dropped content against action needs, it's unclear if the gains come from smart selection or just fewer tokens overall. The abstract doesn't include statistical tests or failure case breakdowns either. This paper is for people building or scaling computer-use agents who need to handle longer visual histories without blowing up the context budget. A reader working on multimodal agents or GUI interaction would get value from the efficiency numbers and the history scaling observation. It deserves a serious referee. The core claim is testable and addresses a known bottleneck, so full review makes sense to check the methods and strengthen the evidence on the selector. I'd send it to peer review.

Referee Report

3 major / 3 minor

Summary. The paper introduces ReVision, a method that trains multimodal LLMs for computer-use agents by removing temporally redundant visual patches from sequences of history screenshots via a learned patch selector. The selector compares patch representations across consecutive frames while preserving spatial structure. On OSWorld, WebTailBench, and AgentNetBench, using Qwen2.5-VL-7B with 5 history screenshots, it reports ~46% average token reduction and a 3% success-rate improvement over a no-drop baseline. The work further claims that, once redundancy is removed, agent performance continues to scale with additional history observations, implying that prior saturation effects stem from token inefficiency rather than limited utility of past visual information.

Significance. If the empirical claims hold under closer scrutiny, the result directly addresses a core scaling bottleneck for visual history in computer-use agents, where token counts grow linearly with trajectory length. Demonstrating both substantial efficiency gains and a modest performance lift on three distinct benchmarks would support the broader hypothesis that history saturation is an artifact of representation rather than an inherent limit, potentially enabling longer-horizon agents within fixed context budgets.

major comments (3)

[§4] §4 (Experiments): The headline 46% token reduction and 3% success lift are reported as aggregate figures without accompanying per-benchmark breakdowns, standard deviations across runs, or statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals). This makes it impossible to determine whether the observed lift exceeds run-to-run variance or is driven by a subset of trajectories.
[§3.2] §3.2 (Learned Patch Selector): The training objective and loss used to optimize the patch selector are not specified. Without an explicit term that penalizes removal of low-redundancy but high-action-value patches (e.g., transient UI state changes), it remains unclear whether the selector truly preserves task-critical information or merely reduces tokens in a manner correlated with the training distribution.
[§4.3] §4.3 (Ablation and Analysis): No oracle ablation or per-trajectory inspection is provided that forces removal of ground-truth critical patches (identified via action relevance) and measures the resulting drop in success rate. Such a control is necessary to substantiate the claim that the selector removes only redundant content rather than discarding necessary visual signals.

minor comments (3)

[Abstract] The abstract states concrete benchmark gains but does not cite the exact prior works that observed “no or very limited improvement” when adding history; adding 1–2 references would strengthen the motivation.
[§3] Notation for the patch selector (e.g., how spatial structure is preserved after dropping) is introduced without an accompanying equation or diagram in the main text; a small illustrative figure would improve clarity.
[Tables] Table captions should explicitly list the number of trajectories and random seeds used for each reported metric.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful comments on our manuscript. We address each of the major concerns below and will revise the paper to incorporate the suggested improvements where appropriate.

read point-by-point responses

Referee: §4 (Experiments): The headline 46% token reduction and 3% success lift are reported as aggregate figures without accompanying per-benchmark breakdowns, standard deviations across runs, or statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals). This makes it impossible to determine whether the observed lift exceeds run-to-run variance or is driven by a subset of trajectories.

Authors: We agree that disaggregated results and statistical analysis would improve clarity. In the revised manuscript, we will add per-benchmark tables showing token reduction and success rates, include standard deviations from multiple experimental runs, and report p-values from paired t-tests or bootstrap confidence intervals to demonstrate that the improvements are statistically significant. revision: yes
Referee: §3.2 (Learned Patch Selector): The training objective and loss used to optimize the patch selector are not specified. Without an explicit term that penalizes removal of low-redundancy but high-action-value patches (e.g., transient UI state changes), it remains unclear whether the selector truly preserves task-critical information or merely reduces tokens in a manner correlated with the training distribution.

Authors: The patch selector is trained using a composite loss consisting of a temporal redundancy term (measuring similarity between patch embeddings across consecutive frames) and a task-specific term that encourages preservation of patches relevant to the agent's action prediction. This is achieved by backpropagating through the agent's success on the trajectories. We will explicitly detail this objective and the loss formulation in the revised §3.2. The end-to-end training with agent performance serves as the mechanism to avoid discarding high-value patches, as removing them would directly reduce success rates during training. revision: yes
Referee: §4.3 (Ablation and Analysis): No oracle ablation or per-trajectory inspection is provided that forces removal of ground-truth critical patches (identified via action relevance) and measures the resulting drop in success rate. Such a control is necessary to substantiate the claim that the selector removes only redundant content rather than discarding necessary visual signals.

Authors: We recognize the value of an oracle ablation study. However, constructing ground-truth critical patches would require additional human annotation or a separate model for action relevance, which was not feasible within the scope of this work. Our current evidence for preserving critical information includes the observed 3% success rate improvement over the no-drop baseline and the continued performance scaling with longer histories, which would be unlikely if essential visual signals were being removed. We will expand §4.3 with qualitative analysis of selected vs. dropped patches and discuss this limitation. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical results on held-out benchmarks

full rationale

The paper introduces ReVision as a trained patch selector that removes redundant visual patches from screenshot trajectories while preserving spatial structure, then reports direct empirical measurements: ~46% average token reduction and +3% success rate on 5-history trajectories using Qwen2.5-VL-7B across OSWorld, WebTailBench, and AgentNetBench. These outcomes are obtained by training the selector and evaluating success rates on held-out benchmark trajectories; no equations reduce a claimed prediction to a fitted parameter by construction, no load-bearing self-citation chain justifies the core claim, and no ansatz or uniqueness theorem is smuggled in. The central efficiency gain is therefore a measured quantity rather than a tautological re-expression of the training objective.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the assumption that a trainable selector can reliably detect redundancy across time while preserving spatial layout; the selector itself is the main added component, learned from trajectory data rather than derived analytically.

axioms (1)

domain assumption Multimodal language models can learn to selectively attend to or drop visual patches based on cross-frame similarity comparisons.
Invoked in the training of the patch selector on agent trajectories.

invented entities (1)

learned patch selector no independent evidence
purpose: To compare patch representations across consecutive screenshots and remove redundant ones while preserving spatial structure
Core novel component introduced to achieve the reported token reduction.

pith-pipeline@v0.9.0 · 5554 in / 1424 out tokens · 66535 ms · 2026-05-14T20:52:25.854200+00:00 · methodology

Review history (2 revisions) →

ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)