arxiv: 2603.26041 · v3 · submitted 2026-03-27 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectives

Daiqiang Li, Haiyun Jiang, Honggang Chen, Huacan Wang, Ronghao Chen, Zeyu Zhang, Zihao Pan

Authors on Pith no claims yet

Pith reviewed 2026-05-14 23:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords token pruningGUI agentsmultimodal LLMshistorical screenshotsrecency effectspatial structureforeground-background separation

0 comments

The pith

GUI agent history tokens can be pruned heavily on old frames while keeping recent ones and background regions to cut cost with almost no performance loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how to prune visual tokens from sequences of GUI screenshots fed to multimodal agents. It finds that background regions, often discarded, actually record interface state changes and supply useful cues. Random token selection turns out to preserve layout better than targeted semantic pruning under fixed budgets. Finally, a recency bias lets the system assign most tokens to the latest screenshots and compress earlier ones, lowering total compute while holding task success nearly steady. These observations challenge standard pruning assumptions and give direct rules for cheaper historical context in GUI agents.

Core claim

GUI screenshots show a foreground-background split where background patches encode interface-state transitions; random pruning better maintains spatial structure than deliberate strategies; and agents benefit from a recency effect that justifies larger token budgets on recent frames and aggressive compression of distant ones, yielding large cost savings with negligible accuracy drop.

What carries the argument

Three empirical perspectives on pruning: edge-based foreground-background separation to reveal background value, comparison of random versus semantic pruning for spatial preservation, and recency-based token budget allocation across the screenshot history.

If this is right

Background patches supply auxiliary transition signals that improve agent reasoning when foreground is pruned.
Random pruning delivers higher task success than semantic or attention-based pruning at identical token counts.
Recency-weighted budgets let total visual tokens drop sharply while success rates stay within a few percent of the full-history baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same uneven temporal allocation might extend to other sequential visual tasks such as video game agents or robotic camera streams.
If background regions carry transition signals, future pruning designs could explicitly protect low-frequency layout features rather than discard them as noise.

Load-bearing premise

The patterns seen with edge separation and the tested agent setups will hold for other GUI applications, models, and task types.

What would settle it

Run the same pruning experiments on a new suite of GUI tasks with different interface styles and observe whether background regions still improve reasoning or whether random selection loses its spatial advantage.

Figures

Figures reproduced from arXiv: 2603.26041 by Daiqiang Li, Haiyun Jiang, Honggang Chen, Huacan Wang, Ronghao Chen, Zeyu Zhang, Zihao Pan.

**Figure 2.** Figure 2: Overview of our study on token pruning for historical screenshots in GUI visual agents. We highlight three [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the edge-based foreground–background partition on a GUI screenshot. The upper panel shows [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Proportion of foreground and background patches across four GUI visual agent datasets. The rectangle is on the right side of the picture. The rectangle is in the topleft corner of the picture. The rectangle is in the lower right corner of the picture. May I ask what position the rectangle is in the picture? [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: A toy example illustrating the impact of token pruning on spatial reasoning. Although the rectangle itself [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Left: In the conventional implementation, pruned tokens are directly removed, while the remaining tokens [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

In recent years, GUI visual agents built upon Multimodal Large Language Models (MLLMs) have demonstrated strong potential in navigation tasks. However, high-resolution GUI screenshots produce a large number of visual tokens, making the direct preservation of complete historical information computationally expensive. In this paper, we conduct an empirical study on token pruning for historical screenshots in GUI scenarios and distill three practical insights that are crucial for designing effective pruning strategies. First, we observe that GUI screenshots exhibit a distinctive foreground-background semantic composition. To probe this property, we apply a simple edge-based separation to partition screenshots into foreground and background regions. Surprisingly, we find that, contrary to the common assumption that background areas have little semantic value, they effectively capture interface-state transitions, thereby providing auxiliary cues for GUI reasoning. Second, compared with carefully designed pruning strategies, random pruning possesses an inherent advantage in preserving spatial structure, enabling better performance under the same computational budget. Finally, we observe that GUI Agents exhibit a recency effect similar to human cognition: by allocating larger token budgets to more recent screenshots and heavily compressing distant ones, we can significantly reduce computational cost while maintaining nearly unchanged performance. These findings offer new insights and practical guidance for the design of efficient GUI visual agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper distills three simple empirical rules for pruning tokens in GUI agent histories—background regions carry state info, random pruning beats designed methods on spatial structure, and recency allocation saves compute with little loss—but the claims rest on narrow setups.

read the letter

The main takeaway is that GUI screenshots split into foreground and background where the background actually helps track interface changes, that random token pruning keeps spatial layout better than targeted strategies under the same budget, and that shifting most tokens to recent frames while compressing older ones cuts cost without much performance drop. They arrive at this by using edge detection to separate regions, running pruning comparisons, and testing time-based budget splits on agent histories. The random pruning result stands out because it pushes back against the usual preference for clever selection, and the recency pattern gives a straightforward way to handle long histories. What the work does well is stay practical and tied to real GUI agent constraints like high-resolution screenshots and MLLM token limits. The observations are presented as direct guidance rather than overclaimed theory, which makes them easy to try out. The soft spots are around scope and controls. The findings come from particular screenshots and models, and the stress-test note is right that there's no clear cross-application or cross-task validation yet, so it's unclear how far the three rules travel. The abstract also skips metrics and baselines, so the full paper needs to show the numbers and statistical checks to make the comparisons convincing. This is for people building or tuning GUI visual agents who care about inference cost on longer sessions or lighter hardware. A reader already working on token efficiency for vision-language models would pick up usable ideas here even if they adapt the details. I'd send it to peer review. The empirical angle is relevant and the ideas are testable, though it will need more runs and broader checks to hold up.

Referee Report

2 major / 2 minor

Summary. The paper conducts an empirical study on token pruning for historical screenshots in GUI visual agents based on MLLMs. It distills three insights: (1) GUI screenshots exhibit a foreground-background semantic composition where background regions capture interface-state transitions (probed via edge-based separation); (2) random pruning has an inherent advantage in preserving spatial structure compared to designed strategies under fixed budgets; (3) GUI agents show a recency effect, so allocating larger token budgets to recent screenshots and compressing distant ones reduces cost while keeping performance nearly unchanged.

Significance. If the three observations hold beyond the tested cases, they supply actionable heuristics for pruning in GUI agents, directly addressing the token explosion from high-resolution screenshots and enabling lower-cost navigation without major accuracy loss. The recency allocation and random-pruning spatial benefit are especially practical for real-time MLLM agents.

major comments (2)

[Experiments / Results] The central claims rest on observations from specific setups (edge-based partitioning, pruning comparisons, recency allocation) yet the manuscript reports no cross-application validation, no tests on additional MLLMs, and no statistical controls for task distribution. This directly undermines the assertion that the insights supply 'practical guidance' for diverse GUI agents.
[Abstract] The abstract and reported findings supply no experimental details, metrics, baselines, or statistical tests, making it impossible to verify whether the data actually support the three stated claims.

minor comments (2)

[Throughout] Notation for token budgets, pruning ratios, and foreground/background partitions should be introduced with explicit definitions and an early table or figure for clarity.
[Figures] Figure captions and axis labels for any pruning-performance plots need to state the exact MLLM, screenshot resolution, and task set used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We provide detailed responses to each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses

Referee: [Experiments / Results] The central claims rest on observations from specific setups (edge-based partitioning, pruning comparisons, recency allocation) yet the manuscript reports no cross-application validation, no tests on additional MLLMs, and no statistical controls for task distribution. This directly undermines the assertion that the insights supply 'practical guidance' for diverse GUI agents.

Authors: We agree that our experiments are based on specific setups and do not include cross-application validation or tests on additional MLLMs. We have added a Limitations section to the revised manuscript that explicitly discusses the scope of our findings and the need for future validation across more diverse GUI applications and models. Additionally, we have included statistical controls by reporting means with standard deviations and performing significance tests on the key comparisons. While we acknowledge this as a limitation, the consistent observations across the tested tasks support the practical insights for GUI visual agents in similar settings. revision: partial
Referee: [Abstract] The abstract and reported findings supply no experimental details, metrics, baselines, or statistical tests, making it impossible to verify whether the data actually support the three stated claims.

Authors: We agree and have revised the abstract to incorporate key experimental details, including the metrics used (task completion rate and token efficiency), the baselines for pruning strategies, and a summary of the main findings with supporting evidence from our experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity in purely empirical observations

full rationale

The paper reports direct experimental observations on GUI screenshots: edge-based foreground/background partitioning, comparisons of pruning strategies including random pruning, and recency-based token budget allocation. No equations, derivations, fitted parameters, or predictions are presented that reduce to the inputs by construction. All three insights are framed as empirical findings from specific setups, with no self-citation load-bearing chains or ansatzes smuggled in. The derivation chain is self-contained observational work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is empirical and relies on standard computer vision techniques without introducing new free parameters or invented entities.

axioms (1)

domain assumption Edge-based separation can partition screenshots into meaningful foreground and background regions
Invoked to probe the semantic composition property.

pith-pipeline@v0.9.0 · 5540 in / 1037 out tokens · 42396 ms · 2026-05-14T23:56:29.573746+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GUI screenshots exhibit a distinctive foreground-background semantic composition... random pruning possesses an inherent advantage in preserving spatial structure... GUI Agents exhibit a recency effect

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 3 internal anchors

[1]

Divprune: Diversity-based visual token pruning for large multimodal models

Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, and Yong Zhang. Divprune: Diversity-based visual token pruning for large multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025

work page 2025
[2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, and Jun Tang. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Web agents with world models: Learning and leveraging environment dynamics in web navigation

Hyungjoo Chae, Namyoung Kim, Kai Tzu-iunn Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sunghwan Kim, Dongha Lee, and Jinyoung Yeo. Web agents with world models: Learning and leveraging environment dynamics in web navigation.arXiv preprint arXiv:2410.13232, 2024

work page arXiv 2024
[4]

Less is more: Empowering gui agent with context-aware simplification

Gongwei Chen, Xurui Zhou, Rui Shao, Yibo Lyu, Kaiwen Zhou, Shuai Wang, Wentao Li, Yinchuan Li, Zhongang Qi, and Liqiang Nie. Less is more: Empowering gui agent with context-aware simplification. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

work page 2025
[5]

Variation-aware vision token dropping for faster large vision-language models.arXiv preprint arXiv:2509.01552, 2025

Junjie Chen, Xuyang Liu, Zichen Wen, Yiyu Wang, Siteng Huang, and Honggang Chen. Variation-aware vision token dropping for faster large vision-language models.arXiv preprint arXiv:2509.01552, 2025

work page arXiv 2025
[6]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In European Conference on Computer Vision, 2024

work page 2024
[7]

Pangu-agent: A fine-tunable generalist agent with structured reasoning.arXiv preprint arXiv:2312.14878, 2023

Filippos Christianos, Georgios Papoudakis, Matthieu Zimmer, Thomas Coste, Zhihao Wu, Jingxuan Chen, Khyati Khandelwal, James Doran, Xidong Feng, Jiacheng Liu, et al. Pangu-agent: A fine-tunable generalist agent with structured reasoning.arXiv preprint arXiv:2312.14878, 2023

work page arXiv 2023
[8]

Scope: Saliency-coverage oriented token pruning for efficient multimodel llms.arXiv preprint arXiv:2510.24214, 2025

Jinhong Deng, Wen Li, Joey Tianyi Zhou, and Yang He. Scope: Saliency-coverage oriented token pruning for efficient multimodel llms.arXiv preprint arXiv:2510.24214, 2025

work page arXiv 2025
[9]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[10]

Navigating the digital world as humans do: Universal visual grounding for gui agents.arXiv preprint arXiv:2410.05243, 2024

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigat- ing the digital world as humans do: Universal visual grounding for gui agents.arXiv preprint arXiv:2410.05243, 2024

work page arXiv 2024
[11]

Ui-venus technical report: Building high-performance ui agents with rft

Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, et al. Ui-venus technical report: Building high-performance ui agents with rft. arXiv preprint arXiv:2508.10833, 2025

work page arXiv 2025
[12]

illava: An image is worth fewer than 1/3 input tokens in large multimodal models.arXiv preprint arXiv:2412.06263, 2024

Lianyu Hu, Fanhua Shang, Liang Wan, and Wei Feng. illava: An image is worth fewer than 1/3 input tokens in large multimodal models.arXiv preprint arXiv:2412.06263, 2024

work page arXiv 2024
[13]

On the effects of data scale on ui control agents

Wei Li, William E Bishop, Alice Li, Christopher Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. On the effects of data scale on ui control agents. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[14]

Showui: One vision-language-action model for gui visual agent

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Stan Weixian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025

work page 2025
[15]

Video compression commander: Plug-and-play inference acceleration for video large language models

Xuyang Liu, Yiyu Wang, Junpeng Ma, and Linfeng Zhang. Video compression commander: Plug-and-play inference acceleration for video large language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 1910–1924, 2025

work page 2025
[16]

Global compression commander: Plug-and-play inference acceleration for high-resolution large vision-language models

Xuyang Liu, Ziming Wang, Junjie Chen, Yuhang Han, Yingyao Wang, Jiale Yuan, Jun Song, Siteng Huang, and Honggang Chen. Global compression commander: Plug-and-play inference acceleration for high-resolution large vision-language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 7350–7358, 2026

work page 2026
[17]

Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices

Quanfeng Lu, Wenqi Shao, Zitao Liu, Lingxiao Du, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, and Ping Luo. Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

work page 2025
[18]

Gui agents: A survey

Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, et al. Gui agents: A survey. InFindings of the Association for Computational Linguistics: ACL 2025, 2025

work page 2025
[20]

Large language models can self-improve at web agent tasks.arXiv preprint arXiv:2405.20309, 2024

Ajay Patel, Markus Hofmarcher, Claudiu Leoveanu-Condrei, Marius-Constantin Dinu, Chris Callison-Burch, and Sepp Hochreiter. Large language models can self-improve at web agent tasks.arXiv preprint arXiv:2405.20309, 2024

work page arXiv 2024
[21]

Agent q: Advanced reasoning and learning for autonomous ai agents.arXiv preprint arXiv:2408.07199, 2024

Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent q: Advanced reasoning and learning for autonomous ai agents.arXiv preprint arXiv:2408.07199, 2024

work page arXiv 2024
[22]

Androidinthewild: A large-scale dataset for android device control

Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Androidinthewild: A large-scale dataset for android device control. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[23]

GUI-G2: Gaussian reward modeling for gui grounding.arXiv preprint arXiv:2507.15846, 2025

Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, et al. GUI-G2: Gaussian reward modeling for gui grounding.arXiv preprint arXiv:2507.15846, 2025

work page arXiv 2025
[24]

Mobile- agent: Autonomous multi-modal mobile device agent with visual perception.arXiv preprint arXiv:2401.16158, 2024

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile- agent: Autonomous multi-modal mobile device agent with visual perception.arXiv preprint arXiv:2401.16158, 2024

work page arXiv 2024
[25]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Gui agents with foundation models: A comprehensive survey.arXiv preprint arXiv:2411.04890, 2024

Shuai Wang, Weiwen Liu, Jingxuan Chen, Yuqi Zhou, Weinan Gan, Xingshan Zeng, Yuhan Che, Shuai Yu, Xinlong Hao, Kun Shao, et al. Gui agents with foundation models: A comprehensive survey.arXiv preprint arXiv:2411.04890, 2024

work page arXiv 2024
[27]

Efficient vision-language models by summarizing visual tokens into compact registers.arXiv preprint arXiv:2410.14072, 2024

Yuxin Wen, Qingqing Cao, Qichen Fu, Sachin Mehta, and Mahyar Najibi. Efficient vision-language models by summarizing visual tokens into compact registers.arXiv preprint arXiv:2410.14072, 2024

work page arXiv 2024
[28]

Token pruning in multimodal large language models: Are we solving the right problem? InFindings of the Association for Computational Linguistics: ACL 2025, 2025

Zichen Wen, Yifeng Gao, Weijia Li, Conghui He, and Linfeng Zhang. Token pruning in multimodal large language models: Are we solving the right problem? InFindings of the Association for Computational Linguistics: ACL 2025, 2025

work page 2025
[29]

important tokens

Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, and Linfeng Zhang. Stop looking for “important tokens” in multimodal language models: Duplication matters more. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

work page 2025
[30]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction

Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[32]

Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms

Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiyong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, and Shanghang Zhang. Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

work page 2025
[33]

Sparsevlm: Visual token sparsification for efficient vision-language model inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Visual token sparsification for efficient vision-language model inference. InInternational Conference on Machine Learning, 2025

work page 2025
[34]

Causal evidence for a privileged working memory state in early visual cortex.Journal of Neuroscience, 2014

Nahid Zokaei, Sanjay Manohar, Masud Husain, and Eva Feredoes. Causal evidence for a privileged working memory state in early visual cortex.Journal of Neuroscience, 2014

work page 2014
[35]

Don’t just chase” highlighted tokens” in mllms: Revisiting visual holistic con- text retention.arXiv preprint arXiv:2510.02912, 2025

Xin Zou, Di Lu, Yizhou Wang, Yibo Yan, Yuanhuiyi Lyu, Xu Zheng, Linfeng Zhang, and Xuming Hu. Don’t just chase" highlighted tokens" in mllms: Revisiting visual holistic context retention.arXiv preprint arXiv:2510.02912, 2025

work page arXiv 2025