arxiv: 2603.14882 · v2 · submitted 2026-03-16 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models

Soumyaratna Debnath , Bui Duc Manh , Zinan Liu , Lin Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords adaptive visual samplingbio-inspired visionvision-language modelstraining-free efficiencyfoveated encodingsemantic feedbacknon-uniform samplingefficient inference

0 comments

The pith

Bio-inspired non-uniform sampling lets vision-language models keep 82-97% of full accuracy using only 1-5% of image pixels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that vision-language models waste computation on uniform pixel treatment across an image. It introduces a training-free method that samples pixels non-uniformly, concentrating detail where the model and scene semantics indicate it matters most. This approach draws from human foveated vision and cortical magnification to preserve scene structure while discarding uninformative areas. When combined with a closed feedback loop that uses the model's own text output to refine sampling, the method delivers large gains over uniform downsampling on visual question answering tasks. The central result is that near-full performance is retained at extremely low pixel budgets without any retraining or architecture changes.

Core claim

LLMind achieves adaptive visual representations for frozen VLMs through a Bio-inspired Adaptive Sampling Strategy (BASS) that uses a Mobius-parameterized module for non-uniform sampling while preserving global structure, augmented by closed-loop semantic feedback that aligns sampling with textual task information at test time. This yields up to 82%, 92%, and 97% of full-resolution performance with 1%, 3%, and 5% of pixels respectively, plus average gains of +20% on VQAv2, +38% on Seed-Bench, and +37% on A-OKVQA over uniform sampling at tight budgets.

What carries the argument

The Bio-inspired Adaptive Sampling Strategy (BASS) Mobius-parameterized non-uniform sampling module, combined with closed-loop semantic feedback from the frozen VLM to adjust perceptual saliency.

If this is right

Existing VLMs can operate under severe pixel or bandwidth constraints while retaining most task performance without retraining.
Plug-and-play deployment is possible across any frozen VLM architecture since no model weights or layers are modified.
Non-uniform sampling outperforms uniform reduction at every tested budget, showing that adaptive allocation is the key efficiency lever.
The method scales to both scene-level and region-guided VQA, indicating broad applicability within current benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real-time mobile or edge VLM applications could reduce latency and power use by routing only the adaptively sampled pixels through the vision encoder.
Extending the feedback loop to incorporate user-provided text prompts before sampling might further improve task alignment in interactive settings.
The same non-uniform principle could be tested on video inputs by adding temporal consistency constraints to the Mobius sampling across frames.

Load-bearing premise

The sampling decisions driven by Mobius parameterization and semantic feedback will always identify and preserve the exact visual details required for each task without systematic omissions or biases that uniform sampling would avoid.

What would settle it

Run the same VQA benchmarks on a new dataset containing questions whose answers depend on fine details distributed uniformly across low-saliency background regions; if accuracy falls below uniform sampling at the 1-5% pixel budgets, the claim fails.

Figures

Figures reproduced from arXiv: 2603.14882 by Bui Duc Manh, Lin Wang, Soumyaratna Debnath, Zinan Liu.

**Figure 1.** Figure 1: (a) Illustration of the underlying principle of our Bio-inspired Adaptive Sampling Strategy (BASS). (b) Performance comparison at 5% and 3% pixel budgets using Qwen2.5-VL [5] across datasets. Project page: https://empactlab.github.io/LLMind-CVPR-2026/ Abstract Vision-Language Models (VLMs) typically assume a uniform spatial fidelity across the entire field of view of visual inputs, dedicating equal precis… view at source ↗

**Figure 2.** Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of the Bio-inspired Adaptive Sampling Strategy (BASS). Given an input image I, the MLP predicts Mobius parameters ¨ θ to warp the image toward salient regions (Eq. 3). The warped image is uniformly sampled under pixel budget B through SB(·), and then reconstructed to its original resolution via an interpolation operator I(·). Finally, the inverse transformation (Eq. 5) restores the global s… view at source ↗

**Figure 5.** Figure 5: Illustration of the compared sampling methods under [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: Ablation study on A-OKVQA dataset with Qwen2.5-VL [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Vision-Language Models (VLMs) typically assume a uniform spatial fidelity across the entire field of view of visual inputs, dedicating equal precision to even the uninformative regions. By contrast, human vision is neither uniform nor static; it is adaptive, selective, and resource-efficient. In light of this, we present the first systematic analysis of bio-inspired visual representation methods, providing insights for more efficient and adaptive VLMs. We propose LLMind (Looking Like the Mind), a novel training-free framework that mimics foveated encoding and cortical magnification in human vision to achieve adaptive, efficient representations for VLMs under tight pixel budgets. Our key idea is to explore a Bio-inspired Adaptive Sampling Strategy (BASS), enabling a Mobius-parameterized module that performs non-uniform sampling while preserving global scene structure. On top of BASS, we introduce closed-loop semantic feedback (CSF) via test-time adaptation to align perceptual saliency with textual information from the frozen VLM. We evaluate LLMind against uniform and other sampling baselines across diverse scene-level and region-guided visual question answering benchmarks. The results show dramatic gains, with average improvements of +20% on VQAv2, +38% on Seed-Bench, and +37% on A-OKVQA compared to uniform sampling under tight pixel budgets. More surprisingly, LLMind retains up to 82%, 92%, and 97% of the full-resolution performance using only 1%, 3%, and 5% of the pixels, respectively. Moreover, LLMind is lightweight, plug-and-play, and compatible with existing VLMs without requiring architectural changes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMind pairs a Mobius-based non-uniform sampler with test-time semantic feedback to cut pixels sharply for frozen VLMs, but the reported gains rest on thin experimental detail.

read the letter

LLMind's main move is a training-free pipeline that uses a Mobius-parameterized sampler (BASS) to allocate sparse pixels non-uniformly while trying to keep global structure, then adds a closed-loop feedback step (CSF) that pulls saliency cues from the VLM's own text output at test time. The headline result is that this keeps 82% of full-resolution performance at 1% pixels, 92% at 3%, and 97% at 5%, with solid average lifts over uniform sampling on VQAv2, Seed-Bench, and A-OKVQA.

Referee Report

4 major / 2 minor

Summary. The paper proposes LLMind, a training-free framework for adaptive visual representations in VLMs that mimics human foveated vision and cortical magnification via a Bio-inspired Adaptive Sampling Strategy (BASS) using Mobius-parameterized non-uniform sampling, augmented by closed-loop semantic feedback (CSF) for test-time alignment with VLM textual queries. It claims substantial efficiency gains on VQA benchmarks, retaining 82%/92%/97% of full-resolution performance at 1%/3%/5% pixel budgets with average improvements of +20% on VQAv2, +38% on Seed-Bench, and +37% on A-OKVQA over uniform sampling.

Significance. If the central claims hold under rigorous verification, the work offers a lightweight, plug-and-play approach to reducing pixel budgets in VLMs without retraining, which could meaningfully improve inference efficiency for resource-constrained deployments. The bio-inspired framing and training-free design are strengths that distinguish it from parameter-heavy alternatives, though the absence of machine-checked elements or reproducible code limits immediate impact.

major comments (4)

[Results] Results section: The retention figures (82% at 1%, 92% at 3%, 97% at 5%) and benchmark improvements are stated without error bars, standard deviations across runs, statistical tests, or detailed baseline configurations (e.g., exact uniform sampling implementation and VLM backbone variants), undermining verification of the headline performance claims.
[Method (BASS)] BASS module description: The Mobius parameterization for non-uniform sampling is presented as a new construct without derivation from the paper's own equations or ablation on parameter sensitivity; this leaves open whether it systematically under-samples spatially diffuse elements (e.g., counting or spatial-relation queries) at 1% budgets, as the skeptic concern notes.
[Experiments] Evaluation protocol: No per-question-type or per-scene-complexity breakdowns are provided to test the assumption that CSF feedback preserves all task-critical details; aggregate averages alone cannot rule out bias against distributed scene information.
[Ablations] Ablation and comparison: The manuscript lacks ablations isolating BASS versus CSF contributions or comparisons to other non-uniform sampling methods beyond uniform baselines, making it impossible to attribute gains specifically to the bio-inspired components.

minor comments (2)

[Abstract] Abstract: The claim of providing 'the first systematic analysis' of bio-inspired methods is not supported by explicit enumeration of prior works analyzed or the analysis methodology.
[Abstract] Notation: BASS and CSF acronyms are used before full expansion in the abstract; ensure first-use definitions are consistent throughout.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify key areas where additional rigor and analysis will strengthen the manuscript. We address each major comment below and commit to revisions that improve verifiability without altering the core claims.

read point-by-point responses

Referee: [Results] Results section: The retention figures (82% at 1%, 92% at 3%, 97% at 5%) and benchmark improvements are stated without error bars, standard deviations across runs, statistical tests, or detailed baseline configurations (e.g., exact uniform sampling implementation and VLM backbone variants), undermining verification of the headline performance claims.

Authors: We agree that error bars and statistical details enhance credibility. LLMind is largely deterministic given fixed inputs, but CSF introduces minor stochasticity. In revision we will report means and standard deviations over 5 random seeds for all headline numbers, add a statistical significance test (paired t-test) against uniform sampling, and explicitly describe the uniform baseline (random pixel subsampling without replacement) plus the exact VLM variants (LLaVA-1.5-7B, etc.) used. revision: yes
Referee: [Method (BASS)] BASS module description: The Mobius parameterization for non-uniform sampling is presented as a new construct without derivation from the paper's own equations or ablation on parameter sensitivity; this leaves open whether it systematically under-samples spatially diffuse elements (e.g., counting or spatial-relation queries) at 1% budgets, as the skeptic concern notes.

Authors: The Mobius parameterization is derived from the cortical magnification equation M(r) = 1/(1 + k*r) where r is eccentricity; we will insert a short derivation subsection (new Eq. 3-5) showing the closed-form mapping from uniform to non-uniform coordinates. We will also add a parameter-sensitivity ablation varying k and the scaling factor. On diffuse elements, VQAv2 and A-OKVQA contain counting/spatial questions; the 82% retention at 1% already reflects performance on these, and we will add a short discussion noting that no systematic drop was observed. revision: partial
Referee: [Experiments] Evaluation protocol: No per-question-type or per-scene-complexity breakdowns are provided to test the assumption that CSF feedback preserves all task-critical details; aggregate averages alone cannot rule out bias against distributed scene information.

Authors: We concur that category-level analysis is needed. In the revised manuscript we will add a new table (or figure) breaking down VQAv2 and Seed-Bench results by question type (counting, spatial, color, object, etc.) and by scene complexity (simple vs. cluttered). This will directly test whether CSF maintains accuracy on distributed-information queries. revision: yes
Referee: [Ablations] Ablation and comparison: The manuscript lacks ablations isolating BASS versus CSF contributions or comparisons to other non-uniform sampling methods beyond uniform baselines, making it impossible to attribute gains specifically to the bio-inspired components.

Authors: We will add a dedicated ablation subsection comparing four variants: uniform, BASS-only, CSF-only, and full LLMind on all three benchmarks. We will also include two additional non-uniform baselines (saliency-map sampling from a pre-trained model and standard foveated grid sampling) to allow direct attribution of gains to the bio-inspired Mobius + closed-loop design. revision: yes

Circularity Check

0 steps flagged

No circularity: BASS and CSF introduced as novel constructs, claims rest on empirical evaluation

full rationale

The paper proposes LLMind as a training-free framework with two new modules: Bio-inspired Adaptive Sampling Strategy (BASS) using a Mobius-parameterized non-uniform sampler, and closed-loop semantic feedback (CSF) for test-time alignment. These are presented as original designs inspired by human vision rather than derived from any prior equations, fitted parameters, or self-citations within the paper. Retention figures (82%/92%/97% at 1%/3%/5% pixels) and benchmark gains are stated as experimental outcomes on VQAv2, Seed-Bench, and A-OKVQA, not as quantities forced by construction from the sampling equations themselves. No load-bearing step reduces to self-definition, renaming of known results, or uniqueness imported via author citation. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

Abstract-only review limits visibility into exact parameter counts and background assumptions; the ledger reflects only elements explicitly named.

free parameters (1)

Mobius transformation parameters
Control the degree of non-uniformity in the sampling module; values not specified in abstract.

axioms (1)

domain assumption Human vision employs foveated encoding and cortical magnification for adaptive, resource-efficient perception
Invoked as the biological basis for the entire adaptive sampling design.

invented entities (2)

BASS (Bio-inspired Adaptive Sampling Strategy) module no independent evidence
purpose: Performs non-uniform pixel sampling while preserving global scene structure
New component introduced to implement the bio-inspired sampling.
CSF (closed-loop semantic feedback) mechanism no independent evidence
purpose: Aligns sampling saliency with textual query information via test-time adaptation
New feedback loop proposed to make sampling query-aware.

pith-pipeline@v0.9.0 · 5605 in / 1408 out tokens · 47382 ms · 2026-05-15T10:27:27.159903+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Möbius transformation ... z = (a w + b)/(c w + d) ... BASS module ... closed-loop semantic feedback (CSF) via SPSA
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

cortical magnification ... foveated encoding ... non-uniform sampling while preserving global scene structure

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 9 internal anchors

[1]

Data augmentation with noise and blur to enhance the performance of yolo7 ob- ject detection algorithm

Abdulghani M Abdulghani, Mokhles M Abdulghani, Wilbur L Walters, and Khalid H Abed. Data augmentation with noise and blur to enhance the performance of yolo7 ob- ject detection algorithm. In2023 Congress in Computer Sci- ence, Computer Engineering, & Applied Computing (CSCE), pages 180–185. IEEE, 2023. 3

work page 2023
[2]

Object detection through search with a foveated visual system.PLoS com- putational biology, 13(10):e1005743, 2017

Emre Akbas and Miguel P Eckstein. Object detection through search with a foveated visual system.PLoS com- putational biology, 13(10):e1005743, 2017. 3

work page 2017
[3]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

work page
[4]

M ¨obius trans- formations revealed.Notices of the American Mathematical Society, 55(10):1226–1231, 2008

Douglas N Arnold and Jonathan P Rogness. M ¨obius trans- formations revealed.Notices of the American Mathematical Society, 55(10):1226–1231, 2008. 2

work page 2008
[5]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

A summary-statistic representation in peripheral vision ex- plains visual crowding.Journal of vision, 9(12):13–13, 2009

Benjamin Balas, Lisa Nakano, and Ruth Rosenholtz. A summary-statistic representation in peripheral vision ex- plains visual crowding.Journal of vision, 9(12):13–13, 2009. 3

work page 2009
[7]

Token Merging: Your ViT But Faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. To- ken merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

A general survey on attention mechanisms in deep learning.IEEE transactions on knowledge and data engineering, 35(4):3279–3298, 2021

Gianni Brauwers and Flavius Frasincar. A general survey on attention mechanisms in deep learning.IEEE transactions on knowledge and data engineering, 35(4):3279–3298, 2021. 2

work page 2021
[9]

Convolutional neural networks develop ma- jor organizational principles of early visual cortex when en- hanced with retinal sampling.Scientific Reports, 14(1):8980,

Danny da Costa, Lukas Kornemann, Rainer Goebel, and Mario Senden. Convolutional neural networks develop ma- jor organizational principles of early visual cortex when en- hanced with retinal sampling.Scientific Reports, 14(1):8980,

work page
[10]

Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023. 1

work page 2023
[11]

The representation of the visual field on the cerebral cortex in monkeys.The Journal of physiology, 159(2):203, 1961

PM Daniel and D Whitteridge. The representation of the visual field on the cerebral cortex in monkeys.The Journal of physiology, 159(2):203, 1961. 2

work page 1961
[12]

PhD thesis, Indian Institute of Technology Gandhinagar, 2025

Soumyaratna Debnath.Computational Art and Pose: Solu- tions for 3D Packing, Scribble Art and Pose Lifting. PhD thesis, Indian Institute of Technology Gandhinagar, 2025. 2

work page 2025
[13]

Modified harris hawk optimization algorithm for mul- tilevel image thresholding

Soumyaratna Debnath, Abhirup Deb, Sourav De, and Sandip Dey. Modified harris hawk optimization algorithm for mul- tilevel image thresholding. InHybrid Computational Intelli- gent Systems, pages 291–310. CRC Press, 2023

work page 2023
[14]

Scribgen: generating scribble art through meta- heuristics

Soumyaratna Debnath, Ashish Tiwari, and Shanmuganathan Raman. Scribgen: generating scribble art through meta- heuristics. InSIGGRAPH Asia 2024 Art Papers, pages 1–9

work page 2024
[15]

Emergent proper- ties of foveated perceptual systems.arXiv preprint arXiv:2006.07991, 2020

Arturo Deza and Talia Konkle. Emergent proper- ties of foveated perceptual systems.arXiv preprint arXiv:2006.07991, 2020. 3

work page arXiv 2006
[16]

Image quality assessment: Unifying structure and texture similarity.IEEE transactions on pattern analysis and ma- chine intelligence, 44(5):2567–2581, 2020

Keyan Ding, Kede Ma, Shiqi Wang, and Eero P Simoncelli. Image quality assessment: Unifying structure and texture similarity.IEEE transactions on pattern analysis and ma- chine intelligence, 44(5):2567–2581, 2020. 4

work page 2020
[17]

Selective attention and the organization of vi- sual information.Journal of experimental psychology: Gen- eral, 113(4):501, 1984

John Duncan. Selective attention and the organization of vi- sual information.Journal of experimental psychology: Gen- eral, 113(4):501, 1984. 2

work page 1984
[18]

Metamers of the ventral stream.Nature neuroscience, 14(9):1195–1201,

Jeremy Freeman and Eero P Simoncelli. Metamers of the ventral stream.Nature neuroscience, 14(9):1195–1201,

work page
[19]

The free-energy principle: a unified brain the- ory?Nature reviews neuroscience, 11(2):127–138, 2010

Karl Friston. The free-energy principle: a unified brain the- ory?Nature reviews neuroscience, 11(2):127–138, 2010. 2

work page 2010
[20]

Featup: A model- agnostic framework for features at any resolution.arXiv preprint arXiv:2403.10516, 2024

Stephanie Fu, Mark Hamilton, Laura Brandt, Axel Feldman, Zhoutong Zhang, and William T Freeman. Featup: A model- agnostic framework for features at any resolution.arXiv preprint arXiv:2403.10516, 2024. 3

work page arXiv 2024
[21]

Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% perfor- mance.Visual Intelligence, 2(1):32, 2024

Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, et al. Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% perfor- mance.Visual Intelligence, 2(1):32, 2024. 5

work page 2024
[22]

Vari- able resolution improves visual question answering under a limited pixel budget

Andrey Gizdov, Shimon Ullman, and Daniel Harari. Vari- able resolution improves visual question answering under a limited pixel budget. InEuropean Conference on Computer Vision, pages 289–298. Springer, 2024. 3, 7

work page 2024
[23]

See- ing more with less: Human-like representations in vision models

Andrey Gizdov, Shimon Ullman, and Daniel Harari. See- ing more with less: Human-like representations in vision models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4408–4417, 2025. 3, 7

work page 2025
[24]

Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 2

work page 2017
[25]

Lvis: A dataset for large vocabulary instance segmentation

Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019. 6

work page 2019
[26]

A survey on vision transformer

Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chun- jing Xu, Yixing Xu, et al. A survey on vision transformer. IEEE transactions on pattern analysis and machine intelli- gence, 45(1):87–110, 2022. 2

work page 2022
[27]

Coco-periph: bridging the gap between human and machine perception in the periphery

Anne Harrington, Vasha DuTell, Mark Hamilton, Ayush Tewari, Simon Stent, William T Freeman, and Ruth Rosen- holtz. Coco-periph: bridging the gap between human and machine perception in the periphery. InThe Twelfth Interna- tional Conference on Learning Representations, 2023. 3

work page 2023
[28]

Eye movements in natural behavior.Trends in cognitive sciences, 9(4):188–194, 2005

Mary Hayhoe and Dana Ballard. Eye movements in natural behavior.Trends in cognitive sciences, 9(4):188–194, 2005. 2

work page 2005
[29]

Vi- sual prompt tuning

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. InEuropean conference on computer vision, pages 709–727. Springer, 2022. 3

work page 2022
[30]

Transformers in vision: A survey.ACM computing surveys (CSUR), 54(10s):1–41, 2022

Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. Transformers in vision: A survey.ACM computing surveys (CSUR), 54(10s):1–41, 2022. 2

work page 2022
[31]

Foveation in the era of deep learning

George Killick, Paul Henderson, Paul Siebert, and Gerardo Aragon-Camarasa. Foveation in the era of deep learning. arXiv preprint arXiv:2312.01450, 2023. 3, 7

work page arXiv 2023
[32]

Oxford Uni- versity Press, 2009

Michael Land and Benjamin Tatler.Looking and acting: vi- sion and eye movements in natural behaviour. Oxford Uni- versity Press, 2009. 2

work page 2009
[33]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 1

work page 2023
[36]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision- language understanding.arXiv preprint arXiv:2403.05525,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Bio- logically inspired deep learning model for efficient foveal- peripheral vision.Frontiers in Computational Neuroscience, 15:746204, 2021

Hristofor Lukanov, Peter K ¨onig, and Gordon Pipa. Bio- logically inspired deep learning model for efficient foveal- peripheral vision.Frontiers in Computational Neuroscience, 15:746204, 2021. 3

work page 2021
[38]

Mind meets space: Rethinking agentic spatial intelligence from a neuroscience-inspired per- spective.arXiv preprint arXiv:2509.09154, 2025

Bui Duc Manh, Soumyaratna Debnath, Zetong Zhang, Shri- ram Damodaran, Arvind Kumar, Yueyi Zhang, Lu Mi, Erik Cambria, and Lin Wang. Mind meets space: Rethinking agentic spatial intelligence from a neuroscience-inspired per- spective.arXiv preprint arXiv:2509.09154, 2025. 2

work page arXiv 2025
[39]

SmolVLM: Redefining small and efficient multimodal models

Andr ´es Marafioti, Orr Zohar, Miquel Farr ´e, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, et al. Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299, 2025. 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Peripheral vision transformer.Advances in Neural Informa- tion Processing Systems, 35:32097–32111, 2022

Juhong Min, Yucheng Zhao, Chong Luo, and Minsu Cho. Peripheral vision transformer.Advances in Neural Informa- tion Processing Systems, 35:32097–32111, 2022. 3

work page 2022
[41]

The geometry of m ¨obius transformations

John Olsen. The geometry of m ¨obius transformations. Rochester: University of Rochester, 2010. 2

work page 2010
[42]

Learning to search for and detect objects in foveal images using deep learning

Beatriz Paula and Plinio Moreno. Learning to search for and detect objects in foveal images using deep learning. In Iberian Conference on Pattern Recognition and Image Anal- ysis, pages 223–237. Springer, 2023. 3

work page 2023
[43]

Human peripheral blur is optimal for object recognition.Vision research, 200: 108083, 2022

RT Pramod, Harish Katti, and SP Arun. Human peripheral blur is optimal for object recognition.Vision research, 200: 108083, 2022. 3

work page 2022
[44]

Dynamicvit: Efficient vision transformers with dynamic token sparsification.Advances in neural information processing systems, 34:13937–13949,

Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification.Advances in neural information processing systems, 34:13937–13949,

work page
[45]

Biologically inspired image sampling for electronic eye

Fr ´ed´erique Robert-Inacio, R ´emy Scaramuzzino, Quentin Stainer, and Edith Kussener-Combier. Biologically inspired image sampling for electronic eye. In2010 Biomedical Circuits and Systems Conference (BioCAS), pages 246–249. IEEE, 2010. 3, 7

work page 2010
[46]

Robotic materials with bioinspired microstructures for high sensitivity and fast ac- tuation.Advanced Science, 13(15):e09739, 2026

Sakshi Sakshi, Rohit Pratyush Behera, Hongyu Zhou, Yi- fan Wang, and Hortense Le Ferrand. Robotic materials with bioinspired microstructures for high sensitivity and fast ac- tuation.Advanced Science, 13(15):e09739, 2026. 2

work page 2026
[47]

A-okvqa: A benchmark for visual question answering using world knowl- edge

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowl- edge. InEuropean conference on computer vision, pages 146–162. Springer, 2022. 2

work page 2022
[48]

Behind the machine’s gaze: Neural networks with biologically-inspired constraints exhibit human-like visual attention.arXiv preprint arXiv:2204.09093, 2022

Leo Schwinn, Doina Precup, Bj ¨orn Eskofier, and Dario Zanca. Behind the machine’s gaze: Neural networks with biologically-inspired constraints exhibit human-like visual attention.arXiv preprint arXiv:2204.09093, 2022. 1, 2

work page arXiv 2022
[49]

Multivariate stochastic approximation using a simultaneous perturbation gradient approximation.IEEE transactions on automatic control, 37(3):332–341, 2002

James C Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation.IEEE transactions on automatic control, 37(3):332–341, 2002. 2

work page 2002
[50]

Pe- ripheral vision and pattern recognition: A review.Journal of vision, 11(5):13–13, 2011

Hans Strasburger, Ingo Rentschler, and Martin J ¨uttner. Pe- ripheral vision and pattern recognition: A review.Journal of vision, 11(5):13–13, 2011. 2

work page 2011
[51]

Central and peripheral vision for scene recognition: A neurocomputational model- ing exploration.Journal of vision, 17(4):9–9, 2017

Panqu Wang and Garrison W Cottrell. Central and peripheral vision for scene recognition: A neurocomputational model- ing exploration.Journal of vision, 17(4):9–9, 2017. 3

work page 2017
[52]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.Ad- vances in neural information processing systems, 33:5776– 5788, 2020

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.Ad- vances in neural information processing systems, 33:5776– 5788, 2020. 4

work page 2020
[54]

Emulating human-like adaptive vision for efficient and flexible machine visual perception.Nature Machine In- telligence, pages 1–19, 2025

Yulin Wang, Yang Yue, Yang Yue, Huanqian Wang, Haojun Jiang, Yizeng Han, Zanlin Ni, Yifan Pu, Minglei Shi, Rui Lu, et al. Emulating human-like adaptive vision for efficient and flexible machine visual perception.Nature Machine In- telligence, pages 1–19, 2025. 1, 2

work page 2025
[55]

Controlmllm: Training-free visual prompt learning for multimodal large language models.Advances in Neural Information Processing Systems, 37:45206–45234, 2024

Mingrui Wu, Xinyue Cai, Jiayi Ji, Jiale Li, Oucheng Huang, Gen Luo, Hao Fei, Guannan Jiang, Xiaoshuai Sun, and Ron- grong Ji. Controlmllm: Training-free visual prompt learning for multimodal large language models.Advances in Neural Information Processing Systems, 37:45206–45234, 2024. 2, 3, 6, 7

work page 2024
[56]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 2, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Ferret: Refer and Ground Anything Anywhere at Any Granularity

Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity.arXiv preprint arXiv:2310.07704, 2023. 6

work page internal anchor Pith review arXiv 2023
[58]

Vsi: A visual saliency-induced index for perceptual image quality assess- ment.IEEE Transactions on Image processing, 23(10): 4270–4281, 2014

Lin Zhang, Ying Shen, and Hongyu Li. Vsi: A visual saliency-induced index for perceptual image quality assess- ment.IEEE Transactions on Image processing, 23(10): 4270–4281, 2014. 4

work page 2014