Recognition: 2 theorem links
· Lean TheoremLLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models
Pith reviewed 2026-05-15 10:27 UTC · model grok-4.3
The pith
Bio-inspired non-uniform sampling lets vision-language models keep 82-97% of full accuracy using only 1-5% of image pixels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLMind achieves adaptive visual representations for frozen VLMs through a Bio-inspired Adaptive Sampling Strategy (BASS) that uses a Mobius-parameterized module for non-uniform sampling while preserving global structure, augmented by closed-loop semantic feedback that aligns sampling with textual task information at test time. This yields up to 82%, 92%, and 97% of full-resolution performance with 1%, 3%, and 5% of pixels respectively, plus average gains of +20% on VQAv2, +38% on Seed-Bench, and +37% on A-OKVQA over uniform sampling at tight budgets.
What carries the argument
The Bio-inspired Adaptive Sampling Strategy (BASS) Mobius-parameterized non-uniform sampling module, combined with closed-loop semantic feedback from the frozen VLM to adjust perceptual saliency.
If this is right
- Existing VLMs can operate under severe pixel or bandwidth constraints while retaining most task performance without retraining.
- Plug-and-play deployment is possible across any frozen VLM architecture since no model weights or layers are modified.
- Non-uniform sampling outperforms uniform reduction at every tested budget, showing that adaptive allocation is the key efficiency lever.
- The method scales to both scene-level and region-guided VQA, indicating broad applicability within current benchmarks.
Where Pith is reading between the lines
- Real-time mobile or edge VLM applications could reduce latency and power use by routing only the adaptively sampled pixels through the vision encoder.
- Extending the feedback loop to incorporate user-provided text prompts before sampling might further improve task alignment in interactive settings.
- The same non-uniform principle could be tested on video inputs by adding temporal consistency constraints to the Mobius sampling across frames.
Load-bearing premise
The sampling decisions driven by Mobius parameterization and semantic feedback will always identify and preserve the exact visual details required for each task without systematic omissions or biases that uniform sampling would avoid.
What would settle it
Run the same VQA benchmarks on a new dataset containing questions whose answers depend on fine details distributed uniformly across low-saliency background regions; if accuracy falls below uniform sampling at the 1-5% pixel budgets, the claim fails.
Figures
read the original abstract
Vision-Language Models (VLMs) typically assume a uniform spatial fidelity across the entire field of view of visual inputs, dedicating equal precision to even the uninformative regions. By contrast, human vision is neither uniform nor static; it is adaptive, selective, and resource-efficient. In light of this, we present the first systematic analysis of bio-inspired visual representation methods, providing insights for more efficient and adaptive VLMs. We propose LLMind (Looking Like the Mind), a novel training-free framework that mimics foveated encoding and cortical magnification in human vision to achieve adaptive, efficient representations for VLMs under tight pixel budgets. Our key idea is to explore a Bio-inspired Adaptive Sampling Strategy (BASS), enabling a Mobius-parameterized module that performs non-uniform sampling while preserving global scene structure. On top of BASS, we introduce closed-loop semantic feedback (CSF) via test-time adaptation to align perceptual saliency with textual information from the frozen VLM. We evaluate LLMind against uniform and other sampling baselines across diverse scene-level and region-guided visual question answering benchmarks. The results show dramatic gains, with average improvements of +20% on VQAv2, +38% on Seed-Bench, and +37% on A-OKVQA compared to uniform sampling under tight pixel budgets. More surprisingly, LLMind retains up to 82%, 92%, and 97% of the full-resolution performance using only 1%, 3%, and 5% of the pixels, respectively. Moreover, LLMind is lightweight, plug-and-play, and compatible with existing VLMs without requiring architectural changes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes LLMind, a training-free framework for adaptive visual representations in VLMs that mimics human foveated vision and cortical magnification via a Bio-inspired Adaptive Sampling Strategy (BASS) using Mobius-parameterized non-uniform sampling, augmented by closed-loop semantic feedback (CSF) for test-time alignment with VLM textual queries. It claims substantial efficiency gains on VQA benchmarks, retaining 82%/92%/97% of full-resolution performance at 1%/3%/5% pixel budgets with average improvements of +20% on VQAv2, +38% on Seed-Bench, and +37% on A-OKVQA over uniform sampling.
Significance. If the central claims hold under rigorous verification, the work offers a lightweight, plug-and-play approach to reducing pixel budgets in VLMs without retraining, which could meaningfully improve inference efficiency for resource-constrained deployments. The bio-inspired framing and training-free design are strengths that distinguish it from parameter-heavy alternatives, though the absence of machine-checked elements or reproducible code limits immediate impact.
major comments (4)
- [Results] Results section: The retention figures (82% at 1%, 92% at 3%, 97% at 5%) and benchmark improvements are stated without error bars, standard deviations across runs, statistical tests, or detailed baseline configurations (e.g., exact uniform sampling implementation and VLM backbone variants), undermining verification of the headline performance claims.
- [Method (BASS)] BASS module description: The Mobius parameterization for non-uniform sampling is presented as a new construct without derivation from the paper's own equations or ablation on parameter sensitivity; this leaves open whether it systematically under-samples spatially diffuse elements (e.g., counting or spatial-relation queries) at 1% budgets, as the skeptic concern notes.
- [Experiments] Evaluation protocol: No per-question-type or per-scene-complexity breakdowns are provided to test the assumption that CSF feedback preserves all task-critical details; aggregate averages alone cannot rule out bias against distributed scene information.
- [Ablations] Ablation and comparison: The manuscript lacks ablations isolating BASS versus CSF contributions or comparisons to other non-uniform sampling methods beyond uniform baselines, making it impossible to attribute gains specifically to the bio-inspired components.
minor comments (2)
- [Abstract] Abstract: The claim of providing 'the first systematic analysis' of bio-inspired methods is not supported by explicit enumeration of prior works analyzed or the analysis methodology.
- [Abstract] Notation: BASS and CSF acronyms are used before full expansion in the abstract; ensure first-use definitions are consistent throughout.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments identify key areas where additional rigor and analysis will strengthen the manuscript. We address each major comment below and commit to revisions that improve verifiability without altering the core claims.
read point-by-point responses
-
Referee: [Results] Results section: The retention figures (82% at 1%, 92% at 3%, 97% at 5%) and benchmark improvements are stated without error bars, standard deviations across runs, statistical tests, or detailed baseline configurations (e.g., exact uniform sampling implementation and VLM backbone variants), undermining verification of the headline performance claims.
Authors: We agree that error bars and statistical details enhance credibility. LLMind is largely deterministic given fixed inputs, but CSF introduces minor stochasticity. In revision we will report means and standard deviations over 5 random seeds for all headline numbers, add a statistical significance test (paired t-test) against uniform sampling, and explicitly describe the uniform baseline (random pixel subsampling without replacement) plus the exact VLM variants (LLaVA-1.5-7B, etc.) used. revision: yes
-
Referee: [Method (BASS)] BASS module description: The Mobius parameterization for non-uniform sampling is presented as a new construct without derivation from the paper's own equations or ablation on parameter sensitivity; this leaves open whether it systematically under-samples spatially diffuse elements (e.g., counting or spatial-relation queries) at 1% budgets, as the skeptic concern notes.
Authors: The Mobius parameterization is derived from the cortical magnification equation M(r) = 1/(1 + k*r) where r is eccentricity; we will insert a short derivation subsection (new Eq. 3-5) showing the closed-form mapping from uniform to non-uniform coordinates. We will also add a parameter-sensitivity ablation varying k and the scaling factor. On diffuse elements, VQAv2 and A-OKVQA contain counting/spatial questions; the 82% retention at 1% already reflects performance on these, and we will add a short discussion noting that no systematic drop was observed. revision: partial
-
Referee: [Experiments] Evaluation protocol: No per-question-type or per-scene-complexity breakdowns are provided to test the assumption that CSF feedback preserves all task-critical details; aggregate averages alone cannot rule out bias against distributed scene information.
Authors: We concur that category-level analysis is needed. In the revised manuscript we will add a new table (or figure) breaking down VQAv2 and Seed-Bench results by question type (counting, spatial, color, object, etc.) and by scene complexity (simple vs. cluttered). This will directly test whether CSF maintains accuracy on distributed-information queries. revision: yes
-
Referee: [Ablations] Ablation and comparison: The manuscript lacks ablations isolating BASS versus CSF contributions or comparisons to other non-uniform sampling methods beyond uniform baselines, making it impossible to attribute gains specifically to the bio-inspired components.
Authors: We will add a dedicated ablation subsection comparing four variants: uniform, BASS-only, CSF-only, and full LLMind on all three benchmarks. We will also include two additional non-uniform baselines (saliency-map sampling from a pre-trained model and standard foveated grid sampling) to allow direct attribution of gains to the bio-inspired Mobius + closed-loop design. revision: yes
Circularity Check
No circularity: BASS and CSF introduced as novel constructs, claims rest on empirical evaluation
full rationale
The paper proposes LLMind as a training-free framework with two new modules: Bio-inspired Adaptive Sampling Strategy (BASS) using a Mobius-parameterized non-uniform sampler, and closed-loop semantic feedback (CSF) for test-time alignment. These are presented as original designs inspired by human vision rather than derived from any prior equations, fitted parameters, or self-citations within the paper. Retention figures (82%/92%/97% at 1%/3%/5% pixels) and benchmark gains are stated as experimental outcomes on VQAv2, Seed-Bench, and A-OKVQA, not as quantities forced by construction from the sampling equations themselves. No load-bearing step reduces to self-definition, renaming of known results, or uniqueness imported via author citation. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- Mobius transformation parameters
axioms (1)
- domain assumption Human vision employs foveated encoding and cortical magnification for adaptive, resource-efficient perception
invented entities (2)
-
BASS (Bio-inspired Adaptive Sampling Strategy) module
no independent evidence
-
CSF (closed-loop semantic feedback) mechanism
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Möbius transformation ... z = (a w + b)/(c w + d) ... BASS module ... closed-loop semantic feedback (CSF) via SPSA
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
cortical magnification ... foveated encoding ... non-uniform sampling while preserving global scene structure
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Abdulghani M Abdulghani, Mokhles M Abdulghani, Wilbur L Walters, and Khalid H Abed. Data augmentation with noise and blur to enhance the performance of yolo7 ob- ject detection algorithm. In2023 Congress in Computer Sci- ence, Computer Engineering, & Applied Computing (CSCE), pages 180–185. IEEE, 2023. 3
work page 2023
-
[2]
Emre Akbas and Miguel P Eckstein. Object detection through search with a foveated visual system.PLoS com- putational biology, 13(10):e1005743, 2017. 3
work page 2017
-
[3]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,
-
[4]
Douglas N Arnold and Jonathan P Rogness. M ¨obius trans- formations revealed.Notices of the American Mathematical Society, 55(10):1226–1231, 2008. 2
work page 2008
-
[5]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Benjamin Balas, Lisa Nakano, and Ruth Rosenholtz. A summary-statistic representation in peripheral vision ex- plains visual crowding.Journal of vision, 9(12):13–13, 2009. 3
work page 2009
-
[7]
Token Merging: Your ViT But Faster
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. To- ken merging: Your vit but faster.arXiv preprint arXiv:2210.09461, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[8]
Gianni Brauwers and Flavius Frasincar. A general survey on attention mechanisms in deep learning.IEEE transactions on knowledge and data engineering, 35(4):3279–3298, 2021. 2
work page 2021
-
[9]
Danny da Costa, Lukas Kornemann, Rainer Goebel, and Mario Senden. Convolutional neural networks develop ma- jor organizational principles of early visual cortex when en- hanced with retinal sampling.Scientific Reports, 14(1):8980,
-
[10]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023. 1
work page 2023
-
[11]
PM Daniel and D Whitteridge. The representation of the visual field on the cerebral cortex in monkeys.The Journal of physiology, 159(2):203, 1961. 2
work page 1961
-
[12]
PhD thesis, Indian Institute of Technology Gandhinagar, 2025
Soumyaratna Debnath.Computational Art and Pose: Solu- tions for 3D Packing, Scribble Art and Pose Lifting. PhD thesis, Indian Institute of Technology Gandhinagar, 2025. 2
work page 2025
-
[13]
Modified harris hawk optimization algorithm for mul- tilevel image thresholding
Soumyaratna Debnath, Abhirup Deb, Sourav De, and Sandip Dey. Modified harris hawk optimization algorithm for mul- tilevel image thresholding. InHybrid Computational Intelli- gent Systems, pages 291–310. CRC Press, 2023
work page 2023
-
[14]
Scribgen: generating scribble art through meta- heuristics
Soumyaratna Debnath, Ashish Tiwari, and Shanmuganathan Raman. Scribgen: generating scribble art through meta- heuristics. InSIGGRAPH Asia 2024 Art Papers, pages 1–9
work page 2024
-
[15]
Emergent proper- ties of foveated perceptual systems.arXiv preprint arXiv:2006.07991, 2020
Arturo Deza and Talia Konkle. Emergent proper- ties of foveated perceptual systems.arXiv preprint arXiv:2006.07991, 2020. 3
-
[16]
Keyan Ding, Kede Ma, Shiqi Wang, and Eero P Simoncelli. Image quality assessment: Unifying structure and texture similarity.IEEE transactions on pattern analysis and ma- chine intelligence, 44(5):2567–2581, 2020. 4
work page 2020
-
[17]
John Duncan. Selective attention and the organization of vi- sual information.Journal of experimental psychology: Gen- eral, 113(4):501, 1984. 2
work page 1984
-
[18]
Metamers of the ventral stream.Nature neuroscience, 14(9):1195–1201,
Jeremy Freeman and Eero P Simoncelli. Metamers of the ventral stream.Nature neuroscience, 14(9):1195–1201,
-
[19]
The free-energy principle: a unified brain the- ory?Nature reviews neuroscience, 11(2):127–138, 2010
Karl Friston. The free-energy principle: a unified brain the- ory?Nature reviews neuroscience, 11(2):127–138, 2010. 2
work page 2010
-
[20]
Stephanie Fu, Mark Hamilton, Laura Brandt, Axel Feldman, Zhoutong Zhang, and William T Freeman. Featup: A model- agnostic framework for features at any resolution.arXiv preprint arXiv:2403.10516, 2024. 3
-
[21]
Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, et al. Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% perfor- mance.Visual Intelligence, 2(1):32, 2024. 5
work page 2024
-
[22]
Vari- able resolution improves visual question answering under a limited pixel budget
Andrey Gizdov, Shimon Ullman, and Daniel Harari. Vari- able resolution improves visual question answering under a limited pixel budget. InEuropean Conference on Computer Vision, pages 289–298. Springer, 2024. 3, 7
work page 2024
-
[23]
See- ing more with less: Human-like representations in vision models
Andrey Gizdov, Shimon Ullman, and Daniel Harari. See- ing more with less: Human-like representations in vision models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4408–4417, 2025. 3, 7
work page 2025
-
[24]
Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 2
work page 2017
-
[25]
Lvis: A dataset for large vocabulary instance segmentation
Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019. 6
work page 2019
-
[26]
A survey on vision transformer
Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chun- jing Xu, Yixing Xu, et al. A survey on vision transformer. IEEE transactions on pattern analysis and machine intelli- gence, 45(1):87–110, 2022. 2
work page 2022
-
[27]
Coco-periph: bridging the gap between human and machine perception in the periphery
Anne Harrington, Vasha DuTell, Mark Hamilton, Ayush Tewari, Simon Stent, William T Freeman, and Ruth Rosen- holtz. Coco-periph: bridging the gap between human and machine perception in the periphery. InThe Twelfth Interna- tional Conference on Learning Representations, 2023. 3
work page 2023
-
[28]
Eye movements in natural behavior.Trends in cognitive sciences, 9(4):188–194, 2005
Mary Hayhoe and Dana Ballard. Eye movements in natural behavior.Trends in cognitive sciences, 9(4):188–194, 2005. 2
work page 2005
-
[29]
Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. InEuropean conference on computer vision, pages 709–727. Springer, 2022. 3
work page 2022
-
[30]
Transformers in vision: A survey.ACM computing surveys (CSUR), 54(10s):1–41, 2022
Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. Transformers in vision: A survey.ACM computing surveys (CSUR), 54(10s):1–41, 2022. 2
work page 2022
-
[31]
Foveation in the era of deep learning
George Killick, Paul Henderson, Paul Siebert, and Gerardo Aragon-Camarasa. Foveation in the era of deep learning. arXiv preprint arXiv:2312.01450, 2023. 3, 7
-
[32]
Oxford Uni- versity Press, 2009
Michael Land and Benjamin Tatler.Looking and acting: vi- sion and eye movements in natural behaviour. Oxford Uni- versity Press, 2009. 2
work page 2009
-
[33]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 2, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 1
work page 2023
-
[36]
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision- language understanding.arXiv preprint arXiv:2403.05525,
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Hristofor Lukanov, Peter K ¨onig, and Gordon Pipa. Bio- logically inspired deep learning model for efficient foveal- peripheral vision.Frontiers in Computational Neuroscience, 15:746204, 2021. 3
work page 2021
-
[38]
Bui Duc Manh, Soumyaratna Debnath, Zetong Zhang, Shri- ram Damodaran, Arvind Kumar, Yueyi Zhang, Lu Mi, Erik Cambria, and Lin Wang. Mind meets space: Rethinking agentic spatial intelligence from a neuroscience-inspired per- spective.arXiv preprint arXiv:2509.09154, 2025. 2
-
[39]
SmolVLM: Redefining small and efficient multimodal models
Andr ´es Marafioti, Orr Zohar, Miquel Farr ´e, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, et al. Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299, 2025. 5, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Juhong Min, Yucheng Zhao, Chong Luo, and Minsu Cho. Peripheral vision transformer.Advances in Neural Informa- tion Processing Systems, 35:32097–32111, 2022. 3
work page 2022
-
[41]
The geometry of m ¨obius transformations
John Olsen. The geometry of m ¨obius transformations. Rochester: University of Rochester, 2010. 2
work page 2010
-
[42]
Learning to search for and detect objects in foveal images using deep learning
Beatriz Paula and Plinio Moreno. Learning to search for and detect objects in foveal images using deep learning. In Iberian Conference on Pattern Recognition and Image Anal- ysis, pages 223–237. Springer, 2023. 3
work page 2023
-
[43]
Human peripheral blur is optimal for object recognition.Vision research, 200: 108083, 2022
RT Pramod, Harish Katti, and SP Arun. Human peripheral blur is optimal for object recognition.Vision research, 200: 108083, 2022. 3
work page 2022
-
[44]
Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification.Advances in neural information processing systems, 34:13937–13949,
-
[45]
Biologically inspired image sampling for electronic eye
Fr ´ed´erique Robert-Inacio, R ´emy Scaramuzzino, Quentin Stainer, and Edith Kussener-Combier. Biologically inspired image sampling for electronic eye. In2010 Biomedical Circuits and Systems Conference (BioCAS), pages 246–249. IEEE, 2010. 3, 7
work page 2010
-
[46]
Sakshi Sakshi, Rohit Pratyush Behera, Hongyu Zhou, Yi- fan Wang, and Hortense Le Ferrand. Robotic materials with bioinspired microstructures for high sensitivity and fast ac- tuation.Advanced Science, 13(15):e09739, 2026. 2
work page 2026
-
[47]
A-okvqa: A benchmark for visual question answering using world knowl- edge
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowl- edge. InEuropean conference on computer vision, pages 146–162. Springer, 2022. 2
work page 2022
-
[48]
Leo Schwinn, Doina Precup, Bj ¨orn Eskofier, and Dario Zanca. Behind the machine’s gaze: Neural networks with biologically-inspired constraints exhibit human-like visual attention.arXiv preprint arXiv:2204.09093, 2022. 1, 2
-
[49]
James C Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation.IEEE transactions on automatic control, 37(3):332–341, 2002. 2
work page 2002
-
[50]
Pe- ripheral vision and pattern recognition: A review.Journal of vision, 11(5):13–13, 2011
Hans Strasburger, Ingo Rentschler, and Martin J ¨uttner. Pe- ripheral vision and pattern recognition: A review.Journal of vision, 11(5):13–13, 2011. 2
work page 2011
-
[51]
Panqu Wang and Garrison W Cottrell. Central and peripheral vision for scene recognition: A neurocomputational model- ing exploration.Journal of vision, 17(4):9–9, 2017. 3
work page 2017
-
[52]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.Ad- vances in neural information processing systems, 33:5776– 5788, 2020. 4
work page 2020
-
[54]
Yulin Wang, Yang Yue, Yang Yue, Huanqian Wang, Haojun Jiang, Yizeng Han, Zanlin Ni, Yifan Pu, Minglei Shi, Rui Lu, et al. Emulating human-like adaptive vision for efficient and flexible machine visual perception.Nature Machine In- telligence, pages 1–19, 2025. 1, 2
work page 2025
-
[55]
Mingrui Wu, Xinyue Cai, Jiayi Ji, Jiale Li, Oucheng Huang, Gen Luo, Hao Fei, Guannan Jiang, Xiaoshuai Sun, and Ron- grong Ji. Controlmllm: Training-free visual prompt learning for multimodal large language models.Advances in Neural Information Processing Systems, 37:45206–45234, 2024. 2, 3, 6, 7
work page 2024
-
[56]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 2, 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Ferret: Refer and Ground Anything Anywhere at Any Granularity
Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity.arXiv preprint arXiv:2310.07704, 2023. 6
work page internal anchor Pith review arXiv 2023
-
[58]
Lin Zhang, Ying Shen, and Hongyu Li. Vsi: A visual saliency-induced index for perceptual image quality assess- ment.IEEE Transactions on Image processing, 23(10): 4270–4281, 2014. 4
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.