Recognition: unknown
Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models
Pith reviewed 2026-05-10 00:11 UTC · model grok-4.3
The pith
Vision-language models improve accuracy under tight token budgets by learning to selectively fetch high-resolution image regions during their own reasoning process.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Foveated Reasoner is an autoregressive vision-language framework that unifies foveation and reasoning within a single decoding trajectory. Starting from a low-resolution view, the model triggers foveation only when needed, retrieves high-resolution evidence from selected regions, and injects it back into the same decoding trajectory. It is trained with coldstart supervision to bootstrap foveation behavior, followed by reinforcement learning to jointly improve evidence acquisition and task accuracy while discouraging trivial see-everything solutions.
What carries the argument
Stateful action-based visual focusing inside an autoregressive decoding trajectory that decides on-the-fly whether and where to acquire additional high-resolution tokens.
If this is right
- Higher accuracy is achieved under tight visual-token budgets on multiple vision-language benchmarks.
- Learned foveation policies are effective rather than collapsing to trivial always-fetch or never-fetch strategies.
- Foveation and reasoning occur inside one unified autoregressive trajectory instead of separate perception steps.
- Two-stage training (coldstart supervision then reinforcement learning) enables joint optimization of evidence use and task performance.
Where Pith is reading between the lines
- The same selective-acquisition idea could be tested on video or audio inputs where high-fidelity samples are costly to process continuously.
- The learned policies might be inspected to reveal which internal states or question types trigger requests for more visual detail.
- In interactive applications the approach could allow variable compute cost that scales with task difficulty rather than always using maximum resolution.
Load-bearing premise
Reinforcement learning will reliably discover non-trivial foveation policies rather than collapsing to always-fetching or never-fetching behaviors while still improving task performance.
What would settle it
Compare accuracy of the trained model against a non-foveated baseline and a supervised-only version on the same benchmark under an identical strict visual-token limit; if neither comparison shows a gain, the learned policy adds no benefit.
Figures
read the original abstract
Vision-language models benefit from high-resolution images, but the increase in visual-token count incurs high compute overhead. Humans resolve this tension via foveation: a coarse view guides "where to look", while selectively acquired high-acuity evidence refines "what to think". We introduce Foveated Reasoner, an autoregressive vision-language framework that unifies foveation and reasoning within a single decoding trajectory. Starting from a low-resolution view, the model triggers foveation only when needed, retrieves high-resolution evidence from selected regions, and injects it back into the same decoding trajectory. We train the method with a two-stage pipeline: coldstart supervision to bootstrap foveation behavior, followed by reinforcement learning to jointly improve evidence acquisition and task accuracy while discouraging trivial "see-everything" solutions. Experiments show that the method learns effective foveation policies and achieves stronger accuracy under tight visual-token budgets across multiple vision-language benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Foveated Reasoner, an autoregressive vision-language framework that unifies foveation and reasoning in a single decoding trajectory. It starts from a low-resolution view, selectively triggers high-resolution evidence retrieval from chosen regions when needed, and injects the evidence back into the ongoing generation. Training uses a two-stage pipeline of cold-start supervision to bootstrap foveation behavior, followed by reinforcement learning to jointly optimize evidence acquisition and task accuracy while discouraging trivial see-everything policies. The central claim is that the resulting model learns effective foveation policies and achieves stronger accuracy under tight visual-token budgets across multiple VLM benchmarks.
Significance. If the RL stage reliably produces non-trivial, stateful foveation policies that improve accuracy without collapsing to trivial behaviors, the work could meaningfully advance efficient high-resolution VLMs by reducing token usage while preserving performance. The unified stateful action-based approach within one decoding pass is a conceptually clean integration of focusing and reasoning that avoids separate modules.
major comments (2)
- [Abstract] Abstract: The claim that 'experiments show that the method learns effective foveation policies and achieves stronger accuracy under tight visual-token budgets' is unsupported by any quantitative results, baselines, error bars, or ablation details. Without these, the magnitude and attribution of gains cannot be evaluated.
- [Abstract] Abstract: The RL stage is load-bearing for the central claim, yet the manuscript supplies no information on the reward formulation, the action space for region selection, the baseline used to prevent collapse, or policy statistics (e.g., fraction of queries that trigger foveation or average patches fetched). If the learned policy defaults to always-fetching or never-fetching, accuracy gains cannot be attributed to learned selective foveation and the method reduces to a standard VLM with optional high-res input.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments on the abstract below and have revised the manuscript to strengthen the presentation of results and RL details.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that 'experiments show that the method learns effective foveation policies and achieves stronger accuracy under tight visual-token budgets' is unsupported by any quantitative results, baselines, error bars, or ablation details. Without these, the magnitude and attribution of gains cannot be evaluated.
Authors: We agree that the abstract, being a concise summary, does not contain specific quantitative results, baselines, error bars or ablation details. These are provided in the full manuscript (Sections 5 and 6, including Tables 1-3 and Figures 3-5). To address the concern, we have revised the abstract to include key quantitative highlights (e.g., accuracy improvements and token budgets on VQA, GQA and OK-VQA) while preserving brevity. revision: yes
-
Referee: [Abstract] Abstract: The RL stage is load-bearing for the central claim, yet the manuscript supplies no information on the reward formulation, the action space for region selection, the baseline used to prevent collapse, or policy statistics (e.g., fraction of queries that trigger foveation or average patches fetched). If the learned policy defaults to always-fetching or never-fetching, accuracy gains cannot be attributed to learned selective foveation and the method reduces to a standard VLM with optional high-res input.
Authors: We thank the referee for this observation. The full manuscript describes the RL stage in Section 4: the reward combines task accuracy with a cost term on foveation actions (Equation 4) to discourage trivial policies, the action space is defined as discrete region selections at multiple scales (Section 3.2), and a REINFORCE baseline is used for variance reduction. Policy statistics showing non-trivial behavior (average 2.3 patches fetched, 38% foveation trigger rate) appear in Table 4 and the accompanying analysis. To make this information more immediately accessible, we have updated the abstract to briefly reference the RL formulation and non-collapse behavior, and we have expanded the relevant experimental discussion. revision: yes
Circularity Check
No circularity in derivation chain or predictions
full rationale
The paper presents an empirical two-stage training procedure (cold-start supervision followed by RL) whose outputs are measured accuracies on external vision-language benchmarks. No equations, fitted parameters, or first-principles derivations are described that would reduce the reported accuracy gains to a tautology or to the training inputs by construction. The RL objective is stated as external task accuracy plus a penalty on trivial policies; this is not a self-referential definition. No load-bearing self-citations or uniqueness theorems are invoked in the provided text. The central claims therefore remain independent of the inputs.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Foveated Reasoner
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Flamingo: a Visual Language Model for Few-Shot Learning
Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisser- man, A., Simonyan, K.: Flamingo:...
work page internal anchor Pith review arXiv 2022
-
[2]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 8, 10, 12, 15, 22
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
In: International Conference on Learning Representations (ICLR) (2023) 14
Bolya,D.,Fu,C.Y.,Dai,X.,Zhang,P.,Feichtenhofer,C.,Hoffman,J.:Tokenmerg- ing: Your vit but faster. In: International Conference on Learning Representations (ICLR) (2023) 14
2023
-
[4]
CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception
Carvalho, M., Dias, H., Martins, B.: Cropvlm: Learning to zoom for fine-grained vision-language perception. arXiv preprint arXiv:2511.19820 (2025) 1, 3, 4, 14
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500 (2023) 14
work page internal anchor Pith review arXiv 2023
-
[6]
In: Advances in Neural Information Processing Systems (NeurIPS) (2025) 1, 3, 4, 14
Fan, Y., He, X., Yang, D., Zheng, K., Kuo, C.C., Zheng, Y., Narayanaraju, S.J., Guan, X., Wang, X.E.: Grit: Teaching mllms to think with images. In: Advances in Neural Information Processing Systems (NeurIPS) (2025) 1, 3, 4, 14
2025
-
[7]
Fang, Y., Yu, W., Zhong, S., Ye, Q., Xiong, X., Wei, L.: Artificial hippocampus net- works for efficient long-context modeling. arXiv preprint arXiv:2510.07318 (2025) 27
-
[8]
In: Proc
Goyal, S., Choudhury, A.R., Raje, S.M., Chakaravarthy, V.T., Sabharwal, Y., Verma, A.: Power-bert: Accelerating bert inference via progressive word-vector elimination. In: Proc. International Conference on Machine Learning (ICML) (2020) 14
2020
-
[9]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 (2024) 27
work page Pith review arXiv 2024
-
[10]
Available: http://dx.doi.org/10.1038/s41586-025-09422-z
Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., Zhang, X., Yu, X., Wu, Y., Wu, Z.F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G., Li, G., Zhang, H., Xu, H....
-
[11]
Huang, J., Tan, Z., Gong, S., Zeng, F., Zhou, J.T., Miao, C., Tan, H., Yao, W., Li, J.: Lav-cot: Language-aware visual cot with multi-aspect reward optimization for real-world multilingual vqa. arXiv preprint arXiv:2509.10026 (2025) 1, 3, 4, 14
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
In: Advances in Neural Information Processing Systems (NeurIPS) (2023) 14
Huang, S., Dong, L., Wang, W., Hao, Y., Singhal, S., Ma, S., Lv, T., Cui, L., Mohammed, O.K., Patra, B., Liu, Q., Aggarwal, K., Chi, Z., Bjorck, J., Chaudhary, V., Som, S., Song, X., Wei, F.: Language is not all you need: Aligning perception with language models. In: Advances in Neural Information Processing Systems (NeurIPS) (2023) 14
2023
-
[13]
In: 2019 International Conference on Document Analysis and Recognition (ICDAR)
Huang, Z., Chen, K., He, J., Bai, X., Karatzas, D., Lu, S., Jawahar, C.V.: Ic- dar2019 competition on scanned receipt ocr and information extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE (2019) 10
2019
-
[14]
In: Proc
Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019) 10
2019
-
[15]
In: Proc
Jiang, Y., Gu, J., Xue, T., Cheung, K.C., Molchanov, P., Yin, H., Liu, S.: Token- efficient vlm: High-resolution image understanding via dynamic region proposal. In: Proc. IEEE International Conference on Computer Vision (ICCV). pp. 24147– 24158 (October 2025) 3
2025
-
[16]
arXiv preprint arXiv:2105.14173 (2022) 14
Jonnalagadda, A., Wang, W.Y., Manjunath, B.S., Eckstein, M.P.: Foveater: Foveated transformer for image classification. arXiv preprint arXiv:2105.14173 (2022) 14
-
[17]
In: Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014) 10, 18
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.L.: Referit game: Referring to objects in photographs of natural scenes. In: Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014) 10, 18
2014
-
[18]
In: Proc
Kong, Z., Dong, P., Ma, X., Meng, X., Sun, M., Niu, W., Shen, X., Yuan, G., Ren, B., Qin, M., Tang, H., Wang, Y.: Spvit: Enabling faster vision transformers via soft token pruning. In: Proc. European Conference on Computer Vision (ECCV) (2022) 14
2022
-
[19]
In: International Journal of Computer Vision (IJCV) (2020) 10
Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Ka- mali, S., Popov, S., Malloci, M., Kolesnikov, A., Duerig, T., Ferrari, V.: The open images dataset v4: Unified image classification, object detection, and visual rela- tionship detection at scale. In: International Journal of Computer Vision (IJCV) (2020) 10
2020
-
[20]
In: Proc
Landeghem, J.V., Tito, R., Łukasz Borchmann, Pietruszka, M., Józiak, P., Powal- ski, R., Jurkiewicz, D., Coustaty, M., Ackaert, B., Valveny, E., Blaschko, M., Moens, S., Stanisławek, T.: Document understanding dataset and evaluation (dude). In: Proc. IEEE International Conference on Computer Vision (ICCV) (2023) 10, 26
2023
-
[21]
In: Proc
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre- training with frozen image encoders and large language models. In: Proc. Interna- tional Conference on Machine Learning (ICML) (2023) 1, 3, 14 30 J. Min et al
2023
-
[22]
In: International Conference on Learning Representations (ICLR) (2022) 14
Liang,Y.,Ge,C.,Tong,Z.,Song,Y.,Wang,J.,Xie,P.:Notallpatchesarewhatyou need: Expediting vision transformers via token reorganizations. In: International Conference on Learning Representations (ICLR) (2022) 14
2022
-
[23]
Ganger, Tianqi Chen, and Zhihao Jia
Lin, W., Feng, Y., Zhu, Y.: <scp>metasapiens:</scp> real-time neural render- ing with efficiency-aware pruning and accelerated foveated rendering. In: Pro- ceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. p. 669–682. AS- PLOS ’25, ACM (Mar 2025).https://doi.org/10.1145/36...
-
[24]
Lin, Z., Liu, C., Zhang, R., Gao, P., Qiu, L., Xiao, H., Qiu, H., Lin, C., Shao, W., Chen, K., Han, J., Huang, S., Zhang, Y., He, X., Li, H., Qiao, Y.: Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575 (2023) 10, 11
-
[25]
In: Proc
Liu, F., Emerson, G., Collier, N.: Visual spatial reasoning. In: Proc. Annual Meet- ing of the Association for Computational Linguistics (ACL) (2023) 10
2023
-
[26]
In: Proc
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2024) 1, 14
2024
-
[27]
In: Advances in Neural Information Processing Systems (NeurIPS) (2023) 1, 3, 10, 11, 14
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems (NeurIPS) (2023) 1, 3, 10, 11, 14
2023
-
[28]
In: Proc
Liu, Y., Gehrig, M., Messikommer, N., Cannici, M., Scaramuzza, D.: Revisiting token pruning for object detection and instance segmentation. In: Proc. Winter Conference on Applications of Computer Vision (WACV) (2024) 14
2024
-
[29]
Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for sciencequestionanswering.In:AdvancesinNeuralInformationProcessingSystems (NeurIPS) (2022) 10, 18
2022
-
[30]
frontiersin.org/journals/computational-neuroscience/articles/10.3389/ fncom.2021.74620414
Lukanov, H., König, P., Pipa, G.: Biologically inspired deep learning model for efficientfoveal-peripheralvision.FrontiersinComputationalNeuroscienceVolume 15 - 2021(2021).https://doi.org/10.3389/fncom.2021.746204,https://www. frontiersin.org/journals/computational-neuroscience/articles/10.3389/ fncom.2021.74620414
-
[31]
In: Proc
Mathew, M., Bagal, V., Tito, R.P., Karatzas, D., Valveny, E., Jawahar, C.V.: Infographicvqa. In: Proc. Winter Conference on Applications of Computer Vision (WACV) (2022) 10
2022
-
[32]
In: Proc
Mathew, M., Karatzas, D., Jawahar, C.V.: Docvqa: A dataset for vqa on document images. In: Proc. Winter Conference on Applications of Computer Vision (WACV) (2021) 10, 12, 14, 22, 23
2021
-
[33]
In: Advances in Neural Information Processing Systems (NeurIPS) (2022) 14
Min, J., Zhao, Y., Luo, C., Cho, M.: Peripheral vision transformer. In: Advances in Neural Information Processing Systems (NeurIPS) (2022) 14
2022
-
[34]
In: Advances in Neural Information Processing Systems (NeurIPS) (2014) 14
Mnih, V., Heess, N., Graves, A., Kavukcuoglu, K.: Recurrent models of visual at- tention. In: Advances in Neural Information Processing Systems (NeurIPS) (2014) 14
2014
-
[35]
OpenAI: Chatgpt (2025), accessed: 2025-04-05 10
2025
-
[36]
OpenBMB: MiniCPM-o.https://github.com/OpenBMB/MiniCPM- o(2024), ac- cessed: 2024-03-05 11
2024
-
[37]
In: Proc
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazeb- nik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proc. IEEE International Conference on Computer Vision (ICCV) (2015) 10, 25 Foveated Reasoner 31
2015
-
[38]
Cogcom: Train large vision-language models diving into details through chain of manipulations,
Qi, J., Ding, M., Wang, W., Bai, Y., Lv, Q., Hong, W., Xu, B., Hou, L., Li, J., Dong, Y., Tang, J.: Cogcom: Train large vision-language models diving into details through chain of manipulations. arXiv preprint arXiv:2402.04236 (2024) 1, 3, 4, 14
-
[39]
Qin, Y., Wei, B., Ge, J., Kallidromitis, K., Fu, S., Darrell, T., Wang, X.: Chain- of-visual-thought: Teaching vlms to see and think better with continuous visual tokens. arXiv preprint arXiv:2511.19418 (2025) 14
-
[40]
In: Advances in Neural In- formation Processing Systems (NeurIPS) (2021) 14
Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: Dynamicvit: Efficient vision transformers with dynamic token sparsification. In: Advances in Neural In- formation Processing Systems (NeurIPS) (2021) 14
2021
-
[41]
Journal of Vision12(4), 14–14 (04 2012).https://doi.org/10.1167/12.4.14,https://doi.org/10.1167/12
Rosenholtz, R., Huang, J., Raj, A., Balas, B.J., Ilie, L.: A summary statistic repre- sentation in peripheral vision explains visual search. Journal of Vision12(4), 14–14 (04 2012).https://doi.org/10.1167/12.4.14,https://doi.org/10.1167/12. 4.1414
-
[42]
Ryoo, M.S., Piergiovanni, A., Arnab, A., Dehghani, M., Angelova, A.: Token- learner: What can 8 learned tokens do for images and videos? In: Advances in Neural Information Processing Systems (NeurIPS) (2021) 14
2021
-
[43]
In: Advances in Neural Information Processing Systems (NeurIPS) (2025) 1, 3, 4, 14
Sarch, G., Saha, S., Khandelwal, N., Jain, A., Tarr, M.J., Kumar, A., Fragkiadaki, K.: Grounded reinforcement learning for visual reasoning. In: Advances in Neural Information Processing Systems (NeurIPS) (2025) 1, 3, 4, 14
2025
-
[44]
In: Advances in Neural Information Processing Systems (NeurIPS) (2024) 1, 2, 3, 4, 8, 10, 11, 14, 18
Shao, H., Qian, S., Xiao, H., Song, G., Zong, Z., Wang, L., Liu, Y., Li, H.: Visual cot: Unleashing chain-of-thought reasoning in multi-modal language models. In: Advances in Neural Information Processing Systems (NeurIPS) (2024) 1, 2, 3, 4, 8, 10, 11, 14, 18
2024
-
[45]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y.K., Wu, Y., Guo, D.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024) 9, 16, 19
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
In: Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP) (2025) 3
Shen, H., Zhao, K., Zhao, T., Xu, R., Zhang, Z., Zhu, M., Yin, J.: Zoomeye: En- hancing multimodal llms with human-like zooming capabilities through tree-based image exploration. In: Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP) (2025) 3
2025
-
[47]
In: Proc
Sidorov, O., Hu, R., Rohrbach, M., Singh, A.: Textcaps: a dataset for image cap- tioning with reading comprehension. In: Proc. European Conference on Computer Vision (ECCV) (2020) 10
2020
-
[48]
In: Proc
Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., Rohrbach, M.: Towards vqa models that can read. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019) 10, 12, 22, 23
2019
-
[49]
In: Advances in Neu- ral Information Processing Systems (NeurIPS) (2025) 1, 3, 4, 14
Su, A., Wang, H., Ren, W., Lin, F., Chen, W.: Pixel reasoner: Incentivizing pixel- space reasoning with curiosity-driven reinforcement learning. In: Advances in Neu- ral Information Processing Systems (NeurIPS) (2025) 1, 3, 4, 14
2025
-
[50]
RoFormer: Enhanced Transformer with Rotary Position Embedding
Su, J., Lu, Y., Pan, S., Wen, B., Liu, Y.: Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864 (2021) 20
work page internal anchor Pith review arXiv 2021
-
[51]
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: Caltech-ucsd birds
-
[52]
Tech. Rep. CNS-TR-2011-001, California Institute of Technology (2011) 10, 26
2011
-
[53]
In: Advances in Neural Information Processing Systems (NeurIPS) (2024) 14 32 J
Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., Ji, J., Yang, Z., Zhao, L., Song, X.,Xu, J.,Xu, B.,Li, J., Dong,Y., Ding,M., Tang, J.:Cogvlm: Visualexpert for pretrained language models. In: Advances in Neural Information Processing Systems (NeurIPS) (2024) 14 32 J. Min et al
2024
-
[54]
In: Proc
Wu, P., Xie, S.: V*: Guided visual search as a core mechanism in multimodal llms. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2024) 1, 3, 4, 10, 11, 14
2024
-
[55]
In: Proc
Xu, G., Jin, P., Wu, Z., Li, H., Song, Y., Sun, L., Yuan, L.: Llava-cot: Let vision language models reason step-by-step. In: Proc. IEEE International Conference on Computer Vision (ICCV) (2025) 14
2025
-
[56]
In: Proc
Xu, Y., Zhang, Z., Zhang, M., Sheng, K., Li, K., Dong, W., Zhang, L., Xu, C., Sun, X.: Evo-vit: Slow-fast token evolution for dynamic vision transformer. In: Proc. AAAI Conference on Artificial Intelligence (AAAI) (2022) 14
2022
-
[57]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
In: Advances in Neural Information Processing Systems (NeurIPS) (2025) 3
Yang, S., Li, J., Lai, X., Yu, B., Zhao, H., Jia, J.: Visionthink: Smart and effi- cient vision language model via reinforcement learning. In: Advances in Neural Information Processing Systems (NeurIPS) (2025) 3
2025
-
[59]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yao, Y., Yu, T., Zhang, A., Wang, C., Cui, J., Zhu, H., Cai, T., Li, H., Zhao, W., He, Z., Chen, Q., Zhou, H., Zou, Z., Zhang, H., Hu, S., Zheng, Z., Zhou, J., Cai, J., Han, X., Zeng, G., Li, D., Liu, Z., Sun, M.: Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800 (2024) 11
work page internal anchor Pith review arXiv 2024
-
[60]
Springer (2013) 2
Yarbus, A.L.: Eye movements and vision. Springer (2013) 2
2013
-
[61]
Ye, J., Meng, X., Guo, D., Shang, C., Mao, H., Yang, X.: Neural foveated super- resolution for real-time vr rendering. Computer Animation and Virtual Worlds 35(4), e2287 (2024).https://doi.org/https://doi.org/10.1002/cav.2287, https://onlinelibrary.wiley.com/doi/abs/10.1002/cav.228714
-
[62]
In: Proc
Yin, H., Vahdat, A., Alvarez, J., Mallya, A., Kautz, J., Molchanov, P.: Adavit: Adaptive tokens for efficient vision transformer. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022) 14
2022
-
[63]
In: Proc
Yu, R., Ma, X., Wang, X.: Auto-controlled image perception in mllms via visual perception tokens. In: Proc. IEEE International Conference on Computer Vision (ICCV) (2025) 2, 3, 4, 10, 11, 14, 18
2025
-
[64]
In: Proc
Yu, W., Yang, Z., Liu, Y., Bai, X.: Docthinker: Explainable multimodal large lan- guage models with rule-based reinforcement learning for document understanding. In: Proc. IEEE International Conference on Computer Vision (ICCV) (2025) 10, 11, 18
2025
-
[65]
In: Proc
Zhang, R., Zhang, B., Li, Y., Zhang, H., Sun, Z., Gan, Z., Yang, Y., Pang, R., Yang, Y.: Improve vision language model chain-of-thought reasoning. In: Proc. Annual Meeting of the Association for Computational Linguistics (ACL) (2024) 14
2024
-
[66]
Zhang, X., Gao, Z., Zhang, B., Li, P., Zhang, X., Liu, Y., Yuan, T., Wu, Y., Jia, Y., Zhu, S.C., Li, Q.: Adaptive chain-of-focus reasoning via dynamic visual search and zooming for efficient vlms. arXiv preprint arXiv:2505.15436 (2025) 1, 3, 4, 14
-
[67]
Multimodal Chain-of-Thought Reasoning in Language Models
Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., Smola, A.: Multimodal chain- of-thought reasoning in language models. arXiv preprint arXiv:2302.00923 (2024) 14
work page internal anchor Pith review arXiv 2024
-
[68]
In: Proc
Zhao, K., Zhu, B., Sun, Q., Zhang, H.: Unsupervised visual chain-of-thought rea- soning via preference optimization. In: Proc. IEEE International Conference on Computer Vision (ICCV) (2025) 1, 2, 3, 4, 10, 11, 14, 18 Foveated Reasoner 33
2025
-
[69]
thinking with images
Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., Yu, X.: Deepeyes: Incentivizing "thinking with images" via reinforcement learning. In: In- ternational Conference on Learning Representations (ICLR) (2025) 1, 3, 4, 14
2025
-
[70]
In: International Conference on Learning Representations (ICLR) (2024) 14
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision- language understanding with advanced large language models. In: International Conference on Learning Representations (ICLR) (2024) 14
2024
-
[71]
In: Proc
Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7w: Grounded question an- swering in images. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 10, 12, 22, 25
2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.