pith. sign in

arxiv: 2605.30126 · v1 · pith:A3Q3C6L7new · submitted 2026-05-28 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

Pith reviewed 2026-06-29 08:11 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG
keywords vision-language modelsvisual token compressionelastic queriespool-anchored resamplingefficient inferencematryoshka representationsspatial grounding
0
0 comments X

The pith

PARCEL anchors elastic query tokens to spatial pool tokens to improve efficiency in vision-language models at multiple budgets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PARCEL to fix problems in elastic visual-token compression for large vision-language models, where aggressive reduction of tokens hurts performance. It sets up spatial pool tokens as low-frequency layout anchors and uses them to condition elastic query tokens so the queries capture complementary features instead of repeating spatial information. This lets one trained model run at different token counts while beating earlier matryoshka-style baselines on 27 benchmarks. The approach keeps the train-once-deploy-anywhere property and shifts the performance-efficiency trade-off forward.

Core claim

PARCEL is a visual tokenization architecture that dynamically partitions feature extraction by establishing spatial pool tokens as low-frequency layout anchors and conditioning elastic query tokens on those anchors through Pool-Conditioned Query Resampling, which encourages the query tokens to focus on complementary visual features rather than redundant spatial mapping.

What carries the argument

Pool-Conditioned Query Resampling, which uses spatial pool tokens to condition elastic query tokens and partition low-frequency layout from higher-detail features.

If this is right

  • A single model can be deployed at any chosen visual-token budget without retraining.
  • Spatial grounding remains stronger than in query-only resampling under heavy compression.
  • Fine-grained detail is preserved better than in spatial-only pooling because aliasing is reduced.
  • The same architecture improves the Pareto frontier across the full range of tested token counts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same anchoring idea could be tested on audio or video token sequences that also face quadratic costs.
  • Lower token budgets enabled by this method might make on-device vision-language inference practical for new hardware.
  • Future work could measure how much the pool tokens themselves can be further compressed once queries are properly conditioned.

Load-bearing premise

Spatial pool tokens will serve as effective low-frequency anchors that steer conditioned query tokens toward complementary features instead of redundant ones.

What would settle it

A head-to-head evaluation at low visual-token budgets where PARCEL fails to match or exceed matryoshka baselines on the reported benchmarks would falsify the performance claim.

read the original abstract

Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression, as in nested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail. Query-only compression, as in nested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), a visual tokenization architecture that dynamically partitions the labor of feature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the "train once, deploy anywhere" paradigm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes PARCEL, a visual tokenization architecture for large vision-language models that partitions feature extraction by establishing spatial pool tokens as low-frequency layout anchors and conditioning elastic query tokens on them via Pool-Conditioned Query Resampling. This is claimed to resolve representational conflicts between spatial-only and query-only compression methods, yielding consistent outperformance over matryoshka baselines on the performance-efficiency Pareto frontier across visual-token budgets on 27 benchmarks while preserving the train-once-deploy-anywhere property.

Significance. If the central mechanism is validated and the reported gains hold under scrutiny, the work would address a practical bottleneck in LVLM inference by improving token compression without sacrificing spatial grounding or requiring per-budget retraining. The emphasis on complementary rather than redundant feature extraction between pool and query tokens is a conceptually coherent attempt to advance beyond existing nested pooling and resampling approaches.

major comments (2)
  1. [Abstract] Abstract: The central claim that PARCEL 'improves the performance-efficiency Pareto frontier' and 'consistently outperforming existing matryoshka baselines' is asserted without any quantitative results, tables, error bars, specific benchmark names, or token-budget values, preventing assessment of whether the data support the claim.
  2. [Abstract] Abstract: The description of Pool-Conditioned Query Resampling provides no equations, architectural diagram, or formal definition of the conditioning operation, and references no ablations, attention-map analyses, or feature-diversity metrics to validate that query tokens extract complementary features rather than duplicating pool information; this mechanism is load-bearing for attributing any gains to the proposed architecture.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each major comment below and indicate planned revisions to strengthen the presentation of claims and mechanism.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that PARCEL 'improves the performance-efficiency Pareto frontier' and 'consistently outperforming existing matryoshka baselines' is asserted without any quantitative results, tables, error bars, specific benchmark names, or token-budget values, preventing assessment of whether the data support the claim.

    Authors: We agree the abstract would be stronger with concrete quantitative anchors. The manuscript reports results across 27 benchmarks at multiple token budgets (e.g., 32–256 tokens) with consistent gains over matryoshka baselines; we will revise the abstract to include one or two representative metrics (average improvement and example budgets) while preserving length constraints. revision: yes

  2. Referee: [Abstract] Abstract: The description of Pool-Conditioned Query Resampling provides no equations, architectural diagram, or formal definition of the conditioning operation, and references no ablations, attention-map analyses, or feature-diversity metrics to validate that query tokens extract complementary features rather than duplicating pool information; this mechanism is load-bearing for attributing any gains to the proposed architecture.

    Authors: Abstracts are high-level summaries and conventionally omit equations, diagrams, and detailed ablation references; these appear in Sections 3–5 of the manuscript. We will revise the abstract wording to more clearly signal that the complementary-feature claim is supported by the ablations and analyses presented in the body, but we do not plan to insert equations or diagrams into the abstract itself. revision: partial

Circularity Check

0 steps flagged

No circularity: new architecture proposed without self-referential derivations or fitted predictions

full rationale

The paper proposes PARCEL as a novel visual tokenization method that partitions feature extraction between pool tokens (low-frequency anchors) and conditioned elastic queries. No equations, parameter fits, or derivations are presented in the provided text that reduce the claimed performance gains to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claims rest on the architectural description and empirical evaluations across benchmarks rather than any self-definitional or fitted-input reduction. This is a standard empirical architecture paper whose derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the domain assumption that pool tokens can act as effective anchors for conditioning without introducing new representational conflicts, plus the new method entity itself.

axioms (1)
  • domain assumption Spatial pool tokens can serve as low-frequency layout anchors that resolve the conflict between spatial and query compression when used to condition elastic queries.
    Invoked directly in the abstract description of how PARCEL partitions feature extraction labor.
invented entities (1)
  • Pool-Conditioned Query Resampling no independent evidence
    purpose: To encourage query tokens to capture complementary features by conditioning them on pool anchors.
    New component introduced to address the limitations of prior compression methods.

pith-pipeline@v0.9.1-grok · 5770 in / 1251 out tokens · 36749 ms · 2026-06-29T08:11:10.280573+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

134 extracted references · 41 canonical work pages · 19 internal anchors

  1. [1]

    Acharya, K

    M. Acharya, K. Kafle, and C. Kanan. Tal- lyqa: Answering complex counting ques- tions. InProceedings of the AAAI conference on artificial intelligence, volume 33, pages 8076–8084, 2019

  2. [2]

    Agrawal, K

    H. Agrawal, K. Desai, Y. Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, and P. Anderson. Nocaps: Novel object captioning at scale. InProceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957, 2019

  3. [3]

    S. R. Alvar, G. Singh, M. Akbari, and Y. Zhang. Divprune: Diversity-based vi- sual token pruning for large multimodal models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- ternRecognition(CVPR),pages9392–9401, June 2025

  4. [4]

    X. An, Y. Xie, K. Yang, W. Zhang, X. Zhao, Z. Cheng, Y. Wang, S. Xu, C. Chen, D. Zhu, C.Wu,H.Tan,C.Li,J.Yang,J.Yu,X.Wang, B. Qin, Y. Wang, Z. Yan, Z. Feng, Z. Liu, B. Li, and J. Deng. Llava-onevision-1.5: Fully open framework for democratized multimodal training, 2025. URLhttps: //arxiv.org/abs/2509.23661

  5. [5]

    K. H. I. Arif, J. Yoon, D. S. Nikolopou- los, H. Vandierendonck, D. John, and B. Ji. Hired: Attention-guided token dropping for efficient inference of high- resolution vision-language models.Pro- ceedings of the AAAI Conference on Arti- ficial Intelligence, 39(2):1773–1781, Apr

  6. [6]

    URL https://ojs.aaai.org/index

    doi: 10.1609/aaai.v39i2.32171. URL https://ojs.aaai.org/index. php/AAAI/article/view/32171

  7. [7]

    Azulay and Y

    A. Azulay and Y. Weiss. Why do deep con- volutional networks generalize so poorly to small image transformations?Journal of Machine Learning Research, 20(184):1– 25, 2019

  8. [8]

    Bachmann, J

    R. Bachmann, J. Allardice, D. Mizrahi, E. Fini, O. F. Kar, E. Amirloo, A. El-Nouby, A. Zamir, and A. Dehghan. Flextok: Re- sampling images into 1d token sequences of flexible length. InForty-second Inter- national Conference on Machine Learning,

  9. [9]

    11 PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

    URLhttps://openreview.net/ forum?id=DgdOkUUBzf. 11 PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

  10. [10]

    C. Baek, J. Song, S. Kim, and K. Kong. An empirical study of attention and diversity for adaptive visual token pruning in mul- timodal reasoning models. InNeurIPS 2025 Workshop on Efficient Reasoning,

  11. [11]

    URLhttps://openreview.net/ forum?id=j2NkINd3pw

  12. [12]

    J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, B. Hui, L. Ji, M. Li, J. Lin, R. Lin, D. Liu, G. Liu, C. Lu, K. Lu, J. Ma, R. Men, X. Ren, X. Ren, C. Tan, S. Tan, J. Tu, P. Wang, S. Wang, W. Wang, S. Wu, B. Xu, J. Xu, A. Yang, H. Yang, J. Yang, S. Yang, Y. Yao, B. Yu, H. Yuan, Z. Yuan, J. Zhang, X. Zhang, Y. Zhang, Z. Zh...

  13. [13]

    I.Bello,H.Pham,Q.V.Le,M.Norouzi,and S. Bengio. Neural combinatorial optimiza- tion with reinforcement learning.arXiv preprint arXiv:1611.09940, 2016

  14. [14]

    Beyer, X

    L. Beyer, X. Zhai, and A. Kolesnikov. Better plain vit baselines for imagenet-1k.arXiv preprint arXiv:2205.01580, 2022

  15. [15]

    PaliGemma: A versatile 3B VLM for transfer

    L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neu- mann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

  16. [16]

    A. F. Biten, R. Tito, A. Mafla, L. Gomez, M. Rusinol, E. Valveny, C. Jawahar, and D. Karatzas. Scene text visual question answering. InProceedings of the IEEE/CVF internationalconferenceoncomputervision, pages 4291–4301, 2019

  17. [17]

    R. N. Bracewell. The fourier transform. Scientific American, 260(6):86–95, 1989

  18. [18]

    Bradbury, R

    J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, Y. Katariya, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang. JAX: composable transfor- mations of Python+NumPy programs,

  19. [19]

    URL http://github.com/ jax-ml/jax

  20. [20]

    Bulat, Y

    A. Bulat, Y. Ouali, and G. Tzimiropou- los. Fwd2bot: Lvlm visual token com- pression with double forward bottleneck,

  21. [21]

    URL https://arxiv.org/abs/ 2503.21757

  22. [22]

    M. Cai, J. Yang, J. Gao, and Y. J. Lee. Matryoshka multimodal mod- els. InThe Thirteenth International Conference on Learning Representations,

  23. [23]

    URLhttps://openreview.net/ forum?id=Uhj5OxAz7I

  24. [24]

    Cappellazzo, M

    U. Cappellazzo, M. Kim, P. Ma, H. Chen, X. Liu, S. Petridis, and M. Pantic. Mome: Mixture of matryoshka experts for audio-visual speech recognition, 2025. URL https://arxiv.org/abs/2510. 04136

  25. [25]

    J. Cha, W. Kang, J. Mun, and B. Roh. Honeybee: Locality-enhanced projector for multimodal llm. InProceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 13817–13827, June 2024

  26. [26]

    Changpinyo, D

    S. Changpinyo, D. Kukliansy, I. Szpektor, X. Chen, N. Ding, and R. Soricut. All you may need for vqa are image captions. InProceedings of the 2022 conference of the north american chapter of the associa- tion for computational linguistics: human language technologies, pages 1947–1963, 2022

  27. [27]

    Collectinghighly parallel data for paraphrase evaluation

    D.ChenandW.B.Dolan. Collectinghighly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguis- tics: human language technologies, pages 190–200, 2011

  28. [28]

    J. Chen, L. Ye, J. He, Z.-Y. Wang, D. Khashabi, and A. Yuille. Efficient large multi-modal models via visual context compression. In A. Globerson, L. Mackey, 12 PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural I...

  29. [29]

    L. Chen, H. Zhao, T. Liu, S. Bai, J. Lin, C. Zhou, and B. Chang. An image is worth 1/2 tokens after layer 2: Plug-and- playinferenceaccelerationforlargevision- language models. InEuropean Conference onComputerVision, pages19–35.Springer, 2024

  30. [30]

    T. Chen, S. Saxena, L. Li, D. J. Fleet, and G. Hinton. Pix2seq: A language model- ing framework for object detection.arXiv preprint arXiv:2109.10852, 2021

  31. [31]

    X. Chen, H. Fang, T.-Y. Lin, R. Vedan- tam, S. Gupta, P. Dollár, and C. L. Zit- nick. Microsoft coco captions: Data collec- tion and evaluation server.arXiv preprint arXiv:1504.00325, 2015

  32. [32]

    X. Chen, X. Wang, S. Changpinyo, A. J. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, et al. Pali: A jointly-scaled mul- tilingual language-image model.arXiv preprint arXiv:2209.06794, 2022

  33. [33]

    X. Chen, J. Djolonga, P. Padlewski, B. Mustafa, S. Changpinyo, J. Wu, C. R. Ruiz, S. Goodman, X. Wang, Y. Tay, et al. Pali-x: On scaling up a multilingual vi- sion and language model.arXiv preprint arXiv:2305.18565, 2023

  34. [34]

    X. Chen, X. Wang, L. Beyer, A. Kolesnikov, J.Wu, P.Voigtlaender, B.Mustafa, S.Good- man, I. Alabdulmohsin, P. Padlewski, et al. Pali-3 vision language models: Smaller, faster, stronger.arXiv preprint arXiv:2310.09199, 2023

  35. [35]

    Z. Chen, W. Wang, Y. Cao, Y. Liu, Z. Gao, E. Cui, J. Zhu, S. Ye, H. Tian, Z. Liu, L. Gu, X. Wang, Q. Li, Y. Ren, Z. Chen, J. Luo, J. Wang, T. Jiang, B. Wang, C. He, B. Shi, X. Zhang, H. Lv, Y. Wang, W. Shao, P. Chu, Z. Tu, T. He, Z. Wu, H. Deng, J. Ge, K. Chen, K. Zhang, L. Wang, M. Dou, L. Lu, X. Zhu, T. Lu, D. Lin, Y. Qiao, J. Dai, and W. Wang. Expandin...

  36. [36]

    URL https://arxiv.org/abs/ 2412.05271

  37. [37]

    M. Endo, X. Wang, and S. Yeung-Levy. Feather the throttle: Revisiting visual to- ken pruning for vision-language model ac- celeration. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 22826–22835, 2025

  38. [38]

    R. C. Gonzalez and R. E. Woods.Digital Image Processing. Addison-Wesley, 1992

  39. [39]

    Introduction to cloud tpu

    Google Cloud. Introduction to cloud tpu. https://cloud.google.com/ tpu/docs/intro-to-tpu, 2026. Ac- cessed: 2026-05-06

  40. [40]

    Goyal, T

    Y. Goyal, T. Khot, D. Summers-Stay, D. Ba- tra, and D. Parikh. Making the v in vqa matter: Elevating the role of image under- standing in visual question answering. In Proceedings of the IEEE conference on com- puter vision and pattern recognition, pages 6904–6913, 2017

  41. [41]

    X. Guo, J. Zhang, and K. Wang. Adaptive- voco: Complexity-aware visual token com- pression for vision-language models. In ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Sig- nalProcessing(ICASSP),pages2396–2400,

  42. [42]

    11465113

    doi: 10.1109/ICASSP55912.2026. 11465113

  43. [43]

    Grauman, J

    D.Gurari, Q.Li, A.J.Stangl, A.Guo, C.Lin, K. Grauman, J. Luo, and J. P. Bigham. Vizwiz grand challenge: Answering visual questions from blind people. InProceed- ings of the IEEE conference on computer vi- sion and pattern recognition, pages 3608– 3617, 2018

  44. [44]

    He and H

    J. He and H. Chen. Energy-driven adaptive visual token pruning for efficient vision- language models, 2026. URLhttps:// arxiv.org/abs/2603.05950. 13 PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

  45. [45]

    T.-Y. Hsu, C. L. Giles, and T.-H. Huang. Scicap: Generating captions for scientific figures. InFindings of the Association for Computational Linguistics: EMNLP 2021, pages 3258–3264, 2021

  46. [46]

    Hu, Z.-Y

    W. Hu, Z.-Y. Dou, L. H. Li, A. Ka- math, N. Peng, and K.-W. Chang. Ma- tryoshka query transformer for large vision-language models.Advances in Neu- ral Information Processing Systems, 37: 50168–50188, 2024

  47. [47]

    Huang, H

    X. Huang, H. Zhou, and K. Han. PruneVid: Visual token pruning for efficient video large language models. In W. Che, J. Nabende, E. Shutova, and M. T. Pile- hvar, editors,Findings of the Associa- tion for Computational Linguistics: ACL 2025, pages 19959–19973, Vienna, Aus- tria, July 2025. Association for Computa- tional Linguistics. ISBN 979-8-89176-256-

  48. [49]

    org/2025.findings-acl.1024/

    URL https://aclanthology. org/2025.findings-acl.1024/

  49. [50]

    Huang, F

    Y. Huang, F. Ma, Y. Shao, J. Guo, Z. YU, L. Cui, and Q. Tian. Nüwa: Mending the spatial integrity torn by VLM token pruning. InThe Fourteenth International Conference on Learning Representations,

  50. [51]

    URLhttps://openreview.net/ forum?id=C9yclwdquU

  51. [52]

    D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reason- ing and compositional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

  52. [53]

    Jahagirdar, W

    S. Jahagirdar, W. Bousselham, A. Kukleva, and H. Kuehne. When llava meets objects: Token composition for vision-language- models.arXiv preprint arXiv:2602.04864, 2026

  53. [54]

    Kazemzadeh, V

    S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language pro- cessing (EMNLP), pages 787–798, 2014

  54. [55]

    Kembhavi, M

    A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, and A. Farhadi. A diagram is worth a dozen images. InEuropean con- ference on computer vision, pages 235–251. Springer, 2016

  55. [56]

    D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  56. [57]

    Z. Kong, Y. Li, F. Zeng, L. Xin, S. Mes- sica, X. Lin, P. Zhao, M. Kellis, H. Tang, and M. Zitnik. Token reduction should go beyond efficiency in generative models – from vision, language to multimodality,

  57. [58]

    URL https://arxiv.org/abs/ 2505.18227

  58. [59]

    Krishna, K

    R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles. Dense-captioning events in videos. InProceedings of the IEEE in- ternational conference on computer vision, pages 706–715, 2017

  59. [60]

    Kudugunta, A

    S. Kudugunta, A. Kusupati, T. Dettmers, K. Chen, I. Dhillon, Y. Tsvetkov, H. Ha- jishirzi, S. Kakade, A. Farhadi, and P. Jain. Matformer: Nested transformer for elastic inference.Advances in Neural Information Processing Systems, 37:140535–140564, 2024

  60. [61]

    Kusupati, G

    A. Kusupati, G. Bhatt, A. Rege, M. Walling- ford, A. Sinha, V. Ramanujan, W. Howard- Snyder, K. Chen, S. Kakade, P. Jain, et al. Matryoshka representation learning.Ad- vancesinNeuralInformationProcessingSys- tems, 35:30233–30249, 2022

  61. [62]

    K. Li, S. Goyal, J. D. Semedo, and J. Z. Kolter. Inference optimal VLMs need fewer visual tokens and more parame- ters. InThe Thirteenth International Conference on Learning Representations,

  62. [63]

    URLhttps://openreview.net/ forum?id=6VhDQP7WGX

  63. [64]

    W. Li, Y. Yuan, J. Liu, D. Tang, S. Wang, J. Qin, J. Zhu, and L. Zhang. Tokenpacker: 14 PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding Efficient visual projector for multimodal llm, 2024. URL https://arxiv.org/ abs/2407.02392

  64. [65]

    li and X

    X. li and X. Song. Efficient vision- language reasoning via adaptive token pruning. In1st Workshop on VLM4RWD @ NeurIPS 2025, 2025. URL https://openreview.net/forum? id=Vbqemx4YCC

  65. [66]

    Y. Li, G. Li, L. He, J. Zheng, H. Li, and Z. Guan. Widget captioning: Generating natural language description for mobile user interface elements. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 5495–5510, 2020

  66. [67]

    Y. Li, F. Wang, Y. Li, M. Chen, M. Zhao, and L. Lan. Semantic-geometric dual com- pression: Training-free visual token reduc- tion for ultra-high-resolution remote sens- ing understanding, 2026. URL https: //arxiv.org/abs/2604.11122

  67. [68]

    Z. Li, Y. Li, F. Fang, R. Takezoe, Z.-H. Bo, C. Qian, M. Guang, G. Zhang, and K. Long. Qmop: Query guided mixture-of- projector for efficient visual token com- pression, 2026. URL https://arxiv. org/abs/2603.21232

  68. [69]

    C. Liao, W. Wang, Z. Wen, X. Zheng, Y. Wang, H. He, Y. Lyu, L. Jiang, X. Zou, Y. Fu, B. Ren, L. Zhang, and X. Hu. Are we using the right benchmark: An evaluation framework for visual token compression methods.arXiv preprint arXiv:2510.07143, 2026

  69. [70]

    H. Liao, Z. Jiang, Y. Hao, Y. Tan, S. He, B. Wang, J. Zhao, K. Xu, and K. Liu. Re- sadapt: Adaptive resolution for efficient multimodal reasoning, 2026. URLhttps: //arxiv.org/abs/2603.28610

  70. [71]

    T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common ob- jects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

  71. [72]

    F. Liu, E. Bugliarello, E. M. Ponti, S. Reddy, N. Collier, and D. Elliott. Visually grounded reasoning across languages and cultures. InProceedings of the 2021 Con- ference on Empirical Methods in Natural Language Processing, pages 10467–10485, 2021

  72. [73]

    H. Liu, C. Li, Y. Li, and Y. J. Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26296–26306, June 2024

  73. [74]

    T. Liu, L. Shi, R. Hong, Y. Hu, Q. Yin, and L. Zhang. Multi-stage vision token dropping: Towards efficient multimodal large language model.arXiv preprint arXiv:2411.10803, 2024

  74. [75]

    Z. Liu, L. Zhu, B. Shi, Z. Zhang, Y. Lou, S. Yang, H. Xi, S. Cao, Y. Gu, D. Li, X. Li, H. Tang, Y. Fang, Y. Chen, C.-Y. Hsieh, D.-A. Huang, A.-C. Cheng, J. Hu, S. Liu, R. Krishna, P. Molchanov, J. Kautz, H. Yin, S. Han, and Y. Lu. Nvila: Efficient frontier visual language models. InProceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Re...

  75. [76]

    Lobry, D

    S. Lobry, D. Marcos, J. Murray, and D. Tuia. Rsvqa: Visual question answering for re- mote sensing data.IEEE Transactions on Geoscience and Remote Sensing, 58(12): 8555–8566, 2020

  76. [77]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter. Decoupled weightdecayregularization.arXivpreprint arXiv:1711.05101, 2017

  77. [78]

    H. Lou, C. Fan, Z. Liu, Y. Wu, and X. Wang. Llava-sp: Enhancing visual representation withvisualspatialtokensformllms. InPro- ceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), pages 22014–22024, October 2025

  78. [79]

    P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science 15 PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding question answering.Advances in neural information processing systems, 35:2507– ...

  79. [80]

    J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and comprehension of unambiguous ob- jectdescriptions. InProceedingsoftheIEEE conference on computer vision and pattern recognition, pages 11–20, 2016

  80. [81]

    Marino, M

    K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InProceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019

Showing first 80 references.