PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

Alessio Tonioni; Bernt Schiele; Federico Tombari; Muhammad Ferjad Naeem; Selim Kuzucu; Vasile Lup

arxiv: 2605.30126 · v1 · pith:A3Q3C6L7new · submitted 2026-05-28 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

Selim Kuzucu , Alessio Tonioni , Vasile Lup , Bernt Schiele , Federico Tombari , Muhammad Ferjad Naeem This is my paper

Pith reviewed 2026-06-29 08:11 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG

keywords vision-language modelsvisual token compressionelastic queriespool-anchored resamplingefficient inferencematryoshka representationsspatial grounding

0 comments

The pith

PARCEL anchors elastic query tokens to spatial pool tokens to improve efficiency in vision-language models at multiple budgets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PARCEL to fix problems in elastic visual-token compression for large vision-language models, where aggressive reduction of tokens hurts performance. It sets up spatial pool tokens as low-frequency layout anchors and uses them to condition elastic query tokens so the queries capture complementary features instead of repeating spatial information. This lets one trained model run at different token counts while beating earlier matryoshka-style baselines on 27 benchmarks. The approach keeps the train-once-deploy-anywhere property and shifts the performance-efficiency trade-off forward.

Core claim

PARCEL is a visual tokenization architecture that dynamically partitions feature extraction by establishing spatial pool tokens as low-frequency layout anchors and conditioning elastic query tokens on those anchors through Pool-Conditioned Query Resampling, which encourages the query tokens to focus on complementary visual features rather than redundant spatial mapping.

What carries the argument

Pool-Conditioned Query Resampling, which uses spatial pool tokens to condition elastic query tokens and partition low-frequency layout from higher-detail features.

If this is right

A single model can be deployed at any chosen visual-token budget without retraining.
Spatial grounding remains stronger than in query-only resampling under heavy compression.
Fine-grained detail is preserved better than in spatial-only pooling because aliasing is reduced.
The same architecture improves the Pareto frontier across the full range of tested token counts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same anchoring idea could be tested on audio or video token sequences that also face quadratic costs.
Lower token budgets enabled by this method might make on-device vision-language inference practical for new hardware.
Future work could measure how much the pool tokens themselves can be further compressed once queries are properly conditioned.

Load-bearing premise

Spatial pool tokens will serve as effective low-frequency anchors that steer conditioned query tokens toward complementary features instead of redundant ones.

What would settle it

A head-to-head evaluation at low visual-token budgets where PARCEL fails to match or exceed matryoshka baselines on the reported benchmarks would falsify the performance claim.

read the original abstract

Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression, as in nested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail. Query-only compression, as in nested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), a visual tokenization architecture that dynamically partitions the labor of feature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the "train once, deploy anywhere" paradigm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PARCEL's pool-conditioned resampling tries to split spatial anchoring from query detail but the abstract gives no evidence the conditioning actually produces complementary features rather than redundancy.

read the letter

The main takeaway is that PARCEL claims to fix the tension between spatial pooling (which loses fine detail under heavy compression) and query resampling (which loses spatial grounding) by letting pool tokens condition the elastic queries. If the mechanism works, it would give a better performance-efficiency curve while keeping the single-model multi-budget setup.

The paper states the problem cleanly and positions the new Pool-Conditioned Query Resampling as the fix that encourages queries to pick up what the pools miss. That framing is reasonable and directly targets a known limitation in matryoshka-style compression. The emphasis on preserving train-once-deploy-anywhere is also practical.

The weak part is the missing support for the central claim. The abstract asserts gains across 27 benchmarks but shows no numbers, no error bars, no equations for the resampling, and no ablation or attention analysis that would confirm queries avoid duplicating the pool information. Without that, it is impossible to tell whether any improvement comes from the conditioning or from other training choices. The stress-test note is on target here.

This is for people working on efficient inference for vision-language models who already follow matryoshka and nested compression work. A reader in that niche would get the architectural idea and could test it, but would need the full methods and results to judge whether the claims hold.

I would send it to peer review. The problem is real, the proposed split of labor is distinct from the baselines mentioned, and referees can check the experiments directly.

Referee Report

2 major / 0 minor

Summary. The paper proposes PARCEL, a visual tokenization architecture for large vision-language models that partitions feature extraction by establishing spatial pool tokens as low-frequency layout anchors and conditioning elastic query tokens on them via Pool-Conditioned Query Resampling. This is claimed to resolve representational conflicts between spatial-only and query-only compression methods, yielding consistent outperformance over matryoshka baselines on the performance-efficiency Pareto frontier across visual-token budgets on 27 benchmarks while preserving the train-once-deploy-anywhere property.

Significance. If the central mechanism is validated and the reported gains hold under scrutiny, the work would address a practical bottleneck in LVLM inference by improving token compression without sacrificing spatial grounding or requiring per-budget retraining. The emphasis on complementary rather than redundant feature extraction between pool and query tokens is a conceptually coherent attempt to advance beyond existing nested pooling and resampling approaches.

major comments (2)

[Abstract] Abstract: The central claim that PARCEL 'improves the performance-efficiency Pareto frontier' and 'consistently outperforming existing matryoshka baselines' is asserted without any quantitative results, tables, error bars, specific benchmark names, or token-budget values, preventing assessment of whether the data support the claim.
[Abstract] Abstract: The description of Pool-Conditioned Query Resampling provides no equations, architectural diagram, or formal definition of the conditioning operation, and references no ablations, attention-map analyses, or feature-diversity metrics to validate that query tokens extract complementary features rather than duplicating pool information; this mechanism is load-bearing for attributing any gains to the proposed architecture.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each major comment below and indicate planned revisions to strengthen the presentation of claims and mechanism.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that PARCEL 'improves the performance-efficiency Pareto frontier' and 'consistently outperforming existing matryoshka baselines' is asserted without any quantitative results, tables, error bars, specific benchmark names, or token-budget values, preventing assessment of whether the data support the claim.

Authors: We agree the abstract would be stronger with concrete quantitative anchors. The manuscript reports results across 27 benchmarks at multiple token budgets (e.g., 32–256 tokens) with consistent gains over matryoshka baselines; we will revise the abstract to include one or two representative metrics (average improvement and example budgets) while preserving length constraints. revision: yes
Referee: [Abstract] Abstract: The description of Pool-Conditioned Query Resampling provides no equations, architectural diagram, or formal definition of the conditioning operation, and references no ablations, attention-map analyses, or feature-diversity metrics to validate that query tokens extract complementary features rather than duplicating pool information; this mechanism is load-bearing for attributing any gains to the proposed architecture.

Authors: Abstracts are high-level summaries and conventionally omit equations, diagrams, and detailed ablation references; these appear in Sections 3–5 of the manuscript. We will revise the abstract wording to more clearly signal that the complementary-feature claim is supported by the ablations and analyses presented in the body, but we do not plan to insert equations or diagrams into the abstract itself. revision: partial

Circularity Check

0 steps flagged

No circularity: new architecture proposed without self-referential derivations or fitted predictions

full rationale

The paper proposes PARCEL as a novel visual tokenization method that partitions feature extraction between pool tokens (low-frequency anchors) and conditioned elastic queries. No equations, parameter fits, or derivations are presented in the provided text that reduce the claimed performance gains to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claims rest on the architectural description and empirical evaluations across benchmarks rather than any self-definitional or fitted-input reduction. This is a standard empirical architecture paper whose derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the domain assumption that pool tokens can act as effective anchors for conditioning without introducing new representational conflicts, plus the new method entity itself.

axioms (1)

domain assumption Spatial pool tokens can serve as low-frequency layout anchors that resolve the conflict between spatial and query compression when used to condition elastic queries.
Invoked directly in the abstract description of how PARCEL partitions feature extraction labor.

invented entities (1)

Pool-Conditioned Query Resampling no independent evidence
purpose: To encourage query tokens to capture complementary features by conditioning them on pool anchors.
New component introduced to address the limitations of prior compression methods.

pith-pipeline@v0.9.1-grok · 5770 in / 1251 out tokens · 36749 ms · 2026-06-29T08:11:10.280573+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

134 extracted references · 41 canonical work pages · 19 internal anchors

[1]

Acharya, K

M. Acharya, K. Kafle, and C. Kanan. Tal- lyqa: Answering complex counting ques- tions. InProceedings of the AAAI conference on artificial intelligence, volume 33, pages 8076–8084, 2019

2019
[2]

Agrawal, K

H. Agrawal, K. Desai, Y. Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, and P. Anderson. Nocaps: Novel object captioning at scale. InProceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957, 2019

2019
[3]

S. R. Alvar, G. Singh, M. Akbari, and Y. Zhang. Divprune: Diversity-based vi- sual token pruning for large multimodal models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- ternRecognition(CVPR),pages9392–9401, June 2025

2025
[4]

X. An, Y. Xie, K. Yang, W. Zhang, X. Zhao, Z. Cheng, Y. Wang, S. Xu, C. Chen, D. Zhu, C.Wu,H.Tan,C.Li,J.Yang,J.Yu,X.Wang, B. Qin, Y. Wang, Z. Yan, Z. Feng, Z. Liu, B. Li, and J. Deng. Llava-onevision-1.5: Fully open framework for democratized multimodal training, 2025. URLhttps: //arxiv.org/abs/2509.23661

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

K. H. I. Arif, J. Yoon, D. S. Nikolopou- los, H. Vandierendonck, D. John, and B. Ji. Hired: Attention-guided token dropping for efficient inference of high- resolution vision-language models.Pro- ceedings of the AAAI Conference on Arti- ficial Intelligence, 39(2):1773–1781, Apr
[6]

URL https://ojs.aaai.org/index

doi: 10.1609/aaai.v39i2.32171. URL https://ojs.aaai.org/index. php/AAAI/article/view/32171

work page doi:10.1609/aaai.v39i2.32171
[7]

Azulay and Y

A. Azulay and Y. Weiss. Why do deep con- volutional networks generalize so poorly to small image transformations?Journal of Machine Learning Research, 20(184):1– 25, 2019

2019
[8]

Bachmann, J

R. Bachmann, J. Allardice, D. Mizrahi, E. Fini, O. F. Kar, E. Amirloo, A. El-Nouby, A. Zamir, and A. Dehghan. Flextok: Re- sampling images into 1d token sequences of flexible length. InForty-second Inter- national Conference on Machine Learning,
[9]

11 PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

URLhttps://openreview.net/ forum?id=DgdOkUUBzf. 11 PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding
[10]

C. Baek, J. Song, S. Kim, and K. Kong. An empirical study of attention and diversity for adaptive visual token pruning in mul- timodal reasoning models. InNeurIPS 2025 Workshop on Efficient Reasoning,

2025
[11]

URLhttps://openreview.net/ forum?id=j2NkINd3pw
[12]

J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, B. Hui, L. Ji, M. Li, J. Lin, R. Lin, D. Liu, G. Liu, C. Lu, K. Lu, J. Ma, R. Men, X. Ren, X. Ren, C. Tan, S. Tan, J. Tu, P. Wang, S. Wang, W. Wang, S. Wu, B. Xu, J. Xu, A. Yang, H. Yang, J. Yang, S. Yang, Y. Yao, B. Yu, H. Yuan, Z. Yuan, J. Zhang, X. Zhang, Y. Zhang, Z. Zh...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

I.Bello,H.Pham,Q.V.Le,M.Norouzi,and S. Bengio. Neural combinatorial optimiza- tion with reinforcement learning.arXiv preprint arXiv:1611.09940, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[14]

Beyer, X

L. Beyer, X. Zhai, and A. Kolesnikov. Better plain vit baselines for imagenet-1k.arXiv preprint arXiv:2205.01580, 2022

work page arXiv 2022
[15]

PaliGemma: A versatile 3B VLM for transfer

L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neu- mann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

A. F. Biten, R. Tito, A. Mafla, L. Gomez, M. Rusinol, E. Valveny, C. Jawahar, and D. Karatzas. Scene text visual question answering. InProceedings of the IEEE/CVF internationalconferenceoncomputervision, pages 4291–4301, 2019

2019
[17]

R. N. Bracewell. The fourier transform. Scientific American, 260(6):86–95, 1989

1989
[18]

Bradbury, R

J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, Y. Katariya, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang. JAX: composable transfor- mations of Python+NumPy programs,
[19]

URL http://github.com/ jax-ml/jax
[20]

Bulat, Y

A. Bulat, Y. Ouali, and G. Tzimiropou- los. Fwd2bot: Lvlm visual token com- pression with double forward bottleneck,
[21]

URL https://arxiv.org/abs/ 2503.21757

work page arXiv
[22]

M. Cai, J. Yang, J. Gao, and Y. J. Lee. Matryoshka multimodal mod- els. InThe Thirteenth International Conference on Learning Representations,
[23]

URLhttps://openreview.net/ forum?id=Uhj5OxAz7I
[24]

Cappellazzo, M

U. Cappellazzo, M. Kim, P. Ma, H. Chen, X. Liu, S. Petridis, and M. Pantic. Mome: Mixture of matryoshka experts for audio-visual speech recognition, 2025. URL https://arxiv.org/abs/2510. 04136

2025
[25]

J. Cha, W. Kang, J. Mun, and B. Roh. Honeybee: Locality-enhanced projector for multimodal llm. InProceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 13817–13827, June 2024

2024
[26]

Changpinyo, D

S. Changpinyo, D. Kukliansy, I. Szpektor, X. Chen, N. Ding, and R. Soricut. All you may need for vqa are image captions. InProceedings of the 2022 conference of the north american chapter of the associa- tion for computational linguistics: human language technologies, pages 1947–1963, 2022

2022
[27]

Collectinghighly parallel data for paraphrase evaluation

D.ChenandW.B.Dolan. Collectinghighly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguis- tics: human language technologies, pages 190–200, 2011

2011
[28]

J. Chen, L. Ye, J. He, Z.-Y. Wang, D. Khashabi, and A. Yuille. Efficient large multi-modal models via visual context compression. In A. Globerson, L. Mackey, 12 PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural I...

work page doi:10.52202/079017-2353 2024
[29]

L. Chen, H. Zhao, T. Liu, S. Bai, J. Lin, C. Zhou, and B. Chang. An image is worth 1/2 tokens after layer 2: Plug-and- playinferenceaccelerationforlargevision- language models. InEuropean Conference onComputerVision, pages19–35.Springer, 2024

2024
[30]

T. Chen, S. Saxena, L. Li, D. J. Fleet, and G. Hinton. Pix2seq: A language model- ing framework for object detection.arXiv preprint arXiv:2109.10852, 2021

work page arXiv 2021
[31]

X. Chen, H. Fang, T.-Y. Lin, R. Vedan- tam, S. Gupta, P. Dollár, and C. L. Zit- nick. Microsoft coco captions: Data collec- tion and evaluation server.arXiv preprint arXiv:1504.00325, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[32]

X. Chen, X. Wang, S. Changpinyo, A. J. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, et al. Pali: A jointly-scaled mul- tilingual language-image model.arXiv preprint arXiv:2209.06794, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

X. Chen, J. Djolonga, P. Padlewski, B. Mustafa, S. Changpinyo, J. Wu, C. R. Ruiz, S. Goodman, X. Wang, Y. Tay, et al. Pali-x: On scaling up a multilingual vi- sion and language model.arXiv preprint arXiv:2305.18565, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

X. Chen, X. Wang, L. Beyer, A. Kolesnikov, J.Wu, P.Voigtlaender, B.Mustafa, S.Good- man, I. Alabdulmohsin, P. Padlewski, et al. Pali-3 vision language models: Smaller, faster, stronger.arXiv preprint arXiv:2310.09199, 2023

work page arXiv 2023
[35]

Z. Chen, W. Wang, Y. Cao, Y. Liu, Z. Gao, E. Cui, J. Zhu, S. Ye, H. Tian, Z. Liu, L. Gu, X. Wang, Q. Li, Y. Ren, Z. Chen, J. Luo, J. Wang, T. Jiang, B. Wang, C. He, B. Shi, X. Zhang, H. Lv, Y. Wang, W. Shao, P. Chu, Z. Tu, T. He, Z. Wu, H. Deng, J. Ge, K. Chen, K. Zhang, L. Wang, M. Dou, L. Lu, X. Zhu, T. Lu, D. Lin, Y. Qiao, J. Dai, and W. Wang. Expandin...
[36]

URL https://arxiv.org/abs/ 2412.05271

work page internal anchor Pith review Pith/arXiv arXiv
[37]

M. Endo, X. Wang, and S. Yeung-Levy. Feather the throttle: Revisiting visual to- ken pruning for vision-language model ac- celeration. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 22826–22835, 2025

2025
[38]

R. C. Gonzalez and R. E. Woods.Digital Image Processing. Addison-Wesley, 1992

1992
[39]

Introduction to cloud tpu

Google Cloud. Introduction to cloud tpu. https://cloud.google.com/ tpu/docs/intro-to-tpu, 2026. Ac- cessed: 2026-05-06

2026
[40]

Goyal, T

Y. Goyal, T. Khot, D. Summers-Stay, D. Ba- tra, and D. Parikh. Making the v in vqa matter: Elevating the role of image under- standing in visual question answering. In Proceedings of the IEEE conference on com- puter vision and pattern recognition, pages 6904–6913, 2017

2017
[41]

X. Guo, J. Zhang, and K. Wang. Adaptive- voco: Complexity-aware visual token com- pression for vision-language models. In ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Sig- nalProcessing(ICASSP),pages2396–2400,

2026
[42]

11465113

doi: 10.1109/ICASSP55912.2026. 11465113

work page doi:10.1109/icassp55912.2026 2026
[43]

Grauman, J

D.Gurari, Q.Li, A.J.Stangl, A.Guo, C.Lin, K. Grauman, J. Luo, and J. P. Bigham. Vizwiz grand challenge: Answering visual questions from blind people. InProceed- ings of the IEEE conference on computer vi- sion and pattern recognition, pages 3608– 3617, 2018

2018
[44]

He and H

J. He and H. Chen. Energy-driven adaptive visual token pruning for efficient vision- language models, 2026. URLhttps:// arxiv.org/abs/2603.05950. 13 PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

work page arXiv 2026
[45]

T.-Y. Hsu, C. L. Giles, and T.-H. Huang. Scicap: Generating captions for scientific figures. InFindings of the Association for Computational Linguistics: EMNLP 2021, pages 3258–3264, 2021

2021
[46]

Hu, Z.-Y

W. Hu, Z.-Y. Dou, L. H. Li, A. Ka- math, N. Peng, and K.-W. Chang. Ma- tryoshka query transformer for large vision-language models.Advances in Neu- ral Information Processing Systems, 37: 50168–50188, 2024

2024
[47]

Huang, H

X. Huang, H. Zhou, and K. Han. PruneVid: Visual token pruning for efficient video large language models. In W. Che, J. Nabende, E. Shutova, and M. T. Pile- hvar, editors,Findings of the Associa- tion for Computational Linguistics: ACL 2025, pages 19959–19973, Vienna, Aus- tria, July 2025. Association for Computa- tional Linguistics. ISBN 979-8-89176-256-

2025
[49]

org/2025.findings-acl.1024/

URL https://aclanthology. org/2025.findings-acl.1024/

2025
[50]

Huang, F

Y. Huang, F. Ma, Y. Shao, J. Guo, Z. YU, L. Cui, and Q. Tian. Nüwa: Mending the spatial integrity torn by VLM token pruning. InThe Fourteenth International Conference on Learning Representations,
[51]

URLhttps://openreview.net/ forum?id=C9yclwdquU
[52]

D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reason- ing and compositional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

2019
[53]

Jahagirdar, W

S. Jahagirdar, W. Bousselham, A. Kukleva, and H. Kuehne. When llava meets objects: Token composition for vision-language- models.arXiv preprint arXiv:2602.04864, 2026

work page arXiv 2026
[54]

Kazemzadeh, V

S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language pro- cessing (EMNLP), pages 787–798, 2014

2014
[55]

Kembhavi, M

A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, and A. Farhadi. A diagram is worth a dozen images. InEuropean con- ference on computer vision, pages 235–251. Springer, 2016

2016
[56]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[57]

Z. Kong, Y. Li, F. Zeng, L. Xin, S. Mes- sica, X. Lin, P. Zhao, M. Kellis, H. Tang, and M. Zitnik. Token reduction should go beyond efficiency in generative models – from vision, language to multimodality,
[58]

URL https://arxiv.org/abs/ 2505.18227

work page arXiv
[59]

Krishna, K

R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles. Dense-captioning events in videos. InProceedings of the IEEE in- ternational conference on computer vision, pages 706–715, 2017

2017
[60]

Kudugunta, A

S. Kudugunta, A. Kusupati, T. Dettmers, K. Chen, I. Dhillon, Y. Tsvetkov, H. Ha- jishirzi, S. Kakade, A. Farhadi, and P. Jain. Matformer: Nested transformer for elastic inference.Advances in Neural Information Processing Systems, 37:140535–140564, 2024

2024
[61]

Kusupati, G

A. Kusupati, G. Bhatt, A. Rege, M. Walling- ford, A. Sinha, V. Ramanujan, W. Howard- Snyder, K. Chen, S. Kakade, P. Jain, et al. Matryoshka representation learning.Ad- vancesinNeuralInformationProcessingSys- tems, 35:30233–30249, 2022

2022
[62]

K. Li, S. Goyal, J. D. Semedo, and J. Z. Kolter. Inference optimal VLMs need fewer visual tokens and more parame- ters. InThe Thirteenth International Conference on Learning Representations,
[63]

URLhttps://openreview.net/ forum?id=6VhDQP7WGX
[64]

W. Li, Y. Yuan, J. Liu, D. Tang, S. Wang, J. Qin, J. Zhu, and L. Zhang. Tokenpacker: 14 PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding Efficient visual projector for multimodal llm, 2024. URL https://arxiv.org/ abs/2407.02392

work page arXiv 2024
[65]

li and X

X. li and X. Song. Efficient vision- language reasoning via adaptive token pruning. In1st Workshop on VLM4RWD @ NeurIPS 2025, 2025. URL https://openreview.net/forum? id=Vbqemx4YCC

2025
[66]

Y. Li, G. Li, L. He, J. Zheng, H. Li, and Z. Guan. Widget captioning: Generating natural language description for mobile user interface elements. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 5495–5510, 2020

2020
[67]

Y. Li, F. Wang, Y. Li, M. Chen, M. Zhao, and L. Lan. Semantic-geometric dual com- pression: Training-free visual token reduc- tion for ultra-high-resolution remote sens- ing understanding, 2026. URL https: //arxiv.org/abs/2604.11122

work page internal anchor Pith review Pith/arXiv arXiv 2026
[68]

Z. Li, Y. Li, F. Fang, R. Takezoe, Z.-H. Bo, C. Qian, M. Guang, G. Zhang, and K. Long. Qmop: Query guided mixture-of- projector for efficient visual token com- pression, 2026. URL https://arxiv. org/abs/2603.21232

work page arXiv 2026
[69]

C. Liao, W. Wang, Z. Wen, X. Zheng, Y. Wang, H. He, Y. Lyu, L. Jiang, X. Zou, Y. Fu, B. Ren, L. Zhang, and X. Hu. Are we using the right benchmark: An evaluation framework for visual token compression methods.arXiv preprint arXiv:2510.07143, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[70]

H. Liao, Z. Jiang, Y. Hao, Y. Tan, S. He, B. Wang, J. Zhao, K. Xu, and K. Liu. Re- sadapt: Adaptive resolution for efficient multimodal reasoning, 2026. URLhttps: //arxiv.org/abs/2603.28610

work page arXiv 2026
[71]

T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common ob- jects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

2014
[72]

F. Liu, E. Bugliarello, E. M. Ponti, S. Reddy, N. Collier, and D. Elliott. Visually grounded reasoning across languages and cultures. InProceedings of the 2021 Con- ference on Empirical Methods in Natural Language Processing, pages 10467–10485, 2021

2021
[73]

H. Liu, C. Li, Y. Li, and Y. J. Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26296–26306, June 2024

2024
[74]

T. Liu, L. Shi, R. Hong, Y. Hu, Q. Yin, and L. Zhang. Multi-stage vision token dropping: Towards efficient multimodal large language model.arXiv preprint arXiv:2411.10803, 2024

work page arXiv 2024
[75]

Z. Liu, L. Zhu, B. Shi, Z. Zhang, Y. Lou, S. Yang, H. Xi, S. Cao, Y. Gu, D. Li, X. Li, H. Tang, Y. Fang, Y. Chen, C.-Y. Hsieh, D.-A. Huang, A.-C. Cheng, J. Hu, S. Liu, R. Krishna, P. Molchanov, J. Kautz, H. Yin, S. Han, and Y. Lu. Nvila: Efficient frontier visual language models. InProceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Re...

2025
[76]

Lobry, D

S. Lobry, D. Marcos, J. Murray, and D. Tuia. Rsvqa: Visual question answering for re- mote sensing data.IEEE Transactions on Geoscience and Remote Sensing, 58(12): 8555–8566, 2020

2020
[77]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter. Decoupled weightdecayregularization.arXivpreprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[78]

H. Lou, C. Fan, Z. Liu, Y. Wu, and X. Wang. Llava-sp: Enhancing visual representation withvisualspatialtokensformllms. InPro- ceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), pages 22014–22024, October 2025

2025
[79]

P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science 15 PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding question answering.Advances in neural information processing systems, 35:2507– ...

2022
[80]

J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and comprehension of unambiguous ob- jectdescriptions. InProceedingsoftheIEEE conference on computer vision and pattern recognition, pages 11–20, 2016

2016
[81]

Marino, M

K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InProceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019

2019

Showing first 80 references.

[1] [1]

Acharya, K

M. Acharya, K. Kafle, and C. Kanan. Tal- lyqa: Answering complex counting ques- tions. InProceedings of the AAAI conference on artificial intelligence, volume 33, pages 8076–8084, 2019

2019

[2] [2]

Agrawal, K

H. Agrawal, K. Desai, Y. Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, and P. Anderson. Nocaps: Novel object captioning at scale. InProceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957, 2019

2019

[3] [3]

S. R. Alvar, G. Singh, M. Akbari, and Y. Zhang. Divprune: Diversity-based vi- sual token pruning for large multimodal models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- ternRecognition(CVPR),pages9392–9401, June 2025

2025

[4] [4]

X. An, Y. Xie, K. Yang, W. Zhang, X. Zhao, Z. Cheng, Y. Wang, S. Xu, C. Chen, D. Zhu, C.Wu,H.Tan,C.Li,J.Yang,J.Yu,X.Wang, B. Qin, Y. Wang, Z. Yan, Z. Feng, Z. Liu, B. Li, and J. Deng. Llava-onevision-1.5: Fully open framework for democratized multimodal training, 2025. URLhttps: //arxiv.org/abs/2509.23661

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

K. H. I. Arif, J. Yoon, D. S. Nikolopou- los, H. Vandierendonck, D. John, and B. Ji. Hired: Attention-guided token dropping for efficient inference of high- resolution vision-language models.Pro- ceedings of the AAAI Conference on Arti- ficial Intelligence, 39(2):1773–1781, Apr

[6] [6]

URL https://ojs.aaai.org/index

doi: 10.1609/aaai.v39i2.32171. URL https://ojs.aaai.org/index. php/AAAI/article/view/32171

work page doi:10.1609/aaai.v39i2.32171

[7] [7]

Azulay and Y

A. Azulay and Y. Weiss. Why do deep con- volutional networks generalize so poorly to small image transformations?Journal of Machine Learning Research, 20(184):1– 25, 2019

2019

[8] [8]

Bachmann, J

R. Bachmann, J. Allardice, D. Mizrahi, E. Fini, O. F. Kar, E. Amirloo, A. El-Nouby, A. Zamir, and A. Dehghan. Flextok: Re- sampling images into 1d token sequences of flexible length. InForty-second Inter- national Conference on Machine Learning,

[9] [9]

11 PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

URLhttps://openreview.net/ forum?id=DgdOkUUBzf. 11 PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

[10] [10]

C. Baek, J. Song, S. Kim, and K. Kong. An empirical study of attention and diversity for adaptive visual token pruning in mul- timodal reasoning models. InNeurIPS 2025 Workshop on Efficient Reasoning,

2025

[11] [11]

URLhttps://openreview.net/ forum?id=j2NkINd3pw

[12] [12]

J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, B. Hui, L. Ji, M. Li, J. Lin, R. Lin, D. Liu, G. Liu, C. Lu, K. Lu, J. Ma, R. Men, X. Ren, X. Ren, C. Tan, S. Tan, J. Tu, P. Wang, S. Wang, W. Wang, S. Wu, B. Xu, J. Xu, A. Yang, H. Yang, J. Yang, S. Yang, Y. Yao, B. Yu, H. Yuan, Z. Yuan, J. Zhang, X. Zhang, Y. Zhang, Z. Zh...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

I.Bello,H.Pham,Q.V.Le,M.Norouzi,and S. Bengio. Neural combinatorial optimiza- tion with reinforcement learning.arXiv preprint arXiv:1611.09940, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[14] [14]

Beyer, X

L. Beyer, X. Zhai, and A. Kolesnikov. Better plain vit baselines for imagenet-1k.arXiv preprint arXiv:2205.01580, 2022

work page arXiv 2022

[15] [15]

PaliGemma: A versatile 3B VLM for transfer

L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neu- mann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

A. F. Biten, R. Tito, A. Mafla, L. Gomez, M. Rusinol, E. Valveny, C. Jawahar, and D. Karatzas. Scene text visual question answering. InProceedings of the IEEE/CVF internationalconferenceoncomputervision, pages 4291–4301, 2019

2019

[17] [17]

R. N. Bracewell. The fourier transform. Scientific American, 260(6):86–95, 1989

1989

[18] [18]

Bradbury, R

J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, Y. Katariya, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang. JAX: composable transfor- mations of Python+NumPy programs,

[19] [19]

URL http://github.com/ jax-ml/jax

[20] [20]

Bulat, Y

A. Bulat, Y. Ouali, and G. Tzimiropou- los. Fwd2bot: Lvlm visual token com- pression with double forward bottleneck,

[21] [21]

URL https://arxiv.org/abs/ 2503.21757

work page arXiv

[22] [22]

M. Cai, J. Yang, J. Gao, and Y. J. Lee. Matryoshka multimodal mod- els. InThe Thirteenth International Conference on Learning Representations,

[23] [23]

URLhttps://openreview.net/ forum?id=Uhj5OxAz7I

[24] [24]

Cappellazzo, M

U. Cappellazzo, M. Kim, P. Ma, H. Chen, X. Liu, S. Petridis, and M. Pantic. Mome: Mixture of matryoshka experts for audio-visual speech recognition, 2025. URL https://arxiv.org/abs/2510. 04136

2025

[25] [25]

J. Cha, W. Kang, J. Mun, and B. Roh. Honeybee: Locality-enhanced projector for multimodal llm. InProceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 13817–13827, June 2024

2024

[26] [26]

Changpinyo, D

S. Changpinyo, D. Kukliansy, I. Szpektor, X. Chen, N. Ding, and R. Soricut. All you may need for vqa are image captions. InProceedings of the 2022 conference of the north american chapter of the associa- tion for computational linguistics: human language technologies, pages 1947–1963, 2022

2022

[27] [27]

Collectinghighly parallel data for paraphrase evaluation

D.ChenandW.B.Dolan. Collectinghighly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguis- tics: human language technologies, pages 190–200, 2011

2011

[28] [28]

J. Chen, L. Ye, J. He, Z.-Y. Wang, D. Khashabi, and A. Yuille. Efficient large multi-modal models via visual context compression. In A. Globerson, L. Mackey, 12 PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural I...

work page doi:10.52202/079017-2353 2024

[29] [29]

L. Chen, H. Zhao, T. Liu, S. Bai, J. Lin, C. Zhou, and B. Chang. An image is worth 1/2 tokens after layer 2: Plug-and- playinferenceaccelerationforlargevision- language models. InEuropean Conference onComputerVision, pages19–35.Springer, 2024

2024

[30] [30]

T. Chen, S. Saxena, L. Li, D. J. Fleet, and G. Hinton. Pix2seq: A language model- ing framework for object detection.arXiv preprint arXiv:2109.10852, 2021

work page arXiv 2021

[31] [31]

X. Chen, H. Fang, T.-Y. Lin, R. Vedan- tam, S. Gupta, P. Dollár, and C. L. Zit- nick. Microsoft coco captions: Data collec- tion and evaluation server.arXiv preprint arXiv:1504.00325, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[32] [32]

X. Chen, X. Wang, S. Changpinyo, A. J. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, et al. Pali: A jointly-scaled mul- tilingual language-image model.arXiv preprint arXiv:2209.06794, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[33] [33]

X. Chen, J. Djolonga, P. Padlewski, B. Mustafa, S. Changpinyo, J. Wu, C. R. Ruiz, S. Goodman, X. Wang, Y. Tay, et al. Pali-x: On scaling up a multilingual vi- sion and language model.arXiv preprint arXiv:2305.18565, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

X. Chen, X. Wang, L. Beyer, A. Kolesnikov, J.Wu, P.Voigtlaender, B.Mustafa, S.Good- man, I. Alabdulmohsin, P. Padlewski, et al. Pali-3 vision language models: Smaller, faster, stronger.arXiv preprint arXiv:2310.09199, 2023

work page arXiv 2023

[35] [35]

Z. Chen, W. Wang, Y. Cao, Y. Liu, Z. Gao, E. Cui, J. Zhu, S. Ye, H. Tian, Z. Liu, L. Gu, X. Wang, Q. Li, Y. Ren, Z. Chen, J. Luo, J. Wang, T. Jiang, B. Wang, C. He, B. Shi, X. Zhang, H. Lv, Y. Wang, W. Shao, P. Chu, Z. Tu, T. He, Z. Wu, H. Deng, J. Ge, K. Chen, K. Zhang, L. Wang, M. Dou, L. Lu, X. Zhu, T. Lu, D. Lin, Y. Qiao, J. Dai, and W. Wang. Expandin...

[36] [36]

URL https://arxiv.org/abs/ 2412.05271

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

M. Endo, X. Wang, and S. Yeung-Levy. Feather the throttle: Revisiting visual to- ken pruning for vision-language model ac- celeration. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 22826–22835, 2025

2025

[38] [38]

R. C. Gonzalez and R. E. Woods.Digital Image Processing. Addison-Wesley, 1992

1992

[39] [39]

Introduction to cloud tpu

Google Cloud. Introduction to cloud tpu. https://cloud.google.com/ tpu/docs/intro-to-tpu, 2026. Ac- cessed: 2026-05-06

2026

[40] [40]

Goyal, T

Y. Goyal, T. Khot, D. Summers-Stay, D. Ba- tra, and D. Parikh. Making the v in vqa matter: Elevating the role of image under- standing in visual question answering. In Proceedings of the IEEE conference on com- puter vision and pattern recognition, pages 6904–6913, 2017

2017

[41] [41]

X. Guo, J. Zhang, and K. Wang. Adaptive- voco: Complexity-aware visual token com- pression for vision-language models. In ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Sig- nalProcessing(ICASSP),pages2396–2400,

2026

[42] [42]

11465113

doi: 10.1109/ICASSP55912.2026. 11465113

work page doi:10.1109/icassp55912.2026 2026

[43] [43]

Grauman, J

D.Gurari, Q.Li, A.J.Stangl, A.Guo, C.Lin, K. Grauman, J. Luo, and J. P. Bigham. Vizwiz grand challenge: Answering visual questions from blind people. InProceed- ings of the IEEE conference on computer vi- sion and pattern recognition, pages 3608– 3617, 2018

2018

[44] [44]

He and H

J. He and H. Chen. Energy-driven adaptive visual token pruning for efficient vision- language models, 2026. URLhttps:// arxiv.org/abs/2603.05950. 13 PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

work page arXiv 2026

[45] [45]

T.-Y. Hsu, C. L. Giles, and T.-H. Huang. Scicap: Generating captions for scientific figures. InFindings of the Association for Computational Linguistics: EMNLP 2021, pages 3258–3264, 2021

2021

[46] [46]

Hu, Z.-Y

W. Hu, Z.-Y. Dou, L. H. Li, A. Ka- math, N. Peng, and K.-W. Chang. Ma- tryoshka query transformer for large vision-language models.Advances in Neu- ral Information Processing Systems, 37: 50168–50188, 2024

2024

[47] [47]

Huang, H

X. Huang, H. Zhou, and K. Han. PruneVid: Visual token pruning for efficient video large language models. In W. Che, J. Nabende, E. Shutova, and M. T. Pile- hvar, editors,Findings of the Associa- tion for Computational Linguistics: ACL 2025, pages 19959–19973, Vienna, Aus- tria, July 2025. Association for Computa- tional Linguistics. ISBN 979-8-89176-256-

2025

[48] [49]

org/2025.findings-acl.1024/

URL https://aclanthology. org/2025.findings-acl.1024/

2025

[49] [50]

Huang, F

Y. Huang, F. Ma, Y. Shao, J. Guo, Z. YU, L. Cui, and Q. Tian. Nüwa: Mending the spatial integrity torn by VLM token pruning. InThe Fourteenth International Conference on Learning Representations,

[50] [51]

URLhttps://openreview.net/ forum?id=C9yclwdquU

[51] [52]

D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reason- ing and compositional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

2019

[52] [53]

Jahagirdar, W

S. Jahagirdar, W. Bousselham, A. Kukleva, and H. Kuehne. When llava meets objects: Token composition for vision-language- models.arXiv preprint arXiv:2602.04864, 2026

work page arXiv 2026

[53] [54]

Kazemzadeh, V

S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language pro- cessing (EMNLP), pages 787–798, 2014

2014

[54] [55]

Kembhavi, M

A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, and A. Farhadi. A diagram is worth a dozen images. InEuropean con- ference on computer vision, pages 235–251. Springer, 2016

2016

[55] [56]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[56] [57]

Z. Kong, Y. Li, F. Zeng, L. Xin, S. Mes- sica, X. Lin, P. Zhao, M. Kellis, H. Tang, and M. Zitnik. Token reduction should go beyond efficiency in generative models – from vision, language to multimodality,

[57] [58]

URL https://arxiv.org/abs/ 2505.18227

work page arXiv

[58] [59]

Krishna, K

R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles. Dense-captioning events in videos. InProceedings of the IEEE in- ternational conference on computer vision, pages 706–715, 2017

2017

[59] [60]

Kudugunta, A

S. Kudugunta, A. Kusupati, T. Dettmers, K. Chen, I. Dhillon, Y. Tsvetkov, H. Ha- jishirzi, S. Kakade, A. Farhadi, and P. Jain. Matformer: Nested transformer for elastic inference.Advances in Neural Information Processing Systems, 37:140535–140564, 2024

2024

[60] [61]

Kusupati, G

A. Kusupati, G. Bhatt, A. Rege, M. Walling- ford, A. Sinha, V. Ramanujan, W. Howard- Snyder, K. Chen, S. Kakade, P. Jain, et al. Matryoshka representation learning.Ad- vancesinNeuralInformationProcessingSys- tems, 35:30233–30249, 2022

2022

[61] [62]

K. Li, S. Goyal, J. D. Semedo, and J. Z. Kolter. Inference optimal VLMs need fewer visual tokens and more parame- ters. InThe Thirteenth International Conference on Learning Representations,

[62] [63]

URLhttps://openreview.net/ forum?id=6VhDQP7WGX

[63] [64]

W. Li, Y. Yuan, J. Liu, D. Tang, S. Wang, J. Qin, J. Zhu, and L. Zhang. Tokenpacker: 14 PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding Efficient visual projector for multimodal llm, 2024. URL https://arxiv.org/ abs/2407.02392

work page arXiv 2024

[64] [65]

li and X

X. li and X. Song. Efficient vision- language reasoning via adaptive token pruning. In1st Workshop on VLM4RWD @ NeurIPS 2025, 2025. URL https://openreview.net/forum? id=Vbqemx4YCC

2025

[65] [66]

Y. Li, G. Li, L. He, J. Zheng, H. Li, and Z. Guan. Widget captioning: Generating natural language description for mobile user interface elements. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 5495–5510, 2020

2020

[66] [67]

Y. Li, F. Wang, Y. Li, M. Chen, M. Zhao, and L. Lan. Semantic-geometric dual com- pression: Training-free visual token reduc- tion for ultra-high-resolution remote sens- ing understanding, 2026. URL https: //arxiv.org/abs/2604.11122

work page internal anchor Pith review Pith/arXiv arXiv 2026

[67] [68]

Z. Li, Y. Li, F. Fang, R. Takezoe, Z.-H. Bo, C. Qian, M. Guang, G. Zhang, and K. Long. Qmop: Query guided mixture-of- projector for efficient visual token com- pression, 2026. URL https://arxiv. org/abs/2603.21232

work page arXiv 2026

[68] [69]

C. Liao, W. Wang, Z. Wen, X. Zheng, Y. Wang, H. He, Y. Lyu, L. Jiang, X. Zou, Y. Fu, B. Ren, L. Zhang, and X. Hu. Are we using the right benchmark: An evaluation framework for visual token compression methods.arXiv preprint arXiv:2510.07143, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[69] [70]

H. Liao, Z. Jiang, Y. Hao, Y. Tan, S. He, B. Wang, J. Zhao, K. Xu, and K. Liu. Re- sadapt: Adaptive resolution for efficient multimodal reasoning, 2026. URLhttps: //arxiv.org/abs/2603.28610

work page arXiv 2026

[70] [71]

T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common ob- jects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

2014

[71] [72]

F. Liu, E. Bugliarello, E. M. Ponti, S. Reddy, N. Collier, and D. Elliott. Visually grounded reasoning across languages and cultures. InProceedings of the 2021 Con- ference on Empirical Methods in Natural Language Processing, pages 10467–10485, 2021

2021

[72] [73]

H. Liu, C. Li, Y. Li, and Y. J. Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26296–26306, June 2024

2024

[73] [74]

T. Liu, L. Shi, R. Hong, Y. Hu, Q. Yin, and L. Zhang. Multi-stage vision token dropping: Towards efficient multimodal large language model.arXiv preprint arXiv:2411.10803, 2024

work page arXiv 2024

[74] [75]

Z. Liu, L. Zhu, B. Shi, Z. Zhang, Y. Lou, S. Yang, H. Xi, S. Cao, Y. Gu, D. Li, X. Li, H. Tang, Y. Fang, Y. Chen, C.-Y. Hsieh, D.-A. Huang, A.-C. Cheng, J. Hu, S. Liu, R. Krishna, P. Molchanov, J. Kautz, H. Yin, S. Han, and Y. Lu. Nvila: Efficient frontier visual language models. InProceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Re...

2025

[75] [76]

Lobry, D

S. Lobry, D. Marcos, J. Murray, and D. Tuia. Rsvqa: Visual question answering for re- mote sensing data.IEEE Transactions on Geoscience and Remote Sensing, 58(12): 8555–8566, 2020

2020

[76] [77]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter. Decoupled weightdecayregularization.arXivpreprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[77] [78]

H. Lou, C. Fan, Z. Liu, Y. Wu, and X. Wang. Llava-sp: Enhancing visual representation withvisualspatialtokensformllms. InPro- ceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), pages 22014–22024, October 2025

2025

[78] [79]

P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science 15 PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding question answering.Advances in neural information processing systems, 35:2507– ...

2022

[79] [80]

J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and comprehension of unambiguous ob- jectdescriptions. InProceedingsoftheIEEE conference on computer vision and pattern recognition, pages 11–20, 2016

2016

[80] [81]

Marino, M

K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InProceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019

2019