PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding
Pith reviewed 2026-06-29 08:11 UTC · model grok-4.3
The pith
PARCEL anchors elastic query tokens to spatial pool tokens to improve efficiency in vision-language models at multiple budgets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PARCEL is a visual tokenization architecture that dynamically partitions feature extraction by establishing spatial pool tokens as low-frequency layout anchors and conditioning elastic query tokens on those anchors through Pool-Conditioned Query Resampling, which encourages the query tokens to focus on complementary visual features rather than redundant spatial mapping.
What carries the argument
Pool-Conditioned Query Resampling, which uses spatial pool tokens to condition elastic query tokens and partition low-frequency layout from higher-detail features.
If this is right
- A single model can be deployed at any chosen visual-token budget without retraining.
- Spatial grounding remains stronger than in query-only resampling under heavy compression.
- Fine-grained detail is preserved better than in spatial-only pooling because aliasing is reduced.
- The same architecture improves the Pareto frontier across the full range of tested token counts.
Where Pith is reading between the lines
- The same anchoring idea could be tested on audio or video token sequences that also face quadratic costs.
- Lower token budgets enabled by this method might make on-device vision-language inference practical for new hardware.
- Future work could measure how much the pool tokens themselves can be further compressed once queries are properly conditioned.
Load-bearing premise
Spatial pool tokens will serve as effective low-frequency anchors that steer conditioned query tokens toward complementary features instead of redundant ones.
What would settle it
A head-to-head evaluation at low visual-token budgets where PARCEL fails to match or exceed matryoshka baselines on the reported benchmarks would falsify the performance claim.
read the original abstract
Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression, as in nested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail. Query-only compression, as in nested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), a visual tokenization architecture that dynamically partitions the labor of feature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the "train once, deploy anywhere" paradigm.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes PARCEL, a visual tokenization architecture for large vision-language models that partitions feature extraction by establishing spatial pool tokens as low-frequency layout anchors and conditioning elastic query tokens on them via Pool-Conditioned Query Resampling. This is claimed to resolve representational conflicts between spatial-only and query-only compression methods, yielding consistent outperformance over matryoshka baselines on the performance-efficiency Pareto frontier across visual-token budgets on 27 benchmarks while preserving the train-once-deploy-anywhere property.
Significance. If the central mechanism is validated and the reported gains hold under scrutiny, the work would address a practical bottleneck in LVLM inference by improving token compression without sacrificing spatial grounding or requiring per-budget retraining. The emphasis on complementary rather than redundant feature extraction between pool and query tokens is a conceptually coherent attempt to advance beyond existing nested pooling and resampling approaches.
major comments (2)
- [Abstract] Abstract: The central claim that PARCEL 'improves the performance-efficiency Pareto frontier' and 'consistently outperforming existing matryoshka baselines' is asserted without any quantitative results, tables, error bars, specific benchmark names, or token-budget values, preventing assessment of whether the data support the claim.
- [Abstract] Abstract: The description of Pool-Conditioned Query Resampling provides no equations, architectural diagram, or formal definition of the conditioning operation, and references no ablations, attention-map analyses, or feature-diversity metrics to validate that query tokens extract complementary features rather than duplicating pool information; this mechanism is load-bearing for attributing any gains to the proposed architecture.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We address each major comment below and indicate planned revisions to strengthen the presentation of claims and mechanism.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that PARCEL 'improves the performance-efficiency Pareto frontier' and 'consistently outperforming existing matryoshka baselines' is asserted without any quantitative results, tables, error bars, specific benchmark names, or token-budget values, preventing assessment of whether the data support the claim.
Authors: We agree the abstract would be stronger with concrete quantitative anchors. The manuscript reports results across 27 benchmarks at multiple token budgets (e.g., 32–256 tokens) with consistent gains over matryoshka baselines; we will revise the abstract to include one or two representative metrics (average improvement and example budgets) while preserving length constraints. revision: yes
-
Referee: [Abstract] Abstract: The description of Pool-Conditioned Query Resampling provides no equations, architectural diagram, or formal definition of the conditioning operation, and references no ablations, attention-map analyses, or feature-diversity metrics to validate that query tokens extract complementary features rather than duplicating pool information; this mechanism is load-bearing for attributing any gains to the proposed architecture.
Authors: Abstracts are high-level summaries and conventionally omit equations, diagrams, and detailed ablation references; these appear in Sections 3–5 of the manuscript. We will revise the abstract wording to more clearly signal that the complementary-feature claim is supported by the ablations and analyses presented in the body, but we do not plan to insert equations or diagrams into the abstract itself. revision: partial
Circularity Check
No circularity: new architecture proposed without self-referential derivations or fitted predictions
full rationale
The paper proposes PARCEL as a novel visual tokenization method that partitions feature extraction between pool tokens (low-frequency anchors) and conditioned elastic queries. No equations, parameter fits, or derivations are presented in the provided text that reduce the claimed performance gains to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claims rest on the architectural description and empirical evaluations across benchmarks rather than any self-definitional or fitted-input reduction. This is a standard empirical architecture paper whose derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Spatial pool tokens can serve as low-frequency layout anchors that resolve the conflict between spatial and query compression when used to condition elastic queries.
invented entities (1)
-
Pool-Conditioned Query Resampling
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Acharya, K
M. Acharya, K. Kafle, and C. Kanan. Tal- lyqa: Answering complex counting ques- tions. InProceedings of the AAAI conference on artificial intelligence, volume 33, pages 8076–8084, 2019
2019
-
[2]
Agrawal, K
H. Agrawal, K. Desai, Y. Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, and P. Anderson. Nocaps: Novel object captioning at scale. InProceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957, 2019
2019
-
[3]
S. R. Alvar, G. Singh, M. Akbari, and Y. Zhang. Divprune: Diversity-based vi- sual token pruning for large multimodal models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- ternRecognition(CVPR),pages9392–9401, June 2025
2025
-
[4]
X. An, Y. Xie, K. Yang, W. Zhang, X. Zhao, Z. Cheng, Y. Wang, S. Xu, C. Chen, D. Zhu, C.Wu,H.Tan,C.Li,J.Yang,J.Yu,X.Wang, B. Qin, Y. Wang, Z. Yan, Z. Feng, Z. Liu, B. Li, and J. Deng. Llava-onevision-1.5: Fully open framework for democratized multimodal training, 2025. URLhttps: //arxiv.org/abs/2509.23661
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
K. H. I. Arif, J. Yoon, D. S. Nikolopou- los, H. Vandierendonck, D. John, and B. Ji. Hired: Attention-guided token dropping for efficient inference of high- resolution vision-language models.Pro- ceedings of the AAAI Conference on Arti- ficial Intelligence, 39(2):1773–1781, Apr
-
[6]
URL https://ojs.aaai.org/index
doi: 10.1609/aaai.v39i2.32171. URL https://ojs.aaai.org/index. php/AAAI/article/view/32171
-
[7]
Azulay and Y
A. Azulay and Y. Weiss. Why do deep con- volutional networks generalize so poorly to small image transformations?Journal of Machine Learning Research, 20(184):1– 25, 2019
2019
-
[8]
Bachmann, J
R. Bachmann, J. Allardice, D. Mizrahi, E. Fini, O. F. Kar, E. Amirloo, A. El-Nouby, A. Zamir, and A. Dehghan. Flextok: Re- sampling images into 1d token sequences of flexible length. InForty-second Inter- national Conference on Machine Learning,
-
[9]
11 PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding
URLhttps://openreview.net/ forum?id=DgdOkUUBzf. 11 PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding
-
[10]
C. Baek, J. Song, S. Kim, and K. Kong. An empirical study of attention and diversity for adaptive visual token pruning in mul- timodal reasoning models. InNeurIPS 2025 Workshop on Efficient Reasoning,
2025
-
[11]
URLhttps://openreview.net/ forum?id=j2NkINd3pw
-
[12]
J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, B. Hui, L. Ji, M. Li, J. Lin, R. Lin, D. Liu, G. Liu, C. Lu, K. Lu, J. Ma, R. Men, X. Ren, X. Ren, C. Tan, S. Tan, J. Tu, P. Wang, S. Wang, W. Wang, S. Wu, B. Xu, J. Xu, A. Yang, H. Yang, J. Yang, S. Yang, Y. Yao, B. Yu, H. Yuan, Z. Yuan, J. Zhang, X. Zhang, Y. Zhang, Z. Zh...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
I.Bello,H.Pham,Q.V.Le,M.Norouzi,and S. Bengio. Neural combinatorial optimiza- tion with reinforcement learning.arXiv preprint arXiv:1611.09940, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
- [14]
-
[15]
PaliGemma: A versatile 3B VLM for transfer
L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neu- mann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
A. F. Biten, R. Tito, A. Mafla, L. Gomez, M. Rusinol, E. Valveny, C. Jawahar, and D. Karatzas. Scene text visual question answering. InProceedings of the IEEE/CVF internationalconferenceoncomputervision, pages 4291–4301, 2019
2019
-
[17]
R. N. Bracewell. The fourier transform. Scientific American, 260(6):86–95, 1989
1989
-
[18]
Bradbury, R
J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, Y. Katariya, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang. JAX: composable transfor- mations of Python+NumPy programs,
-
[19]
URL http://github.com/ jax-ml/jax
-
[20]
Bulat, Y
A. Bulat, Y. Ouali, and G. Tzimiropou- los. Fwd2bot: Lvlm visual token com- pression with double forward bottleneck,
- [21]
-
[22]
M. Cai, J. Yang, J. Gao, and Y. J. Lee. Matryoshka multimodal mod- els. InThe Thirteenth International Conference on Learning Representations,
-
[23]
URLhttps://openreview.net/ forum?id=Uhj5OxAz7I
-
[24]
Cappellazzo, M
U. Cappellazzo, M. Kim, P. Ma, H. Chen, X. Liu, S. Petridis, and M. Pantic. Mome: Mixture of matryoshka experts for audio-visual speech recognition, 2025. URL https://arxiv.org/abs/2510. 04136
2025
-
[25]
J. Cha, W. Kang, J. Mun, and B. Roh. Honeybee: Locality-enhanced projector for multimodal llm. InProceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 13817–13827, June 2024
2024
-
[26]
Changpinyo, D
S. Changpinyo, D. Kukliansy, I. Szpektor, X. Chen, N. Ding, and R. Soricut. All you may need for vqa are image captions. InProceedings of the 2022 conference of the north american chapter of the associa- tion for computational linguistics: human language technologies, pages 1947–1963, 2022
2022
-
[27]
Collectinghighly parallel data for paraphrase evaluation
D.ChenandW.B.Dolan. Collectinghighly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguis- tics: human language technologies, pages 190–200, 2011
2011
-
[28]
J. Chen, L. Ye, J. He, Z.-Y. Wang, D. Khashabi, and A. Yuille. Efficient large multi-modal models via visual context compression. In A. Globerson, L. Mackey, 12 PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural I...
-
[29]
L. Chen, H. Zhao, T. Liu, S. Bai, J. Lin, C. Zhou, and B. Chang. An image is worth 1/2 tokens after layer 2: Plug-and- playinferenceaccelerationforlargevision- language models. InEuropean Conference onComputerVision, pages19–35.Springer, 2024
2024
- [30]
-
[31]
X. Chen, H. Fang, T.-Y. Lin, R. Vedan- tam, S. Gupta, P. Dollár, and C. L. Zit- nick. Microsoft coco captions: Data collec- tion and evaluation server.arXiv preprint arXiv:1504.00325, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[32]
X. Chen, X. Wang, S. Changpinyo, A. J. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, et al. Pali: A jointly-scaled mul- tilingual language-image model.arXiv preprint arXiv:2209.06794, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[33]
X. Chen, J. Djolonga, P. Padlewski, B. Mustafa, S. Changpinyo, J. Wu, C. R. Ruiz, S. Goodman, X. Wang, Y. Tay, et al. Pali-x: On scaling up a multilingual vi- sion and language model.arXiv preprint arXiv:2305.18565, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [34]
-
[35]
Z. Chen, W. Wang, Y. Cao, Y. Liu, Z. Gao, E. Cui, J. Zhu, S. Ye, H. Tian, Z. Liu, L. Gu, X. Wang, Q. Li, Y. Ren, Z. Chen, J. Luo, J. Wang, T. Jiang, B. Wang, C. He, B. Shi, X. Zhang, H. Lv, Y. Wang, W. Shao, P. Chu, Z. Tu, T. He, Z. Wu, H. Deng, J. Ge, K. Chen, K. Zhang, L. Wang, M. Dou, L. Lu, X. Zhu, T. Lu, D. Lin, Y. Qiao, J. Dai, and W. Wang. Expandin...
-
[36]
URL https://arxiv.org/abs/ 2412.05271
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
M. Endo, X. Wang, and S. Yeung-Levy. Feather the throttle: Revisiting visual to- ken pruning for vision-language model ac- celeration. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 22826–22835, 2025
2025
-
[38]
R. C. Gonzalez and R. E. Woods.Digital Image Processing. Addison-Wesley, 1992
1992
-
[39]
Introduction to cloud tpu
Google Cloud. Introduction to cloud tpu. https://cloud.google.com/ tpu/docs/intro-to-tpu, 2026. Ac- cessed: 2026-05-06
2026
-
[40]
Goyal, T
Y. Goyal, T. Khot, D. Summers-Stay, D. Ba- tra, and D. Parikh. Making the v in vqa matter: Elevating the role of image under- standing in visual question answering. In Proceedings of the IEEE conference on com- puter vision and pattern recognition, pages 6904–6913, 2017
2017
-
[41]
X. Guo, J. Zhang, and K. Wang. Adaptive- voco: Complexity-aware visual token com- pression for vision-language models. In ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Sig- nalProcessing(ICASSP),pages2396–2400,
2026
- [42]
-
[43]
Grauman, J
D.Gurari, Q.Li, A.J.Stangl, A.Guo, C.Lin, K. Grauman, J. Luo, and J. P. Bigham. Vizwiz grand challenge: Answering visual questions from blind people. InProceed- ings of the IEEE conference on computer vi- sion and pattern recognition, pages 3608– 3617, 2018
2018
- [44]
-
[45]
T.-Y. Hsu, C. L. Giles, and T.-H. Huang. Scicap: Generating captions for scientific figures. InFindings of the Association for Computational Linguistics: EMNLP 2021, pages 3258–3264, 2021
2021
-
[46]
Hu, Z.-Y
W. Hu, Z.-Y. Dou, L. H. Li, A. Ka- math, N. Peng, and K.-W. Chang. Ma- tryoshka query transformer for large vision-language models.Advances in Neu- ral Information Processing Systems, 37: 50168–50188, 2024
2024
-
[47]
Huang, H
X. Huang, H. Zhou, and K. Han. PruneVid: Visual token pruning for efficient video large language models. In W. Che, J. Nabende, E. Shutova, and M. T. Pile- hvar, editors,Findings of the Associa- tion for Computational Linguistics: ACL 2025, pages 19959–19973, Vienna, Aus- tria, July 2025. Association for Computa- tional Linguistics. ISBN 979-8-89176-256-
2025
-
[49]
org/2025.findings-acl.1024/
URL https://aclanthology. org/2025.findings-acl.1024/
2025
-
[50]
Huang, F
Y. Huang, F. Ma, Y. Shao, J. Guo, Z. YU, L. Cui, and Q. Tian. Nüwa: Mending the spatial integrity torn by VLM token pruning. InThe Fourteenth International Conference on Learning Representations,
-
[51]
URLhttps://openreview.net/ forum?id=C9yclwdquU
-
[52]
D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reason- ing and compositional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019
2019
-
[53]
S. Jahagirdar, W. Bousselham, A. Kukleva, and H. Kuehne. When llava meets objects: Token composition for vision-language- models.arXiv preprint arXiv:2602.04864, 2026
-
[54]
Kazemzadeh, V
S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language pro- cessing (EMNLP), pages 787–798, 2014
2014
-
[55]
Kembhavi, M
A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, and A. Farhadi. A diagram is worth a dozen images. InEuropean con- ference on computer vision, pages 235–251. Springer, 2016
2016
-
[56]
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[57]
Z. Kong, Y. Li, F. Zeng, L. Xin, S. Mes- sica, X. Lin, P. Zhao, M. Kellis, H. Tang, and M. Zitnik. Token reduction should go beyond efficiency in generative models – from vision, language to multimodality,
- [58]
-
[59]
Krishna, K
R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles. Dense-captioning events in videos. InProceedings of the IEEE in- ternational conference on computer vision, pages 706–715, 2017
2017
-
[60]
Kudugunta, A
S. Kudugunta, A. Kusupati, T. Dettmers, K. Chen, I. Dhillon, Y. Tsvetkov, H. Ha- jishirzi, S. Kakade, A. Farhadi, and P. Jain. Matformer: Nested transformer for elastic inference.Advances in Neural Information Processing Systems, 37:140535–140564, 2024
2024
-
[61]
Kusupati, G
A. Kusupati, G. Bhatt, A. Rege, M. Walling- ford, A. Sinha, V. Ramanujan, W. Howard- Snyder, K. Chen, S. Kakade, P. Jain, et al. Matryoshka representation learning.Ad- vancesinNeuralInformationProcessingSys- tems, 35:30233–30249, 2022
2022
-
[62]
K. Li, S. Goyal, J. D. Semedo, and J. Z. Kolter. Inference optimal VLMs need fewer visual tokens and more parame- ters. InThe Thirteenth International Conference on Learning Representations,
-
[63]
URLhttps://openreview.net/ forum?id=6VhDQP7WGX
-
[64]
W. Li, Y. Yuan, J. Liu, D. Tang, S. Wang, J. Qin, J. Zhu, and L. Zhang. Tokenpacker: 14 PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding Efficient visual projector for multimodal llm, 2024. URL https://arxiv.org/ abs/2407.02392
-
[65]
li and X
X. li and X. Song. Efficient vision- language reasoning via adaptive token pruning. In1st Workshop on VLM4RWD @ NeurIPS 2025, 2025. URL https://openreview.net/forum? id=Vbqemx4YCC
2025
-
[66]
Y. Li, G. Li, L. He, J. Zheng, H. Li, and Z. Guan. Widget captioning: Generating natural language description for mobile user interface elements. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 5495–5510, 2020
2020
-
[67]
Y. Li, F. Wang, Y. Li, M. Chen, M. Zhao, and L. Lan. Semantic-geometric dual com- pression: Training-free visual token reduc- tion for ultra-high-resolution remote sens- ing understanding, 2026. URL https: //arxiv.org/abs/2604.11122
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [68]
-
[69]
C. Liao, W. Wang, Z. Wen, X. Zheng, Y. Wang, H. He, Y. Lyu, L. Jiang, X. Zou, Y. Fu, B. Ren, L. Zhang, and X. Hu. Are we using the right benchmark: An evaluation framework for visual token compression methods.arXiv preprint arXiv:2510.07143, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [70]
-
[71]
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common ob- jects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014
2014
-
[72]
F. Liu, E. Bugliarello, E. M. Ponti, S. Reddy, N. Collier, and D. Elliott. Visually grounded reasoning across languages and cultures. InProceedings of the 2021 Con- ference on Empirical Methods in Natural Language Processing, pages 10467–10485, 2021
2021
-
[73]
H. Liu, C. Li, Y. Li, and Y. J. Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26296–26306, June 2024
2024
- [74]
-
[75]
Z. Liu, L. Zhu, B. Shi, Z. Zhang, Y. Lou, S. Yang, H. Xi, S. Cao, Y. Gu, D. Li, X. Li, H. Tang, Y. Fang, Y. Chen, C.-Y. Hsieh, D.-A. Huang, A.-C. Cheng, J. Hu, S. Liu, R. Krishna, P. Molchanov, J. Kautz, H. Yin, S. Han, and Y. Lu. Nvila: Efficient frontier visual language models. InProceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Re...
2025
-
[76]
Lobry, D
S. Lobry, D. Marcos, J. Murray, and D. Tuia. Rsvqa: Visual question answering for re- mote sensing data.IEEE Transactions on Geoscience and Remote Sensing, 58(12): 8555–8566, 2020
2020
-
[77]
Decoupled Weight Decay Regularization
I. Loshchilov and F. Hutter. Decoupled weightdecayregularization.arXivpreprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[78]
H. Lou, C. Fan, Z. Liu, Y. Wu, and X. Wang. Llava-sp: Enhancing visual representation withvisualspatialtokensformllms. InPro- ceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), pages 22014–22024, October 2025
2025
-
[79]
P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science 15 PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding question answering.Advances in neural information processing systems, 35:2507– ...
2022
-
[80]
J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and comprehension of unambiguous ob- jectdescriptions. InProceedingsoftheIEEE conference on computer vision and pattern recognition, pages 11–20, 2016
2016
-
[81]
Marino, M
K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InProceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.