Recognition: 2 theorem links
· Lean TheoremSparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference
Pith reviewed 2026-05-15 14:51 UTC · model grok-4.3
The pith
SparseVLM prunes visual tokens in VLMs using text attention scores without any training or added parameters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SparseVLM is a text-guided training-free token optimization mechanism that selects relevant text tokens to rate the significance of visual tokens using self-attention matrices, prunes visual tokens using a proposed strategy to maximize sparsity while retaining information, introduces a rank-based strategy to adaptively determine the sparsification ratio for each layer, and includes a token recycling method that compresses pruned tokens into more compact representations.
What carries the argument
Text-guided self-attention scoring of visual tokens combined with per-layer rank-based adaptive pruning and token recycling.
If this is right
- Various VLMs achieve substantial reductions in FLOPs and inference latency on image and video understanding tasks.
- LLaVA specifically sees 54 percent fewer FLOPs and 37 percent lower CUDA latency while keeping 97 percent accuracy.
- The approach requires no extra parameters or fine-tuning on any training data.
- Sparsification ratios adapt automatically per layer rather than using a fixed global value.
Where Pith is reading between the lines
- The same text-guided scoring idea could be tested on other multimodal settings such as audio-language or video-only models.
- Token recycling may prove more important than simple pruning when scaling to deeper or wider VLMs.
- The method might reduce energy use enough to enable on-device inference for longer video sequences.
- Extending the recycling step to preserve cross-layer information could further limit accuracy loss.
Load-bearing premise
Self-attention scores between selected text tokens and visual tokens reliably identify which visual tokens can be pruned or recycled without losing task-critical information.
What would settle it
Running the method on LLaVA or similar VLMs on standard image or video benchmarks and measuring accuracy retention below 97 percent of the unpruned baseline would falsify the central performance claim.
read the original abstract
In vision-language models (VLMs), visual tokens usually bear a significant amount of computational overhead despite sparsity of information in them when compared to text tokens. To address this, most existing methods learn a network to prune redundant visual tokens using certain training data. Differently, we propose a text-guided training-free token optimization mechanism dubbed SparseVLM that eliminates the need of extra parameters or fine-tuning costs. Given that visual tokens complement text tokens in VLM's linguistic reasoning, we select relevant text tokens to rate the significance of visual tokens using self-attention matrices and, then, prune visual tokens using the proposed strategy to maximize sparsity while retaining information. In particular, we introduce a rank-based strategy to adaptively determine the sparsification ratio for each layer, alongside a token recycling method that compresses pruned tokens into more compact representations. Experimental results show that SparseVLM increases the efficiency of various VLMs in a number of image and video understanding tasks. For example, LLaVA when equipped with SparseVLM achieves 54% reduction in FLOPs, 37% decrease in CUDA latency while maintaining 97% of its original accuracy. Our code is available at https://github.com/Gumpest/SparseVLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SparseVLM, a training-free visual token sparsification method for VLMs. It selects relevant text tokens, scores visual tokens via self-attention matrices, applies a rank-based adaptive sparsification ratio per layer, and recycles pruned tokens into compact representations. Experiments report that equipping LLaVA with SparseVLM yields 54% FLOP reduction, 37% CUDA latency decrease, and 97% accuracy retention across image and video understanding tasks, with similar gains on other VLMs.
Significance. If the central efficiency claims hold under broader validation, the work is significant as a practical, parameter-free approach that avoids fine-tuning costs and extra parameters. The open-sourced code at the provided GitHub link is a clear strength for reproducibility. The method directly targets the known computational imbalance between visual and text tokens in VLMs.
major comments (2)
- [Method] Method section (pruning and scoring procedure): The claim that self-attention scores between selected text tokens and visual tokens reliably identify prunable tokens is load-bearing for the 97% accuracy retention result. No analysis is provided of cases where task-critical visual information receives low initial attention (e.g., fine-grained details in multi-step reasoning), and the post-pruning recycling step cannot recover information already discarded by the initial decision.
- [Experiments] Experiments section (LLaVA results): The reported 54% FLOP / 37% latency / 97% accuracy figures lack ablations isolating the contribution of the rank-based adaptive ratio versus the recycling mechanism, and no edge-case evaluation is shown for tasks where attention may under-score necessary tokens. This leaves moderate uncertainty about robustness beyond the tested settings.
minor comments (2)
- [Abstract] Abstract and §4: The abstract states results on 'various VLMs' but provides concrete numbers only for LLaVA; the main text should explicitly tabulate per-model breakdowns to support the broader claim.
- [Method] Implementation details: While code is linked, the manuscript should include a brief pseudocode or parameter table for the layer-wise sparsification ratios and text-token selection heuristic to aid readers without immediate code access.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive recommendation. We address each major comment below with clarifications and proposed revisions to improve the manuscript.
read point-by-point responses
-
Referee: [Method] Method section (pruning and scoring procedure): The claim that self-attention scores between selected text tokens and visual tokens reliably identify prunable tokens is load-bearing for the 97% accuracy retention result. No analysis is provided of cases where task-critical visual information receives low initial attention (e.g., fine-grained details in multi-step reasoning), and the post-pruning recycling step cannot recover information already discarded by the initial decision.
Authors: We agree that the attention-based scoring mechanism is central to the approach and that the lack of explicit analysis on potential failure modes (such as low-attention critical tokens in fine-grained or multi-step tasks) is a limitation in the current manuscript. Our experiments across diverse image and video benchmarks demonstrate 97% average accuracy retention, indicating that such cases do not substantially degrade performance in the evaluated settings. However, we will revise the Method and Discussion sections to include a dedicated analysis of this aspect, with qualitative examples drawn from the tested tasks illustrating token importance scores and the role of recycling in aggregating pruned information. We will also explicitly note that recycling provides a compact representation of discarded tokens but cannot recover all details lost in the initial pruning decision. revision: yes
-
Referee: [Experiments] Experiments section (LLaVA results): The reported 54% FLOP / 37% latency / 97% accuracy figures lack ablations isolating the contribution of the rank-based adaptive ratio versus the recycling mechanism, and no edge-case evaluation is shown for tasks where attention may under-score necessary tokens. This leaves moderate uncertainty about robustness beyond the tested settings.
Authors: We appreciate the call for more granular ablations to isolate component contributions. The current results reflect the full method, but we will add new ablation studies in the revised Experiments section: one comparing the rank-based adaptive sparsification ratio against fixed-ratio variants, and another evaluating performance with and without the token recycling step. These will quantify their individual impacts on the reported FLOP, latency, and accuracy metrics. For edge-case robustness, our evaluation already covers a range of tasks including fine-grained visual question answering and multi-step video reasoning where attention under-scoring could occur; we will expand the discussion to explicitly address this, highlighting that significant accuracy drops would have been observed if under-scoring were prevalent. We are open to evaluating any additional specific edge-case tasks suggested by the referee. revision: yes
Axiom & Free-Parameter Ledger
free parameters (1)
- layer-wise sparsification ratios
axioms (1)
- domain assumption Self-attention matrices between text and visual tokens accurately reflect the importance of visual tokens for downstream reasoning.
Forward citations
Cited by 22 Pith papers
-
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
VideoRouter uses query-adaptive semantic and image routers plus new training datasets to reduce visual tokens by up to 67.9% while improving performance over the InternVL baseline on long-video benchmarks.
-
Pro$^2$Assist: Continuous Step-Aware Proactive Assistance with Multimodal Egocentric Perception for Long-Horizon Procedural Tasks
Pro²Assist uses multimodal egocentric perception from AR glasses to track fine-grained progress in long-horizon procedural tasks and deliver timely proactive assistance, outperforming baselines by over 21% in action u...
-
Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment
VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with margin...
-
VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models
VisPCO uses continuous relaxation, straight-through estimators, and budget-aware Pareto-frontier learning to automatically discover optimal visual token pruning configurations that approximate grid-search results acro...
-
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
-
Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding
DualComp uses a lightweight router to split visual token compression into a semantic stream with size-adaptive clustering and a geometric stream with path-tracing recovery, enabling low-cost high-fidelity UHR remote s...
-
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
-
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.
-
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and ...
-
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.
-
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
RTPrune prunes visual tokens in DeepSeek-OCR via a reading-twice two-stage process, retaining 84.25% tokens for 99.47% accuracy and 1.23x faster prefill on OmniDocBench.
-
Geometry-Guided 3D Visual Token Pruning for Video-Language Models
Geo3DPruner uses geometry-aware global attention and two-stage voxel pruning to remove 90% of visual tokens from spatial videos while keeping over 90% of original performance on 3D scene benchmarks.
-
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
-
Decoupled Similarity for Task-Aware Token Pruning in Large Vision-Language Models
DeSAP uses decoupled cross-modal similarity plus visual saliency to prune visual tokens in LVLMs, retaining 11.1% tokens for 10x FLOPs reduction and 98.1% performance on LLaVA-1.5-7B.
-
DINO-VO: Learning Where to Focus for Enhanced State Estimation
DINO-VO achieves state-of-the-art monocular visual odometry accuracy and generalization by training a differentiable patch selector together with multi-task features and inverse-depth bundle adjustment.
-
ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling
ForestPrune prunes 90% of visual tokens in video MLLMs like LLaVA-OneVision while retaining 95.8% accuracy by modeling tokens as spatial-temporal forests and scoring importance via tree depth and node roles.
-
AttenA+: Rectifying Action Inequality in Robotic Foundation Models
AttenA+ applies velocity-driven action attention to reweight training objectives toward kinematically critical low-velocity segments, yielding small benchmark gains on Libero and RoboTwin without added parameters.
-
VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models
Training-free adaptive reuse of stable visual state in video VLMs reduces follow-up latency by 15-36x on Qwen2.5-VL while preserving correctness on VideoMME, with smaller first-query speedups via pruning.
-
Do Vision Language Models Need to Process Image Tokens?
Visual representations in VLMs converge quickly to stable low-complexity forms while text continues evolving, with task-dependent needs for sustained image token access.
-
Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models
AOT reduces visual tokens in VLLMs via intra-frame and inter-frame anchors with local-global optimal transport, delivering competitive benchmark performance and efficiency gains in a training-free way.
-
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
RTPrune delivers 99.47% accuracy and 1.23x faster prefill on OmniDocBench for DeepSeek-OCR-Large by retaining only 84.25% of tokens through a reading-twice inspired two-stage pruning process.
-
Towards Efficient Large Vision-Language Models: A Comprehensive Survey on Inference Strategies
The paper surveys and taxonomizes inference optimization methods for large vision-language models across four categories while noting limitations and open problems.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a visual language model for few-shot learning
Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 2022
work page 2022
-
[2]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J. Qwen- VL : A frontier large vision-language model with versatile abilities. arXiv:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Token merging: Your vit but faster
Bolya, D., Fu, C.-Y., Dai, X., Zhang, P., Feichtenhofer, C., and Hoffman, J. Token merging: Your vit but faster. In International Conference on Learning Representations, 2023
work page 2023
-
[4]
D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 2020
work page 2020
-
[5]
Cai, M., Yang, J., Gao, J., and Lee, Y. J. Matryoshka multimodal models. In International Conference on Learning Representations, 2025
work page 2025
-
[6]
Honeybee: Locality-enhanced projector for multimodal llm
Cha, J., Kang, W., Mun, J., and Roh, B. Honeybee: Locality-enhanced projector for multimodal llm. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[7]
Chen, L., Zhao, H., Liu, T., Bai, S., Lin, J., Zhou, C., and Chang, B. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In Proceedings of the European Conference on Computer Vision, 2024 a
work page 2024
-
[8]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024 b
work page 2024
-
[9]
Instruct BLIP : Towards general-purpose vision-language models with instruction tuning
Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P., and Hoi, S. Instruct BLIP : Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 2023
work page 2023
-
[10]
Flash A ttention: Fast and memory-efficient exact attention with io-awareness
Dao, T., Fu, D., Ermon, S., Rudra, A., and R \'e , C. Flash A ttention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 2022
work page 2022
-
[11]
Glm: General language model pretraining with autoregressive blank infilling
Du, Z., Qian, Y., Liu, X., Ding, M., Qiu, J., Yang, Z., and Tang, J. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2022
work page 2022
-
[12]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al. MME : A comprehensive evaluation benchmark for multimodal large language models. arXiv:2306.13394, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Making the v in vqa matter: Elevating the role of image understanding in visual question answering
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017
work page 2017
-
[14]
Hudson, D. A. and Manning, C. D. GQA : A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019
work page 2019
-
[15]
Tgif-qa: Toward spatio-temporal reasoning in visual question answering
Jang, Y., Song, Y., Yu, Y., Kim, Y., and Kim, G. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017
work page 2017
-
[16]
Videopoet: A large language model for zero-shot video generation
Kondratyuk, D., Yu, L., Gu, X., Lezama, J., Huang, J., Hornung, R., Adam, H., Akbari, H., Alon, Y., Birodkar, V., et al. Videopoet: A large language model for zero-shot video generation. In International Conference on Machine Learning, 2024
work page 2024
-
[17]
Seed-bench: Benchmarking multimodal large language models
Li, B., Ge, Y., Ge, Y., Wang, G., Wang, R., Zhang, R., and Shan, Y. Seed-bench: Benchmarking multimodal large language models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024 a
work page 2024
-
[18]
Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, 2023 a
work page 2023
-
[19]
Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W. X., and Wen, J.-R. Evaluating object hallucination in large vision-language models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2023 b
work page 2023
-
[20]
LLaMA-VID : An image is worth 2 tokens in large language models
Li, Y., Wang, C., and Jia, J. LLaMA-VID : An image is worth 2 tokens in large language models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024 b
work page 2024
-
[22]
Video-llava: Learning united visual representation by alignment before projection
Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., and Yuan, L. Video-llava: Learning united visual representation by alignment before projection. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2024
work page 2024
-
[23]
Liu, H., Li, C., Li, Y., and Lee, Y. J. Improved baselines with visual instruction tuning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024 a
work page 2024
-
[24]
Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning. Advances in Neural Information Processing Systems, 2024 b
work page 2024
-
[25]
Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al. Mmbench: Is your multi-modal model an all-around player? In Proceedings of the European Conference on Computer Vision, 2024 c
work page 2024
-
[26]
Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., and Xie, S. A convnet for the 2020s. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022
work page 2022
-
[27]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.-W., Zhu, S.-C., Tafjord, O., Clark, P., and Kalyan, A. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 2022
work page 2022
-
[28]
Video-chatgpt: Towards detailed video understanding via large vision and language models
Maaz, M., Rasheed, H., Khan, S., and Khan, F. Video-chatgpt: Towards detailed video understanding via large vision and language models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2024
work page 2024
-
[29]
Marr, D. Vision: A computational investigation into the human representation and processing of visual information. MIT press, 2010
work page 2010
-
[31]
Language models are unsupervised multitask learners
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI blog, 2019
work page 2019
-
[32]
Clustering by fast search and find of density peaks
Rodriguez, A. Clustering by fast search and find of density peaks. Science, 2014
work page 2014
-
[34]
Towards VQA models that can read
Singh, A., Natarjan, V., Shah, M., Jiang, Y., Chen, X., Parikh, D., and Rohrbach, M. Towards VQA models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019
work page 2019
-
[35]
Stewart, G. W. On the early history of the singular value decomposition. SIAM review, 1993
work page 1993
-
[38]
N., Kaiser, ., and Polosukhin, I
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. Attention is all you need. Advances in Neural Information Processing Systems, 2017
work page 2017
-
[39]
Q., Wang, Q., Gao, Y., Xu, Q., Xu, T., Hu, Y., Chen, E., and Shou, M
Wu, S., Chen, J., Lin, K. Q., Wang, Q., Gao, Y., Xu, Q., Xu, T., Hu, Y., Chen, E., and Shou, M. Z. Videollm-mod: Efficient video-language streaming with mixture-of-depths vision computation. Advances in Neural Information Processing Systems, 2024
work page 2024
-
[40]
Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction
Xing, L., Huang, Q., Dong, X., Lu, J., Zhang, P., Zang, Y., Cao, Y., He, C., Wang, J., Wu, F., et al. Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[41]
Video question answering via gradually refined attention over appearance and motion
Xu, D., Zhao, Z., Xiao, J., Wu, F., Zhang, H., He, X., and Zhuang, Y. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the ACM international conference on Multimedia, 2017
work page 2017
-
[42]
DeCo : Decoupling token compression from semantic abstraction in multimodal large language models
Yao, L., Li, L., Ren, S., Wang, L., Liu, Y., Sun, X., and Hou, L. DeCo : Decoupling token compression from semantic abstraction in multimodal large language models. arXiv:2405.20985, 2024
-
[43]
VoCo-LLaMA : Towards vision compression with large language models
Ye, X., Gan, Y., Huang, X., Ge, Y., Shan, Y., and Tang, Y. VoCo-LLaMA : Towards vision compression with large language models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[44]
Mm-vet: Evaluating large multimodal models for integrated capabilities
Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., and Wang, L. Mm-vet: Evaluating large multimodal models for integrated capabilities. In International Conference on Machine Learning, 2024
work page 2024
-
[45]
Activitynet-qa: A dataset for understanding complex web videos via question answering
Yu, Z., Xu, D., Yu, J., Yu, T., Zhao, Z., Zhuang, Y., and Tao, D. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, 2019
work page 2019
-
[46]
Unveiling the tapestry of consistency in large vision-language models
Zhang, Y., Huang, T., Fan, C.-K., Dong, H., Li, J., Wang, J., Cheng, K., Zhang, S., Guo, H., et al. Unveiling the tapestry of consistency in large vision-language models. Advances in Neural Information Processing Systems, 2024 a
work page 2024
-
[47]
Freekd: Knowledge distillation via semantic frequency prompt
Zhang, Y., Huang, T., Liu, J., Jiang, T., Cheng, K., and Zhang, S. Freekd: Knowledge distillation via semantic frequency prompt. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024 b
work page 2024
-
[48]
Zhu, B., Lin, B., Ning, M., Yan, Y., Cui, J., HongFa, W., Pang, Y., Jiang, W., Zhang, J., Li, Z., et al. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. In International Conference on Learning Representations, 2024 a
work page 2024
-
[49]
Minigpt-4: Enhancing vision-language understanding with advanced large language models
Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. In International Conference on Learning Representations, 2024 b
work page 2024
-
[50]
Scaling Learning Algorithms Towards
Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
-
[51]
and Osindero, Simon and Teh, Yee Whye , journal =
Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , year =
- [52]
-
[53]
A diagram is worth a dozen images , author=. Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part IV 14 , pages=. 2016 , organization=
work page 2016
-
[54]
International Conference on Machine Learning , year=
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities , author=. International Conference on Machine Learning , year=
-
[55]
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , year=
SEED-Bench: Benchmarking Multimodal Large Language Models , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , year=
-
[56]
Advances in Neural Information Processing Systems , year=
Attention is all you need , author=. Advances in Neural Information Processing Systems , year=
-
[57]
Language models are unsupervised multitask learners , author=. OpenAI blog , year=
-
[58]
Gpt-4 technical report , author=. arXiv:2303.08774 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [59]
-
[60]
LLaMA: Open and Efficient Foundation Language Models
Llama: Open and efficient foundation language models , author=. arXiv:2302.13971 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[61]
Instruction tuning with gpt-4 , author=. arXiv:2304.03277 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[62]
Advances in Neural Information Processing Systems , year=
Language models are few-shot learners , author=. Advances in Neural Information Processing Systems , year=
-
[63]
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , year=
A convnet for the 2020s , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , year=
-
[64]
Advances in Neural Information Processing Systems , year=
Visual instruction tuning , author=. Advances in Neural Information Processing Systems , year=
-
[65]
Mini-gemini: Mining the potential of multi-modality vision language models , author=. arXiv:2403.18814 , year=
-
[66]
Proceedings of the Annual Meeting of the Association for Computational Linguistics , year=
GLM: General Language Model Pretraining with Autoregressive Blank Infilling , author=. Proceedings of the Annual Meeting of the Association for Computational Linguistics , year=
-
[67]
International Conference on Learning Representations , year=
Minigpt-4: Enhancing vision-language understanding with advanced large language models , author=. International Conference on Learning Representations , year=
-
[68]
Proceedings of the Conference on Empirical Methods in Natural Language Processing , year=
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection , author=. Proceedings of the Conference on Empirical Methods in Natural Language Processing , year=
-
[69]
Qwen technical report , author=. arXiv:2309.16609 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[70]
Gemini: A Family of Highly Capable Multimodal Models
Gemini: a family of highly capable multimodal models , author=. arXiv:2312.11805 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[71]
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , url=
Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae , month=. LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , url=
-
[72]
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , year=
Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , year=
-
[73]
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
Sharegpt4v: Improving large multi-modal models with better captions , author=. arXiv:2311.12793 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[74]
CogVLM: Visual Expert for Pretrained Language Models
Cogvlm: Visual expert for pretrained language models , author=. arXiv:2311.03079 , year=
work page internal anchor Pith review arXiv
-
[75]
Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren , journal=. Qwen-
-
[76]
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , year=
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , year=
-
[77]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
mplug-owl: Modularization empowers large language models with multimodality , author=. arXiv:2304.14178 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[78]
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites , author=. arXiv:2404.16821 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[79]
International Conference on Machine Learning , year=
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International Conference on Machine Learning , year=
-
[80]
International conference on machine learning , year=
Learning transferable visual models from natural language supervision , author=. International conference on machine learning , year=
-
[81]
International Conference on Machine Learning , year=
Videopoet: A large language model for zero-shot video generation , author=. International Conference on Machine Learning , year=
-
[82]
Advances in Neural Information Processing Systems , year=
Videollm-mod: Efficient video-language streaming with mixture-of-depths vision computation , author=. Advances in Neural Information Processing Systems , year=
-
[83]
arXiv preprint arXiv:2403.15388 , year=
Llava-prumerge: Adaptive token reduction for efficient large multimodal models , author=. arXiv preprint arXiv:2403.15388 , year=
-
[84]
Li, Yanwei and Wang, Chengyao and Jia, Jiaya , booktitle=
-
[85]
Yao, Linli and Li, Lei and Ren, Shuhuai and Wang, Lean and Liu, Yuanxin and Sun, Xu and Hou, Lu , journal=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.