arxiv: 2410.04417 · v4 · submitted 2024-10-06 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Yuan Zhang , Chun-Kai Fan , Junpeng Ma , Wenzhao Zheng , Tao Huang , Kuan Cheng , Denis Gudovskiy , Tomoyuki Okuno

show 3 more authors

Yohei Nakata Kurt Keutzer Shanghang Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual token sparsificationtraining-free pruningself-attention guided selectionefficient vision-language modelstoken recyclingLLaVA optimizationimage and video understanding

0 comments

The pith

SparseVLM prunes visual tokens in VLMs using text attention scores without any training or added parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SparseVLM as a training-free mechanism to reduce computational overhead in vision-language models by eliminating redundant visual tokens. It selects relevant text tokens and uses their self-attention scores with visual tokens to determine importance, then prunes accordingly while recycling discarded tokens into compact forms. A rank-based approach sets the pruning ratio adaptively for each layer. This yields large efficiency gains across VLMs on image and video tasks, for instance cutting LLaVA's FLOPs by 54 percent and latency by 37 percent while retaining 97 percent of original accuracy. A reader would care because the method removes the usual costs of learning pruning networks or fine-tuning, making deployment of these models more practical.

Core claim

SparseVLM is a text-guided training-free token optimization mechanism that selects relevant text tokens to rate the significance of visual tokens using self-attention matrices, prunes visual tokens using a proposed strategy to maximize sparsity while retaining information, introduces a rank-based strategy to adaptively determine the sparsification ratio for each layer, and includes a token recycling method that compresses pruned tokens into more compact representations.

What carries the argument

Text-guided self-attention scoring of visual tokens combined with per-layer rank-based adaptive pruning and token recycling.

If this is right

Various VLMs achieve substantial reductions in FLOPs and inference latency on image and video understanding tasks.
LLaVA specifically sees 54 percent fewer FLOPs and 37 percent lower CUDA latency while keeping 97 percent accuracy.
The approach requires no extra parameters or fine-tuning on any training data.
Sparsification ratios adapt automatically per layer rather than using a fixed global value.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same text-guided scoring idea could be tested on other multimodal settings such as audio-language or video-only models.
Token recycling may prove more important than simple pruning when scaling to deeper or wider VLMs.
The method might reduce energy use enough to enable on-device inference for longer video sequences.
Extending the recycling step to preserve cross-layer information could further limit accuracy loss.

Load-bearing premise

Self-attention scores between selected text tokens and visual tokens reliably identify which visual tokens can be pruned or recycled without losing task-critical information.

What would settle it

Running the method on LLaVA or similar VLMs on standard image or video benchmarks and measuring accuracy retention below 97 percent of the unpruned baseline would falsify the central performance claim.

read the original abstract

In vision-language models (VLMs), visual tokens usually bear a significant amount of computational overhead despite sparsity of information in them when compared to text tokens. To address this, most existing methods learn a network to prune redundant visual tokens using certain training data. Differently, we propose a text-guided training-free token optimization mechanism dubbed SparseVLM that eliminates the need of extra parameters or fine-tuning costs. Given that visual tokens complement text tokens in VLM's linguistic reasoning, we select relevant text tokens to rate the significance of visual tokens using self-attention matrices and, then, prune visual tokens using the proposed strategy to maximize sparsity while retaining information. In particular, we introduce a rank-based strategy to adaptively determine the sparsification ratio for each layer, alongside a token recycling method that compresses pruned tokens into more compact representations. Experimental results show that SparseVLM increases the efficiency of various VLMs in a number of image and video understanding tasks. For example, LLaVA when equipped with SparseVLM achieves 54% reduction in FLOPs, 37% decrease in CUDA latency while maintaining 97% of its original accuracy. Our code is available at https://github.com/Gumpest/SparseVLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SparseVLM gives a workable training-free pruning method that cuts visual token overhead in VLMs by roughly half while holding most accuracy.

read the letter

The core point here is that SparseVLM prunes visual tokens in a training-free way by scoring them against selected text tokens via self-attention, then applies a rank-based adaptive ratio per layer and recycles some pruned tokens into compact forms. This produces the reported 54% FLOP drop and 37% latency cut on LLaVA at 97% retained accuracy, plus similar gains on other image and video tasks. The combination of text-guided scoring, layer-wise adaptation, and recycling is the main novelty relative to earlier pruning approaches that usually require training or fixed ratios. The paper does a decent job showing the method works across several VLMs without extra parameters, and releasing the code helps reproducibility. The numbers are concrete enough to be useful for people running inference on limited hardware. The soft spot is the reliance on attention scores as a proxy for token importance. If a visual detail matters for downstream reasoning but starts with low text attention, the initial pruning decision could drop it before recycling has a chance to help. The abstract does not show extensive ablations on fine-grained or multi-step tasks where this might surface, so the robustness claim rests on the main benchmarks. Overall this is aimed at engineers and researchers who need faster VLM inference without retraining. Readers working on deployment or efficiency papers would find it worth reading. It deserves peer review because the empirical gains are clear and the method is simple to test, even if some edge cases need more checking.

Referee Report

2 major / 2 minor

Summary. The paper proposes SparseVLM, a training-free visual token sparsification method for VLMs. It selects relevant text tokens, scores visual tokens via self-attention matrices, applies a rank-based adaptive sparsification ratio per layer, and recycles pruned tokens into compact representations. Experiments report that equipping LLaVA with SparseVLM yields 54% FLOP reduction, 37% CUDA latency decrease, and 97% accuracy retention across image and video understanding tasks, with similar gains on other VLMs.

Significance. If the central efficiency claims hold under broader validation, the work is significant as a practical, parameter-free approach that avoids fine-tuning costs and extra parameters. The open-sourced code at the provided GitHub link is a clear strength for reproducibility. The method directly targets the known computational imbalance between visual and text tokens in VLMs.

major comments (2)

[Method] Method section (pruning and scoring procedure): The claim that self-attention scores between selected text tokens and visual tokens reliably identify prunable tokens is load-bearing for the 97% accuracy retention result. No analysis is provided of cases where task-critical visual information receives low initial attention (e.g., fine-grained details in multi-step reasoning), and the post-pruning recycling step cannot recover information already discarded by the initial decision.
[Experiments] Experiments section (LLaVA results): The reported 54% FLOP / 37% latency / 97% accuracy figures lack ablations isolating the contribution of the rank-based adaptive ratio versus the recycling mechanism, and no edge-case evaluation is shown for tasks where attention may under-score necessary tokens. This leaves moderate uncertainty about robustness beyond the tested settings.

minor comments (2)

[Abstract] Abstract and §4: The abstract states results on 'various VLMs' but provides concrete numbers only for LLaVA; the main text should explicitly tabulate per-model breakdowns to support the broader claim.
[Method] Implementation details: While code is linked, the manuscript should include a brief pseudocode or parameter table for the layer-wise sparsification ratios and text-token selection heuristic to aid readers without immediate code access.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation. We address each major comment below with clarifications and proposed revisions to improve the manuscript.

read point-by-point responses

Referee: [Method] Method section (pruning and scoring procedure): The claim that self-attention scores between selected text tokens and visual tokens reliably identify prunable tokens is load-bearing for the 97% accuracy retention result. No analysis is provided of cases where task-critical visual information receives low initial attention (e.g., fine-grained details in multi-step reasoning), and the post-pruning recycling step cannot recover information already discarded by the initial decision.

Authors: We agree that the attention-based scoring mechanism is central to the approach and that the lack of explicit analysis on potential failure modes (such as low-attention critical tokens in fine-grained or multi-step tasks) is a limitation in the current manuscript. Our experiments across diverse image and video benchmarks demonstrate 97% average accuracy retention, indicating that such cases do not substantially degrade performance in the evaluated settings. However, we will revise the Method and Discussion sections to include a dedicated analysis of this aspect, with qualitative examples drawn from the tested tasks illustrating token importance scores and the role of recycling in aggregating pruned information. We will also explicitly note that recycling provides a compact representation of discarded tokens but cannot recover all details lost in the initial pruning decision. revision: yes
Referee: [Experiments] Experiments section (LLaVA results): The reported 54% FLOP / 37% latency / 97% accuracy figures lack ablations isolating the contribution of the rank-based adaptive ratio versus the recycling mechanism, and no edge-case evaluation is shown for tasks where attention may under-score necessary tokens. This leaves moderate uncertainty about robustness beyond the tested settings.

Authors: We appreciate the call for more granular ablations to isolate component contributions. The current results reflect the full method, but we will add new ablation studies in the revised Experiments section: one comparing the rank-based adaptive sparsification ratio against fixed-ratio variants, and another evaluating performance with and without the token recycling step. These will quantify their individual impacts on the reported FLOP, latency, and accuracy metrics. For edge-case robustness, our evaluation already covers a range of tasks including fine-grained visual question answering and multi-step video reasoning where attention under-scoring could occur; we will expand the discussion to explicitly address this, highlighting that significant accuracy drops would have been observed if under-scoring were prevalent. We are open to evaluating any additional specific edge-case tasks suggested by the referee. revision: yes

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that attention scores capture token importance and that visual information is sufficiently sparse to allow aggressive pruning with minimal loss.

free parameters (1)

layer-wise sparsification ratios
Determined adaptively via the rank-based strategy; exact thresholds or scaling factors are not detailed in the abstract.

axioms (1)

domain assumption Self-attention matrices between text and visual tokens accurately reflect the importance of visual tokens for downstream reasoning.
This assumption underpins the token significance rating and pruning decision.

pith-pipeline@v0.9.0 · 5554 in / 1211 out tokens · 48838 ms · 2026-05-15T14:51:52.984961+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

VideoRouter uses query-adaptive semantic and image routers plus new training datasets to reduce visual tokens by up to 67.9% while improving performance over the InternVL baseline on long-video benchmarks.
Pro$^2$Assist: Continuous Step-Aware Proactive Assistance with Multimodal Egocentric Perception for Long-Horizon Procedural Tasks
cs.AI 2026-05 unverdicted novelty 7.0

Pro²Assist uses multimodal egocentric perception from AR glasses to track fine-grained progress in long-horizon procedural tasks and deliver timely proactive assistance, outperforming baselines by over 21% in action u...
Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment
cs.RO 2026-04 unverdicted novelty 7.0

VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with margin...
VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

VisPCO uses continuous relaxation, straight-through estimators, and budget-aware Pareto-frontier learning to automatically discover optimal visual token pruning configurations that approximate grid-search results acro...
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
cs.CV 2026-04 unverdicted novelty 7.0

Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding
cs.CV 2026-04 unverdicted novelty 7.0

DualComp uses a lightweight router to split visual token compression into a semantic stream with size-adaptive clustering and a geometric stream with path-tracing recovery, enabling low-cost high-fidelity UHR remote s...
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
cs.CV 2024-10 accept novelty 7.0

PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
cs.LG 2026-05 unverdicted novelty 6.0

A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
cs.CV 2026-05 unverdicted novelty 6.0

LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and ...
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
cs.CV 2026-05 unverdicted novelty 6.0

VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
cs.CV 2026-05 unverdicted novelty 6.0

RTPrune prunes visual tokens in DeepSeek-OCR via a reading-twice two-stage process, retaining 84.25% tokens for 99.47% accuracy and 1.23x faster prefill on OmniDocBench.
Geometry-Guided 3D Visual Token Pruning for Video-Language Models
cs.CV 2026-04 conditional novelty 6.0

Geo3DPruner uses geometry-aware global attention and two-stage voxel pruning to remove 90% of visual tokens from spatial videos while keeping over 90% of original performance on 3D scene benchmarks.
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
cs.CV 2026-04 unverdicted novelty 6.0

POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
Decoupled Similarity for Task-Aware Token Pruning in Large Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

DeSAP uses decoupled cross-modal similarity plus visual saliency to prune visual tokens in LVLMs, retaining 11.1% tokens for 10x FLOPs reduction and 98.1% performance on LLaVA-1.5-7B.
DINO-VO: Learning Where to Focus for Enhanced State Estimation
cs.CV 2026-04 unverdicted novelty 6.0

DINO-VO achieves state-of-the-art monocular visual odometry accuracy and generalization by training a differentiable patch selector together with multi-task features and inverse-depth bundle adjustment.
ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling
cs.CV 2026-03 unverdicted novelty 6.0

ForestPrune prunes 90% of visual tokens in video MLLMs like LLaVA-OneVision while retaining 95.8% accuracy by modeling tokens as spatial-temporal forests and scoring importance via tree depth and node roles.
AttenA+: Rectifying Action Inequality in Robotic Foundation Models
cs.RO 2026-05 unverdicted novelty 5.0

AttenA+ applies velocity-driven action attention to reweight training objectives toward kinematically critical low-velocity segments, yielding small benchmark gains on Libero and RoboTwin without added parameters.
VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models
cs.CV 2026-05 unverdicted novelty 5.0

Training-free adaptive reuse of stable visual state in video VLMs reduces follow-up latency by 15-36x on Qwen2.5-VL while preserving correctness on VideoMME, with smaller first-query speedups via pruning.
Do Vision Language Models Need to Process Image Tokens?
cs.CV 2026-04 unverdicted novelty 5.0

Visual representations in VLMs converge quickly to stable low-complexity forms while text continues evolving, with task-dependent needs for sustained image token access.
Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models
cs.CV 2026-03 unverdicted novelty 5.0

AOT reduces visual tokens in VLLMs via intra-frame and inter-frame anchors with local-global optimal transport, delivering competitive benchmark performance and efficiency gains in a training-free way.
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
cs.CV 2026-05 unverdicted novelty 4.0

RTPrune delivers 99.47% accuracy and 1.23x faster prefill on OmniDocBench for DeepSeek-OCR-Large by retaining only 84.25% of tokens through a reading-twice inspired two-stage pruning process.
Towards Efficient Large Vision-Language Models: A Comprehensive Survey on Inference Strategies
cs.LG 2026-03 unverdicted novelty 2.0

The paper surveys and taxonomizes inference optimization methods for large vision-language models across four categories while noting limitations and open problems.

Reference graph

Works this paper leans on

113 extracted references · 113 canonical work pages · cited by 20 Pith papers · 11 internal anchors

[1]

Flamingo: a visual language model for few-shot learning

Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 2022

work page 2022
[2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J. Qwen- VL : A frontier large vision-language model with versatile abilities. arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Token merging: Your vit but faster

Bolya, D., Fu, C.-Y., Dai, X., Zhang, P., Feichtenhofer, C., and Hoffman, J. Token merging: Your vit but faster. In International Conference on Learning Representations, 2023

work page 2023
[4]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 2020

work page 2020
[5]

Cai, M., Yang, J., Gao, J., and Lee, Y. J. Matryoshka multimodal models. In International Conference on Learning Representations, 2025

work page 2025
[6]

Honeybee: Locality-enhanced projector for multimodal llm

Cha, J., Kang, W., Mun, J., and Roh, B. Honeybee: Locality-enhanced projector for multimodal llm. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[7]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Chen, L., Zhao, H., Liu, T., Bai, S., Lin, J., Zhou, C., and Chang, B. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In Proceedings of the European Conference on Computer Vision, 2024 a

work page 2024
[8]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024 b

work page 2024
[9]

Instruct BLIP : Towards general-purpose vision-language models with instruction tuning

Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P., and Hoi, S. Instruct BLIP : Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 2023

work page 2023
[10]

Flash A ttention: Fast and memory-efficient exact attention with io-awareness

Dao, T., Fu, D., Ermon, S., Rudra, A., and R \'e , C. Flash A ttention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 2022

work page 2022
[11]

Glm: General language model pretraining with autoregressive blank infilling

Du, Z., Qian, Y., Liu, X., Ding, M., Qiu, J., Yang, Z., and Tang, J. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2022

work page 2022
[12]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al. MME : A comprehensive evaluation benchmark for multimodal large language models. arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017

work page 2017
[14]

Hudson, D. A. and Manning, C. D. GQA : A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

work page 2019
[15]

Tgif-qa: Toward spatio-temporal reasoning in visual question answering

Jang, Y., Song, Y., Yu, Y., Kim, Y., and Kim, G. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017

work page 2017
[16]

Videopoet: A large language model for zero-shot video generation

Kondratyuk, D., Yu, L., Gu, X., Lezama, J., Huang, J., Hornung, R., Adam, H., Akbari, H., Alon, Y., Birodkar, V., et al. Videopoet: A large language model for zero-shot video generation. In International Conference on Machine Learning, 2024

work page 2024
[17]

Seed-bench: Benchmarking multimodal large language models

Li, B., Ge, Y., Ge, Y., Wang, G., Wang, R., Zhang, R., and Shan, Y. Seed-bench: Benchmarking multimodal large language models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024 a

work page 2024
[18]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, 2023 a

work page 2023
[19]

X., and Wen, J.-R

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W. X., and Wen, J.-R. Evaluating object hallucination in large vision-language models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2023 b

work page 2023
[20]

LLaMA-VID : An image is worth 2 tokens in large language models

Li, Y., Wang, C., and Jia, J. LLaMA-VID : An image is worth 2 tokens in large language models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024 b

work page 2024
[22]

Video-llava: Learning united visual representation by alignment before projection

Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., and Yuan, L. Video-llava: Learning united visual representation by alignment before projection. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2024

work page 2024
[23]

Liu, H., Li, C., Li, Y., and Lee, Y. J. Improved baselines with visual instruction tuning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024 a

work page 2024
[24]

Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning. Advances in Neural Information Processing Systems, 2024 b

work page 2024
[25]

Mmbench: Is your multi-modal model an all-around player? In Proceedings of the European Conference on Computer Vision, 2024 c

Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al. Mmbench: Is your multi-modal model an all-around player? In Proceedings of the European Conference on Computer Vision, 2024 c

work page 2024
[26]

A convnet for the 2020s

Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., and Xie, S. A convnet for the 2020s. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022

work page 2022
[27]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.-W., Zhu, S.-C., Tafjord, O., Clark, P., and Kalyan, A. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 2022

work page 2022
[28]

Video-chatgpt: Towards detailed video understanding via large vision and language models

Maaz, M., Rasheed, H., Khan, S., and Khan, F. Video-chatgpt: Towards detailed video understanding via large vision and language models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2024

work page 2024
[29]

Vision: A computational investigation into the human representation and processing of visual information

Marr, D. Vision: A computational investigation into the human representation and processing of visual information. MIT press, 2010

work page 2010
[31]

Language models are unsupervised multitask learners

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI blog, 2019

work page 2019
[32]

Clustering by fast search and find of density peaks

Rodriguez, A. Clustering by fast search and find of density peaks. Science, 2014

work page 2014
[34]

Towards VQA models that can read

Singh, A., Natarjan, V., Shah, M., Jiang, Y., Chen, X., Parikh, D., and Rohrbach, M. Towards VQA models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

work page 2019
[35]

Stewart, G. W. On the early history of the singular value decomposition. SIAM review, 1993

work page 1993
[38]

N., Kaiser, ., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. Attention is all you need. Advances in Neural Information Processing Systems, 2017

work page 2017
[39]

Q., Wang, Q., Gao, Y., Xu, Q., Xu, T., Hu, Y., Chen, E., and Shou, M

Wu, S., Chen, J., Lin, K. Q., Wang, Q., Gao, Y., Xu, Q., Xu, T., Hu, Y., Chen, E., and Shou, M. Z. Videollm-mod: Efficient video-language streaming with mixture-of-depths vision computation. Advances in Neural Information Processing Systems, 2024

work page 2024
[40]

Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction

Xing, L., Huang, Q., Dong, X., Lu, J., Zhang, P., Zang, Y., Cao, Y., He, C., Wang, J., Wu, F., et al. Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[41]

Video question answering via gradually refined attention over appearance and motion

Xu, D., Zhao, Z., Xiao, J., Wu, F., Zhang, H., He, X., and Zhuang, Y. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the ACM international conference on Multimedia, 2017

work page 2017
[42]

DeCo : Decoupling token compression from semantic abstraction in multimodal large language models

Yao, L., Li, L., Ren, S., Wang, L., Liu, Y., Sun, X., and Hou, L. DeCo : Decoupling token compression from semantic abstraction in multimodal large language models. arXiv:2405.20985, 2024

work page arXiv 2024
[43]

VoCo-LLaMA : Towards vision compression with large language models

Ye, X., Gan, Y., Huang, X., Ge, Y., Shan, Y., and Tang, Y. VoCo-LLaMA : Towards vision compression with large language models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[44]

Mm-vet: Evaluating large multimodal models for integrated capabilities

Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., and Wang, L. Mm-vet: Evaluating large multimodal models for integrated capabilities. In International Conference on Machine Learning, 2024

work page 2024
[45]

Activitynet-qa: A dataset for understanding complex web videos via question answering

Yu, Z., Xu, D., Yu, J., Yu, T., Zhao, Z., Zhuang, Y., and Tao, D. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, 2019

work page 2019
[46]

Unveiling the tapestry of consistency in large vision-language models

Zhang, Y., Huang, T., Fan, C.-K., Dong, H., Li, J., Wang, J., Cheng, K., Zhang, S., Guo, H., et al. Unveiling the tapestry of consistency in large vision-language models. Advances in Neural Information Processing Systems, 2024 a

work page 2024
[47]

Freekd: Knowledge distillation via semantic frequency prompt

Zhang, Y., Huang, T., Liu, J., Jiang, T., Cheng, K., and Zhang, S. Freekd: Knowledge distillation via semantic frequency prompt. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2024 b

work page 2024
[48]

Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment

Zhu, B., Lin, B., Ning, M., Yan, Y., Cui, J., HongFa, W., Pang, Y., Jiang, W., Zhang, J., Li, Z., et al. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. In International Conference on Learning Representations, 2024 a

work page 2024
[49]

Minigpt-4: Enhancing vision-language understanding with advanced large language models

Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. In International Conference on Learning Representations, 2024 b

work page 2024
[50]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[51]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , year =

work page
[52]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[53]

Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part IV 14 , pages=

A diagram is worth a dozen images , author=. Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part IV 14 , pages=. 2016 , organization=

work page 2016
[54]

International Conference on Machine Learning , year=

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities , author=. International Conference on Machine Learning , year=

work page
[55]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , year=

SEED-Bench: Benchmarking Multimodal Large Language Models , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , year=

work page
[56]

Advances in Neural Information Processing Systems , year=

Attention is all you need , author=. Advances in Neural Information Processing Systems , year=

work page
[57]

OpenAI blog , year=

Language models are unsupervised multitask learners , author=. OpenAI blog , year=

work page
[58]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[59]

Deepseek

Bi, Xiao and Chen, Deli and Chen, Guanting and Chen, Shanhuang and Dai, Damai and Deng, Chengqi and Ding, Honghui and Dong, Kai and Du, Qiushi and Fu, Zhe and others , journal=. Deepseek

work page
[60]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[61]

Instruction Tuning with GPT-4

Instruction tuning with gpt-4 , author=. arXiv:2304.03277 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[62]

Advances in Neural Information Processing Systems , year=

Language models are few-shot learners , author=. Advances in Neural Information Processing Systems , year=

work page
[63]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , year=

A convnet for the 2020s , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , year=

work page
[64]

Advances in Neural Information Processing Systems , year=

Visual instruction tuning , author=. Advances in Neural Information Processing Systems , year=

work page
[65]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Mini-gemini: Mining the potential of multi-modality vision language models , author=. arXiv:2403.18814 , year=

work page arXiv
[66]

Proceedings of the Annual Meeting of the Association for Computational Linguistics , year=

GLM: General Language Model Pretraining with Autoregressive Blank Infilling , author=. Proceedings of the Annual Meeting of the Association for Computational Linguistics , year=

work page
[67]

International Conference on Learning Representations , year=

Minigpt-4: Enhancing vision-language understanding with advanced large language models , author=. International Conference on Learning Representations , year=

work page
[68]

Proceedings of the Conference on Empirical Methods in Natural Language Processing , year=

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection , author=. Proceedings of the Conference on Empirical Methods in Natural Language Processing , year=

work page
[69]

Qwen Technical Report

Qwen technical report , author=. arXiv:2309.16609 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[70]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[71]

LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , url=

Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae , month=. LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , url=

work page
[72]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , year=

Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , year=

work page
[73]

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Sharegpt4v: Improving large multi-modal models with better captions , author=. arXiv:2311.12793 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[74]

CogVLM: Visual Expert for Pretrained Language Models

Cogvlm: Visual expert for pretrained language models , author=. arXiv:2311.03079 , year=

work page internal anchor Pith review arXiv
[75]

Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren , journal=. Qwen-

work page
[76]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , year=

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , year=

work page
[77]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

mplug-owl: Modularization empowers large language models with multimodality , author=. arXiv:2304.14178 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[78]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites , author=. arXiv:2404.16821 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[79]

International Conference on Machine Learning , year=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International Conference on Machine Learning , year=

work page
[80]

International conference on machine learning , year=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , year=

work page
[81]

International Conference on Machine Learning , year=

Videopoet: A large language model for zero-shot video generation , author=. International Conference on Machine Learning , year=

work page
[82]

Advances in Neural Information Processing Systems , year=

Videollm-mod: Efficient video-language streaming with mixture-of-depths vision computation , author=. Advances in Neural Information Processing Systems , year=

work page
[83]

arXiv preprint arXiv:2403.15388 , year=

Llava-prumerge: Adaptive token reduction for efficient large multimodal models , author=. arXiv preprint arXiv:2403.15388 , year=

work page arXiv
[84]

Li, Yanwei and Wang, Chengyao and Jia, Jiaya , booktitle=

work page
[85]

Yao, Linli and Li, Lei and Ren, Shuhuai and Wang, Lean and Liu, Yuanxin and Sun, Xu and Hou, Lu , journal=

work page

Showing first 80 references.