Treat visual tokens as text? but your mllm only needs fewer efforts to see

Zeliang Zhang, Phu Pham, Wentian Zhao, Kun Wan, Yu-Jhe Li, Jianing Zhou, Daniel Miranda, Ajinkya Kale, Chenliang Xu · 2024 · arXiv 2410.06169

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

representative citing papers

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

cs.CV · 2026-04-03 · conditional · novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

Counting to Four is still a Chore for VLMs

cs.CV · 2026-04-11 · unverdicted · novelty 6.0

VLMs fail at counting because visual evidence degrades in later language layers, and a lightweight Modality Attention Share intervention can encourage better use of image information during answer generation.

EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling

cs.CV · 2026-04-18 · unverdicted · novelty 5.0

EvoComp compresses visual tokens in MLLMs by 3x while retaining 99.3% accuracy via an evolutionary labeling strategy that searches for low-loss, semantically diverse token subsets.

Growing a Multi-head Twig via Distillation and Reinforcement Learning to Accelerate Large Vision-Language Models

cs.CV · 2025-03-18 · unverdicted · novelty 5.0

TwigVLM adds a twig module to VLMs for twig-guided token pruning and self-speculative decoding, retaining 96% performance after pruning 88.9% visual tokens and delivering 154% speedup on long responses for LLaVA-1.5-7B.

citing papers explorer

Showing 4 of 4 citing papers.

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning cs.CV · 2026-04-03 · conditional · none · ref 13
VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.
Counting to Four is still a Chore for VLMs cs.CV · 2026-04-11 · unverdicted · none · ref 16
VLMs fail at counting because visual evidence degrades in later language layers, and a lightweight Modality Attention Share intervention can encourage better use of image information during answer generation.
EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling cs.CV · 2026-04-18 · unverdicted · none · ref 60
EvoComp compresses visual tokens in MLLMs by 3x while retaining 99.3% accuracy via an evolutionary labeling strategy that searches for low-loss, semantically diverse token subsets.
Growing a Multi-head Twig via Distillation and Reinforcement Learning to Accelerate Large Vision-Language Models cs.CV · 2025-03-18 · unverdicted · none · ref 67
TwigVLM adds a twig module to VLMs for twig-guided token pruning and self-speculative decoding, retaining 96% performance after pruning 88.9% visual tokens and delivering 154% speedup on long responses for LLaVA-1.5-7B.

Treat visual tokens as text? but your mllm only needs fewer efforts to see

fields

years

verdicts

representative citing papers

citing papers explorer