Look-m: Look- once optimization in kv cache for efficient multimodal long- context inference

Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan · 2024 · arXiv 2406.18139

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

representative citing papers

Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models

cs.CV · 2026-04-28 · conditional · novelty 7.0

Prefill-Time Intervention (PTI) reduces hallucinations in large vision-language models by applying a one-time modality-aware steering correction to the initial KV cache at the prefill stage rather than during autoregressive decoding.

The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Attention sinks arise from variance discrepancy in self-attention value aggregation, amplified by super neurons and first-token dimension disparity, and can be mitigated by head-wise RMSNorm to accelerate pre-training convergence.

Geometry-Guided 3D Visual Token Pruning for Video-Language Models

cs.CV · 2026-04-20 · conditional · novelty 6.0

Geo3DPruner uses geometry-aware global attention and two-stage voxel pruning to remove 90% of visual tokens from spatial videos while keeping over 90% of original performance on 3D scene benchmarks.

Reducing Peak Memory Usage for Modern Multimodal Large Language Model Pipelines

cs.CV · 2026-04-17 · unverdicted · novelty 6.0

A structure-aware sequential compression of KV cache during prefill reduces peak memory usage in MLLMs while keeping generative performance nearly intact.

RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction

cs.LG · 2026-04-14 · unverdicted · novelty 6.0

RetentiveKV uses entropy to drive state-space model transitions that retain and reactivate low-attention visual tokens in a continuous memory instead of pruning them, delivering 5x KV cache compression and 1.5x faster decoding.

POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

cs.CV · 2026-04-13 · unverdicted · novelty 6.0

POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.

citing papers explorer

Showing 6 of 6 citing papers.

Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models cs.CV · 2026-04-28 · conditional · none · ref 44
Prefill-Time Intervention (PTI) reduces hallucinations in large vision-language models by applying a one-time modality-aware steering correction to the initial KV cache at the prefill stage rather than during autoregressive decoding.
The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity cs.LG · 2026-05-07 · unverdicted · none · ref 27
Attention sinks arise from variance discrepancy in self-attention value aggregation, amplified by super neurons and first-token dimension disparity, and can be mitigated by head-wise RMSNorm to accelerate pre-training convergence.
Geometry-Guided 3D Visual Token Pruning for Video-Language Models cs.CV · 2026-04-20 · conditional · none · ref 27
Geo3DPruner uses geometry-aware global attention and two-stage voxel pruning to remove 90% of visual tokens from spatial videos while keeping over 90% of original performance on 3D scene benchmarks.
Reducing Peak Memory Usage for Modern Multimodal Large Language Model Pipelines cs.CV · 2026-04-17 · unverdicted · none · ref 3
A structure-aware sequential compression of KV cache during prefill reduces peak memory usage in MLLMs while keeping generative performance nearly intact.
RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction cs.LG · 2026-04-14 · unverdicted · none · ref 51
RetentiveKV uses entropy to drive state-space model transitions that retain and reactivate low-attention visual tokens in a continuous memory instead of pruning them, delivering 5x KV cache compression and 1.5x faster decoding.
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs cs.CV · 2026-04-13 · unverdicted · none · ref 83
POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.

Look-m: Look- once optimization in kv cache for efficient multimodal long- context inference

fields

years

verdicts

representative citing papers

citing papers explorer