CIVIC: End-to-End Sequence Compactness for Efficient Vision-Language Models

Bo Yu; Cathy Liu; Chenxi Liu; Fengze Yang; Xuewen Luo

arxiv: 2605.28115 · v1 · pith:QRZBX2FRnew · submitted 2026-05-27 · 💻 cs.AI

CIVIC: End-to-End Sequence Compactness for Efficient Vision-Language Models

Fengze Yang , Bo Yu , Xuewen Luo , Cathy Liu , Chenxi Liu This is my paper

Pith reviewed 2026-06-29 12:10 UTC · model grok-4.3

classification 💻 cs.AI

keywords vision-language modelstoken reductionKV-cache efficiencyinference optimizationmultimodal reasoningsequence compactnessvisual token compression

0 comments

The pith

CIVIC keeps visual sequences compact through every stage of a VLM to shrink KV-cache memory to one-third and lower inference latency without accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that a single consistent compact pathway for visual tokens, enforced from the vision encoder onward, can convert token count reductions into actual lower memory use and faster wall-clock inference. Existing pruning methods add structural costs after the fact that cancel out the theoretical savings. CIVIC achieves the reductions by holding the sequence short across encoder, projection, prefill, and KV-cache, supported by text-aligned distillation and an adaptive retention rule. Readers would care because high-resolution images currently make these models expensive to run, so genuine hardware-level savings could expand where they can be deployed.

Core claim

CIVIC is a path-consistent compact visual inference framework that maintains compact sequence representations seamlessly across the vision encoder, projection layer, LLM prefill, and KV-cache. By avoiding non-contiguous memory access and localized unmerging overheads, and by using text-aligned KL distillation together with an adaptive spatial retention floor, the method shrinks KV-cache memory to approximately one-third of the baseline, reduces end-to-end inference latency, and preserves accuracy on multimodal reasoning and visual grounding benchmarks when tested on the Qwen3-VL architecture.

What carries the argument

The path-consistent compact visual inference framework that enforces contiguous sequence reduction from encoder through KV-cache.

If this is right

KV-cache memory usage falls to roughly one-third of the baseline size.
End-to-end inference latency decreases because pruning overhead is eliminated.
Accuracy on multimodal reasoning and visual grounding benchmarks stays equivalent to the full-sequence model.
The same compact pathway works inside the Qwen3-VL architecture when paired with text-aligned KL distillation and an adaptive spatial retention floor.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same end-to-end compactness principle could be tested on video or audio sequence models that face analogous memory bottlenecks.
Hardware measurements on additional accelerator types would show whether the reported latency gains hold beyond the tested setup.
Longer visual contexts might benefit disproportionately if the retention floor can be made to scale with sequence length.
Token handling in multimodal systems could shift from post-processing fixes to an integrated design choice from the start.

Load-bearing premise

That a contiguous compact sequence can be kept across all stages without losing the fine-grained spatial and semantic details required for accurate downstream tasks.

What would settle it

A side-by-side run on a visual grounding benchmark in which CIVIC shows measurably lower accuracy than the baseline while its recorded KV-cache size and latency fail to scale with the reported sequence reduction.

Figures

Figures reproduced from arXiv: 2605.28115 by Bo Yu, Cathy Liu, Chenxi Liu, Fengze Yang, Xuewen Luo.

**Figure 1.** Figure 1: Comparison between theoretical compression and practical inference behavior across recent efficient VLM [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the proposed CIVIC pipeline. Existing methods typically apply post-hoc runtime compression [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Relative benchmark performance of efficient VLM methods normalized by the dense baseline on MMMU, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Relative KV-cache memory utilization across token reduction configurations. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Granular wall-clock execution latency (ms) decomposed across pipeline stages. The path-consistent framework [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation analysis of CIVIC’s core architectural axes. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Vision-Language Models (VLMs) face severe memory and latency bottlenecks due to high-resolution visual tokens. While current token reduction methods theoretically save FLOPs, post-hoc pruning introduces structural overhead, failing to yield proportional wall-clock acceleration. However, enforcing a contiguous compact pathway risks geometric disorientation and loss of fine-grained localization. To overcome these barriers, this paper introduces CIVIC, a path-consistent compact visual inference framework. By maintaining compact sequence representations seamlessly across the vision encoder, projection layer, LLM prefill, and KV-cache, CIVIC avoids non-contiguous memory access and localized unmerging overheads. Evaluated on the Qwen3-VL architecture, CIVIC successfully translates sequence reductions into genuine physical hardware efficiency, shrinking KV-cache memory to approximately one-third of the baseline and reducing end-to-end inference latency. Enabled by text-aligned KL distillation and an adaptive spatial retention floor, CIVIC achieves these efficiency milestones without degrading accuracy across rigorous multimodal reasoning and visual grounding benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CIVIC pushes end-to-end contiguous token compactness in VLMs to turn reductions into actual KV-cache and latency wins on Qwen3-VL, but the abstract leaves the supporting numbers and controls thin.

read the letter

CIVIC's main claim is that keeping a single compact visual sequence all the way from encoder through projection, prefill, and KV-cache delivers measurable hardware savings—roughly one-third the cache memory and lower end-to-end latency—while text-aligned KL distillation plus an adaptive spatial retention floor keeps accuracy intact on multimodal and grounding benchmarks.

What stands out as new is the insistence on path consistency across every stage rather than applying pruning after the model is already running. That framing directly targets the overheads that usually kill wall-clock gains from post-hoc methods.

The paper does a clear job naming the practical gap: theoretical FLOP cuts often fail to produce speed or memory relief because of scattered memory access and unmerging costs. The proposed fix is simple enough that an engineer could test the core idea without much extra machinery.

The weak spots sit in the evidence. No accuracy deltas, no baseline tables, no ablation on the distillation or the retention floor appear in the abstract, so it is impossible to judge whether the claimed preservation of fine-grained localization actually holds or whether the floor is carrying most of the load. The circularity risk in the distillation step also cannot be checked without the equations and training details. If those sections in the full paper are thin or rely on post-selection, the central efficiency story weakens.

This is for people who ship high-resolution VLMs on constrained hardware and need concrete memory and latency numbers more than another theoretical pruning trick. A reader already working on Qwen-style models or KV-cache optimizations would get the most out of trying the framework.

I would send it to peer review. The problem is real, the proposed mechanism is coherent, and a referee can verify the experiments. If the numbers and ablations check out, the work is worth having in the literature.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces CIVIC, a path-consistent compact visual inference framework for vision-language models. It maintains contiguous sequence reductions across the vision encoder, projection layer, LLM prefill, and KV-cache to convert theoretical token pruning into actual wall-clock hardware gains. On the Qwen3-VL architecture, the method is reported to reduce KV-cache memory to roughly one-third of baseline while lowering end-to-end inference latency, using text-aligned KL distillation and an adaptive spatial retention floor to preserve accuracy on multimodal reasoning and visual grounding benchmarks.

Significance. If the end-to-end compactness claim holds with the reported hardware metrics, the work would address a recognized practical limitation in current VLM token-reduction literature, where post-hoc pruning often fails to deliver proportional latency or memory savings due to structural overhead. The emphasis on contiguous pathways and hardware-measurable outcomes could inform more deployable efficiency techniques for high-resolution multimodal models.

major comments (2)

[Abstract] The abstract asserts that CIVIC 'successfully translates sequence reductions into genuine physical hardware efficiency' with a 3x KV-cache reduction and no accuracy loss, yet no quantitative tables, baseline comparisons, error bars, or ablation results are referenced. Without these, the central hardware-efficiency claim cannot be evaluated for robustness against post-hoc implementation choices.
[Abstract] The description of the 'adaptive spatial retention floor' and 'text-aligned KL distillation' is given at a high level only. The manuscript should specify the exact formulation (e.g., the retention threshold schedule or the KL target distribution) and demonstrate that these components are not fitted in a manner that circularly depends on the evaluation benchmarks.

minor comments (1)

[Abstract] The abstract uses the phrase 'approximately one-third' for KV-cache reduction; providing the precise measured ratio and the input resolution at which it was obtained would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on our manuscript. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [Abstract] The abstract asserts that CIVIC 'successfully translates sequence reductions into genuine physical hardware efficiency' with a 3x KV-cache reduction and no accuracy loss, yet no quantitative tables, baseline comparisons, error bars, or ablation results are referenced. Without these, the central hardware-efficiency claim cannot be evaluated for robustness against post-hoc implementation choices.

Authors: We agree that the abstract would benefit from explicit pointers to the supporting evidence. In the revised manuscript we will update the abstract to reference Table 2 (KV-cache memory and reduction ratios), Table 3 (end-to-end latency on Qwen3-VL), Figure 5 (accuracy on multimodal reasoning and grounding benchmarks with error bars), and the ablation studies in Section 5. These tables already contain baseline comparisons against token-pruning and KV-cache compression methods as well as controls for implementation overhead. revision: yes
Referee: [Abstract] The description of the 'adaptive spatial retention floor' and 'text-aligned KL distillation' is given at a high level only. The manuscript should specify the exact formulation (e.g., the retention threshold schedule or the KL target distribution) and demonstrate that these components are not fitted in a manner that circularly depends on the evaluation benchmarks.

Authors: Section 3.2 defines text-aligned KL distillation as the KL divergence between the student’s next-token distribution and the teacher distribution conditioned on text tokens only (Equation 4). Section 3.3 specifies the adaptive spatial retention floor as a per-layer threshold derived from cumulative attention scores with a floor of 0.3 and a linear schedule based on token importance (Algorithm 1). Hyperparameter selection was performed on a held-out validation split of the training data; the evaluation benchmarks were never used for tuning, as documented in Appendix B. We will add a one-sentence summary of these formulations to the abstract. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract and available description present CIVIC as an engineering framework that maintains contiguous compact representations across encoder, projection, prefill and KV-cache stages, augmented by text-aligned KL distillation and an adaptive spatial retention floor. No equations, parameter-fitting procedures, self-citations, or uniqueness theorems are exhibited that would reduce any claimed prediction or result to its own inputs by construction. The efficiency claims rest on empirical translation of sequence reduction into measured hardware metrics on external benchmarks rather than on any self-referential derivation. The central mechanism is therefore self-contained and independent of the patterns that would trigger a positive circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the adaptive spatial retention floor is mentioned but its exact parameterization and fitting procedure are not described.

pith-pipeline@v0.9.1-grok · 5703 in / 996 out tokens · 45189 ms · 2026-06-29T12:10:11.360602+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 2 internal anchors

[1]

Visiontrim: Unified vision token compression for training-free mllm acceleration.arXiv preprint arXiv:2601.22674,

Hanxun Yu, Wentong Li, Xuan Qu, Song Wang, Junbo Chen, and Jianke Zhu. Visiontrim: Unified vision token compression for training-free mllm acceleration.arXiv preprint arXiv:2601.22674,

work page arXiv
[2]

Visionzip: Longer is better but not necessary in vision language models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19792–19802, 2025a. Lianyu Hu, Fanhua Shang, Wei Feng, and Liang Wan. Lightvlm: Acceleraing lar...

work page arXiv
[3]

Vl-cache: Sparsity and modality-aware kv cache compression for vision-language model inference acceleration.arXiv preprint arXiv:2410.23317,

Dezhan Tu, Danylo Vashchilenko, Yuzhe Lu, and Panpan Xu. Vl-cache: Sparsity and modality-aware kv cache compression for vision-language model inference acceleration.arXiv preprint arXiv:2410.23317,

work page arXiv
[4]

HybridKV: Hybrid KV Cache Compression for Efficient Multimodal Large Language Model Inference

Bowen Zeng, Feiyang Ren, Jun Zhang, Xiaoling Gu, Ke Chen, Lidan Shou, and Huan Li. Hybridkv: Hybrid kv cache compression for efficient multimodal large language model inference.arXiv preprint arXiv:2604.05887,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Token Merging: Your ViT But Faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.arXiv preprint arXiv:2210.09461,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Dymu: Dynamic merging and virtual unmerging for efficient vision-language models.arXiv preprint arXiv:2504.17040, 2025a

Zhenhailong Wang et al. Dymu: Dynamic merging and virtual unmerging for efficient vision-language models.arXiv preprint arXiv:2504.17040, 2025a. Sihan Yang, Runsen Xu, Chenhang Cui, Tai Wang, Dahua Lin, and Jiangmiao Pang. Vflowopt: A token pruning framework for lmms with visual information flow-guided optimization.arXiv preprint arXiv:2505.?????, 2025b. ...

work page arXiv
[7]

Packcache: A training-free acceleration method for unified autoregressive video generation via compact kv-cache.arXiv preprint arXiv:2601.04359,

Kunyang Li, Mubarak Shah, and Yuzhang Shang. Packcache: A training-free acceleration method for unified autoregressive video generation via compact kv-cache.arXiv preprint arXiv:2601.04359,

work page arXiv
[8]

Internvl-x: Advancing and accelerating internvl series with efficient visual token compression.arXiv preprint arXiv:2503.21307,

Dongchen Lu, Yuyao Sun, Zilu Zhang, Leping Huang, Jianliang Zeng, Mao Shu, and Huo Cao. Internvl-x: Advancing and accelerating internvl series with efficient visual token compression.arXiv preprint arXiv:2503.21307,

work page arXiv
[9]

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, and 1 others

Shaolei Zhang, Qingkai Fang, Zhe Yang, and Yang Feng. Llava-mini: Efficient image and video large multimodal models with one vision token.arXiv preprint arXiv:2501.03895,

work page arXiv
[10]

Leo-mini: An efficient multimodal large language model using conditional token reduction and mixture of multi-modal experts.arXiv preprint arXiv:2504.04653, 2025b

Yimu Wang, Mozhgan Nasr Azadani, Sean Sedwards, and Krzysztof Czarnecki. Leo-mini: An efficient multimodal large language model using conditional token reduction and mixture of multi-modal experts.arXiv preprint arXiv:2504.04653, 2025b. Zihua Wang, Ruibo Li, Haozhe Du, Joey Tianyi Zhou, Yu Zhang, and Xu Yang. Flash: Latent-aware semi-autoregressive specul...

work page arXiv

[1] [1]

Visiontrim: Unified vision token compression for training-free mllm acceleration.arXiv preprint arXiv:2601.22674,

Hanxun Yu, Wentong Li, Xuan Qu, Song Wang, Junbo Chen, and Jianke Zhu. Visiontrim: Unified vision token compression for training-free mllm acceleration.arXiv preprint arXiv:2601.22674,

work page arXiv

[2] [2]

Visionzip: Longer is better but not necessary in vision language models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19792–19802, 2025a. Lianyu Hu, Fanhua Shang, Wei Feng, and Liang Wan. Lightvlm: Acceleraing lar...

work page arXiv

[3] [3]

Vl-cache: Sparsity and modality-aware kv cache compression for vision-language model inference acceleration.arXiv preprint arXiv:2410.23317,

Dezhan Tu, Danylo Vashchilenko, Yuzhe Lu, and Panpan Xu. Vl-cache: Sparsity and modality-aware kv cache compression for vision-language model inference acceleration.arXiv preprint arXiv:2410.23317,

work page arXiv

[4] [4]

HybridKV: Hybrid KV Cache Compression for Efficient Multimodal Large Language Model Inference

Bowen Zeng, Feiyang Ren, Jun Zhang, Xiaoling Gu, Ke Chen, Lidan Shou, and Huan Li. Hybridkv: Hybrid kv cache compression for efficient multimodal large language model inference.arXiv preprint arXiv:2604.05887,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Token Merging: Your ViT But Faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.arXiv preprint arXiv:2210.09461,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Dymu: Dynamic merging and virtual unmerging for efficient vision-language models.arXiv preprint arXiv:2504.17040, 2025a

Zhenhailong Wang et al. Dymu: Dynamic merging and virtual unmerging for efficient vision-language models.arXiv preprint arXiv:2504.17040, 2025a. Sihan Yang, Runsen Xu, Chenhang Cui, Tai Wang, Dahua Lin, and Jiangmiao Pang. Vflowopt: A token pruning framework for lmms with visual information flow-guided optimization.arXiv preprint arXiv:2505.?????, 2025b. ...

work page arXiv

[7] [7]

Packcache: A training-free acceleration method for unified autoregressive video generation via compact kv-cache.arXiv preprint arXiv:2601.04359,

Kunyang Li, Mubarak Shah, and Yuzhang Shang. Packcache: A training-free acceleration method for unified autoregressive video generation via compact kv-cache.arXiv preprint arXiv:2601.04359,

work page arXiv

[8] [8]

Internvl-x: Advancing and accelerating internvl series with efficient visual token compression.arXiv preprint arXiv:2503.21307,

Dongchen Lu, Yuyao Sun, Zilu Zhang, Leping Huang, Jianliang Zeng, Mao Shu, and Huo Cao. Internvl-x: Advancing and accelerating internvl series with efficient visual token compression.arXiv preprint arXiv:2503.21307,

work page arXiv

[9] [9]

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, and 1 others

Shaolei Zhang, Qingkai Fang, Zhe Yang, and Yang Feng. Llava-mini: Efficient image and video large multimodal models with one vision token.arXiv preprint arXiv:2501.03895,

work page arXiv

[10] [10]

Leo-mini: An efficient multimodal large language model using conditional token reduction and mixture of multi-modal experts.arXiv preprint arXiv:2504.04653, 2025b

Yimu Wang, Mozhgan Nasr Azadani, Sean Sedwards, and Krzysztof Czarnecki. Leo-mini: An efficient multimodal large language model using conditional token reduction and mixture of multi-modal experts.arXiv preprint arXiv:2504.04653, 2025b. Zihua Wang, Ruibo Li, Haozhe Du, Joey Tianyi Zhou, Yu Zhang, and Xu Yang. Flash: Latent-aware semi-autoregressive specul...

work page arXiv