TRIO: Token Reduction via Inference-Objective Guidance for Efficient Vision-Language Models

Congyang Ou; Dawei Yan; Haokui Zhang; Peng Wang; Qingsen Yan; Rong Xiao; Ying Li; Yu Zhang

arxiv: 2602.04657 · v3 · pith:BLQDQJEInew · submitted 2026-02-04 · 💻 cs.CV

TRIO: Token Reduction via Inference-Objective Guidance for Efficient Vision-Language Models

Haokui Zhang , Congyang Ou , Dawei Yan , Peng Wang , Qingsen Yan , Yu Zhang , Ying Li , Rong Xiao This is my paper

Pith reviewed 2026-05-16 07:24 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelstoken reductiongradient saliencyinference efficiencyLLaVAnon-maximum suppressiontraining-free compression

0 comments

The pith

TRIO reduces visual tokens in vision-language models to 11 percent while retaining 97 percent performance by selecting tokens whose removal leaves the final output unchanged.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes TRIO to accelerate vision-language models by compressing visual tokens according to their contribution to preserving the model's output rather than relying on token similarity. Vision tokens are reordered using token-level gradient saliency from a layer-local proxy loss that approximates the constraint from the current layer to the final result, then the top tokens are kept via non-maximum suppression. The method requires no training, works with or without an encoder compressor, and is compatible with FlashAttention. On LLaVA-Next-7B it keeps only 11.1 percent of tokens yet maintains 97.2 percent of original accuracy together with substantial speed and memory gains. A sympathetic reader cares because the approach offers a practical, plug-in route to efficient deployment of large multimodal models.

Core claim

TRIO transforms visual token compression into the problem of preserving output result invariance and selects tokens primarily by their importance to this goal: vision tokens are reordered with the guidance of token-level gradient saliency generated by a designed layer-local proxy loss, a coarse constraint from the current layer to the final result, after which the most valuable vision tokens are retained following the non-maximum suppression principle.

What carries the argument

The layer-local proxy loss that produces token-level gradient saliency as a coarse constraint from the current layer to the final output.

If this is right

TRIO can be deployed independently as an encoder-free method or combined with encoder-side compressors such as VisionZip.
The approach is training-free and directly compatible with FlashAttention.
On LLaVA-Next-7B it yields 2.75 times prefill speedup, 2.14 times inference speedup, 6.22 times lower FLOPs, and 6.05 times reduced KV cache.
Retaining only 11.1 percent of visual tokens still preserves 97.2 percent of original performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Output-invariance gradients appear to be a stronger token-importance signal than inter-token similarity heuristics used in prior work.
The same layer-local proxy idea could be tested on other multimodal architectures beyond the LLaVA family.
Adaptive choice of which layer supplies the proxy loss might further improve the accuracy-speed trade-off.
The method opens a path to real-time multimodal inference on edge devices with limited memory bandwidth.

Load-bearing premise

The layer-local proxy loss produces token-level gradient saliency that reliably identifies which tokens can be removed without materially changing the final model output.

What would settle it

Running TRIO on LLaVA-Next-7B at the stated 11.1 percent token retention rate and measuring whether accuracy falls substantially below 97.2 percent of the unpruned baseline.

read the original abstract

Recently, reducing redundant visual tokens in vision-language models (VLMs) to accelerate VLM inference has emerged as a hot topic. However, most existing methods rely on heuristics constructed based on inter-visual-token similarity or cross-modal visual-text similarity, which gives rise to certain limitations in compression performance and practical deployment. In contrast, we propose TRIO from the perspective of inference objectives, which transforms visual token compression into preserving output result invariance and selects tokens primarily by their importance to this goal. Specifically, vision tokens are reordered with the guidance of token-level gradient saliency generated by our designed layer-local proxy loss, a coarse constraint from the current layer to the final result. Then the most valuable vision tokens are selected following the non-maximum suppression (NMS) principle.The proposed TRIO is training-free and compatible with FlashAttention, friendly to practical application and deployment. It can be deployed independently as an encoder-free method, or combined with encoder compression approaches like VisionZip for use as an encoder-involved method. On LLaVA-Next-7B, TRIO retains just 11.1\% of visual tokens but maintains 97.2\% of the original performance, with a 2.75$\times$ prefill speedup, 2.14$\times$ inference speedup, 6.22$\times$ lower FLOPs, and 6.05$\times$ reduced KV Cache overhead.Our code is available at https://github.com/ocy1/TRIO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TRIO reframes token pruning around output invariance via a layer-local proxy loss and gets strong efficiency numbers on one model, but the evaluation scope is narrow.

read the letter

TRIO shifts the focus in visual token reduction for vision-language models from similarity-based heuristics to directly targeting output invariance through an inference-objective lens. They design a layer-local proxy loss to generate gradient saliency scores for each token, reorder them accordingly, and apply non-maximum suppression to pick the most important ones. This is all training-free. On LLaVA-Next-7B, the results show retaining just 11.1% of visual tokens while keeping 97.2% of original performance. That comes with a 2.75 times prefill speedup, 2.14 times overall inference speedup, much lower FLOPs, and reduced KV cache needs. The method works standalone or combined with encoder compression like VisionZip, and it's set up to play nice with FlashAttention. The novelty lies in using this proxy loss for saliency instead of cross-token similarities, which the abstract positions as addressing limitations in compression and deployment. Releasing the code is a plus for anyone wanting to test it. Where it could be softer is the scope of the evaluation. The abstract highlights results on one model without spelling out multiple benchmarks, controls, or specific ablations on how the proxy loss was designed. The concern that the layer-local constraint might miss global token importance due to later-layer mixing is worth checking against the full experiments. If the proxy really approximates the end-to-end effect well, that's fine, but it needs evidence. This kind of work is for folks building or deploying VLMs who care about cutting down compute and memory without retraining. A reader would get practical ideas and a starting point from the code and numbers. It has enough substance to go to peer review rather than a desk reject, so an editor should send it out for feedback on the method's robustness.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces TRIO, a training-free visual token compression method for VLMs that reorders tokens according to saliency scores computed from gradients of a designed layer-local proxy loss (a coarse constraint from the current layer onward) and then applies non-maximum suppression to retain the most important tokens. The central claim is that this inference-objective approach preserves output invariance, demonstrated on LLaVA-Next-7B by retaining only 11.1% of visual tokens while achieving 97.2% of original performance together with 2.75× prefill speedup, 2.14× inference speedup, 6.22× lower FLOPs, and 6.05× reduced KV-cache overhead; the method is also stated to be compatible with FlashAttention and usable either encoder-free or in combination with encoder compression techniques such as VisionZip.

Significance. If the proxy-loss saliency reliably identifies tokens whose removal leaves the final output distribution essentially unchanged, TRIO would supply a practical, training-free compression technique that directly targets inference objectives rather than relying on heuristic similarity measures, and its compatibility with existing pipelines could accelerate deployment of large VLMs.

major comments (3)

[Abstract] Abstract: the headline result (11.1% tokens retained at 97.2% performance on LLaVA-Next-7B) is presented without any description of the concrete benchmarks, number of evaluation runs, or experimental controls used to measure “original performance,” rendering it impossible to judge whether the reported invariance holds under standard VLM evaluation protocols.
[Method (proxy loss)] Method (layer-local proxy loss): the claim that gradients from the single-layer proxy loss produce token saliency scores whose top-ranked tokens (post-NMS) can be dropped while leaving the final answer distribution unchanged is load-bearing, yet the manuscript supplies neither an ablation of the proxy-loss design nor a comparison against full end-to-end gradients, leaving the weakest assumption—that local gradients suffice for global output sensitivity—unexamined.
[Experiments] Experiments: no quantitative evidence is given that the layer-local constraint remains accurate when later attention blocks introduce strong cross-token mixing or when tasks require fine-grained visual details that become salient only after multiple layers, which directly risks the central invariance guarantee.

minor comments (1)

[Abstract] Abstract: the statement that TRIO is “compatible with FlashAttention” is not accompanied by any implementation detail on how the reordering and NMS steps interact with the attention kernel.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below and will revise the manuscript to incorporate the requested clarifications and additional analyses.

read point-by-point responses

Referee: [Abstract] Abstract: the headline result (11.1% tokens retained at 97.2% performance on LLaVA-Next-7B) is presented without any description of the concrete benchmarks, number of evaluation runs, or experimental controls used to measure “original performance,” rendering it impossible to judge whether the reported invariance holds under standard VLM evaluation protocols.

Authors: We agree that the abstract would benefit from greater specificity. In the revised version we will expand it to name the concrete benchmarks (the standard LLaVA-Next evaluation suite: VQAv2, GQA, TextVQA, POPE, MME, etc.), state that “original performance” is the full-token baseline measured under identical decoding settings, and note that reported numbers are averages over three independent runs with standard deviation. These additions will allow readers to assess the invariance claim against established VLM protocols. revision: yes
Referee: [Method (proxy loss)] Method (layer-local proxy loss): the claim that gradients from the single-layer proxy loss produce token saliency scores whose top-ranked tokens (post-NMS) can be dropped while leaving the final answer distribution unchanged is load-bearing, yet the manuscript supplies neither an ablation of the proxy-loss design nor a comparison against full end-to-end gradients, leaving the weakest assumption—that local gradients suffice for global output sensitivity—unexamined.

Authors: The layer-local proxy loss is deliberately formulated as a lightweight, inference-time approximation that avoids the prohibitive cost of full back-propagation. While the original submission did not contain a dedicated ablation, we will add one in the revision that (i) compares the chosen proxy formulation against two alternative local losses and (ii) reports token-ranking agreement and final-output KL divergence versus full end-to-end gradients on a held-out subset of 200 samples. This will directly test the local-to-global sensitivity assumption. revision: yes
Referee: [Experiments] Experiments: no quantitative evidence is given that the layer-local constraint remains accurate when later attention blocks introduce strong cross-token mixing or when tasks require fine-grained visual details that become salient only after multiple layers, which directly risks the central invariance guarantee.

Authors: We acknowledge the need for explicit validation of the approximation under deeper mixing and fine-grained tasks. The revised experiments section will include (i) layer-wise output-distribution divergence curves when TRIO is applied at different depths and (ii) results on fine-grained benchmarks (e.g., detailed visual grounding and high-resolution captioning subsets). These quantitative measurements will either corroborate the layer-local design or highlight its limitations, which we will discuss transparently. revision: yes

Circularity Check

0 steps flagged

No significant circularity; heuristic proxy-gradient method is self-contained.

full rationale

The paper defines TRIO as a training-free procedure that reorders visual tokens using saliency scores from gradients of a hand-designed layer-local proxy loss, then applies NMS. No equations reduce any claimed prediction or performance metric back to fitted parameters by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the derivation. The reported speedups and retention ratios are empirical measurements on LLaVA-Next-7B, not tautological outputs of the input definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the assumption that a coarse layer-local proxy loss can stand in for full inference-objective importance; no free parameters or new entities are introduced in the abstract description.

axioms (1)

domain assumption Layer-local proxy loss gradients provide a sufficient signal for preserving final output invariance when selecting tokens
This is the central mechanism that allows training-free selection without full back-propagation.

pith-pipeline@v0.9.0 · 5589 in / 1162 out tokens · 24131 ms · 2026-05-16T07:24:57.792012+00:00 · methodology

TRIO: Token Reduction via Inference-Objective Guidance for Efficient Vision-Language Models

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)