Recognition: 2 theorem links
· Lean TheoremToken Merging: Your ViT But Faster
Pith reviewed 2026-05-12 20:46 UTC · model grok-4.3
The pith
Merging similar tokens lets off-the-shelf Vision Transformers run twice as fast with almost no accuracy loss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Token Merging gradually combines similar tokens in a transformer using a general and light-weight matching algorithm that is as fast as pruning while being more accurate. This allows 2x the throughput of state-of-the-art ViT-L and ViT-H models on images and 2.2x on video with only a 0.2-0.3% accuracy drop, and it can merge object parts into one token even over multiple frames of video.
What carries the argument
The Token Merging (ToMe) procedure, which applies a bipartite soft matching algorithm to identify and merge the most similar tokens at each layer of the transformer.
If this is right
- Off-the-shelf ToMe can double the throughput of ViT-L at 512 and ViT-H at 518 resolution on image tasks with 0.2-0.3% accuracy drop.
- 2.2x throughput increase for ViT-L on video tasks with comparable accuracy retention.
- ToMe applied during training improves training speed up to 2x for MAE fine-tuning on video.
- Training with ToMe yields 2x throughput on audio models with a 0.4% mAP drop.
- ToMe merges parts of objects into single tokens, observable even across video frames.
Where Pith is reading between the lines
- This approach implies that much of the token information in ViTs is redundant and can be reduced dynamically.
- Extensions could include adapting the merging for other transformer architectures beyond vision.
- Combining ToMe with hardware-specific optimizations might yield even greater efficiency gains.
- The qualitative observation of merging object parts suggests ToMe could aid in understanding what information transformers prioritize.
Load-bearing premise
Similar tokens identified by the matching algorithm can be merged without losing critical information for the downstream task.
What would settle it
Applying ToMe to a standard ViT model on an image classification benchmark like ImageNet and observing an accuracy drop larger than 1% would falsify the negligible-loss claim.
read the original abstract
We introduce Token Merging (ToMe), a simple method to increase the throughput of existing ViT models without needing to train. ToMe gradually combines similar tokens in a transformer using a general and light-weight matching algorithm that is as fast as pruning while being more accurate. Off-the-shelf, ToMe can 2x the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images and 2.2x the throughput of ViT-L on video with only a 0.2-0.3% accuracy drop in each case. ToMe can also easily be applied during training, improving in practice training speed up to 2x for MAE fine-tuning on video. Training with ToMe further minimizes accuracy drop, leading to 2x the throughput of ViT-B on audio for only a 0.4% mAP drop. Qualitatively, we find that ToMe merges object parts into one token, even over multiple frames of video. Overall, ToMe's accuracy and speed are competitive with state-of-the-art on images, video, and audio.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Token Merging (ToMe), a simple method to increase the throughput of existing Vision Transformer (ViT) models without retraining. ToMe progressively merges similar tokens inside transformer blocks via a lightweight bipartite matching algorithm based on cosine similarity. Off-the-shelf application yields approximately 2x throughput on large image models (ViT-L@512, ViT-H@518) and 2.2x on video (ViT-L) with 0.2-0.3% accuracy drop; when used during training it further reduces the accuracy penalty and enables 2x throughput on audio with 0.4% mAP drop. The method is shown to merge coherent object parts across frames, and the only free variable (merge ratio) is fixed per model size across ImageNet, Kinetics, and AudioSet.
Significance. If the empirical results hold, the work is significant for practical acceleration of large ViTs. It supplies a training-free, modality-agnostic technique that is competitive with state-of-the-art efficiency methods while preserving accuracy, with fully specified merge schedules, per-layer token counts, and wall-clock timings on identical hardware. The qualitative evidence that merges correspond to semantic parts rather than critical singletons strengthens the central claim that similar tokens can be safely combined.
major comments (2)
- [§4] §4 (Experiments): The reported 0.2-0.3% accuracy drops and throughput numbers lack error bars, standard deviations across multiple runs, or statistical significance tests. Without these, it is difficult to determine whether the small drops are robust or within measurement noise, especially given the reader's note on missing full experimental details and baselines.
- [§3.2] §3.2 (Matching algorithm): The claim that the matching routine is 'as fast as pruning' requires an explicit complexity or timing breakdown of the bipartite matching step relative to the rest of the forward pass; the current description does not quantify the overhead for the reported token counts.
minor comments (3)
- Figure captions for the qualitative token-merge visualizations should explicitly state the layer, merge ratio, and dataset used so readers can reproduce the observed object-part merges.
- The related-work section should cite the most recent token-pruning and distillation baselines that post-date the initial arXiv version to ensure the 'competitive with state-of-the-art' claim is up to date.
- Notation for the per-layer token reduction schedule (e.g., how many tokens are merged at each block) could be presented in a single summary table for easier reference across image, video, and audio experiments.
Simulated Author's Rebuttal
We thank the referee for their positive evaluation and recommendation for minor revision. The comments are constructive and we address each one below with specific plans for the revised manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): The reported 0.2-0.3% accuracy drops and throughput numbers lack error bars, standard deviations across multiple runs, or statistical significance tests. Without these, it is difficult to determine whether the small drops are robust or within measurement noise, especially given the reader's note on missing full experimental details and baselines.
Authors: We acknowledge the value of error bars for assessing small accuracy differences. Our reported results are from single runs, which is standard practice for large-scale ViT experiments on ImageNet, Kinetics, and AudioSet given the prohibitive cost of multiple independent trainings. However, the 0.2-0.4% drops are consistent across model sizes (ViT-B/L/H), input resolutions, and three modalities, providing indirect evidence of robustness beyond measurement noise. In the revision we will explicitly note that all numbers are single-run results, expand the experimental details section to address the reader's note on baselines, and add a discussion of cross-experiment consistency. We cannot retroactively add error bars without new compute, but the consistency across settings supports the claims. revision: partial
-
Referee: [§3.2] §3.2 (Matching algorithm): The claim that the matching routine is 'as fast as pruning' requires an explicit complexity or timing breakdown of the bipartite matching step relative to the rest of the forward pass; the current description does not quantify the overhead for the reported token counts.
Authors: We agree that an explicit breakdown improves clarity. The ToMe matching uses a greedy bipartite matching on cosine similarities with O(N log N) complexity via sorting (or O(N) with bucket sort approximations), which is asymptotically comparable to top-k pruning. For the token counts in our experiments (e.g., 196 down to ~50 tokens per layer), this step is dominated by the O(N^2) attention and adds negligible wall-clock time. In the revision we will add a dedicated paragraph in §3.2 with the complexity analysis, plus empirical timing tables in the appendix showing the matching overhead is <1% of total forward-pass time on the same hardware used for throughput measurements. revision: yes
Circularity Check
No significant circularity; purely algorithmic and empirical
full rationale
The paper presents Token Merging (ToMe) as an off-the-shelf algorithmic procedure that applies bipartite matching on cosine similarities to gradually merge tokens inside standard ViT blocks. No derivation chain exists that reduces a claimed result to its own inputs by construction: the matching routine is a standard, parameter-light algorithm whose correctness is not asserted via self-citation or fitted parameters renamed as predictions. Throughput and accuracy numbers are obtained from direct wall-clock measurements on fixed hardware with a single, publicly stated merge ratio per model size; these measurements are falsifiable outside the paper and do not rely on any internal loop or ansatz smuggled through prior self-work. The method is therefore self-contained as an engineering optimization whose central claims rest on transparent implementation details and reproducible benchmarks rather than tautological re-labeling of inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Similar tokens identified by the matching algorithm can be merged without significant loss of task-relevant information
Forward citations
Cited by 26 Pith papers
-
LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute
LookWhen factorizes video recognition into learning when, where, and what to compute via uniqueness-based token selection and dual-teacher distillation, achieving better accuracy-FLOPs trade-offs than baselines on mul...
-
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
VideoRouter uses query-adaptive semantic and image routers plus new training datasets to reduce visual tokens by up to 67.9% while improving performance over the InternVL baseline on long-video benchmarks.
-
Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals
Pairwise scoring signals in Vision Transformer token reduction are inherently unstable due to high perturbation counts and degrade in deep layers, causing collapse, while unary signals with triage enable CATIS to reta...
-
Efficient Multi-View 3D Object Detection by Dynamic Token Selection and Fine-Tuning
Dynamic token selection and training only 1.6 million parameters instead of over 300 million reduces computation by 48-55% and improves accuracy over prior state-of-the-art on the NuScenes dataset.
-
Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding
DualComp uses a lightweight router to split visual token compression into a semantic stream with size-adaptive clustering and a geometric stream with path-tracing recovery, enabling low-cost high-fidelity UHR remote s...
-
ESOM: Efficiently Understanding Streaming Video Anomalies with Open-world Dynamic Definitions
ESOM is a training-free streaming model for open-world video anomaly detection with dynamic definitions that achieves real-time single-GPU efficiency and state-of-the-art results on a new benchmark.
-
Elastic Attention Cores for Scalable Vision Transformers
VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintain...
-
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and ...
-
CASCADE: Context-Aware Relaxation for Speculative Image Decoding
CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...
-
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.
-
Compressing Sequences in the Latent Embedding Space: $K$-Token Merging for Large Language Models
K-Token Merging compresses LLM inputs by merging blocks of K token embeddings in latent space, achieving up to 75% length reduction with minimal performance drop on reasoning, classification, and code tasks.
-
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...
-
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
-
Decoupled Similarity for Task-Aware Token Pruning in Large Vision-Language Models
DeSAP uses decoupled cross-modal similarity plus visual saliency to prune visual tokens in LVLMs, retaining 11.1% tokens for 10x FLOPs reduction and 98.1% performance on LLaVA-1.5-7B.
-
Small Vision-Language Models are Smart Compressors for Long Video Understanding
Tempo uses a 6B SVLM as a local temporal compressor with training-free adaptive token allocation to achieve SOTA long-video understanding at 0.5-16 tokens per frame, scoring 52.3 on 4101s LVBench under 8K budget.
-
Semantic Trimming and Auxiliary Multi-step Prediction for Generative Recommendation
STAMP mitigates semantic dilution in SID-based generative recommendation via adaptive input pruning and densified output supervision, delivering 1.23-1.38x speedup and 17-55% VRAM savings with maintained or improved accuracy.
-
OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models
OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.
-
VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models
Training-free adaptive reuse of stable visual state in video VLMs reduces follow-up latency by 15-36x on Qwen2.5-VL while preserving correctness on VideoMME, with smaller first-query speedups via pruning.
-
FED-FSTQ: Fisher-Guided Token Quantization for Communication-Efficient Federated Fine-Tuning of LLMs on Edge Devices
Fed-FSTQ reduces uplink traffic by 46x and improves time-to-accuracy by 52% in federated LLM fine-tuning using Fisher-guided token quantization and selection.
-
Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors
SEPatch3D accelerates ViT-based 3D object detectors up to 57% faster than StreamPETR via dynamic patch sizing and cross-granularity enhancement while keeping comparable accuracy on nuScenes and Argoverse 2.
-
SVD-Prune: Training-Free Token Pruning For Efficient Vision-Language Models
SVD-Prune selects vision tokens via SVD leverage scores to keep performance high even when pruning to only 16-32 tokens.
-
Do Vision Language Models Need to Process Image Tokens?
Visual representations in VLMs converge quickly to stable low-complexity forms while text continues evolving, with task-dependent needs for sustained image token access.
-
SAT: Selective Aggregation Transformer for Image Super-Resolution
SAT introduces density and isolation-based token aggregation to enable efficient global attention in super-resolution transformers, claiming up to 0.22 dB PSNR gain and 27% FLOP reduction over PFT.
-
Do We Need Distinct Representations for Every Speech Token? Unveiling and Exploiting Redundancy in Large Speech Language Models
Deep layers of speech language models show high token redundancy that can be compressed via training-free similarity pooling, reducing prefilling costs by 27% while preserving task performance.
-
ZAYA1-VL-8B Technical Report
ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting b...
-
CATP: Confidence-Aware Token Pruning for Camouflaged Object Detection
CATP prunes low-confidence tokens in COD Transformers and uses dual-path compensation to cut computation while preserving segmentation accuracy on boundary regions.
Reference graph
Works this paper leans on
-
[1]
Hydra attention: Efficient attention with many heads
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, and Judy Hoffman. Hydra attention: Efficient attention with many heads. arXiv:2209.07484 [cs.CV],
-
[2]
Rethinking Attention with Performers
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. arXiv:2009.14794 [cs.LG],
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[3]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. Flashattention: Fast and memory-efficient exact attention with io-awareness.arXiv:2205.14135 [cs.LG],
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
The Kinetics Human Action Video Dataset
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv:1705.06950 [cs.CV],
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Length-adaptive transformer: Train once with length drop, use anytime with search
10 Published as a conference paper at ICLR 2023 Gyuwan Kim and Kyunghyun Cho. Length-adaptive transformer: Train once with length drop, use anytime with search. arXiv:2010.07003 [cs.CL],
-
[6]
Learned token pruning for transformers
Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Woosuk Kwon, Joseph Hassoun, and Kurt Keutzer. Learned token pruning for transformers. arXiv:2107.00910 [cs.CL],
-
[7]
A study on token pruning for colbert
Carlos Lassance, Maroua Maachou, Joohee Park, and St´ephane Clinchant. A study on token pruning for colbert. arXiv:2112.06540 [cs.CL],
-
[8]
Token pooling in vision transformers
Dmitrii Marin, Jen-Hao Rick Chang, Anurag Ranjan, Anish Prabhu, Mohammad Rastegari, and Oncel Tuzel. Token pooling in vision transformers. arXiv:2110.03860 [cs.CV],
-
[9]
Cp-vit: Cascade vision transformer pruning via progressive sparsity prediction
11 Published as a conference paper at ICLR 2023 Zhuoran Song, Yihong Xu, Zhezhi He, Li Jiang, Naifeng Jing, and Xiaoyao Liang. Cp-vit: Cascade vision transformer pruning via progressive sparsity prediction. arXiv preprint arXiv:2203.04570,
-
[10]
Linformer: Self-Attention with Linear Complexity
Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv:2006.04768 [cs.LG],
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[11]
A unified pruning framework for vision transformers.arXiv:2111.15127 [cs.CV],
Hao Yu and Jianxin Wu. A unified pruning framework for vision transformers.arXiv:2111.15127 [cs.CV],
-
[12]
For all results, im/s indicates throughput and “speed” indicates improvement over the baseline
12 Published as a conference paper at ICLR 2023 A F ULL RESULTS Results for plots and tables in the main paper. For all results, im/s indicates throughput and “speed” indicates improvement over the baseline. All throughputs are measured on a V100, but the actual values may differ a little from the main paper as the model may have been benchmarked on a dif...
work page 2023
-
[13]
Again, we make no special changes for these models. 13 Published as a conference paper at ICLR 2023 model r acc drop im/s speed ViT-B/16 @ 384 0 85.30 0.00 85.7 1.00 ViT-B/16 @ 384 5 85.27 -0.03 88.6 1.03 ViT-B/16 @ 384 10 85.21 -0.09 94.6 1.10 ViT-B/16 @ 384 15 85.18 -0.13 101.7 1.19 ViT-B/16 @ 384 20 85.09 -0.22 109.1 1.27 ViT-B/16 @ 384 25 85.03 -0.27 ...
work page 2023
-
[14]
A.2 V IDEO We run the ViT-L model from Feichtenhofer et al
r = 11 for DeiT-S didn’t finish training. A.2 V IDEO We run the ViT-L model from Feichtenhofer et al. (2022) off the shelf. In Tab. 12, we show the results of this experiment by sweeping over r. For each setting, we evaluate with 1 spatial crop and 10 temporal clips. Note that the original baseline is evaluated with 3 spatial crops and 7 temporal clips, wh...
work page 2022
-
[15]
Like with images, for these off-the-shelf MAE pretrained models we don’t use proportional attention
Thus, the baseline has slightly lower accuracy than the original paper. Like with images, for these off-the-shelf MAE pretrained models we don’t use proportional attention. 14 Published as a conference paper at ICLR 2023 model r acc drop im/s speed ViT-B/16 0 83.62 0.00 305 1.00 ViT-B/16 1 83.55 -0.07 307 1.01 ViT-B/16 2 83.50 -0.12 317 1.04 ViT-B/16 3 83...
work page 2023
-
[16]
(2022) to evaluate off-the-shelf
We used the model from Huang et al. (2022) to evaluate off-the-shelf. However, for training we train with our own implementation that’s different from the paper. For this reason, in Tab. 13, we list two different baselines (one from the original paper, and the other trained by us). In this case, we don’t use proportional attention during off-the-shelf eva...
work page 2022
-
[17]
This has the effect of regularizing layers so that they don’t rely on a single block
Drop path randomly drops out entire attention and MLP blocks with some probability. This has the effect of regularizing layers so that they don’t rely on a single block. Because we use the K matrix 16 Published as a conference paper at ICLR 2023 from blocks that could be dropped out, we test the value of this parameter. Again, we find this not necessary to...
work page 2023
-
[18]
Around throughputs of 1600-1800, the best schedule is close to constant, which is why constant is close to optimal in this range. For throughputs beyond that, however, a decreasing schedule is best. For this, reason we define a linearly decreasing schedule in addition to a constant schedule in the main paper. 17 Published as a conference paper at ICLR 2023...
work page 2023
-
[19]
ToMe’s propensity for part and object segmentation appears time and time again across many different images. In Fig. 9, we also display more results of ToMe performing object tracking on video. Note that in (Feichtenhofer et al., 2022), each token represents more than one frame. Namely, the patch size is 2× 16× 16 and thus 2 frames of video correspond to ...
work page 2022
-
[20]
19 Published as a conference paper at ICLR 2023 Figure 9: More visualization on video. Continuation of Fig
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.