pith. machine review for the scientific record. sign in

arxiv: 2210.09461 · v3 · submitted 2022-10-17 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Token Merging: Your ViT But Faster

Cheng-Yang Fu, Christoph Feichtenhofer, Daniel Bolya, Judy Hoffman, Peizhao Zhang, Xiaoliang Dai

Pith reviewed 2026-05-12 20:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords token mergingvision transformermodel accelerationthroughputViTefficient inferencetransformer optimization
0
0 comments X

The pith

Merging similar tokens lets off-the-shelf Vision Transformers run twice as fast with almost no accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Token Merging (ToMe) to increase the throughput of existing Vision Transformer models without any training. It gradually combines similar tokens using a lightweight matching algorithm that matches the speed of pruning but with better accuracy. Off-the-shelf use achieves roughly doubled speed on large ViT models for images and video, with accuracy drops of only 0.2 to 0.3 percent. The method also speeds up training and works on audio tasks when used during fine-tuning. A sympathetic reader would care because it makes powerful but computationally heavy models more accessible for practical deployment without redesign or retraining.

Core claim

Token Merging gradually combines similar tokens in a transformer using a general and light-weight matching algorithm that is as fast as pruning while being more accurate. This allows 2x the throughput of state-of-the-art ViT-L and ViT-H models on images and 2.2x on video with only a 0.2-0.3% accuracy drop, and it can merge object parts into one token even over multiple frames of video.

What carries the argument

The Token Merging (ToMe) procedure, which applies a bipartite soft matching algorithm to identify and merge the most similar tokens at each layer of the transformer.

If this is right

  • Off-the-shelf ToMe can double the throughput of ViT-L at 512 and ViT-H at 518 resolution on image tasks with 0.2-0.3% accuracy drop.
  • 2.2x throughput increase for ViT-L on video tasks with comparable accuracy retention.
  • ToMe applied during training improves training speed up to 2x for MAE fine-tuning on video.
  • Training with ToMe yields 2x throughput on audio models with a 0.4% mAP drop.
  • ToMe merges parts of objects into single tokens, observable even across video frames.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach implies that much of the token information in ViTs is redundant and can be reduced dynamically.
  • Extensions could include adapting the merging for other transformer architectures beyond vision.
  • Combining ToMe with hardware-specific optimizations might yield even greater efficiency gains.
  • The qualitative observation of merging object parts suggests ToMe could aid in understanding what information transformers prioritize.

Load-bearing premise

Similar tokens identified by the matching algorithm can be merged without losing critical information for the downstream task.

What would settle it

Applying ToMe to a standard ViT model on an image classification benchmark like ImageNet and observing an accuracy drop larger than 1% would falsify the negligible-loss claim.

read the original abstract

We introduce Token Merging (ToMe), a simple method to increase the throughput of existing ViT models without needing to train. ToMe gradually combines similar tokens in a transformer using a general and light-weight matching algorithm that is as fast as pruning while being more accurate. Off-the-shelf, ToMe can 2x the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images and 2.2x the throughput of ViT-L on video with only a 0.2-0.3% accuracy drop in each case. ToMe can also easily be applied during training, improving in practice training speed up to 2x for MAE fine-tuning on video. Training with ToMe further minimizes accuracy drop, leading to 2x the throughput of ViT-B on audio for only a 0.4% mAP drop. Qualitatively, we find that ToMe merges object parts into one token, even over multiple frames of video. Overall, ToMe's accuracy and speed are competitive with state-of-the-art on images, video, and audio.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces Token Merging (ToMe), a simple method to increase the throughput of existing Vision Transformer (ViT) models without retraining. ToMe progressively merges similar tokens inside transformer blocks via a lightweight bipartite matching algorithm based on cosine similarity. Off-the-shelf application yields approximately 2x throughput on large image models (ViT-L@512, ViT-H@518) and 2.2x on video (ViT-L) with 0.2-0.3% accuracy drop; when used during training it further reduces the accuracy penalty and enables 2x throughput on audio with 0.4% mAP drop. The method is shown to merge coherent object parts across frames, and the only free variable (merge ratio) is fixed per model size across ImageNet, Kinetics, and AudioSet.

Significance. If the empirical results hold, the work is significant for practical acceleration of large ViTs. It supplies a training-free, modality-agnostic technique that is competitive with state-of-the-art efficiency methods while preserving accuracy, with fully specified merge schedules, per-layer token counts, and wall-clock timings on identical hardware. The qualitative evidence that merges correspond to semantic parts rather than critical singletons strengthens the central claim that similar tokens can be safely combined.

major comments (2)
  1. [§4] §4 (Experiments): The reported 0.2-0.3% accuracy drops and throughput numbers lack error bars, standard deviations across multiple runs, or statistical significance tests. Without these, it is difficult to determine whether the small drops are robust or within measurement noise, especially given the reader's note on missing full experimental details and baselines.
  2. [§3.2] §3.2 (Matching algorithm): The claim that the matching routine is 'as fast as pruning' requires an explicit complexity or timing breakdown of the bipartite matching step relative to the rest of the forward pass; the current description does not quantify the overhead for the reported token counts.
minor comments (3)
  1. Figure captions for the qualitative token-merge visualizations should explicitly state the layer, merge ratio, and dataset used so readers can reproduce the observed object-part merges.
  2. The related-work section should cite the most recent token-pruning and distillation baselines that post-date the initial arXiv version to ensure the 'competitive with state-of-the-art' claim is up to date.
  3. Notation for the per-layer token reduction schedule (e.g., how many tokens are merged at each block) could be presented in a single summary table for easier reference across image, video, and audio experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive evaluation and recommendation for minor revision. The comments are constructive and we address each one below with specific plans for the revised manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The reported 0.2-0.3% accuracy drops and throughput numbers lack error bars, standard deviations across multiple runs, or statistical significance tests. Without these, it is difficult to determine whether the small drops are robust or within measurement noise, especially given the reader's note on missing full experimental details and baselines.

    Authors: We acknowledge the value of error bars for assessing small accuracy differences. Our reported results are from single runs, which is standard practice for large-scale ViT experiments on ImageNet, Kinetics, and AudioSet given the prohibitive cost of multiple independent trainings. However, the 0.2-0.4% drops are consistent across model sizes (ViT-B/L/H), input resolutions, and three modalities, providing indirect evidence of robustness beyond measurement noise. In the revision we will explicitly note that all numbers are single-run results, expand the experimental details section to address the reader's note on baselines, and add a discussion of cross-experiment consistency. We cannot retroactively add error bars without new compute, but the consistency across settings supports the claims. revision: partial

  2. Referee: [§3.2] §3.2 (Matching algorithm): The claim that the matching routine is 'as fast as pruning' requires an explicit complexity or timing breakdown of the bipartite matching step relative to the rest of the forward pass; the current description does not quantify the overhead for the reported token counts.

    Authors: We agree that an explicit breakdown improves clarity. The ToMe matching uses a greedy bipartite matching on cosine similarities with O(N log N) complexity via sorting (or O(N) with bucket sort approximations), which is asymptotically comparable to top-k pruning. For the token counts in our experiments (e.g., 196 down to ~50 tokens per layer), this step is dominated by the O(N^2) attention and adds negligible wall-clock time. In the revision we will add a dedicated paragraph in §3.2 with the complexity analysis, plus empirical timing tables in the appendix showing the matching overhead is <1% of total forward-pass time on the same hardware used for throughput measurements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely algorithmic and empirical

full rationale

The paper presents Token Merging (ToMe) as an off-the-shelf algorithmic procedure that applies bipartite matching on cosine similarities to gradually merge tokens inside standard ViT blocks. No derivation chain exists that reduces a claimed result to its own inputs by construction: the matching routine is a standard, parameter-light algorithm whose correctness is not asserted via self-citation or fitted parameters renamed as predictions. Throughput and accuracy numbers are obtained from direct wall-clock measurements on fixed hardware with a single, publicly stated merge ratio per model size; these measurements are falsifiable outside the paper and do not rely on any internal loop or ansatz smuggled through prior self-work. The method is therefore self-contained as an engineering optimization whose central claims rest on transparent implementation details and reproducible benchmarks rather than tautological re-labeling of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that similarity-based token merging preserves sufficient information for downstream tasks; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Similar tokens identified by the matching algorithm can be merged without significant loss of task-relevant information
    This assumption underpins the claim that accuracy drops remain small (0.2-0.4%) across images, video, and audio.

pith-pipeline@v0.9.0 · 5510 in / 1200 out tokens · 46878 ms · 2026-05-12T20:46:15.799352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute

    cs.CV 2026-05 conditional novelty 7.0

    LookWhen factorizes video recognition into learning when, where, and what to compute via uniqueness-based token selection and dual-teacher distillation, achieving better accuracy-FLOPs trade-offs than baselines on mul...

  2. VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    VideoRouter uses query-adaptive semantic and image routers plus new training datasets to reduce visual tokens by up to 67.9% while improving performance over the InternVL baseline on long-video benchmarks.

  3. Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals

    cs.AI 2026-04 unverdicted novelty 7.0

    Pairwise scoring signals in Vision Transformer token reduction are inherently unstable due to high perturbation counts and degrade in deep layers, causing collapse, while unary signals with triage enable CATIS to reta...

  4. Efficient Multi-View 3D Object Detection by Dynamic Token Selection and Fine-Tuning

    cs.CV 2026-04 unverdicted novelty 7.0

    Dynamic token selection and training only 1.6 million parameters instead of over 300 million reduces computation by 48-55% and improves accuracy over prior state-of-the-art on the NuScenes dataset.

  5. Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding

    cs.CV 2026-04 unverdicted novelty 7.0

    DualComp uses a lightweight router to split visual token compression into a semantic stream with size-adaptive clustering and a geometric stream with path-tracing recovery, enabling low-cost high-fidelity UHR remote s...

  6. ESOM: Efficiently Understanding Streaming Video Anomalies with Open-world Dynamic Definitions

    cs.CV 2026-04 unverdicted novelty 7.0

    ESOM is a training-free streaming model for open-world video anomaly detection with dynamic definitions that achieves real-time single-GPU efficiency and state-of-the-art results on a new benchmark.

  7. Elastic Attention Cores for Scalable Vision Transformers

    cs.CV 2026-05 unverdicted novelty 6.0

    VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintain...

  8. LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

    cs.CV 2026-05 unverdicted novelty 6.0

    LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and ...

  9. CASCADE: Context-Aware Relaxation for Speculative Image Decoding

    cs.CV 2026-05 unverdicted novelty 6.0

    CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...

  10. VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding

    cs.CV 2026-05 unverdicted novelty 6.0

    VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.

  11. Compressing Sequences in the Latent Embedding Space: $K$-Token Merging for Large Language Models

    cs.CL 2026-04 conditional novelty 6.0

    K-Token Merging compresses LLM inputs by merging blocks of K token embeddings in latent space, achieving up to 75% length reduction with minimal performance drop on reasoning, classification, and code tasks.

  12. One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...

  13. POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.

  14. Decoupled Similarity for Task-Aware Token Pruning in Large Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    DeSAP uses decoupled cross-modal similarity plus visual saliency to prune visual tokens in LVLMs, retaining 11.1% tokens for 10x FLOPs reduction and 98.1% performance on LLaVA-1.5-7B.

  15. Small Vision-Language Models are Smart Compressors for Long Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    Tempo uses a 6B SVLM as a local temporal compressor with training-free adaptive token allocation to achieve SOTA long-video understanding at 0.5-16 tokens per frame, scoring 52.3 on 4101s LVBench under 8K budget.

  16. Semantic Trimming and Auxiliary Multi-step Prediction for Generative Recommendation

    cs.IR 2026-04 unverdicted novelty 6.0

    STAMP mitigates semantic dilution in SID-based generative recommendation via adaptive input pruning and densified output supervision, delivering 1.23-1.38x speedup and 17-55% VRAM savings with maintained or improved accuracy.

  17. OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models

    cs.AI 2026-05 unverdicted novelty 5.0

    OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.

  18. VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 5.0

    Training-free adaptive reuse of stable visual state in video VLMs reduces follow-up latency by 15-36x on Qwen2.5-VL while preserving correctness on VideoMME, with smaller first-query speedups via pruning.

  19. FED-FSTQ: Fisher-Guided Token Quantization for Communication-Efficient Federated Fine-Tuning of LLMs on Edge Devices

    cs.LG 2026-04 unverdicted novelty 5.0

    Fed-FSTQ reduces uplink traffic by 46x and improves time-to-accuracy by 52% in federated LLM fine-tuning using Fisher-guided token quantization and selection.

  20. Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors

    cs.CV 2026-04 conditional novelty 5.0

    SEPatch3D accelerates ViT-based 3D object detectors up to 57% faster than StreamPETR via dynamic patch sizing and cross-granularity enhancement while keeping comparable accuracy on nuScenes and Argoverse 2.

  21. SVD-Prune: Training-Free Token Pruning For Efficient Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 5.0

    SVD-Prune selects vision tokens via SVD leverage scores to keep performance high even when pruning to only 16-32 tokens.

  22. Do Vision Language Models Need to Process Image Tokens?

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual representations in VLMs converge quickly to stable low-complexity forms while text continues evolving, with task-dependent needs for sustained image token access.

  23. SAT: Selective Aggregation Transformer for Image Super-Resolution

    cs.CV 2026-04 unverdicted novelty 5.0

    SAT introduces density and isolation-based token aggregation to enable efficient global attention in super-resolution transformers, claiming up to 0.22 dB PSNR gain and 27% FLOP reduction over PFT.

  24. Do We Need Distinct Representations for Every Speech Token? Unveiling and Exploiting Redundancy in Large Speech Language Models

    cs.CL 2026-04 unverdicted novelty 5.0

    Deep layers of speech language models show high token redundancy that can be compressed via training-free similarity pooling, reducing prefilling costs by 27% while preserving task performance.

  25. ZAYA1-VL-8B Technical Report

    cs.CV 2026-05 unverdicted novelty 4.0

    ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting b...

  26. CATP: Confidence-Aware Token Pruning for Camouflaged Object Detection

    cs.CV 2026-04 unverdicted novelty 4.0

    CATP prunes low-confidence tokens in COD Transformers and uses dual-path compensation to cut computation while preserving segmentation accuracy on boundary regions.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 25 Pith papers · 4 internal anchors

  1. [1]

    Hydra attention: Efficient attention with many heads

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, and Judy Hoffman. Hydra attention: Efficient attention with many heads. arXiv:2209.07484 [cs.CV],

  2. [2]

    Rethinking Attention with Performers

    Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. arXiv:2009.14794 [cs.LG],

  3. [3]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. Flashattention: Fast and memory-efficient exact attention with io-awareness.arXiv:2205.14135 [cs.LG],

  4. [4]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv:1705.06950 [cs.CV],

  5. [5]

    Length-adaptive transformer: Train once with length drop, use anytime with search

    10 Published as a conference paper at ICLR 2023 Gyuwan Kim and Kyunghyun Cho. Length-adaptive transformer: Train once with length drop, use anytime with search. arXiv:2010.07003 [cs.CL],

  6. [6]

    Learned token pruning for transformers

    Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Woosuk Kwon, Joseph Hassoun, and Kurt Keutzer. Learned token pruning for transformers. arXiv:2107.00910 [cs.CL],

  7. [7]

    A study on token pruning for colbert

    Carlos Lassance, Maroua Maachou, Joohee Park, and St´ephane Clinchant. A study on token pruning for colbert. arXiv:2112.06540 [cs.CL],

  8. [8]

    Token pooling in vision transformers

    Dmitrii Marin, Jen-Hao Rick Chang, Anurag Ranjan, Anish Prabhu, Mohammad Rastegari, and Oncel Tuzel. Token pooling in vision transformers. arXiv:2110.03860 [cs.CV],

  9. [9]

    Cp-vit: Cascade vision transformer pruning via progressive sparsity prediction

    11 Published as a conference paper at ICLR 2023 Zhuoran Song, Yihong Xu, Zhezhi He, Li Jiang, Naifeng Jing, and Xiaoyao Liang. Cp-vit: Cascade vision transformer pruning via progressive sparsity prediction. arXiv preprint arXiv:2203.04570,

  10. [10]

    Linformer: Self-Attention with Linear Complexity

    Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv:2006.04768 [cs.LG],

  11. [11]

    A unified pruning framework for vision transformers.arXiv:2111.15127 [cs.CV],

    Hao Yu and Jianxin Wu. A unified pruning framework for vision transformers.arXiv:2111.15127 [cs.CV],

  12. [12]

    For all results, im/s indicates throughput and “speed” indicates improvement over the baseline

    12 Published as a conference paper at ICLR 2023 A F ULL RESULTS Results for plots and tables in the main paper. For all results, im/s indicates throughput and “speed” indicates improvement over the baseline. All throughputs are measured on a V100, but the actual values may differ a little from the main paper as the model may have been benchmarked on a dif...

  13. [13]

    Again, we make no special changes for these models. 13 Published as a conference paper at ICLR 2023 model r acc drop im/s speed ViT-B/16 @ 384 0 85.30 0.00 85.7 1.00 ViT-B/16 @ 384 5 85.27 -0.03 88.6 1.03 ViT-B/16 @ 384 10 85.21 -0.09 94.6 1.10 ViT-B/16 @ 384 15 85.18 -0.13 101.7 1.19 ViT-B/16 @ 384 20 85.09 -0.22 109.1 1.27 ViT-B/16 @ 384 25 85.03 -0.27 ...

  14. [14]

    A.2 V IDEO We run the ViT-L model from Feichtenhofer et al

    r = 11 for DeiT-S didn’t finish training. A.2 V IDEO We run the ViT-L model from Feichtenhofer et al. (2022) off the shelf. In Tab. 12, we show the results of this experiment by sweeping over r. For each setting, we evaluate with 1 spatial crop and 10 temporal clips. Note that the original baseline is evaluated with 3 spatial crops and 7 temporal clips, wh...

  15. [15]

    Like with images, for these off-the-shelf MAE pretrained models we don’t use proportional attention

    Thus, the baseline has slightly lower accuracy than the original paper. Like with images, for these off-the-shelf MAE pretrained models we don’t use proportional attention. 14 Published as a conference paper at ICLR 2023 model r acc drop im/s speed ViT-B/16 0 83.62 0.00 305 1.00 ViT-B/16 1 83.55 -0.07 307 1.01 ViT-B/16 2 83.50 -0.12 317 1.04 ViT-B/16 3 83...

  16. [16]

    (2022) to evaluate off-the-shelf

    We used the model from Huang et al. (2022) to evaluate off-the-shelf. However, for training we train with our own implementation that’s different from the paper. For this reason, in Tab. 13, we list two different baselines (one from the original paper, and the other trained by us). In this case, we don’t use proportional attention during off-the-shelf eva...

  17. [17]

    This has the effect of regularizing layers so that they don’t rely on a single block

    Drop path randomly drops out entire attention and MLP blocks with some probability. This has the effect of regularizing layers so that they don’t rely on a single block. Because we use the K matrix 16 Published as a conference paper at ICLR 2023 from blocks that could be dropped out, we test the value of this parameter. Again, we find this not necessary to...

  18. [18]

    bipartite soft matching

    Around throughputs of 1600-1800, the best schedule is close to constant, which is why constant is close to optimal in this range. For throughputs beyond that, however, a decreasing schedule is best. For this, reason we define a linearly decreasing schedule in addition to a constant schedule in the main paper. 17 Published as a conference paper at ICLR 2023...

  19. [19]

    ToMe’s propensity for part and object segmentation appears time and time again across many different images. In Fig. 9, we also display more results of ToMe performing object tracking on video. Note that in (Feichtenhofer et al., 2022), each token represents more than one frame. Namely, the patch size is 2× 16× 16 and thus 2 frames of video correspond to ...

  20. [20]

    Continuation of Fig

    19 Published as a conference paper at ICLR 2023 Figure 9: More visualization on video. Continuation of Fig