arxiv: 2605.11683 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

DORA: Dynamic Online Reinforcement Agent for Token Merging in Vision Transformers

Kaixuan He , Song Chen , Yi Kang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords token mergingvision transformersreinforcement learningdynamic inferencecomputational efficiencyactor-critictoken reductionImageNet

0 comments

The pith

A reinforcement learning agent learns to merge tokens dynamically in Vision Transformers during inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DORA as an online RL framework that treats token merging in each ViT block as a sequential decision problem. A lightweight policy selects merges based on current features and layer context, trained offline with a dense reward that penalizes deviation from the original model's outputs. This replaces static heuristics or fixed ratios with input-adaptive choices, aiming to reduce quadratic attention cost while keeping accuracy loss negligible. If the approach works as claimed, ViTs could deliver higher throughput on diverse inputs without per-model retraining or large accuracy penalties.

Core claim

DORA models token merging as a Markov Decision Process solved by an asymmetric Actor-Critic agent; the high-capacity critic enables stable offline training on distillation-based rewards, while the minimal actor head runs online to decide merges per block, yielding up to 12.66 percent merging at under 0.05 percent accuracy drop and up to 76 percent better compute savings than prior methods on ImageNet-1K under matched accuracy.

What carries the argument

Asymmetric Actor-Critic RL policy that outputs per-block merging actions from feature states, optimized by a non-linear distillation penalty in the reward.

If this is right

The method improves the accuracy-efficiency frontier across ViT-Tiny through ViT-Large scales.
It delivers over 430 percent relative efficiency gains on out-of-distribution sets such as ImageNet-A and ImageNet-C.
Dynamic per-input decisions remove reliance on predefined masks or fixed ratios used by earlier token-reduction techniques.
The online inference cost stays low because only the actor head is retained after offline training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the policy transfers across architectures, similar RL agents could adaptively prune tokens in other attention-heavy models such as those used for video or language.
Deployment pipelines with varying hardware budgets might adopt a single trained agent rather than multiple static configurations.
The dense reward design could be extended to include latency or memory measurements directly, turning the agent into a hardware-aware scheduler.

Load-bearing premise

The lightweight policy trained offline on one set of images and models will maintain its accuracy-efficiency performance on new inputs, different ViT sizes, and other tasks without extra tuning or undetected accuracy loss.

What would settle it

Run DORA on a held-out ViT variant or dataset such as ImageNet-O and measure whether accuracy drops more than 0.05 percent at the reported merging rates or whether FLOPs savings fall below the best static baseline at equal accuracy.

Figures

Figures reproduced from arXiv: 2605.11683 by Kaixuan He, Song Chen, Yi Kang.

**Figure 1.** Figure 1: Overall architecture of the proposed DORA framework. The framework employs a decoupled design, consisting of an offline training phase and an online inference phase. During the online phase, a pre-trained lightweight Actor network dynamically generates token merging masks, enabling input-adaptive acceleration for Transformers. In the offline phase, a high-capacity Critic network assists in training and opt… view at source ↗

**Figure 2.** Figure 2: Computational savings comparison of token reduction methods under strictly aligned Top-1 accuracy constraints. (a) Normalized FLOPs results on the ViT architectures. (b) Normalized FLOPs results on the DeiT architectures [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of the dynamic merging process. Red bounding boxes indicate the specific Transformer blocks where the RL agent executes token merging, with the corresponding block index annotated in the top-right corner. Unmarked images represent blocks where the agent opted for zero-merging actions (bypassing the reduction step). 8, and 9 for ViT-Tiny, and at blocks 3, 5, 6, and 15 for ViT-Large. This non-u… view at source ↗

read the original abstract

Vision Transformers (ViTs) incur significant computational overhead due to the quadratic complexity of self-attention relative to the token sequence length. While existing token reduction methods mitigate this issue, they predominantly rely on fixed heuristic metrics, predefined ratios, or static offline masks, which lack the adaptability to capture input-dependent redundancy during inference. In this paper, we propose DORA (Dynamic Online Reinforcement Agent), the first reinforcement learning (RL)-driven online inference framework for dynamic token merging in ViTs. We formulate the merging process as a sequential Markov Decision Process (MDP), where a lightweight RL agent determines the merging strategy for each Transformer block based on the current feature state and layer-specific context. To balance computational efficiency and feature fidelity, the agent is optimized via a dense reward function incorporating a non-linear distillation-based penalty. We implement an asymmetric Actor-Critic architecture that utilizes a high-capacity Critic for stable offline training while retaining a minimal Actor head for low-computation online inference. Evaluations across multiple ViT scales (Tiny to Large) demonstrate that DORA improves the accuracy-efficiency Pareto front compared to current baselines. Under strict negligible accuracy-drop constraints (<= 0.05%), DORA achieves up to a 12.66% token merging rate, and delivers up to a 569.7% relative improvement over the most efficient baseline. On ImageNet-1K, under aligned accuracy constraints, DORA achieves up to a 76% relative improvement in computational savings compared to state-of-the-art methods. Furthermore, on out-of-distribution (OOD) benchmarks such as ImageNet-A and ImageNet-C, DORA attains a relative efficiency advantage of over 430%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DORA's RL framing for online token merging in ViTs produces quantified efficiency gains on ImageNet and OOD sets, but the offline-trained actor's generalization remains the untested weak link.

read the letter

DORA frames token merging as a per-block MDP solved by a lightweight RL actor that runs online, trained against a heavier critic with a non-linear distillation reward. That asymmetric setup and the MDP formulation with layer context are the actual novelties; no prior token-reduction work cited in the abstract uses this combination. The paper shows the approach across ViT-Tiny through Large, reports up to 12.66% merging under a 0.05% accuracy budget, and claims large relative savings versus static baselines on ImageNet-1K plus ImageNet-A/C. Those numbers are specific and the OOD results are a plus if they hold.

Referee Report

2 major / 2 minor

Summary. The paper introduces DORA, the first RL-driven online inference framework for dynamic token merging in Vision Transformers. It formulates merging as a sequential MDP where a lightweight RL agent selects merges per Transformer block using current features and layer context; an asymmetric Actor-Critic is trained offline with a dense distillation-based reward and deployed with only the minimal actor at inference. Experiments across ViT scales on ImageNet-1K and OOD sets (ImageNet-A/C) report up to 12.66% merging rate and large relative efficiency gains under a strict <=0.05% accuracy-drop constraint, plus up to 76% better computational savings versus SOTA under aligned accuracy.

Significance. If the reported Pareto-front improvements and generalization hold, the work would meaningfully advance token-reduction methods by replacing fixed heuristics with input-adaptive RL decisions while preserving deployment efficiency via the asymmetric architecture. The offline high-capacity critic plus lightweight actor is a concrete strength that directly addresses inference cost, and the dense reward formulation offers a principled way to trade efficiency against feature fidelity.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): the headline claims of 12.66% merging rate and 569.7% relative improvement under <=0.05% accuracy drop are load-bearing for the central contribution, yet the manuscript supplies no protocol details, baseline definitions, number of random seeds, or statistical tests; without these the link between the MDP policy and the numerical gains cannot be verified.
[§3.2] §3.2 (MDP formulation): the asymmetric Actor-Critic trains the critic offline on trajectories collected under the dense reward, then deploys only the actor; the layer-specific context does not include uncertainty estimation, online fine-tuning, or explicit distribution-shift bounds, so averaged accuracy figures can mask per-image or per-scale degradation on unseen inputs or ViT variants (Tiny vs. Large), directly undermining the negligible-accuracy-drop guarantee.

minor comments (2)

[§3] Notation for the reward function and state representation should be introduced with an explicit equation early in §3 to improve readability.
[Figures and Tables] Figure captions and table headers would benefit from explicit mention of the exact accuracy-drop threshold used for each reported point.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights areas where additional rigor and transparency will strengthen the manuscript. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the headline claims of 12.66% merging rate and 569.7% relative improvement under <=0.05% accuracy drop are load-bearing for the central contribution, yet the manuscript supplies no protocol details, baseline definitions, number of random seeds, or statistical tests; without these the link between the MDP policy and the numerical gains cannot be verified.

Authors: We agree that the experimental protocol requires more explicit documentation to enable verification. In the revised manuscript, we will expand §4 with: (i) precise definitions and references for all baselines together with the exact formulas used to compute merging rates and relative improvements, (ii) the number of random seeds employed (three seeds were run for every reported result, with mean and standard deviation), and (iii) statistical tests including standard deviations across seeds and paired t-tests for key comparisons. These additions will make the connection between the MDP policy, reward design, and reported gains fully reproducible and verifiable. revision: yes
Referee: [§3.2] §3.2 (MDP formulation): the asymmetric Actor-Critic trains the critic offline on trajectories collected under the dense reward, then deploys only the actor; the layer-specific context does not include uncertainty estimation, online fine-tuning, or explicit distribution-shift bounds, so averaged accuracy figures can mask per-image or per-scale degradation on unseen inputs or ViT variants (Tiny vs. Large), directly undermining the negligible-accuracy-drop guarantee.

Authors: The asymmetric design deliberately keeps only the lightweight actor at inference to preserve deployment efficiency while still allowing input-dependent decisions via the current features and layer context. Experiments already cover multiple ViT scales (Tiny to Large) and OOD sets (ImageNet-A/C), showing consistent average performance under the stated accuracy constraint. Nevertheless, we acknowledge that explicit uncertainty estimation, online fine-tuning, or distribution-shift bounds are absent and that averaged metrics alone may obscure per-image or per-scale variation. In the revision we will therefore (i) add a limitations paragraph in §3.2 discussing these aspects, (ii) report per-image accuracy variance and worst-case degradation in the supplementary material, and (iii) break down results by ViT scale. These changes increase transparency without altering the core method. revision: partial

Circularity Check

0 steps flagged

No derivation chain; empirical RL method with no self-referential equations or fitted predictions

full rationale

The paper proposes DORA as an RL-based online token merging framework for ViTs, formulated as an MDP with an asymmetric Actor-Critic setup and a dense distillation reward. No mathematical derivation, first-principles result, or predictive equation is presented that reduces to its own inputs by construction. Performance claims (e.g., 12.66% merging rate under <=0.05% accuracy drop) rest entirely on empirical evaluations across ViT scales and benchmarks, not on quantities defined in terms of the method's fitted parameters or self-citations. The reader's assessment of score 2.0 aligns with this: absent any load-bearing self-citation chain, ansatz smuggling, or renaming of known results, the work is self-contained as an empirical contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard RL modeling assumptions and empirical training procedures rather than new axioms or invented physical entities.

free parameters (1)

Reward-function coefficients and RL hyperparameters
Typical dense reward weights and training hyperparameters are required to balance efficiency and fidelity; their specific values are not stated in the abstract.

axioms (1)

domain assumption Token merging decisions can be modeled as a sequential Markov Decision Process whose state is the current feature representation and layer context.
Explicitly stated in the abstract as the formulation of the merging process.

pith-pipeline@v0.9.0 · 5601 in / 1344 out tokens · 58667 ms · 2026-05-13T01:08:26.691038+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

formulate the merging process as a sequential Markov Decision Process (MDP)... lightweight RL agent determines the merging strategy... dense reward function incorporating a non-linear distillation-based penalty
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

asymmetric Actor-Critic architecture... high-capacity Critic for stable offline training while retaining a minimal Actor head

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 1 internal anchor

[1]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021

work page 2021
[2]

Training data-efficient Image Transformers & distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient Image Transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, pages 10347–10357, 2021

work page 2021
[3]

Swin Transformer: Hierarchical Vision Transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical Vision Transformer using shifted windows. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021

work page 2021
[4]

Token merging: Your ViT but faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster. InInternational Conference on Learning Representations, 2023

work page 2023
[5]

AdapTiV: Sign-similarity based image- adaptive token merging for Vision Transformer acceleration

Seungjae Yoo, Hangyeol Kim, and Joo-Young Kim. AdapTiV: Sign-similarity based image- adaptive token merging for Vision Transformer acceleration. InProceedings of the IEEE/ACM International Symposium on Microarchitecture, pages 64–75, 2024

work page 2024
[6]

Dynam- icViT: Efficient Vision Transformers with dynamic token sparsification

Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynam- icViT: Efficient Vision Transformers with dynamic token sparsification. InAdvances in Neural Information Processing Systems, volume 34, pages 13937–13949, 2021

work page 2021
[7]

Not all patches are what you need: Expediting Vision Transformers via token reorganizations

Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Not all patches are what you need: Expediting Vision Transformers via token reorganizations. In International Conference on Learning Representations, 2022

work page 2022
[8]

V-Pruner: A fast and globally-informed token pruning framework for Vision Transformer

Guangzhen Yao, Jiayun Zheng, Zezhou Wang, Wenxin Zhang, Renda Han, Chuangxin Zhao, Zeyu Zhang, and Runhao Liu. V-Pruner: A fast and globally-informed token pruning framework for Vision Transformer. InProceedings of the AAAI Conference on Artificial Intelligence, pages 34396–34404, 2026

work page 2026
[9]

A-ViT: Adaptive tokens for efficient Vision Transformer

Hongxu Yin, Arash Vahdat, Jose M Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov. A-ViT: Adaptive tokens for efficient Vision Transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10809–10818, 2022

work page 2022
[10]

Hoai-Chau Tran, Duy M. H. Nguyen, Duy M. Nguyen, Trung Tin Nguyen, Ngan Le, Peng- tao Xie, Daniel Sonntag, James Zou, Binh T. Nguyen, and Mathias Niepert. Accelerating Transformers with spectrum-preserving token merging. InAdvances in Neural Information Processing Systems, volume 37, 2024

work page 2024
[11]

Learning to merge tokens via decoupled embedding for efficient Vision Transformers

Dong Hoon Lee and Seunghoon Hong. Learning to merge tokens via decoupled embedding for efficient Vision Transformers. InAdvances in Neural Information Processing Systems, volume 37, 2024

work page 2024
[12]

Hongjie Wang, Bhishma Dedhia, and Niraj K. Jha. Zero-TPrune: Zero-shot token pruning through leveraging of the attention graph in pre-trained Transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16070–16079, 2024

work page 2024
[13]

Proximal policy optimization algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. InProceedings of the International Conference on Machine Learning, 2017

work page 2017
[14]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. 10

work page internal anchor Pith review Pith/arXiv arXiv 2015
[15]

Asymmetric actor-critic for multi-turn LLM agents.arXiv preprint arXiv:2604.00304, 2026

Shuli Jiang, Zhaoyang Zhang, Yi Zhang, Shuo Yang, Wei Xia, and Stefano Soatto. Asymmetric actor-critic for multi-turn LLM agents.arXiv preprint arXiv:2604.00304, 2026

work page arXiv 2026
[16]

ImageNet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009

work page 2009
[17]

Natural adversarial examples

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15262–15271, 2021

work page 2021
[18]

The many faces of robustness: A critical analysis of Out-of-Distribution generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorando, Rahul Desai, Tyler Dymull, Sean Michaels, Maximilian Hunter, et al. The many faces of robustness: A critical analysis of Out-of-Distribution generalization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8349, 2021

work page 2021
[19]

Benchmarking neural network robustness to common corruptions and perturbations

Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. InInternational Conference on Learning Representations, 2019

work page 2019
[20]

How to train your ViT? data, augmentation, and regularization in Vision Transformers.Transactions on Machine Learning Research, 2022

Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train your ViT? data, augmentation, and regularization in Vision Transformers.Transactions on Machine Learning Research, 2022

work page 2022
[21]

Is space-time attention all you need for video understanding? InProceedings of the International Conference on Machine Learning, pages 813–824, 2021

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InProceedings of the International Conference on Machine Learning, pages 813–824, 2021

work page 2021
[22]

Video Swin Transformer

Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video Swin Transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3192–3201, 2022. 11

work page 2022