Recognition: no theorem link
The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling
Pith reviewed 2026-05-13 18:46 UTC · model grok-4.3
The pith
Discrete action tokenization creates a compression gap that blocks vision encoder scaling gains in VLA models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In any visuomotor pipeline, scaling behavior is governed by the location of the tightest information bottleneck. When actions remain continuous the vision encoder is the binding constraint, so encoder improvements raise performance directly. When actions are discretized through a fixed-capacity codebook the codebook itself becomes the binding constraint, and upstream encoder improvements cannot propagate past it no matter how rich the representation becomes. This principle is demonstrated on the LIBERO benchmark by a factorial comparison in which encoder upgrades improve Diffusion Policy by more than 21 percentage points while OAT gains remain substantially smaller, by an encoder-quality梯度实验
What carries the argument
The Compression Gap principle: the claim that scaling is governed by whichever stage imposes the tightest information bottleneck in the visuomotor pipeline.
If this is right
- Encoder upgrades will raise performance in continuous-action models but leave discrete-action models largely unchanged.
- Increasing codebook size partially restores the benefit of better encoders, confirming the bottleneck location.
- Uniform scaling of model size or data will not overcome the limit imposed by a fixed action codebook.
- Identifying the binding bottleneck in a pipeline is required before further scaling investments can be effective.
Where Pith is reading between the lines
- Hybrid continuous-discrete action representations may allow scaling to continue without immediate codebook expansion.
- The same bottleneck logic could apply to other tokenization choices in multimodal models beyond robotics.
- Variable-capacity or learned codebooks that adapt to task demands would be a direct way to relax the observed constraint.
Load-bearing premise
That the performance differences between continuous and discrete pipelines are driven primarily by the location of the information bottleneck rather than by differences in training dynamics, optimizer behavior, or other unmeasured factors.
What would settle it
A controlled run in which codebook capacity is increased while the vision encoder is held fixed and performance is measured to check whether sensitivity to encoder quality is restored in proportion to the capacity increase.
Figures
read the original abstract
Scaling Vision-Language-Action (VLA) models by upgrading the vision encoder is expected to improve downstream manipulation performance--as it does in vision-language modeling. We show that this expectation fails when actions are represented as discrete tokens, and explain why through an information-theoretic principle we call the Compression Gap: in any visuomotor pipeline, scaling behavior is governed by the location of the tightest information bottleneck. When actions are continuous (e.g., Diffusion Policy), the vision encoder is the binding constraint, and upgrading it directly improves performance. When actions are discretized through a fixed-capacity codebook (e.g., OAT), the codebook becomes the binding constraint, and encoder improvements cannot propagate past it--regardless of how rich the upstream representation is. We validate this principle on the LIBERO benchmark with three lines of evidence: a factorial experiment showing that encoder upgrades improve Diffusion Policy by over 21 percentage points while OAT gains are substantially attenuated across model scales; an encoder quality gradient across four encoders confirming that Diffusion Policy tracks encoder quality monotonically while OAT remains flat; and a codebook size experiment demonstrating that relaxing codebook capacity partially recovers encoder sensitivity, providing causal evidence for the bottleneck hypothesis. Our findings reveal that scaling in Physical AI requires identifying where information bottlenecks lie in the pipeline, rather than uniformly increasing model or data size.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that scaling vision encoders in Vision-Language-Action (VLA) models fails to improve performance when actions are represented as discrete tokens from a fixed-capacity codebook, due to an information bottleneck termed the 'Compression Gap.' In contrast, continuous action representations (e.g., Diffusion Policy) allow encoder improvements to propagate. This is supported by three experiments on the LIBERO benchmark: a factorial comparison showing >21pp gains for continuous policies but attenuated gains for discrete OAT; an encoder-quality gradient where only continuous policies track encoder improvements monotonically; and a codebook-size ablation that partially restores encoder sensitivity.
Significance. If the central claim is substantiated, the work has clear significance for Physical AI scaling strategies by highlighting that uniform model scaling is ineffective without addressing pipeline-specific bottlenecks. The multiple lines of evidence, particularly the codebook-size ablation providing a causal test, represent a strength. The result would usefully caution against assuming vision-encoder scaling benefits transfer across action representations.
major comments (2)
- [§4.1] §4.1 Factorial Experiment: the comparison of Diffusion Policy (continuous) against OAT (discrete) confounds action discretization with differences in loss (diffusion vs. cross-entropy), sampling procedure, and training dynamics. Without a controlled swap that holds architecture, optimizer, and loss fixed while toggling only discretization, the attenuated encoder gains cannot be attributed solely to the codebook bottleneck.
- [§4.2] §4.2 Encoder Quality Gradient: the claim that OAT remains flat across four encoders while Diffusion Policy tracks quality monotonically would be stronger with explicit reporting of per-encoder performance deltas, confidence intervals, and controls for total parameter count or training compute to rule out capacity confounds.
minor comments (2)
- [Abstract] Abstract: the reported 'over 21 percentage points' improvement for Diffusion Policy lacks a precise baseline condition or task-averaged metric; adding this detail would improve clarity.
- [§3] Notation: the term 'Compression Gap' is introduced as a new principle but is not formally defined with an equation or information-theoretic bound; a short definition in §3 would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments. We address each major point below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [§4.1] §4.1 Factorial Experiment: the comparison of Diffusion Policy (continuous) against OAT (discrete) confounds action discretization with differences in loss (diffusion vs. cross-entropy), sampling procedure, and training dynamics. Without a controlled swap that holds architecture, optimizer, and loss fixed while toggling only discretization, the attenuated encoder gains cannot be attributed solely to the codebook bottleneck.
Authors: We agree that the factorial experiment compares policies that differ in multiple aspects beyond discretization, including the objective function and inference procedure. These differences are, however, tightly coupled to the choice of action representation, as discrete tokens require a classification loss and categorical sampling. To isolate the effect of the codebook, we rely primarily on the codebook-size ablation (§4.3), which varies only the codebook capacity while keeping the rest of the OAT architecture fixed and demonstrates a partial restoration of encoder scaling benefits. We will revise §4.1 to explicitly acknowledge the confounding factors and to highlight the ablation as the primary causal evidence for the Compression Gap. revision: partial
-
Referee: [§4.2] §4.2 Encoder Quality Gradient: the claim that OAT remains flat across four encoders while Diffusion Policy tracks quality monotonically would be stronger with explicit reporting of per-encoder performance deltas, confidence intervals, and controls for total parameter count or training compute to rule out capacity confounds.
Authors: We concur that reporting per-encoder deltas, confidence intervals, and explicit controls for compute would improve the robustness of the results. In the revised version, we will add a supplementary table listing the success rates for each of the four encoders under both policies, including mean and standard deviation across three random seeds. We will also confirm that all experiments used identical training hyperparameters, batch sizes, and step counts to ensure comparable compute budgets, thereby ruling out capacity confounds. revision: yes
Circularity Check
No circularity: empirical validation via controlled experiments stands independent of definitions
full rationale
The manuscript advances the Compression Gap as an explanatory principle and supports it with three external empirical lines (factorial encoder upgrade experiment, encoder-quality gradient across four encoders, and codebook-capacity ablation) on the LIBERO benchmark. These comparisons contrast continuous versus discrete action heads through observable performance deltas rather than through any internal equations that reduce by construction to fitted inputs or self-referential definitions. No mathematical derivation chain, self-citation load-bearing step, or renaming of known results is present; the bottleneck claim is tested against data rather than asserted tautologically.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The codebook used for discrete action tokenization has fixed capacity that becomes the binding information constraint in the visuomotor pipeline.
invented entities (1)
-
Compression Gap
no independent evidence
Reference graph
Works this paper leans on
-
[1]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[4]
Behavior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024
Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions.arXiv preprint arXiv:2403.03181,
-
[5]
Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen
URLhttps://arxiv.org/abs/2602.04215. Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple
-
[6]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Beyond language modeling: An exploration of multimodal pretraining.arXiv preprint arXiv:2603.03276,
Shengbang Tong, David Fan, John Nguyen, Ellis Brown, Gaoyue Zhou, Shengyi Qian, Boyang Zheng, Théophane Vallaeys, Junlin Han, Rob Fergus, et al. Beyond language modeling: An exploration of multimodal pretraining.arXiv preprint arXiv:2603.03276,
-
[9]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdul- mohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.