arxiv: 2604.03191 · v1 · submitted 2026-04-03 · 💻 cs.RO · cs.CV· cs.LG

Recognition: no theorem link

The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling

Takuya Shiba

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:46 UTC · model grok-4.3

classification 💻 cs.RO cs.CVcs.LG

keywords vision-language-actiondiscrete tokenizationinformation bottleneckrobot manipulationscalingLIBERO benchmarkcompression gapcontinuous vs discrete actions

0 comments

The pith

Discrete action tokenization creates a compression gap that blocks vision encoder scaling gains in VLA models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that upgrading the vision encoder improves manipulation performance in continuous-action policies such as Diffusion Policy, but the same upgrade yields little or no gain when actions are represented as discrete tokens from a fixed codebook such as OAT. The reason is an information-theoretic limit called the Compression Gap: in any visuomotor pipeline the tightest bottleneck governs how scaling propagates, and the codebook becomes that bottleneck once actions are discretized. A reader should care because the result explains why simply enlarging encoders or models fails to advance physical AI when the action representation is discrete, and it points to the need to locate and relax the actual constraint instead of applying uniform scaling.

Core claim

In any visuomotor pipeline, scaling behavior is governed by the location of the tightest information bottleneck. When actions remain continuous the vision encoder is the binding constraint, so encoder improvements raise performance directly. When actions are discretized through a fixed-capacity codebook the codebook itself becomes the binding constraint, and upstream encoder improvements cannot propagate past it no matter how rich the representation becomes. This principle is demonstrated on the LIBERO benchmark by a factorial comparison in which encoder upgrades improve Diffusion Policy by more than 21 percentage points while OAT gains remain substantially smaller, by an encoder-quality梯度实验

What carries the argument

The Compression Gap principle: the claim that scaling is governed by whichever stage imposes the tightest information bottleneck in the visuomotor pipeline.

If this is right

Encoder upgrades will raise performance in continuous-action models but leave discrete-action models largely unchanged.
Increasing codebook size partially restores the benefit of better encoders, confirming the bottleneck location.
Uniform scaling of model size or data will not overcome the limit imposed by a fixed action codebook.
Identifying the binding bottleneck in a pipeline is required before further scaling investments can be effective.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid continuous-discrete action representations may allow scaling to continue without immediate codebook expansion.
The same bottleneck logic could apply to other tokenization choices in multimodal models beyond robotics.
Variable-capacity or learned codebooks that adapt to task demands would be a direct way to relax the observed constraint.

Load-bearing premise

That the performance differences between continuous and discrete pipelines are driven primarily by the location of the information bottleneck rather than by differences in training dynamics, optimizer behavior, or other unmeasured factors.

What would settle it

A controlled run in which codebook capacity is increased while the vision encoder is held fixed and performance is measured to check whether sensitivity to encoder quality is restored in proportion to the capacity increase.

Figures

Figures reproduced from arXiv: 2604.03191 by Takuya Shiba.

read the original abstract

Scaling Vision-Language-Action (VLA) models by upgrading the vision encoder is expected to improve downstream manipulation performance--as it does in vision-language modeling. We show that this expectation fails when actions are represented as discrete tokens, and explain why through an information-theoretic principle we call the Compression Gap: in any visuomotor pipeline, scaling behavior is governed by the location of the tightest information bottleneck. When actions are continuous (e.g., Diffusion Policy), the vision encoder is the binding constraint, and upgrading it directly improves performance. When actions are discretized through a fixed-capacity codebook (e.g., OAT), the codebook becomes the binding constraint, and encoder improvements cannot propagate past it--regardless of how rich the upstream representation is. We validate this principle on the LIBERO benchmark with three lines of evidence: a factorial experiment showing that encoder upgrades improve Diffusion Policy by over 21 percentage points while OAT gains are substantially attenuated across model scales; an encoder quality gradient across four encoders confirming that Diffusion Policy tracks encoder quality monotonically while OAT remains flat; and a codebook size experiment demonstrating that relaxing codebook capacity partially recovers encoder sensitivity, providing causal evidence for the bottleneck hypothesis. Our findings reveal that scaling in Physical AI requires identifying where information bottlenecks lie in the pipeline, rather than uniformly increasing model or data size.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The Compression Gap framing is plausible and the LIBERO experiments show a real pattern, but the comparisons do not isolate discretization from differences in loss and training dynamics between the heads.

read the letter

The paper's core observation is that encoder upgrades help continuous-action policies like Diffusion Policy on LIBERO but give little lift to discrete-token models like OAT, and it attributes this to the codebook acting as the binding bottleneck. That pattern appears in the factorial runs, the encoder-quality gradient, and the codebook-size ablation, which is the main new piece. The information-theoretic framing is straightforward and matches what we already know about capacity limits in other pipelines, so the principle itself is not surprising but the targeted tests on a standard robotics benchmark are useful. The codebook-size result in particular gives some causal purchase on the claim. The soft spot is exactly the one in the stress-test note: the main contrast pits a diffusion head against a cross-entropy token head, so loss, sampling, and optimization all differ at once. Without a controlled swap that keeps the rest of the architecture and recipe fixed while toggling only discretization, it is hard to be sure the attenuation is caused by the codebook rather than by how the discrete head is trained. The paper does not appear to report that isolation. This work is aimed at groups scaling VLA models for manipulation, especially those already using or considering discrete action tokenization. It is worth a serious referee because the question is timely, the empirical pattern is clear enough to motivate follow-up, and the suggested fix (relaxing the codebook) is testable. I would send it to review with a request for the tighter head-controlled comparison.

Referee Report

2 major / 2 minor

Summary. The paper claims that scaling vision encoders in Vision-Language-Action (VLA) models fails to improve performance when actions are represented as discrete tokens from a fixed-capacity codebook, due to an information bottleneck termed the 'Compression Gap.' In contrast, continuous action representations (e.g., Diffusion Policy) allow encoder improvements to propagate. This is supported by three experiments on the LIBERO benchmark: a factorial comparison showing >21pp gains for continuous policies but attenuated gains for discrete OAT; an encoder-quality gradient where only continuous policies track encoder improvements monotonically; and a codebook-size ablation that partially restores encoder sensitivity.

Significance. If the central claim is substantiated, the work has clear significance for Physical AI scaling strategies by highlighting that uniform model scaling is ineffective without addressing pipeline-specific bottlenecks. The multiple lines of evidence, particularly the codebook-size ablation providing a causal test, represent a strength. The result would usefully caution against assuming vision-encoder scaling benefits transfer across action representations.

major comments (2)

[§4.1] §4.1 Factorial Experiment: the comparison of Diffusion Policy (continuous) against OAT (discrete) confounds action discretization with differences in loss (diffusion vs. cross-entropy), sampling procedure, and training dynamics. Without a controlled swap that holds architecture, optimizer, and loss fixed while toggling only discretization, the attenuated encoder gains cannot be attributed solely to the codebook bottleneck.
[§4.2] §4.2 Encoder Quality Gradient: the claim that OAT remains flat across four encoders while Diffusion Policy tracks quality monotonically would be stronger with explicit reporting of per-encoder performance deltas, confidence intervals, and controls for total parameter count or training compute to rule out capacity confounds.

minor comments (2)

[Abstract] Abstract: the reported 'over 21 percentage points' improvement for Diffusion Policy lacks a precise baseline condition or task-averaged metric; adding this detail would improve clarity.
[§3] Notation: the term 'Compression Gap' is introduced as a new principle but is not formally defined with an equation or information-theoretic bound; a short definition in §3 would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments. We address each major point below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§4.1] §4.1 Factorial Experiment: the comparison of Diffusion Policy (continuous) against OAT (discrete) confounds action discretization with differences in loss (diffusion vs. cross-entropy), sampling procedure, and training dynamics. Without a controlled swap that holds architecture, optimizer, and loss fixed while toggling only discretization, the attenuated encoder gains cannot be attributed solely to the codebook bottleneck.

Authors: We agree that the factorial experiment compares policies that differ in multiple aspects beyond discretization, including the objective function and inference procedure. These differences are, however, tightly coupled to the choice of action representation, as discrete tokens require a classification loss and categorical sampling. To isolate the effect of the codebook, we rely primarily on the codebook-size ablation (§4.3), which varies only the codebook capacity while keeping the rest of the OAT architecture fixed and demonstrates a partial restoration of encoder scaling benefits. We will revise §4.1 to explicitly acknowledge the confounding factors and to highlight the ablation as the primary causal evidence for the Compression Gap. revision: partial
Referee: [§4.2] §4.2 Encoder Quality Gradient: the claim that OAT remains flat across four encoders while Diffusion Policy tracks quality monotonically would be stronger with explicit reporting of per-encoder performance deltas, confidence intervals, and controls for total parameter count or training compute to rule out capacity confounds.

Authors: We concur that reporting per-encoder deltas, confidence intervals, and explicit controls for compute would improve the robustness of the results. In the revised version, we will add a supplementary table listing the success rates for each of the four encoders under both policies, including mean and standard deviation across three random seeds. We will also confirm that all experiments used identical training hyperparameters, batch sizes, and step counts to ensure comparable compute budgets, thereby ruling out capacity confounds. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical validation via controlled experiments stands independent of definitions

full rationale

The manuscript advances the Compression Gap as an explanatory principle and supports it with three external empirical lines (factorial encoder upgrade experiment, encoder-quality gradient across four encoders, and codebook-capacity ablation) on the LIBERO benchmark. These comparisons contrast continuous versus discrete action heads through observable performance deltas rather than through any internal equations that reduce by construction to fitted inputs or self-referential definitions. No mathematical derivation chain, self-citation load-bearing step, or renaming of known results is present; the bottleneck claim is tested against data rather than asserted tautologically.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the assumption that the discrete codebook imposes a hard information bottleneck independent of other pipeline components, with no free parameters or invented entities beyond the named principle itself.

axioms (1)

domain assumption The codebook used for discrete action tokenization has fixed capacity that becomes the binding information constraint in the visuomotor pipeline.
Invoked directly to explain why encoder scaling fails to propagate in discrete cases.

invented entities (1)

Compression Gap no independent evidence
purpose: Named principle describing the location-dependent information bottleneck in VLA pipelines.
Conceptual label for the observed phenomenon; no independent falsifiable prediction outside the reported experiments.

pith-pipeline@v0.9.0 · 5535 in / 1276 out tokens · 56565 ms · 2026-05-13T18:46:16.643831+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages · 6 internal anchors

[1]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[4]

Behavior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024

Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions.arXiv preprint arXiv:2403.03181,

work page arXiv
[5]

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen

URLhttps://arxiv.org/abs/2602.04215. Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple

work page arXiv
[6]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Beyond language modeling: An exploration of multimodal pretraining.arXiv preprint arXiv:2603.03276,

Shengbang Tong, David Fan, John Nguyen, Ellis Brown, Gaoyue Zhou, Shengyi Qian, Boyang Zheng, Théophane Vallaeys, Junlin Han, Rob Fergus, et al. Beyond language modeling: An exploration of multimodal pretraining.arXiv preprint arXiv:2603.03276,

work page arXiv
[9]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdul- mohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786,

work page internal anchor Pith review Pith/arXiv arXiv