pith. sign in

arxiv: 2606.19784 · v1 · pith:LJMQBTG6new · submitted 2026-06-18 · 💻 cs.RO

EquiVLA: A General Framework for Rotationally Equivariant Vision-Language-Action Models

Pith reviewed 2026-06-26 17:27 UTC · model grok-4.3

classification 💻 cs.RO
keywords EquiVLASO(2) equivarianceVision-Language-Action modelsrobot manipulationDiffusion Transformerequivariant networksLIBERO benchmarkCALVIN benchmark
0
0 comments X

The pith

EquiVLA adds an end-to-end SO(2) equivariance chain to vision-language-action models and lifts success rates on manipulation benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EquiVLA as a general method to impose rotational symmetry on VLA models that pair a frozen vision-language backbone with a flow-matching action head. It splits the symmetry into two parts: EquiPerceptor creates approximately equivariant visual features from standard ViT outputs, while EquiActor supplies an exactly equivariant diffusion transformer for action sequences. The resulting chain maps rotated camera views to correspondingly rotated actions without retraining for every orientation. Experiments report higher success on four LIBERO suites, longer successful sequences on CALVIN, and better real-robot performance on Mobile ALOHA. A reader would care because the approach promises more data-efficient policies that generalize across rotations rather than memorizing each new viewpoint.

Core claim

EquiVLA is the first general framework for end-to-end SO(2)-equivariant VLA models. It introduces EquiPerceptor, which produces approximately SO(2)-equivariant visual representations from frozen ViT features, and EquiActor, an exactly SO(2)-equivariant flow-matching Diffusion Transformer action head. Together they establish an approximate SO(2) equivariance chain from camera observations to predicted action sequences, delivering 92.6 percent average success on LIBERO, an average sequence length of 4.03 on CALVIN, and 72 percent real-robot success.

What carries the argument

The EquiPerceptor plus EquiActor pair that converts frozen ViT features into an approximate-to-exact SO(2) equivariance chain for action prediction.

If this is right

  • The framework applies to any VLA architecture that freezes a vision-language backbone and uses a flow-matching Diffusion Transformer action head.
  • Policies generalize across rotational configurations with less additional data.
  • The equivariance holds from raw camera observations through to full action sequences.
  • Reported gains reach 92.6 percent average success on LIBERO, 4.03 average sequence length on CALVIN, and 72 percent success on real-robot tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same modular split could be tested with other group symmetries such as translations or discrete rotations if analogous equivariant layers are substituted.
  • If the symmetry benefit holds, data collection for new robot setups could focus on fewer orientations and still cover the full rotation group.
  • The framework may interact with existing data-augmentation strategies, so measuring the combined effect on sample efficiency would be a direct next measurement.
  • Extending the chain to handle partial occlusions or changing camera intrinsics would test whether the equivariance remains useful under realistic visual variation.

Load-bearing premise

The performance gains result from the imposed rotational symmetry rather than from differences in model capacity or training details that were not fully ablated.

What would settle it

An experiment that keeps parameter count and training schedule identical but removes the equivariance operations from both EquiPerceptor and EquiActor, then checks whether success rates fall back to the non-equivariant baseline levels.

Figures

Figures reproduced from arXiv: 2606.19784 by An Thai Le, Duy M. H. Nguyen, Gia-Binh Nguyen, Long Dinh, Minh Duc Nguyen, Minh N. Vu, Ngo Anh Vien, Pham Tri Quang, Quang-Tan Nguyen, Thien-Loc Ha, Trong-Bao Ho.

Figure 1
Figure 1. Figure 1: EquiVLA architecture. (Top) EQUIPERCEPTOR symmetrizes the frozen VLM’s visual pipeline via Token-level Frame Averaging, producing equivariant tokens z eq and invariant tokens z inv . The latter are fed into the frozen VLM alongside the wrist image and language instruction to produce context tokens z ctx, fused with z eq via the Equivariant Adapter to form the visual representation passed to EQUIACTOR. (Bot… view at source ↗
Figure 2
Figure 2. Figure 2: Real-robot tasks. For each task, the first and last frames show the initial and goal states, respectively. See Appendix A for full task descriptions [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Representative real-robot rollouts on the five Mobile ALOHA tasks. Each row shows six [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: reports the per-group equivariance error ϵeq at each C8 rotation angle, complementing the aggregate statistics in [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example LIBERO-Object scenes at each rotation angle used in the controlled orientation [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Success rate on LIBERO-Object under controlled scene rotations from [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
read the original abstract

Vision-Language-Action (VLA) models have emerged as a powerful paradigm for generalist robot manipulation, yet they lack geometric inductive biases: policies trained at specific orientations require substantially more data to generalize across rotational configurations. We present \textsc{EquiVLA}, the first general framework for end-to-end $\mathrm{SO}(2)$-equivariant VLA models, applicable to any architecture coupling a frozen vision-language backbone with a flow-matching Diffusion Transformer action head. \textsc{EquiVLA} introduces \textsc{EquiPerceptor}, which produces approximately $\mathrm{SO}(2)$-equivariant visual representations from frozen ViT features; and \textsc{EquiActor}, an exactly $\mathrm{SO}(2)$-equivariant flow-matching Diffusion Transformer action head. Together, they establish an approximate $\mathrm{SO}(2)$ equivariance chain from camera observations to predicted action sequences. Instantiated on GR00T~N1.5 and evaluated across four LIBERO suites, CALVIN ABCD$\to$D, and five real-robot tasks on Mobile ALOHA, \textsc{EquiVLA} achieves $92.6\%$ average success on LIBERO (vs. $78.1\%$ baseline), an average sequence length of $4.03$ on CALVIN (vs. $3.45$), and improves real-robot success from $54\%$ to $72\%$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents EquiVLA as the first general framework for end-to-end SO(2)-equivariant Vision-Language-Action models. It introduces EquiPerceptor to generate approximately SO(2)-equivariant visual representations from frozen ViT features of a vision-language backbone, and EquiActor as an exactly SO(2)-equivariant flow-matching Diffusion Transformer for the action head. The framework is instantiated on GR00T N1.5 and evaluated on LIBERO, CALVIN, and real-robot tasks on Mobile ALOHA, reporting average success rates of 92.6% on LIBERO (vs. 78.1% baseline), 4.03 sequence length on CALVIN (vs. 3.45), and 72% real-robot success (vs. 54%).

Significance. If the gains are attributable to the approximate-to-exact SO(2) equivariance chain, the work would meaningfully advance VLA models by adding geometric inductive biases that improve rotational generalization. The multi-benchmark evaluation spanning simulation suites and real-robot tasks on Mobile ALOHA is a positive aspect of the empirical contribution.

major comments (2)
  1. [§3.2] §3.2 (EquiPerceptor construction): The method produces approximately SO(2)-equivariant features from a frozen non-equivariant ViT backbone, but provides no quantitative verification (e.g., feature transformation tests under explicit SO(2) rotations of input images) that the output representations transform as required; this verification is load-bearing for the end-to-end equivariance chain asserted in the abstract and for attributing gains to symmetry.
  2. [§5] §5 (Experiments, LIBERO and CALVIN results): The reported deltas (92.6% vs 78.1% on LIBERO; 4.03 vs 3.45 on CALVIN) compare against baselines without ablations that exactly match parameter count, training schedule, and optimizer while disabling the group-equivariant layers in both EquiPerceptor and EquiActor; without these controls the performance improvements cannot be attributed to the equivariance mechanism rather than capacity or training differences.
minor comments (1)
  1. [Abstract, §1] The abstract and §1 would benefit from an explicit statement of the precise group representation used for the SO(2) action on image features and actions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (EquiPerceptor construction): The method produces approximately SO(2)-equivariant features from a frozen non-equivariant ViT backbone, but provides no quantitative verification (e.g., feature transformation tests under explicit SO(2) rotations of input images) that the output representations transform as required; this verification is load-bearing for the end-to-end equivariance chain asserted in the abstract and for attributing gains to symmetry.

    Authors: We agree that quantitative verification of the approximate SO(2) equivariance in EquiPerceptor is important for supporting the claimed chain and for attributing performance gains. In the revised manuscript we will add explicit feature transformation tests: we will rotate input images by multiples of 90 degrees, extract EquiPerceptor features, apply the corresponding group action to the features, and report the L2 discrepancy between the transformed and directly computed features, averaged over the evaluation sets. revision: yes

  2. Referee: [§5] §5 (Experiments, LIBERO and CALVIN results): The reported deltas (92.6% vs 78.1% on LIBERO; 4.03 vs 3.45 on CALVIN) compare against baselines without ablations that exactly match parameter count, training schedule, and optimizer while disabling the group-equivariant layers in both EquiPerceptor and EquiActor; without these controls the performance improvements cannot be attributed to the equivariance mechanism rather than capacity or training differences.

    Authors: We acknowledge that the current baselines do not include parameter-matched ablations that isolate the effect of the group-equivariant layers. In the revision we will add controlled ablations on both LIBERO and CALVIN: we will replace the SO(2)-equivariant layers in EquiPerceptor and EquiActor with standard (non-equivariant) counterparts while preserving identical parameter counts, training schedules, optimizers, and data, and report the resulting performance to better attribute gains to the symmetry mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents EquiVLA as an empirical framework that combines an approximate-equivariant EquiPerceptor module on frozen ViT features with an exactly equivariant EquiActor Diffusion Transformer head, then reports benchmark improvements on LIBERO, CALVIN, and real-robot tasks. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that would reduce any claimed result to its own inputs by construction. The performance numbers are presented as direct empirical comparisons, not as outputs of a closed mathematical loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the new modules EquiPerceptor and EquiActor plus the assumption that their combination yields a usable equivariance chain; no free parameters or invented physical entities are described in the abstract.

axioms (1)
  • domain assumption Frozen ViT features can be transformed into approximately SO(2)-equivariant representations via EquiPerceptor without retraining the backbone.
    This premise is required for the approximate equivariance chain to hold while keeping the vision-language backbone frozen.
invented entities (2)
  • EquiPerceptor no independent evidence
    purpose: Produce approximately SO(2)-equivariant visual representations from frozen ViT features
    New module introduced to bridge the frozen backbone to the equivariant action head.
  • EquiActor no independent evidence
    purpose: Exactly SO(2)-equivariant flow-matching Diffusion Transformer action head
    New module introduced to enforce exact equivariance on action prediction.

pith-pipeline@v0.9.1-grok · 5832 in / 1546 out tokens · 37164 ms · 2026-06-26T17:27:08.760138+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 12 linked inside Pith

  1. [1]

    Bjorck, F

    J. Bjorck, F. Casta˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  2. [2]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  3. [3]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  4. [4]

    D. Wang, B. Hu, S. Song, R. Walters, and R. Platt. A practical guide for incorporating symmetry in diffusion policy.arXiv preprint arXiv:2505.13431, 2025

  5. [5]

    D. Wang, S. Hart, D. Surovik, T. Kelestemur, H. Huang, H. Zhao, M. Yeatman, J. Wang, R. Walters, and R. Platt. Equivariant diffusion policy.arXiv preprint arXiv:2407.01812, 2024

  6. [6]

    Yang, Z.-a

    J. Yang, Z.-a. Cao, C. Deng, R. Antonova, S. Song, and J. Bohg. Equibot: Sim (3)-equivariant diffusion policy for generalizable and data efficient learning.arXiv preprint arXiv:2407.01479, 2024

  7. [7]

    D. Wang, R. Walters, and R. Platt. SO(2)-equivariant reinforcement learning.arXiv preprint arXiv:2203.04439, 2022

  8. [8]

    C. Tie, Y . Chen, R. Wu, B. Dong, Z. Li, C. Gao, and H. Dong. ET-SEED: Efficient trajectory- level SE(3) equivariant diffusion policy. InInternational Conference on Learning Representa- tions, 2025

  9. [9]

    X. Zhu, Y . Qi, Y . Zhu, R. Walters, and R. Platt. Equact: An se (3)-equivariant multi-task transformer for open-loop robotic manipulation.arXiv preprint arXiv:2505.21351, 2025

  10. [10]

    J. Deng, Y . Wang, Y . Zhu, T. Feng, T. Wo, and Z. Shao. Eq.Bot: Enhance robotic manipulation learning via group equivariant canonicalization.arXiv preprint arXiv:2511.15194, 2025

  11. [11]

    B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36: 44776–44791, 2023

  12. [12]

    O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

  13. [13]

    Z. Fu, T. Z. Zhao, and C. Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation.arXiv preprint arXiv:2401.02117, 2024

  14. [14]

    O. Puny, M. Atzmon, H. Ben-Hamu, I. Misra, A. Grover, E. J. Smith, and Y . Lipman. Frame averaging for invariant and equivariant network design.arXiv preprint arXiv:2110.03336, 2021

  15. [15]

    Brohan, N

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  16. [16]

    Zitkovich, T

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 9

  17. [17]

    O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  18. [18]

    Intelligence, K

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al. π0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

  19. [19]

    S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

  20. [20]

    Q. Li, Y . Liang, Z. Wang, L. Luo, X. Chen, M. Liao, F. Wei, Y . Deng, S. Xu, Y . Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

  21. [21]

    Shukor, D

    M. Shukor, D. Aubakirova, F. Capuano, P. Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

  22. [22]

    M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

  23. [23]

    C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems, 2023

  24. [24]

    Huang, D

    H. Huang, D. Wang, R. Walters, and R. Platt. Equivariant transporter network.arXiv preprint arXiv:2202.09400, 2022

  25. [25]

    M. Jia, D. Wang, G. Su, D. Klee, X. Zhu, R. Walters, and R. Platt. Seil: Simulation-augmented equivariant imitation learning.arXiv preprint arXiv:2211.00194, 2022

  26. [26]

    James, Z

    S. James, Z. Ma, D. R. Arrojo, and A. J. Davison. Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

  27. [27]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  28. [28]

    Z. Li, J. Cen, B. Su, W. Huang, T. Xu, Y . Rong, and D. Zhao. Large language-geometry model: When llm meets equivariance.arXiv preprint arXiv:2502.11149, 2025

  29. [29]

    Weiler and G

    M. Weiler and G. Cesa. General e (2)-equivariant steerable cnns.Advances in neural information processing systems, 32, 2019

  30. [30]

    Weiler, P

    M. Weiler, P. Forr´e, E. Verlinde, and M. Welling. Equivariant and coordinate independent convolutional networks.A Gauge Field Theory of Neural Networks, 110, 2023

  31. [31]

    G. Cesa, L. Lang, and M. Weiler. A program to build e (n)-equivariant steerable cnns. In International conference on learning representations, 2022

  32. [32]

    O. Mees, L. Hermann, and W. Burgard. What matters in language conditioned robotic imitation learning over unstructured data.IEEE Robotics and Automation Letters, 7(4):11205–11212, 2022

  33. [33]

    Reuss, J

    M. Reuss, J. Pari, P. Agrawal, and R. Lioutikov. Efficient diffusion transformer policies with mixture of expert denoisers for multitask learning. InInternational Conference on Learning Representations, volume 2025, pages 17247–17275, 2025. 10

  34. [34]

    A. A. Duval, V . Schmidt, A. Hern ´andez-Garcıa, S. Miret, F. D. Malliaros, Y . Bengio, and D. Rolnick. Faenet: Frame averaging equivariant gnn for materials modeling. InInternational Conference on Machine Learning, pages 9013–9033. PMLR, 2023

  35. [35]

    J. Y . Park, S. Bhatt, S. Zeng, L. L. Wong, A. Koppel, S. Ganesh, and R. Walters. Approximate equivariance in reinforcement learning.arXiv preprint arXiv:2411.04225, 2024

  36. [36]

    A” block is placed at one of the four corners of the table, and a letter “I

    B. Elesedy and S. Zaidi. Provably strict generalisation benefit for equivariant models. In International conference on machine learning, pages 2959–2969. PMLR, 2021. 11 Appendix Contents A Real-Robot Environment Details 13 B Training Details 14 C Equivariance Analysis 14 C.1 Exact Equivariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

  37. [37]

    GR00T N1.5(no equivariance constraint) shows error growing monotonically from 0◦ to 180◦, consistent with the cosine distance of rotations on a non-equivariant feature space. 2.GR00T N1.5 + EQUIACTORshows an oscillating pattern with higher error at45 ◦-offset angles (45◦,135 ◦,225 ◦,315 ◦) than at C4 angles (90◦,180 ◦,270 ◦), consistent with Theorem 5(b–c...

  38. [38]

    EQUIVLAremains near-zero and approximately flat across all angles, confirming that EQUIPER- CEPTORsuppresses the upstream vision equivariance violation uniformly across the group. 0◦ 45◦ 90◦ 135◦ 180◦ 225◦ 270◦ 315◦ 2.8 5.5 8.3 11 GR00T N1.5 GR00T N1.5 + EQUIACTOR EQUIVLA (ours) Figure 4: Per-group equivariance error ϵeq at each C8 rotation angle, measure...