X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining
Pith reviewed 2026-06-30 11:20 UTC · model grok-4.3
The pith
Action tokenization can serve as a semantic interface between vision-language reasoning and robot control rather than mere motion compression.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
X-Tokenizer is a lightweight encoder-SRQ-decoder architecture that supplies a shared action interface across robotic embodiments. Its Semantic Residual Quantization imposes an asymmetric structure on residual vector quantization: the first level is trained with Masked Action Modeling to form a discrete action language capturing coarse motion intent, while deeper levels act as reconstruction-oriented residuals. The full model is further pretrained with contrastive alignment to a pretrained foundation model's representation space and with next-frame vision-language feature prediction. A single frozen X-Tokenizer then supplies representation-shaping supervision inside a mixed discrete-continuou
What carries the argument
Semantic Residual Quantization (SRQ), an asymmetric residual vector quantization in which the first level is trained via Masked Action Modeling to produce discrete tokens that capture coarse motion intent while deeper levels preserve fine-grained reconstruction details.
If this is right
- A frozen X-Tokenizer can be plugged into existing mixed discrete-continuous VLAs as a representation-shaping signal without retraining the tokenizer itself.
- The resulting tokens improve multimodal grounding performance by 13.5 percent and long-horizon task performance by 8.25 points over prior action tokenizers.
- The same tokenizer supplies a shared interface across diverse robotic arm embodiments after pretraining on 2.4 million trajectories.
- Action tokenizers can be viewed as semantic interfaces that transfer knowledge from pretrained vision-language models into precise robot control.
- Deeper residual levels remain reconstruction-oriented while the first level forms the discrete action language.
Where Pith is reading between the lines
- The same SRQ structure could be tested on non-arm embodiments such as mobile bases or dexterous hands to check whether the coarse-intent layer generalizes.
- Scaling the pretraining corpus beyond 2.4 million trajectories might reveal whether the semantic alignment continues to improve or saturates.
- The learned discrete action language could be inspected directly for human-interpretable motion primitives.
- Inserting X-Tokenizer supervision into purely discrete VLAs might reduce the need for continuous action heads in some tasks.
Load-bearing premise
The assumption that an asymmetric first level trained with masked action modeling plus contrastive and next-frame alignment will produce tokens that carry semantic multimodal intent rather than only geometric motion details.
What would settle it
An ablation that removes either the masked action modeling objective or the contrastive alignment step and measures whether multimodal grounding and long-horizon task scores fall back to or below the level achieved by standard residual vector quantization tokenizers such as FAST.
read the original abstract
Modern Vision-Language-Action (VLA) models must bridge pretrained vision-language reasoning and precise continuous robot control. Existing action tokenizers discretize actions primarily for reconstruction, producing codes that preserve motion geometry but provide only weak semantic supervision to the backbone. We therefore formulate action tokenization not as mere compression, but as semantic interface learning between multimodal reasoning and executable control. To this end, we introduce X-Tokenizer, a lightweight encoder-Semantic Residual Quantization (SRQ)-decoder architecture that provides a shared action interface across diverse robotic arm embodiments. Its key component, SRQ, imposes an asymmetric structure on residual vector quantization: the first level is trained with Masked Action Modeling (MAM) to form a discrete action language that captures coarse motion intent, while deeper levels remain reconstruction-oriented residuals that preserve fine-grained details. To further align action tokens with multimodal semantics, X-Tokenizer is pretrained with contrastive alignment to the representation space of a pretrained foundation model and with next-frame vision-language feature prediction. Pretrained on 2.4M trajectories (2.0B action frames), a single frozen X-Tokenizer plugs into a mixed discrete-continuous VLA as a representation-shaping supervision signal. X-Tokenizer achieves top real-world aggregate and strong RoboTwin 2.0 simulation results. Outperforming FAST in multimodal grounding (+13.5%) and long-horizon tasks (+8.25), it shows that action tokenizers serve as semantic interfaces for VLA pretraining beyond mere action compression.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces X-Tokenizer, a lightweight encoder-SRQ-decoder for action tokenization in VLA pretraining. SRQ applies asymmetric residual vector quantization: the first level uses Masked Action Modeling (MAM) to learn a discrete action language for coarse motion intent, while deeper levels focus on reconstruction. Additional pretraining via contrastive alignment to a foundation model and next-frame vision-language prediction aligns tokens semantically. Pretrained on 2.4M trajectories (2.0B action frames), a frozen X-Tokenizer is plugged into mixed discrete-continuous VLAs, claiming top real-world aggregate results and strong RoboTwin 2.0 performance, outperforming FAST by +13.5% in multimodal grounding and +8.25 in long-horizon tasks.
Significance. If the empirical claims hold with proper controls, the work would be significant for reframing action tokenizers as semantic interfaces rather than pure compressors, potentially enabling better multimodal grounding in VLA models across embodiments.
major comments (2)
- [Abstract] Abstract: performance claims (top real-world aggregate, +13.5% multimodal grounding, +8.25 long-horizon) are stated without any description of experimental setup, baselines, metrics, error bars, data splits, or statistical tests, rendering the central empirical claim unverifiable from the provided text.
- [Abstract] Abstract (SRQ and pretraining objectives): the claim that MAM on the first quantization level plus contrastive/next-frame objectives produces a 'discrete action language' capturing coarse intent (as opposed to reconstruction-only) is presented without ablations or controls showing this structure is load-bearing for the reported gains over FAST.
minor comments (1)
- [Abstract] Abstract: define or cite the exact metrics underlying 'multimodal grounding' and 'long-horizon tasks' and clarify the FAST baseline implementation.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on the abstract. We address each major point below and will revise the abstract accordingly to improve clarity and verifiability while preserving its conciseness.
read point-by-point responses
-
Referee: [Abstract] Abstract: performance claims (top real-world aggregate, +13.5% multimodal grounding, +8.25 long-horizon) are stated without any description of experimental setup, baselines, metrics, error bars, data splits, or statistical tests, rendering the central empirical claim unverifiable from the provided text.
Authors: We agree that the abstract omits key experimental details due to length constraints. The full manuscript details the setups, baselines (including FAST), metrics, data (2.4M trajectories), and results with error bars in Sections 4 and 5. We will revise the abstract to briefly reference the evaluation protocol, main baselines, and metrics to make the claims more verifiable from the abstract alone. revision: yes
-
Referee: [Abstract] Abstract (SRQ and pretraining objectives): the claim that MAM on the first quantization level plus contrastive/next-frame objectives produces a 'discrete action language' capturing coarse intent (as opposed to reconstruction-only) is presented without ablations or controls showing this structure is load-bearing for the reported gains over FAST.
Authors: The manuscript contains ablations (Section 4.3) isolating the asymmetric SRQ with MAM on the first level versus standard RVQ, as well as the contribution of contrastive and next-frame objectives, showing improved semantic alignment and gains over reconstruction-only baselines. These support the claim that the structure is load-bearing. We will revise the abstract to reference these ablations more explicitly. revision: partial
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper describes an architectural choice (SRQ with asymmetric MAM on the first quantization level plus contrastive and next-frame objectives) and reports empirical benchmark results on real-world and RoboTwin tasks. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text that would reduce the central claim (action tokenizers as semantic interfaces) to its own inputs by construction. The performance gains are presented as measured outcomes rather than derived equivalences, making the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
invented entities (3)
-
X-Tokenizer
no independent evidence
-
SRQ
no independent evidence
-
MAM
no independent evidence
Reference graph
Works this paper leans on
-
[1]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model
Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.𝜋0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025
2025
-
[5]
Controlvla: Few-shot object-centric adaptation for pre-trained vision-language-action models
Puhao Li, Yingying Wu, Ziheng Xi, Wanlin Li, Yuzhe Huang, Zhiyuan Zhang, Yinghan Chen, Jianan Wang, Song-Chun Zhu, Tengyu Liu, et al. Controlvla: Few-shot object-centric adaptation for pre-trained vision-language-action models. arXiv preprint arXiv:2506.16211, 2025
-
[6]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Motus: A Unified Latent Action World Model
Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Dongxiu Liu, Haoyi Niu, Zhihao Wang, Jinliang Zheng, Yinan Zheng, Zhonghong Ou, Jianming Hu, Jianxiong Li, and Xianyuan Zhan. Efficient robotic policy learning via latent space backward planning.arXiv preprint arXiv:2505.06861, 2025
-
[9]
LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
LucasMaes, QuentinLeLidec, DamienScieur, YannLeCun, andRandallBalestriero. Leworldmodel: Stableend-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[10]
WorldVLA: Towards Autoregressive Action World Model
Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Shalfun Li, Victor Yao, Charles Yang, Truth Qu, Regis Cheng, Ryan Yu, Howard Lu, Newton Von, Vincent Chen, Yohann Tang, Maeve Zhang, Ellie Ma, Gody Li, Sage Yang, Lorien Shu, J. W. Gao, Ethan Chen, Colin Ye, Yu Sun, Elise Mon, PS Zhang, Neo Li, Lily Li, James Wang, Ping Yang, Chris Pan, Lucy Liang, Hang Su, Roy Gan, Hao Wang, and Qian Wang. Wall-wm: Carvi...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[12]
Causal World Modeling for Robot Control
Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[13]
In9th Annual Conference on Robot Learning, 2025
Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al.𝜋0.5 : a vision-language-action model with open-world generalization. In9th Annual Conference on Robot Learning, 2025
2025
-
[14]
Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, Vedant Choudhary, Foster Collins, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Maitrayee Dhaka, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[15]
Ryan Yu, Pushi Zhang, Starrick Liu, Brae Liu, Miracle Kang, Shalfun Li, Lights Shi, Ellie Ma, Ping Yang, Chris Pan, et al. Wall-oss-0.5 technical report.arXiv preprint arXiv:2605.30877, 2026. 14
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[16]
Igniting vlms toward the embodied space.CoRR, abs/2509.11766,
Andy Zhai, Brae Liu, Bruno Fang, Chalse Cai, Ellie Ma, Ethan Yin, Hao Wang, Hugo Zhou, James Wang, Lights Shi, et al. Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766, 2025
-
[17]
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al. Hybridvla: Collaborative diffusion and autoregression in a unified vision-language-action model. arXiv preprint arXiv:2503.10631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Universal actions for enhanced embodied foundation models
Jinliang Zheng, Jianxiong Li, Dongxiu Liu, Yinan Zheng, Zhihao Wang, Zhonghong Ou, Yu Liu, Jingjing Liu, Ya-Qin Zhang, and Xianyuan Zhan. Universal actions for enhanced embodied foundation models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22508–22519, 2025
2025
-
[19]
Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025
Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025
-
[20]
A Pragmatic VLA Foundation Model
Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[21]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Behavior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024
Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024
-
[23]
Vq-vla: Improving vision-language- action models via scaling vector-quantized action tokenizers
Yating Wang, Haoyi Zhu, Mingyu Liu, Jiange Yang, Hao-Shu Fang, and Tong He. Vq-vla: Improving vision-language- action models via scaling vector-quantized action tokenizers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11089–11099, 2025
2025
-
[24]
Yicheng Liu, Shiduo Zhang, Zibin Dong, Baijun Ye, Tianyuan Yuan, Xiaopeng Yu, Linqi Yin, Chenhao Lu, Junhao Shi, Luca Jiang-Tao Yu, et al. Faster: Toward efficient autoregressive vision language action modeling via neural action tokenization.arXiv preprint arXiv:2512.04952, 2025
-
[25]
Actioncodec: What makes for good action tokenizers.arXiv preprint arXiv:2602.15397, 2026
Zibin Dong, Yicheng Liu, Shiduo Zhang, Baijun Ye, Yifu Yuan, Fei Ni, Jingjing Gong, Xipeng Qiu, Hang Zhao, Yinchuan Li, et al. Actioncodec: What makes for good action tokenizers.arXiv preprint arXiv:2602.15397, 2026
-
[26]
Jianxiong Li, Jinliang Zheng, Yinan Zheng, Liyuan Mao, Xiao Hu, Sijie Cheng, Haoyi Niu, Jihao Liu, Yu Liu, Jingjing Liu, et al. Decisionnce: Embodied multimodal representations via implicit preference learning.arXiv preprint arXiv:2402.18137, 2024
-
[27]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021
2021
-
[28]
Momentum contrast for unsupervised visual representation learning
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020
2020
-
[29]
Autoregressive image generation using residual quantization
Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11523–11532, 2022
2022
-
[30]
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019
2019
-
[31]
Rt-2: Vision-language-action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023
2023
-
[32]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Oat: Ordered action tokenization
Chaoqi Liu, Xiaoshen Han, Jiawei Gao, Yue Zhao, Haonan Chen, and Yilun Du. Oat: Ordered action tokenization. In Proceedings of Robotics: Science and Systems, 2026
2026
-
[35]
Chubin Zhang, Jianan Wang, Zifeng Gao, Yue Su, Tianru Dai, Cai Zhou, Jiwen Lu, and Yansong Tang. Clap: Contrastive latent action pretraining for learning vision-language-action models from human videos.arXiv preprint arXiv:2601.04061, 2026. 15
-
[36]
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
Boyu Chen, Yi Chen, Lu Qiu, Jerry Bai, Yuying Ge, and Yixiao Ge. UniT: Toward a unified physical language for human-to-humanoid policy learning and world modeling.arXiv preprint arXiv:2604.19734, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[37]
Demystifying Action Space Design for Robotic Manipulation Policies
Yuchun Feng, Jinliang Zheng, Zhihao Wang, Dongxiu Liu, Jianxiong Li, Jiangmiao Pang, Tai Wang, and Xianyuan Zhan. Demystifying action space design for robotic manipulation policies.arXiv preprint arXiv:2602.23408, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[38]
Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving
Yinan Zheng, Tianyi Tan, Bin Huang, Enguang Liu, Ruiming Liang, Jianlin Zhang, Jianwei Cui, Guang Chen, Kun Ma, Hangjun Ye, et al. Unleashing the potential of diffusion models for end-to-end autonomous driving.arXiv preprint arXiv:2602.22801, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[39]
Perceiver: General perception with iterative attention
Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. InInternational Conference on Machine Learning, pages 4651–4664. PMLR, 2021
2021
-
[40]
Perceiver IO: A general architecture for structured inputs and outputs
Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver IO: A general architecture for structured inputs and outputs. InInternational Conference on Learning Representations, 2022
2022
-
[41]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Zimmermann, and Wieland Brendel
Evgenia Rusak, Patrik Reizinger, Attila Juhos, Oliver Bringmann, Roland S. Zimmermann, and Wieland Brendel. Infonce: Identifying the gap between theory and practice, 2025. URLhttps://arxiv.org/abs/2407.00143
-
[43]
RDT2: Exploring the scaling limit of UMI data towards zero-shot cross-embodiment generalization,
Songming Liu, Bangguo Li, Kai Ma, Lingxuan Wu, Hengkai Tan, Xiao Ouyang, Hang Su, and Jun Zhu. Rdt2: Exploring the scaling limit of umi data towards zero-shot cross-embodiment generalization.arXiv preprint arXiv:2602.03310, 2026
-
[44]
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[45]
Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Vector quantization.IEEE Assp Magazine, 1(2):4–29, 1984
Robert Gray. Vector quantization.IEEE Assp Magazine, 1(2):4–29, 1984
1984
-
[47]
BERT: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186. Association ...
2019
-
[48]
Contextual joint factor acoustic embeddings
Yanpei Shi and Thomas Hain. Contextual joint factor acoustic embeddings. In2021 IEEE Spoken Language Technology Workshop (SLT), pages 750–757. IEEE, 2021
2021
-
[49]
MaskGIT:Maskedgenerativeimagetransformer
HuiwenChang,HanZhang,LuJiang,CeLiu,andWilliamT.Freeman. MaskGIT:Maskedgenerativeimagetransformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022
2022
-
[50]
6d rotation representation for unconstrained head pose estimation
Thorsten Hempel, Ahmed A Abdelrahman, and Ayoub Al-Hamadi. 6d rotation representation for unconstrained head pose estimation. In2022 IEEE International Conference on image processing (ICIP), pages 2496–2500. IEEE, 2022
2022
-
[51]
Learning trajectory dependencies for human motion prediction
Wei Mao, Miaomiao Liu, Mathieu Salzmann, and Hongdong Li. Learning trajectory dependencies for human motion prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 9489–9497, 2019
2019
-
[52]
Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems
Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Xindong He, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025
2025
-
[53]
Agibot world 2026
AgiBot World Team. Agibot world 2026. https://huggingface.co/datasets/agibot-world/ AgiBotWorld2026, 2026
2026
-
[54]
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...
2024
-
[55]
RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation
Kun Wu, Chengkai Hou, Jiaming Liu, Zhengping Che, Xiaozhu Ju, Zhuqin Yang, Meng Li, Yinuo Zhao, Zhiyuan Xu, Guang Yang, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation. arXiv preprint arXiv:2412.13877, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
Chengkai Hou, Kun Wu, Jiaming Liu, Zhengping Che, Di Wu, Fei Liao, Guangrun Li, Jingyang He, Qiuxuan Feng, Zhao Jin, et al. Robomind 2.0: A multimodal, bimanual mobile manipulation dataset for generalizable embodied intelligence.arXiv preprint arXiv:2512.24653, 2025
-
[57]
RoboCOIN: An Open-Sourced Bimanual Robotic Data Collection for Integrated Manipulation
Shihan Wu, Xuecheng Liu, Shaoxuan Xie, Pengwei Wang, Xinghang Li, Bowen Yang, Zhe Li, Kai Zhu, Hongyu Wu, Yiheng Liu, et al. Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation.arXiv preprint arXiv:2511.17441, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
RoboChallenge Table30 v2 Dataset.https://huggingface.co/datasets/RoboChallenge/ Table30v2, 2025
RoboChallenge.ai. RoboChallenge Table30 v2 Dataset.https://huggingface.co/datasets/RoboChallenge/ Table30v2, 2025. Accessed: 2026-05-07
2025
-
[59]
10Kh RealOmni-Open DataSet
GenRobot AI. 10Kh RealOmni-Open DataSet. https://www.genrobot.ai/data/open-dataset, 2025. Ac- cessed: 2026-05-07
2025
-
[60]
Bridgedata v2: A dataset for robot learning at scale
Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR, 2023
2023
-
[61]
RT-1: Robotics transformer for real-world control at scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. RT-1: Robotics transformer for real-world control at scale. InRobotics: Science and Systems (RSS), 2023
2023
-
[62]
Bc-z: Zero-shot task generalization with robotic imitation learning
Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. Inconference on Robot Learning, pages 991–1002. PMLR, 2022
2022
-
[63]
Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation.The International Journal of Robotics Research, 44(10-11):1863–1891, 2025
Minho Heo, Youngwoon Lee, Doohyun Lee, and Joseph J Lim. Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation.The International Journal of Robotics Research, 44(10-11):1863–1891, 2025
2025
-
[64]
Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0
Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024. ...
2024
-
[65]
Input projection.A linear projection followed by LayerNorm, GELU and dropout maps𝑥1:𝑇 from ℝ𝐷 to ℝ𝐻
-
[66]
3.RoPE positional encoding.Rotary position embeddings on the time dimension with base104
Embodiment conditioning.An encoder-side embedding vectorm∈ℝ 𝐻, looked up from a learnable registry of1024slots (one of which is a special learnable “none” slot used under CFG-style dropout), is added broadcast over time. 3.RoPE positional encoding.Rotary position embeddings on the time dimension with base104
-
[67]
Self-attention stack.A12-layer Transformer encoder (8heads, GELU FFN of width4𝐻, dropout0 .1) processes the projected sequence with the chunk’s padding mask
-
[68]
Optional state cross-attention.When 𝑜 is provided (i.e., not CFG-dropped), a single cross-attention block uses the linearly projected𝑜 as key and value while the time series acts as query, followed by residual + LayerNorm
-
[69]
multimodal
Latent query cross-attention.𝑀max=16learnable latent queriesq 1:𝑀 are equipped with their own RoPE encoding, expanded across the batch, and cross-attend to the encoded sequence to extract a length-𝑀 summary. 7.Position-wise FFN.A final FFN with residual + LayerNorm. 18 Decoder.The decoder Dec ingests the quantized latent˜z1:𝑀 together with𝑜 andm, and outp...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.