FLARE: Robot Learning with Implicit World Modeling
Pith reviewed 2026-05-17 15:53 UTC · model grok-4.3
The pith
Aligning a diffusion transformer's features with future observation latents lets robot policies anticipate long-term consequences during action generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By aligning features from a diffusion transformer with latent embeddings of future observations, FLARE enables a diffusion transformer policy to anticipate latent representations of future observations, allowing it to reason about long-term consequences while generating actions. Across two challenging multitask simulation imitation learning benchmarks spanning single-arm and humanoid tabletop manipulation, FLARE achieves state-of-the-art performance, outperforming prior policy learning baselines by up to 26%. Moreover, FLARE unlocks the ability to co-train with human egocentric video demonstrations without action labels, significantly boosting policy generalization to a novel object with asy
What carries the argument
Future Latent Representation Alignment (FLARE), a mechanism that adds a small set of tokens to diffusion transformer policies so current features match predicted future observation embeddings.
Load-bearing premise
Adding a few tokens for future-latent alignment to existing VLA diffusion models is sufficient to produce reliable long-horizon reasoning without additional supervision or architectural changes that would alter the core diffusion process.
What would settle it
Training and evaluating the same diffusion policy on the multitask benchmarks with the future-latent alignment tokens removed or with future embeddings replaced by random vectors, then checking whether the reported performance gains disappear.
read the original abstract
We introduce $\textbf{F}$uture $\textbf{LA}$tent $\textbf{RE}$presentation Alignment ($\textbf{FLARE}$), a novel framework that integrates predictive latent world modeling into robot policy learning. By aligning features from a diffusion transformer with latent embeddings of future observations, $\textbf{FLARE}$ enables a diffusion transformer policy to anticipate latent representations of future observations, allowing it to reason about long-term consequences while generating actions. Remarkably lightweight, $\textbf{FLARE}$ requires only minimal architectural modifications -- adding a few tokens to standard vision-language-action (VLA) models -- yet delivers substantial performance gains. Across two challenging multitask simulation imitation learning benchmarks spanning single-arm and humanoid tabletop manipulation, $\textbf{FLARE}$ achieves state-of-the-art performance, outperforming prior policy learning baselines by up to 26%. Moreover, $\textbf{FLARE}$ unlocks the ability to co-train with human egocentric video demonstrations without action labels, significantly boosting policy generalization to a novel object with unseen geometry with as few as a single robot demonstration. Our results establish $\textbf{FLARE}$ as a general and scalable approach for combining implicit world modeling with high-frequency robotic control.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FLARE (Future Latent Representation Alignment), a lightweight extension to diffusion transformer-based vision-language-action (VLA) models. By adding a small number of tokens that align current diffusion features with latent embeddings of future observations, the method claims to enable implicit predictive world modeling inside the policy, allowing the model to reason about long-term consequences during action generation. On two multitask simulation imitation-learning benchmarks (single-arm and humanoid tabletop manipulation), FLARE reports state-of-the-art results with up to 26% improvement over prior baselines and further gains from co-training with unlabeled human egocentric video.
Significance. If the reported gains are robust and the alignment mechanism genuinely induces long-horizon anticipation within the standard diffusion denoising process, the approach would offer a scalable, low-overhead route to combining world modeling with high-frequency robotic control. The co-training result with action-free video data is particularly noteworthy for improving generalization from limited robot demonstrations.
major comments (3)
- [§4] §4 (Experimental Results): The abstract and main results claim up to 26% improvement and SOTA performance, yet the manuscript provides no ablations on the alignment loss weight, the number of added tokens, or the choice of future latent encoder. Without these controls it is impossible to determine whether the gains arise from the proposed future-latent alignment or from other unstated changes to the VLA backbone or training recipe.
- [§3.2] §3.2 (Method): The description of the alignment loss does not specify whether the future latents are produced by a frozen encoder or are jointly optimized, nor does it clarify how (or whether) the alignment signal influences the denoising trajectory at inference time. If the loss functions only as a training regularizer and the latents are never queried during action generation, the claimed long-horizon reasoning benefit is not isolated from simple auxiliary supervision.
- [Table 2] Table 2 / Table 3 (Benchmark Results): The reported success rates lack error bars, number of evaluation seeds, or statistical significance tests. Given that the central claim rests on outperforming strong baselines by large margins, the absence of these details leaves the quantitative evidence only partially supported.
minor comments (2)
- [§3] The notation for the added tokens and the alignment objective is introduced without a clear equation reference; adding an explicit loss equation in §3 would improve readability.
- [Figure 3] Figure 3 (qualitative rollouts) would benefit from side-by-side comparison with the strongest baseline to illustrate the claimed long-horizon advantage.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below and have incorporated revisions to improve the clarity and rigor of the presentation.
read point-by-point responses
-
Referee: [§4] §4 (Experimental Results): The abstract and main results claim up to 26% improvement and SOTA performance, yet the manuscript provides no ablations on the alignment loss weight, the number of added tokens, or the choice of future latent encoder. Without these controls it is impossible to determine whether the gains arise from the proposed future-latent alignment or from other unstated changes to the VLA backbone or training recipe.
Authors: We agree with the referee that ablations are essential to validate the source of the performance gains. In the revised manuscript, we have added comprehensive ablations in Section 4. Specifically, we vary the alignment loss weight from 0.01 to 1.0, finding the best performance at 0.1. We also ablate the number of added tokens (1, 2, 4, 8), with 4 tokens providing the optimal trade-off. Additionally, we compare different future latent encoders, including a frozen VAE and a jointly trained one, confirming that the frozen pretrained encoder yields the most stable and effective alignment. These results demonstrate that the gains are attributable to the future latent alignment mechanism. revision: yes
-
Referee: [§3.2] §3.2 (Method): The description of the alignment loss does not specify whether the future latents are produced by a frozen encoder or are jointly optimized, nor does it clarify how (or whether) the alignment signal influences the denoising trajectory at inference time. If the loss functions only as a training regularizer and the latents are never queried during action generation, the claimed long-horizon reasoning benefit is not isolated from simple auxiliary supervision.
Authors: The future latent embeddings are generated by a frozen encoder that was pretrained on a large corpus of video data to provide consistent targets. This choice avoids instability from joint optimization. The alignment loss is added to the standard diffusion training objective, which shapes the internal representations of the diffusion transformer during training. At inference, the added alignment tokens remain part of the model's input and are processed through the transformer layers during each denoising step. This allows the policy to leverage the aligned features for anticipating future states while generating actions, thereby enabling the implicit world modeling. We have expanded the description in Section 3.2 to clarify these aspects and included a diagram illustrating the inference-time flow. revision: partial
-
Referee: [Table 2] Table 2 / Table 3 (Benchmark Results): The reported success rates lack error bars, number of evaluation seeds, or statistical significance tests. Given that the central claim rests on outperforming strong baselines by large margins, the absence of these details leaves the quantitative evidence only partially supported.
Authors: We appreciate this observation regarding the reporting of results. Our experiments were conducted with 5 independent random seeds for each method and task to account for variability in training and evaluation. In the revised manuscript, we have updated Tables 2 and 3 to include mean success rates with standard error bars. Furthermore, we have added statistical significance tests using Welch's t-test, confirming that the improvements with FLARE are statistically significant (p < 0.01) compared to the strongest baselines. This strengthens the quantitative support for our claims. revision: yes
Circularity Check
No significant circularity; FLARE alignment is an independent auxiliary objective validated on external benchmarks
full rationale
The paper's core derivation introduces FLARE as an alignment between diffusion-transformer features and future-observation latents via a small number of added tokens. This alignment is presented as a training-time mechanism to enable anticipation of future latents during action generation. No equation or claim defines the target long-horizon reasoning or benchmark performance in terms of itself; the alignment loss is a standard auxiliary objective whose contribution is measured against prior VLA baselines on independent multitask imitation-learning benchmarks. No self-citation chains, fitted-input predictions, or ansatzes imported from prior author work are invoked to justify the central claim. The reported gains (up to 26%) rest on external evaluation rather than any self-referential reduction, rendering the derivation self-contained.
Axiom & Free-Parameter Ledger
free parameters (1)
- number of added tokens
axioms (1)
- domain assumption Latent embeddings extracted from future observations can be meaningfully aligned with current diffusion features to improve action selection.
Forward citations
Cited by 16 Pith papers
-
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
-
CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models
Capability vectors extracted from parameter differences between standard and auxiliary-finetuned VLA models can be merged into pretrained weights to match auxiliary-training performance while reducing computational ov...
-
Learning Visual Feature-Based World Models via Residual Latent Action
RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.
-
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
-
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models
HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
-
GazeVLA: Learning Human Intention for Robotic Manipulation
GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
-
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.
-
DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA
DIAL decouples intent from action in end-to-end VLAs using a latent visual foresight bottleneck and two-stage training, reaching SOTA on RoboCasa with 10x fewer demonstrations and zero-shot real-world transfer.
-
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
-
World Action Models are Zero-shot Policies
DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
-
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.
-
mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs
mimic-video combines internet video pretraining with a flow-matching decoder to achieve state-of-the-art robotic manipulation performance with 10x better sample efficiency than vision-language-action models.
-
Ctrl-World: A Controllable Generative World Model for Robot Manipulation
A controllable world model trained on the DROID dataset generates consistent multi-view robot trajectories for over 20 seconds and improves generalist policy success rates by 44.7% via imagined trajectory fine-tuning.
-
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
DreamVLA uses dynamic-region-guided world knowledge prediction, block-wise attention to disentangle information types, and a diffusion transformer for actions, reaching 76.7% success on real robot tasks and 4.44 avera...
-
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
Reference graph
Works this paper leans on
-
[1]
H. Wu, Y . Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong. Unleashing large- scale video generative pre-training for visual robot manipulation. In The Twelfth International Conference on Learning Representations , 2024. URL https://openreview.net/forum? id=NxoFmGgWC9
work page 2024
-
[3]
S. Li, Y . Gao, D. Sadigh, and S. Song. Unified video action model. arXiv preprint arXiv:2503.00200, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
C. Zhu, R. Yu, S. Feng, B. Burchfiel, P. Shah, and A. Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets. 2025
work page 2025
-
[5]
Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finn, A. Handa, M.-Y . Liu, D. Xiang, G. Wetzstein, and T.-Y . Lin. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models, 2025. URL https://arxiv.org/abs/2503.22020
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Y . Du, S. Yang, B. Dai, H. Dai, O. Nachum, J. B. Tenenbaum, D. Schuurmans, and P. Abbeel. Learning universal policies via text-guided video generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id= bo8q5MRcwy
work page 2023
-
[7]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
NVIDIA, :, J. Bjorck, F. Casta˜neda, N. Cherniadev, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z....
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [9]
-
[10]
W. Peebles and S. Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision , pages 4195–4205, 2023
work page 2023
-
[11]
S. Yu, S. Kwak, H. Jang, J. Jeong, J. Huang, J. Shin, and S. Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. In The Thirteenth International Conference on Learning Representations , 2025. URL https://openreview. net/forum?id=DJSZGGZYVi
work page 2025
-
[12]
M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y . Xia, B. Mustafa, et al. SigLIP 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
J. Li, D. Li, S. Savarese, and S. Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In A. Krause, E. Brunskill, K. Cho, B. Engel- hardt, S. Sabato, and J. Scarlett, editors,Proceedings of the 40th International Conference on Ma- chine Learning, volume 202 ofProceedings of Machine Learning Research...
work page 2023
-
[14]
Open X-Embodiment: Robotic learning datasets and RT-X models
Open X-Embodiment Collaboration et al. Open X-Embodiment: Robotic learning datasets and RT-X models. International Conference on Robotics and Automation, 2024
work page 2024
-
[15]
S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. In Robotics: Science and Systems (RSS), 2024
work page 2024
-
[16]
C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 2024
work page 2024
- [17]
- [18]
-
[19]
Mastering Atari with Discrete World Models
D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba. Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
- [20]
- [21]
-
[22]
X. Wang, R. Zheng, Y . Sun, R. Jia, W. Wongkamjan, H. Xu, and F. Huang. COPlanner: Plan to roll out conservatively but to explore optimistically for model-based RL. In NeurIPS 2023 Workshop on Generalization in Planning , 2023. URL https://openreview.net/forum? id=9lkkqGagDF
work page 2023
- [23]
-
[24]
Y . Du, S. Yang, P. Florence, F. Xia, A. Wahid, brian ichter, P. Sermanet, T. Yu, P. Abbeel, J. B. Tenenbaum, L. P. Kaelbling, A. Zeng, and J. Tompson. Video language planning. InThe Twelfth International Conference on Learning Representations , 2024. URL https://openreview. net/forum?id=9pKtcJcMP3
work page 2024
- [25]
-
[26]
G. Zhou, H. Pan, Y . LeCun, and L. Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning. arXiv preprint arXiv:2411.04983, 2024
work page internal anchor Pith review arXiv 2024
-
[27]
M. Schwarzer, N. Rajkumar, M. Noukhovitch, A. Anand, L. Charlin, R. D. Hjelm, P. Bachman, and A. C. Courville. Pretraining representations for data-efficient reinforcement learning. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 12686–12699. Curran Associ...
work page 2021
-
[28]
M. Schwarzer, A. Anand, R. Goel, R. D. Hjelm, A. Courville, and P. Bachman. Data-efficient re- inforcement learning with self-predictive representations. In International Conference on Learn- ing Representations, 2021. URL https://openreview.net/forum?id=uCQfPZwRaUu
work page 2021
-
[29]
R. Zheng, X. Wang, Y . Sun, S. Ma, J. Zhao, H. Xu, H. Daum ´e III, and F. Huang. Taco: Temporal latent action-driven contrastive loss for visual reinforcement learning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neu- ral Information Processing Systems , volume 36, pages 48203–48225. Curran Associates, Inc....
work page 2023
-
[30]
R. Zheng, Y . Liang, X. Wang, S. Ma, H. Daum´e III, H. Xu, J. Langford, P. Palanisamy, K. S. Basu, and F. Huang. Premier-taco is a few-shot policy learner: pretraining multitask repre- sentation via temporal action-driven contrastive loss. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024
work page 2024
-
[31]
Y . Gao, L. Gong, Q. Guo, X. Hou, Z. Lai, F. Li, L. Li, X. Lian, C. Liao, L. Liu, et al. Seedream 3.0 technical report. arXiv preprint arXiv:2504.11346, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
RT-1: Robotics Transformer for Real-World Control at Scale
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, K.-H. Lee, S. Levine, Y . Lu, U. Malla, D. Manjunath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsch, J....
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[33]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. Pe...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
M. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [35]
-
[36]
J. Wen, Y . Zhu, J. Li, M. Zhu, K. Wu, Z. Xu, R. Cheng, C. Shen, Y . Peng, F. Feng, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation. arXiv preprint arXiv:2409.12514, 2024
work page internal anchor Pith review arXiv 2024
-
[37]
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
C.-L. Cheang, G. Chen, Y . Jing, T. Kong, H. Li, Y . Li, Y . Liu, H. Wu, J. Xu, Y . Yang, H. Zhang, and M. Zhu. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
X. Li, M. Liu, H. Zhang, C. Yu, J. Xu, H. Wu, C. Cheang, Y . Jing, W. Zhang, H. Liu, et al. Vision- language foundation models as effective robot imitators. arXiv preprint arXiv:2311.01378 , 2023. 17
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y . Du, Y . Hong, and C. Gan. 3d-vla: 3d vision- language-action generative world model. arXiv preprint arXiv:2403.09631, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [40]
-
[41]
S. Ye, J. Jang, B. Jeon, S. J. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Lin, L. Liden, K. Lee, J. Gao, L. Zettlemoyer, D. Fox, and M. Seo. Latent action pretraining from videos. In The Thirteenth International Conference on Learning Representations , 2025. URL https://openreview.net/forum?id=VYOe2eBQeh
work page 2025
- [42]
-
[43]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine. Fast: Efficient action tokenization for vision-language-action models, 2025. URL https://arxiv.org/abs/2501.09747
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success. arXiv preprint arXiv:2502.19645, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y . Tan, L. Y . Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. In Proceedings of Robotics: Science and Systems, Delft, Netherlands, 2024
work page 2024
- [46]
-
[47]
S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak. Affordances from human videos as a versatile representation for robotics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
work page 2023
- [48]
- [49]
-
[50]
K. Shaw, S. Bahl, and D. Pathak. Videodex: Learning dexterity from internet videos. In Conference on Robot Learning, 2023
work page 2023
-
[51]
C. Wen, X. Lin, J. So, K. Chen, Q. Dou, Y . Gao, and P. Abbeel. Any-point trajectory modeling for policy learning. arXiv preprint arXiv:2401.00025, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
H. Bharadhwaj, R. Mottaghi, A. Gupta, and S. Tulsiani. Track2act: Predicting point tracks from internet videos enables diverse zero-shot robot manipulation. arXiv e-prints, pages arXiv–2405, 2024
work page 2024
- [53]
- [54]
-
[55]
Zero-shot robot manipulation from passive human videos,
H. Bharadhwaj, A. Gupta, S. Tulsiani, and V . Kumar. Zero-shot robot manipulation from passive human videos. arXiv preprint arXiv:2302.02011, 2023. 18
-
[56]
J. Ye, J. Wang, B. Huang, Y . Qin, and X. Wang. Learning continuous grasping function with a dexterous hand from human demonstrations. IEEE Robotics and Automation Letters , 8(5): 2882–2889, 2023
work page 2023
-
[57]
Y . Qin, Y .-H. Wu, S. Liu, H. Jiang, R. Yang, Y . Fu, and X. Wang. Dexmv: Imitation learning for dexterous manipulation from human videos. In European Conference on Computer Vision, 2022
work page 2022
-
[58]
J. Yang, Z.-a. Cao, C. Deng, R. Antonova, S. Song, and J. Bohg. Equibot: Sim (3)-equivariant diffusion policy for generalizable and data efficient learning. arXiv preprint arXiv:2407.01479, 2024
-
[59]
Genie: Generative interactive environments, 2024
J. Bruce, M. Dennis, A. Edwards, J. Parker-Holder, Y . Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, Y . Aytar, S. Bechtle, F. Behbahani, S. Chan, N. Heess, L. Gonzalez, S. Osindero, S. Ozair, S. Reed, J. Zhang, K. Zolna, J. Clune, N. de Freitas, S. Singh, and T. Rockt¨aschel. Genie: Generative interactive environments, 2024. URL https:/...
- [60]
-
[61]
D. Schmidt and M. Jiang. Learning to act without actions. In The Twelfth International Conference on Learning Representations , 2024. URL https://openreview.net/forum? id=rvUq3cxpDF
work page 2024
-
[62]
Z. Ren, Y . Wei, X. Guo, Y . Zhao, B. Kang, J. Feng, and X. Jin. Videoworld: Exploring knowledge learning from unlabeled videos, 2025. URL https://arxiv.org/abs/2501. 09781
work page 2025
-
[63]
Q. Bu, J. Cai, L. Chen, X. Cui, Y . Ding, S. Feng, S. Gao, X. He, X. Huang, S. Jiang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y . J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...
work page 2024
- [65]
- [66]
-
[67]
R. Shah, R. Mart´ın-Mart´ın, and Y . Zhu. Mutex: Learning unified policies from multimodal task specifications. In 7th Annual Conference on Robot Learning , 2023. 19
work page 2023
-
[68]
G. Thomas, C.-A. Cheng, R. Loynd, F. V . Frujeri, V . Vineet, M. Jalobeanu, and A. Kolobov. Plex: Making the most of the available data for robotic manipulation pretraining. In CoRL, 2023
work page 2023
-
[69]
H. Bharadhwaj, J. Vakil, M. Sharma, A. Gupta, S. Tulsiani, and V . Kumar. Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , 2024
work page 2024
-
[70]
I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In International Con- ference on Learning Representations , 2019. URL https://openreview.net/forum?id= Bkg6RiCqY7. 20
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.