Intercepting the Future: Latent-Space Predictive World Model for Dynamic VLA Manipulation
Pith reviewed 2026-06-28 14:17 UTC · model grok-4.3
The pith
A small latent world model forecasts future VLA feature tokens to let frozen models act on moving objects.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that a motion-aware latent world model can be trained to predict future patch tokens in a VLA feature space, conditioned only on per-token velocity and acceleration derived from optical flow, and that rolling these predictions forward for an adaptive horizon under a language-and-motion saliency mask supplies the frozen action decoder with usable future states for successful manipulation of moving objects.
What carries the argument
The motion-aware latent world model that forecasts future VLA patch tokens conditioned on optical-flow-derived velocity and acceleration, with adaptive rollout halted by uncertainty.
If this is right
- Success rates on twenty dynamic simulation scenarios rise from the 31-58 percent range achieved by the strongest baseline to the 79-97 percent range.
- On physical robot hardware the method reaches 29-30 out of 30 successes on conveyor and rolling-ball tasks, 23 out of 30 on paddle interception, and 19 out of 30 on projectile catching, where every baseline scores zero.
- Only 4.9 million additional parameters are required to augment a frozen 7-billion-parameter VLA.
- The same frozen action decoder operates on the predicted tokens, so no retraining of the large model is needed.
Where Pith is reading between the lines
- The same conditioning on velocity and acceleration might be tested on longer sequences or on tasks whose dynamics are not well captured by first- and second-order motion.
- Uncertainty-triggered halting of prediction could be combined with other planning loops that must decide when to stop forecasting.
- If the saliency mask proves critical, removing it on the same tasks would show how much focus on task-relevant patches contributes to the reported gains.
Load-bearing premise
A small world model trained on manipulation video can accurately forecast future patch tokens when conditioned only on per-token velocity and acceleration from optical flow, and these predictions remain useful over an adaptive horizon without rapid divergence from actual scene dynamics.
What would settle it
Replace the world-model predictions with the current tokens on the same dynamic tasks and measure whether success rates fall to the level of the non-predictive baselines.
Figures
read the original abstract
Vision-Language-Action (VLA) models generalize across static manipulation but fail when objects move during task execution. They map the current observation to an action and assume the scene is stationary between observation and execution, so at any non-trivial object speed the resulting latency exceeds the time available to grasp. We close this gap with AHEAD (Anticipatory Horizon Extrapolation with Adaptive Dynamics), a predict-then-act wrapper that augments a frozen VLA with a motion-aware latent world model. A small world model trained on manipulation video forecasts future patch tokens in the VLA's feature space, conditioned on per-token velocity and acceleration from optical flow. A language-and-motion saliency mask concentrates prediction on task-relevant patches, and the model rolls forward for an adaptive horizon, halting when prediction uncertainty crosses a threshold. The frozen action decoder then receives the predicted future tokens in place of the current ones. AHEAD adds 4.9M parameters to a frozen 7B OpenVLA and reaches 79 to 97% success across 20 dynamic simulation scenarios where the strongest baseline reaches 31 to 58%. On a physical UFactory xArm 7, AHEAD succeeds on 29/30 to 30/30 on three conveyor and rolling-ball tasks, 23/30 on paddle interception, and 19/30 on projectile catching where every baseline scores 0/30.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AHEAD, a predict-then-act wrapper around a frozen 7B OpenVLA model. It adds a 4.9M-parameter latent world model trained on manipulation video that forecasts future VLA patch tokens conditioned on per-token velocity and acceleration derived from optical flow, applies a language-and-motion saliency mask, and rolls out for an adaptive horizon that stops on uncertainty threshold. The predicted tokens replace current observations for the action decoder. Reported results include 79-97% success across 20 dynamic simulation scenarios (vs. 31-58% for strongest baseline) and, on a physical xArm 7, 29/30-30/30 on conveyor/rolling-ball tasks, 23/30 on paddle interception, and 19/30 on projectile catching (vs. 0/30 for all baselines).
Significance. If the central mechanism holds, the work offers an efficient route to dynamic VLA manipulation by freezing a large model and adding a compact motion-aware predictor, with substantial reported gains on tasks that expose latency limits of standard VLAs. The parameter efficiency (4.9M added) and adaptive-horizon design are concrete strengths that could influence follow-on work on anticipatory control.
major comments (3)
- [Abstract] Abstract and results sections: the reported success rates (e.g., 19/30 on projectile catching, 79-97% in simulation) are presented without accompanying details on baseline implementations, number of trials per condition, statistical tests, or controls for confounding factors such as camera calibration and lighting; these omissions make it impossible to assess whether the performance gaps are attributable to the world model.
- [Method] Method description of the world model: the claim that conditioning solely on per-token velocity/acceleration from optical flow plus learned residuals suffices for accurate future patch tokens over an adaptive horizon is load-bearing for the central claim, yet no quantitative prediction-fidelity metrics (token MSE, rollout error vs. horizon length, or divergence rate) are supplied to test whether the 4.9M-parameter model remains close to ground truth on tasks involving collisions or occlusions.
- [Experiments] Physical-experiment results: success on projectile catching (19/30) is presented as evidence that the adaptive horizon prevents harmful divergence, but without reported values for the uncertainty threshold, saliency-mask coverage on fast-moving objects, or comparison of predicted vs. observed tokens at the moment of action execution, it remains unclear whether the gains stem from accurate forecasting or from other implementation details.
minor comments (2)
- Notation for the saliency mask and uncertainty threshold should be introduced with explicit equations or pseudocode to allow reproduction.
- The abstract states the world model is 'trained on manipulation video' but does not specify the dataset size, diversity, or whether it includes the target dynamic scenarios; this detail belongs in the method section.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on reproducibility and evaluation. We agree that additional details and metrics will strengthen the manuscript and will incorporate them in the revision. Below we respond point by point to the major comments.
read point-by-point responses
-
Referee: [Abstract] Abstract and results sections: the reported success rates (e.g., 19/30 on projectile catching, 79-97% in simulation) are presented without accompanying details on baseline implementations, number of trials per condition, statistical tests, or controls for confounding factors such as camera calibration and lighting; these omissions make it impossible to assess whether the performance gaps are attributable to the world model.
Authors: We agree that more explicit documentation is needed. The full experiments section specifies 30 trials per real-robot condition and 100 episodes per simulation scenario, with baselines re-implemented from their original papers using identical hardware and camera setups. Lighting and calibration were held constant across all methods. No formal statistical tests were performed. In the revision we will add a summary table in the results section listing trial counts, baseline references, and controlled factors, plus a short paragraph on experimental controls. revision: yes
-
Referee: [Method] Method description of the world model: the claim that conditioning solely on per-token velocity/acceleration from optical flow plus learned residuals suffices for accurate future patch tokens over an adaptive horizon is load-bearing for the central claim, yet no quantitative prediction-fidelity metrics (token MSE, rollout error vs. horizon length, or divergence rate) are supplied to test whether the 4.9M-parameter model remains close to ground truth on tasks involving collisions or occlusions.
Authors: Task success serves as the primary validation, yet we acknowledge that direct fidelity metrics would better substantiate the world-model claim. The revision will add, in Section 3, token-level MSE and rollout divergence statistics computed on held-out manipulation videos that include collisions and occlusions, reported as a function of horizon length. These will be generated from the same training distribution used for the 4.9M-parameter model. revision: yes
-
Referee: [Experiments] Physical-experiment results: success on projectile catching (19/30) is presented as evidence that the adaptive horizon prevents harmful divergence, but without reported values for the uncertainty threshold, saliency-mask coverage on fast-moving objects, or comparison of predicted vs. observed tokens at the moment of action execution, it remains unclear whether the gains stem from accurate forecasting or from other implementation details.
Authors: We will include the exact uncertainty threshold (0.1 in normalized feature space), average saliency-mask coverage on dynamic objects (65% in the reported trials), and a supplementary figure showing predicted versus observed tokens at action-execution time for both successful and failed projectile-catching episodes. These additions will appear in the revised experiments section and supplementary material. revision: yes
Circularity Check
No circularity: success metrics derive from external task evaluation, not from fitted parameters by construction
full rationale
The paper augments a frozen 7B VLA with a separately trained 4.9M-parameter world model whose outputs are evaluated on downstream manipulation success rates in simulation and on physical hardware. No equations, definitions, or self-citations reduce the reported success percentages (79-97% sim, 19-30/30 real) to quantities that are tautologically equal to the model's own training losses or fitted velocities. The derivation chain therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- world-model parameter count (4.9M)
axioms (2)
- domain assumption Optical flow supplies reliable per-token velocity and acceleration signals for conditioning future-token prediction.
- domain assumption The VLA latent space supports accurate short-term forecasting of patch tokens.
Reference graph
Works this paper leans on
-
[1]
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Ghosh, H
Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, J. Luo, Y . L. Tan, L. Y . Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024
2024
-
[3]
Brohan, N
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Her- zog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, S. Levine, H. Michalewski, I. Mordatch, K. Pertsch, K. Rao, K. Reymann, M. ...
2023
-
[4]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410....
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
D. M. Wolpert and M. Kawato. Multiple paired forward and inverse models for motor control. Neural Networks, 11(7–8):1317–1329, 1998
1998
-
[6]
Shadmehr and F
R. Shadmehr and F. A. Mussa-Ivaldi. Adaptive representation of dynamics during learning of a motor task.Journal of Neuroscience, 14(5):3208–3224, 1994
1994
-
[7]
T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems (RSS), 2023
2023
- [9]
- [10]
-
[11]
Mastering Diverse Domains through World Models
D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023. 9
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
J. Bruce, M. Dennis, A. Edwards, J. Parker-Holder, Y . Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steiber, C. Apps, Y . Aytar, S. Bechtle, F. Behbahani, S. Chan, N. Heess, L. Gonzalez, S. Osindero, S. Ozair, S. Reed, J. Zhang, K. Zolna, J. Clune, N. de Freitas, S. Singh, and T. Rockt¨aschel. Genie: Generative interactive environments.arXiv preprint arXiv:2...
-
[13]
J. Cen, C. Yu, H. Yuan, Y . Jiang, S. Huang, J. Guo, X. Li, Y . Song, H. Luo, F. Wang, D. Zhao, and H. Chen. WorldVLA: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Huang, J
Y . Huang, J. Zhang, S. Zou, X. Liu, R. Hu, and K. Xu. LaDi-WM: A latent diffusion-based world model for predictive manipulation. InConference on Robot Learning (CoRL), 2025
2025
-
[15]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
J. Bjorck et al. GR00T N1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. NVIDIA
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Dream to Control: Learning Behaviors by Latent Imagination
D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2020
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[17]
Hafner, T
D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba. Mastering Atari with discrete world models. In International Conference on Learning Representations (ICLR), 2021
2021
-
[18]
TD-MPC2: Scalable, Robust World Models for Continuous Control
N. Hansen, H. Su, and X. Wang. TD-MPC2: Scalable, robust world models for continuous control.arXiv preprint arXiv:2310.16828, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
M. Yang, Y . Du, K. Ghasemipour, J. Tompson, D. Schuurmans, and P. Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Williams, P
G. Williams, P. Drews, B. Goldfain, J. M. Rehg, and E. A. Theodorou. Information theoretic MPC for model-based reinforcement learning. InIEEE International Conference on Robotics and Automation (ICRA), pages 1714–1721, 2017
2017
-
[21]
Nagabandi, G
A. Nagabandi, G. Keet, R. S. Fearing, and S. Levine. Neural network dynamics for model- based deep reinforcement learning with model-free fine-tuning. InIEEE International Confer- ence on Robotics and Automation (ICRA), pages 7559–7566, 2018
2018
-
[22]
Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control
F. Ebert, C. Finn, S. Dasari, A. Xie, A. Lee, and S. Levine. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control. InarXiv preprint arXiv:1812.00568, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[23]
J. B. Rawlings, D. Q. Mayne, and M. M. Diehl. Model predictive control: Theory, computation, and design.Nob Hill Publishing, 2017
2017
-
[24]
Morrison, P
D. Morrison, P. Corke, and J. Leitner. Closing the loop for robotic grasping: A real-time, generative grasp synthesis approach. InRobotics: Science and Systems, 2018
2018
-
[25]
Kopicki, D
M. Kopicki, D. Belter, and J. Wyatt. Dynamic grasp and trajectory planning for moving objects. Autonomous Robots, 43(1):175–189, 2019
2019
-
[26]
Y . Guo, L. X. Shi, J. Chen, and C. Finn. Ctrl-World: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [27]
-
[28]
Teed and J
Z. Teed and J. Deng. RAFT: Recurrent all-pairs field transforms for optical flow. InEuropean Conference on Computer Vision (ECCV), pages 402–419, 2020
2020
-
[29]
Lipman, R
Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023. 10
2023
-
[30]
Esser, S
P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, K. Lacey, A. Goodwin, Y . Marek, and R. Rom- bach. Scaling rectified flow transformers for high-resolution image synthesis. InInternational Conference on Machine Learning (ICML), 2024
2024
-
[31]
Todorov, T
E. Todorov, T. Erez, and Y . Tassa. MuJoCo: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5026– 5033, 2012
2012
-
[32]
Haddadin, S
S. Haddadin, S. Parusel, L. Johannsmeier, S. Golz, S. Gabl, F. Walch, M. Sagsetter, T. Kollar, and A. Albu-Sch ¨affer. The Franka Emika robot: A reference platform for robotics research and education.IEEE Robotics & Automation Magazine, 29(1):46–64, 2022
2022
-
[33]
M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Zhang, H
W. Zhang, H. Liu, Z. Qi, Y . Wang, X. Yu, J. Zhang, R. Dong, J. He, H. Wang, Z. Zhang, L. Yi, W. Zeng, and X. Jin. DreamVLA: A vision-language-action model dreamed with compre- hensive world knowledge. InAdvances in Neural Information Processing Systems (NeurIPS), 2025
2025
-
[35]
C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10–11):1684–1704, 2024
2024
-
[36]
Grauman, A
K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, M. Martin, T. Nagarajan, I. Radosavovic, S. K. Ramakrishnan, F. Ryan, J. Sharma, M. Wray, M. Xu, E. Z. Xu, C. Zhao, S. Bansal, D. Batra, V . Cartillier, S. Crane, T. Do, M. Doulaty, A. Erapalli, C. Feichtenhofer, A. Fragomeni, Q. Fu, C. Fuegen, A....
2022
-
[37]
Damen, H
D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100.International Journal of Computer Vision, 130(1): 33–55, 2022
2022
-
[38]
Dekhne, G
A. Dekhne, G. Hastings, J. Murnane, and F. Sch ¨ur. Automation in logistics: Big opportunity, bigger uncertainty.McKinsey & Company Report, 2019
2019
-
[39]
Correll, K
N. Correll, K. E. Bekris, D. Berenson, O. Brock, A. Causo, K. Hauser, K. Okada, A. Rodriguez, J. M. Romano, and P. R. Wurman. Analysis and observations from the first Amazon picking challenge. InIEEE Transactions on Automation Science and Engineering, volume 15, pages 172–188, 2018
2018
- [40]
-
[41]
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[42]
Finn and S
C. Finn and S. Levine. Deep visual foresight for planning robot motion. InIEEE International Conference on Robotics and Automation (ICRA), pages 2786–2793, 2017. 11
2017
- [43]
- [44]
-
[45]
K. Yamamoto, H. Ito, H. Ichiwara, H. Mori, and T. Ogata. Real-time motion generation and data augmentation for grasping moving objects with dynamic speed and position changes. arXiv preprint arXiv:2309.12547, 2023
-
[46]
C.-L. Cheang et al. GR-2: A generative video-language-action model with web-scale knowl- edge for robot manipulation.arXiv preprint arXiv:2407.14615, 2024
-
[47]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
A. Bardes, Q. Garrido, J. Ponce, X. Chen, M. Rabbat, Y . LeCun, M. Assran, and N. Ballas. V-JEPA 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3M: A universal visual represen- tation for robot manipulation. InConference on Robot Learning (CoRL), 2023
2023
-
[49]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
K. Cho, B. van Merri ¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y . Ben- gio. Learning phrase representations using RNN encoder-decoder for statistical machine trans- lation. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, 2014
2014
-
[51]
Gal and Z
Y . Gal and Z. Ghahramani. Dropout as a Bayesian approximation: Representing model un- certainty in deep learning. InInternational Conference on Machine Learning (ICML), pages 1050–1059, 2016
2016
-
[52]
S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu, and L. Zhang. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard. CALVIN: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks. InIEEE Robotics and Automation Letters, volume 7, pages 7327–7334, 2022
2022
-
[54]
James, Z
S. James, Z. Ma, D. R. Arrojo, and A. J. Davison. RLBench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020
2020
-
[55]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment Collaboration. Open X-embodiment: Robotic learning datasets and RT- X models.arXiv preprint arXiv:2310.08864, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, pages 6840–6851, 2020
2020
-
[57]
Vaswani, N
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, 2017
2017
-
[58]
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. InInternational Conference on Learning Representations (ICLR), 2015. 12
2015
-
[59]
Dosovitskiy, L
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021
2021
-
[60]
X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre- training.arXiv preprint arXiv:2303.15343, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[61]
Radford, J
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), pages 8748–8763, 2021
2021
-
[62]
E. A. Theodorou, J. Buchli, and S. Schaal. A generalized path integral control approach to reinforcement learning. InJournal of Machine Learning Research, volume 11, pages 3137– 3181, 2010
2010
-
[63]
Tobin, R
J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30, 2017
2017
-
[64]
Lakshminarayanan, A
B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, 2017
2017
-
[65]
R. P. N. Rao and D. H. Ballard. Predictive coding in the visual cortex: a functional interpreta- tion of some extra-classical receptive-field effects.Nature Neuroscience, 2(1):79–87, 1999
1999
-
[66]
YouTube-8M: A Large-Scale Video Classification Benchmark
S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijaya- narasimhan. YouTube-8M: A large-scale video classification benchmark. InarXiv preprint arXiv:1609.08675, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[67]
J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[68]
Kirillov, E
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll´ar, and R. Girshick. Segment anything. InIEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 4015–4026, 2023
2023
-
[69]
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y . J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[70]
Something Something
R. Goyal, S. E. Kahou, V . Michalski, J. Materzynska, S. Westphal, H. Kim, V . Haenel, I. Fru- end, P. Yianilos, M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, and R. Memisevic. The “Something Something” video database for learning and evaluating visual common sense. In IEEE International Conference on Computer Vision (ICCV), pages 5843–5851, 2017
2017
-
[71]
Pick up the yellow duck
H. Walke, K. Black, A. Lee, M. J. Kim, M. Du, C. Zheng, T. Zhao, P. Hansen-Estruch, Q. Vuong, A. He, V . Myers, K. Fang, C. Finn, and S. Levine. BridgeV2: Manipulator control pre-training across tasks and environments. InConference on Robot Learning (CoRL), 2023. 13 A Full Architecture Pipeline Frame @ t-1 Frame @ t Flow map t-2 → t-1 Flow map t-1 → t Ave...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.