Supervise What Survives: Geometry-Guided VLA Adaptation from Synthetic Robot Videos

Chen Gao; Danze Chen; Mike Zheng Shou; Qiming Huang; Yanzhe Chen; Zhijun Cao

arxiv: 2606.24448 · v1 · pith:COHSFV4Vnew · submitted 2026-06-23 · 💻 cs.RO

Supervise What Survives: Geometry-Guided VLA Adaptation from Synthetic Robot Videos

Danze Chen , Yanzhe Chen , Qiming Huang , Zhijun Cao , Chen Gao , Mike Zheng Shou This is my paper

Pith reviewed 2026-06-25 23:55 UTC · model grok-4.3

classification 💻 cs.RO

keywords Vision-Language-Action modelssynthetic robot videosgeometry supervisionrepresentation alignment2D waypointsreal-robot adaptationAsymmetric Preservation Principle

0 comments

The pith

Synthetic robot videos can improve real-robot VLA policies by supervising only the vision backbone with extracted 2D geometry while training actions exclusively on real demonstrations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that human-to-robot video generation preserves visible spatial trajectories but loses the underlying motor commands, creating an asymmetric signal. It therefore routes the surviving geometry, recovered as future 2D end-effector waypoints, through an auxiliary head to the VLA vision backbone only. The action head remains trained solely on scarce real demonstrations, and the waypoint loss continues during fine-tuning as a spatial anchor that keeps the backbone geometrically grounded. Under matched data budgets this separation yields higher success on real-robot tasks than methods that attempt to recover pseudo-actions from the same synthetic videos, and it reduces the performance gap to policies trained on far more real data.

Core claim

Following the Asymmetric Preservation Principle, GRA extracts future 2D end-effector waypoints from synthetic videos through pose estimation, retargeting, simulation, and calibrated projection, then routes them to the VLA vision backbone via an auxiliary 2D head; the action head is trained on real demonstrations only, and the waypoint loss persists during fine-tuning as a spatial representation anchor.

What carries the argument

The Asymmetric Preservation Principle, which states that generated videos retain geometric trajectories (the where) while losing control signals (the how), allowing geometry to supervise perception separately from action learning.

If this is right

GRA outperforms pseudo-action baselines under matched data budgets on real-robot tasks.
The method narrows the gap between policies trained with limited real demonstrations and those trained with substantially more real data.
The persisting waypoint loss prevents the vision backbone from losing geometric grounding during fine-tuning.
Geometry supervision is applied only to perception; control learning remains isolated to real demonstrations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of geometry and control signals could be tested on other VLA architectures or with different video-generation pipelines to check whether the benefit generalizes.
If the waypoint extraction pipeline introduces systematic bias in certain environments, the auxiliary head might need an uncertainty-weighted loss to avoid harming the backbone.
Extending the anchor to 3D keypoints or scene affordances might further stabilize representation learning when 2D projections become ambiguous.

Load-bearing premise

The 2D waypoints recovered from synthetic videos supply a clean, domain-consistent supervision signal for the vision backbone that does not require the action head to reconcile conflicting information.

What would settle it

If, on the same real-robot tasks and data budgets, a policy trained with GRA waypoint supervision achieves success rates no higher than or lower than a pseudo-action baseline trained on the identical synthetic videos, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2606.24448 by Chen Gao, Danze Chen, Mike Zheng Shou, Qiming Huang, Yanzhe Chen, Zhijun Cao.

**Figure 1.** Figure 1: Illustration of Motivation. (a) Human-to-robot generation produces robot videos at scale, but the generated videos lack action labels. (b) Existing methods recover pseudo-actions from the generated pixels and feed them to the action head, inheriting distorted control signals. (c) GRA routes future 2D end-effector waypoints, extracted from the source human video, to the vision backbone, while the action hea… view at source ↗

**Figure 2.** Figure 2: Overview of GRA. Top: a human-to-robot video generator produces a robot frame I s t from a human demonstration; an independent geometric path extracts 2D end-effector waypoints y s t from the source human video. Stage 1: the vision backbone ϕv and a 2D auxiliary head h2D are trained on (I s t , ys t ) to predict future waypoints under L s 2D. Stage 2: the action head ϕa is added and the policy is fine-tune… view at source ↗

**Figure 3.** Figure 3: Illustration of (a) the real-robot experimental setup and (b) rollout examples of the three [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Information preservation gap measured with frozen vision features (mean over three tasks). [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Stage 1 spatial probe predictions on held-out episodes. Each task block shows three held-out episodes (rows) at five equi-spaced time steps (columns); the language instruction for the task is given above the block. At every panel the K=8 future EE waypoints are shown as eight dots connected by a polyline; red is the geometric ground truth from the source-to-waypoint pipeline, blue is the prediction of h2D.… view at source ↗

read the original abstract

Vision-Language-Action (VLA) models require large-scale video-action pairs, yet real teleoperation remains scarce. While generated robot videos offer a scalable alternative, existing methods treat them as real robot data by recovering pseudo-actions from synthesized pixels. We argue that deriving low-level control from generated visuals is a mismatched abstraction. A video captures only \emph{geometry}: the spatial trajectory representing the \emph{where} of a task. A real demonstration captures \emph{control}: the exact motor commands representing the \emph{how}. Human-to-robot video generation preserves these unequally: the visible geometry survives the generation process, while the underlying control signals are lost. This \textbf{Asymmetric Preservation Principle} dictates a clean rule: this surviving geometry should solely supervise visual perception, leaving control to real demonstrations. Following this principle, we propose \textbf{GRA} (\textbf{G}eometry-guided \textbf{R}epresentation \textbf{A}lignment), which extracts the geometric content as future 2D end-effector waypoints, computed from the source human video through pose estimation, retargeting, simulation, and calibrated projection, and routes them to the VLA vision backbone via an auxiliary 2D head. The action head is trained on real demonstrations only. During fine-tuning, the waypoint loss persists as a \textbf{spatial representation anchor} that prevents the backbone from losing its geometric grounding. On real-robot tasks, GRA outperforms pseudo-action baselines under matched data budgets and narrows the gap to policies trained with substantially more real demonstrations, suggesting that correctly routed geometry bridges generated videos to robot policies more reliably than recovered actions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The Asymmetric Preservation Principle and routing only 2D waypoints to the vision backbone is a sensible split, but the human-to-synthetic waypoint alignment is the part that needs the most scrutiny.

read the letter

The main point is that generated robot videos keep spatial geometry but lose the actual control signals, so the paper routes future 2D end-effector waypoints (extracted from the source human video through pose estimation, retargeting, simulation, and projection) only to an auxiliary head on the vision backbone, while the action head trains exclusively on real demonstrations. This is framed as the Asymmetric Preservation Principle and is presented as a cleaner way to use synthetic data than trying to recover pseudo-actions.

The approach is straightforward and avoids a clear mismatch in abstraction levels. Treating the surviving geometry as a spatial anchor during fine-tuning is a reasonable design choice, and the claim that it outperforms matched-budget pseudo-action baselines on real-robot tasks is the kind of result that would matter for scaling VLA training.

The soft spot is exactly the one flagged in the stress test. Waypoints come from human videos and go through multiple transformation steps before being applied to synthetic robot video inputs. Retargeting and calibrated projection can easily introduce viewpoint or morphology mismatches, so the auxiliary signal may not correspond to what the backbone actually sees in the frames. If that happens, any reported gains could come from regularization effects rather than reliable geometric grounding. The abstract does not show ablations on waypoint accuracy or robustness across tasks, which leaves the central mechanism under-supported.

The paper shows clear thinking about the data sources and is worth a serious referee to check the implementation details, the stability of the waypoint extraction, and whether the performance numbers hold up under closer inspection. It is not ready to cite yet but deserves the review process.

Referee Report

2 major / 2 minor

Summary. The paper claims that synthetic robot videos preserve geometry (spatial trajectories) but lose control signals, per the Asymmetric Preservation Principle. It proposes GRA, which extracts 2D end-effector waypoints from source human videos (via pose estimation, retargeting, simulation, and calibrated projection) to supervise only the VLA vision backbone via an auxiliary head, while the action head trains exclusively on real demonstrations. The waypoint loss acts as a spatial anchor during fine-tuning. Experiments reportedly show GRA outperforming pseudo-action baselines on real-robot tasks under matched data budgets and narrowing the gap to policies with more real data.

Significance. If the central results hold, the work provides a clean separation of supervision signals that could improve data efficiency for VLA adaptation from scalable synthetic videos. The explicit routing of geometry to the vision backbone and control to real actions is a principled contribution; the auxiliary 2D head as a persistent anchor is a concrete mechanism worth testing in other settings.

major comments (2)

[GRA method description (waypoint computation paragraph)] The waypoint extraction pipeline (pose estimation → retargeting → simulation → calibrated projection) is described only from the human-video source; no section verifies that the resulting 2D waypoints align with visible end-effector locations in the actual synthetic robot video frames fed to the vision backbone. This alignment is load-bearing for the claim that geometry supervision is beneficial rather than a source of domain-shift noise.
[Experiments and ablations] The experimental claims rest on the auxiliary 2D head delivering reliable spatial grounding without requiring the action head to reconcile mismatched signals. No ablation or analysis is referenced that isolates whether observed gains survive when waypoint accuracy is deliberately degraded or when retargeting errors are measured.

minor comments (2)

[Method] Notation for the auxiliary loss (waypoint loss) and its weighting relative to the main action loss should be made explicit with an equation.
[Introduction] The abstract states that generated videos 'preserve geometry but lose control'; a short related-work paragraph situating this against prior video-to-action or geometry-from-video papers would help readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and the recommendation of major revision. The two major comments identify areas where additional verification and analysis would strengthen the presentation of GRA. We address each point below and commit to incorporating the requested elements in the revised manuscript.

read point-by-point responses

Referee: [GRA method description (waypoint computation paragraph)] The waypoint extraction pipeline (pose estimation → retargeting → simulation → calibrated projection) is described only from the human-video source; no section verifies that the resulting 2D waypoints align with visible end-effector locations in the actual synthetic robot video frames fed to the vision backbone. This alignment is load-bearing for the claim that geometry supervision is beneficial rather than a source of domain-shift noise.

Authors: We agree that explicit verification of alignment between the extracted 2D waypoints and the visible end-effector locations in the synthetic robot video frames is necessary to support the claim. The current manuscript describes the pipeline but does not include a dedicated verification step or quantitative comparison on the synthetic frames themselves. In the revision we will add a new subsection (and figure) that reports both qualitative overlays of projected waypoints on sample synthetic frames and quantitative alignment metrics (mean pixel error and percentage of waypoints within a 10-pixel threshold) computed on held-out synthetic robot videos. This addition will directly confirm that the geometry signal remains consistent rather than introducing domain-shift noise. revision: yes
Referee: [Experiments and ablations] The experimental claims rest on the auxiliary 2D head delivering reliable spatial grounding without requiring the action head to reconcile mismatched signals. No ablation or analysis is referenced that isolates whether observed gains survive when waypoint accuracy is deliberately degraded or when retargeting errors are measured.

Authors: We concur that an ablation isolating the contribution of waypoint accuracy would provide stronger evidence that the observed gains derive from reliable spatial grounding. The present experiments do not contain such a controlled degradation study or explicit reporting of retargeting error statistics. In the revised manuscript we will add an ablation that perturbs the waypoint targets with increasing levels of Gaussian noise (simulating retargeting inaccuracies) and reports the resulting policy success rates. We will also include measured retargeting error statistics from the pipeline and discuss their relationship to task performance, thereby addressing the concern about mismatched signals. revision: yes

Circularity Check

0 steps flagged

No circularity; principle is conceptual argument, method applies it without self-referential reduction

full rationale

The paper posits the Asymmetric Preservation Principle as a conceptual distinction between preserved geometry and lost control signals in generated videos, then defines GRA as extracting 2D waypoints from source human videos (via pose/retargeting/simulation/projection) to supervise only the vision backbone while training the action head on real demonstrations. No equations, fitted parameters renamed as predictions, or self-citations are present in the provided text that would make the claimed outperformance reduce to a definition or input by construction. The derivation chain consists of the posited principle plus empirical comparison to pseudo-action baselines, remaining self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the Asymmetric Preservation Principle stated in the abstract; no free parameters or invented entities are described.

axioms (1)

ad hoc to paper Generated videos preserve geometry (spatial trajectories) but lose control signals (motor commands).
Invoked as the Asymmetric Preservation Principle to justify routing only waypoints to the vision backbone.

pith-pipeline@v0.9.1-grok · 5844 in / 1125 out tokens · 23302 ms · 2026-06-25T23:55:49.884126+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 21 linked inside Pith

[1]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024
[2]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024
[3]

Intelligence, K

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025
[4]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

2023
[5]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

Pith/arXiv arXiv 2024
[6]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–
[7]

Y . Song, C. Liu, W. Mao, and M. Z. Shou. Mitty: Diffusion-based human-to-robot video generation.arXiv preprint arXiv:2512.17253, 2025

arXiv 2025
[8]

Lepert, J

M. Lepert, J. Fang, and J. Bohg. Phantom: Training robots without robots using only human videos.arXiv preprint arXiv:2503.00779, 2025

Pith/arXiv arXiv 2025
[9]

H. Li, I. Zhang, R. Ouyang, X. Wang, Z. Zhu, Z. Yang, Z. Zhang, B. Wang, C. Ni, W. Qin, et al. Mimicdreamer: Aligning human and robot demonstrations for scalable vla training. arXiv preprint arXiv:2509.22199, 2025

arXiv 2025
[10]

J. Jang, S. Ye, Z. Lin, J. Xiang, J. Bjorck, Y . Fang, F. Hu, S. Huang, K. Kundalia, Y .-C. Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models. arXiv preprint arXiv:2505.12705, 2025

Pith/arXiv arXiv 2025
[11]

Bjorck, F

J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025
[12]

G. R. Team, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, T. Armstrong, A. Balakr- ishna, R. Baruch, M. Bauza, M. Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

Pith/arXiv arXiv 2025
[13]

Pertsch, K

K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

Pith/arXiv arXiv 2025
[14]

Z. Bai, C. Gao, and M. Z. Shou. Evolve-vla: Test-time training from environment feedback for vision-language-action models.arXiv preprint arXiv:2512.14666, 2025

arXiv 2025
[15]

Zhang, H

W. Zhang, H. Liu, Z. Qi, Y . Wang, X. Yu, J. Zhang, R. Dong, J. He, H. Wang, Z. Zhang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge. Advances in Neural Information Processing Systems, 38:24195–24228, 2026. 9

2026
[16]

H. Yuan, Z. Liang, A. Chen, Y . Wang, H. Li, P. Lin, Y . Huang, Z. Lei, T. Zhang, J. Zhang, et al. Qwen-robotmanip technical report: Alignment unlocks scale for robotic manipulation foundation models.arXiv preprint arXiv:2606.17846, 2026

Pith/arXiv arXiv 2026
[17]

P. Yang, H. Ci, Y . Chen, Q. Lv, H. Cai, and M. Z. Shou. Actionmap: Robot policy learning via voxel action heatmap.arXiv preprint arXiv:2606.06904, 2026

Pith/arXiv arXiv 2026
[18]

H. R. Walke, K. Black, T. Z. Zhao, Q. Vuong, C. Zheng, P. Hansen-Estruch, A. W. He, V . My- ers, M. J. Kim, M. Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR, 2023

2023
[19]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

Pith/arXiv arXiv 2025
[20]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

2022
[21]

Y . Chen, K. Y . Ma, Q. Lv, Y . Lin, Z. Bai, C. Gao, and M. Z. Shou. Escaping the diversity trap in robotic manipulation via anchor-centric adaptation.arXiv preprint arXiv:2605.07381, 2026

Pith/arXiv arXiv 2026
[22]

Tobin, R

J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 23–30. IEEE, 2017

2017
[23]

Mandlekar, S

A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. arXiv preprint arXiv:2310.17596, 2023

Pith/arXiv arXiv 2023
[24]

A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 3(4):6, 2025

Pith/arXiv arXiv 2025
[25]

H. Ci, X. Liu, P. Yang, Y . Song, and M. Z. Shou. H2r-grounder: A paired-data-free paradigm for translating human interaction videos into physically grounded robot videos.arXiv preprint arXiv:2512.09406, 2025

arXiv 2025
[26]

Z. Chen, Y . Li, Z. Liang, X. Chen, et al. Moto: Latent motion token as the bridging language for robot manipulation.arXiv preprint arXiv:2412.04445, 2024

arXiv 2024
[27]

S. Ye, J. Jang, B. Jeon, S. J. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Lin, L. Liden, K. Lee, J. Gao, L. Zettlemoyer, D. Fox, and M. Seo. Latent action pretraining from videos. In Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Representations, volume 2025, pages 28213–28239, 2025

2025
[28]

Liang, P

J. Liang, P. Tokmakov, R. Liu, S. Sudhakar, P. Shah, R. Ambrus, and C. V ondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

Pith/arXiv arXiv 2025
[29]

Allshire, H

A. Allshire, H. Choi, J. Zhang, D. McAllister, A. Zhang, C. M. Kim, T. Darrell, P. Abbeel, J. Malik, and A. Kanazawa. Visual imitation enables contextual humanoid control.arXiv preprint arXiv:2505.03729, 2025

arXiv 2025
[30]

Lepert, J

M. Lepert, J. Fang, and J. Bohg. Masquerade: Learning from in-the-wild human videos using data-editing.arXiv preprint arXiv:2508.09976, 2025

Pith/arXiv arXiv 2025
[31]

Y . Li, Y . Deng, J. Zhang, J. Jang, M. Memmel, C. Garrett, F. Ramos, D. Fox, A. Li, A. Gupta, et al. Hamster: Hierarchical action models for open-world robot manipulation. InInternational Conference on Learning Representations, volume 2025, pages 24040–24068, 2025. 10

2025
[32]

Y . Hu, Y . Guo, P. Wang, X. Chen, Y .-J. Wang, J. Zhang, K. Sreenath, C. Lu, and J. Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024

Pith/arXiv arXiv 2024
[33]

H. Luo, Y . Feng, W. Zhang, S. Zheng, Y . Wang, H. Yuan, J. Liu, C. Xu, Q. Jin, and Z. Lu. Being-h0: vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025

arXiv 2025
[34]

X. Qiu, Y . Wang, J. Cai, Z. Chen, C. Lin, T.-H. Wang, and C. Gan. Lucibot: Automated robot policy learning from generated videos.arXiv preprint arXiv:2503.09871, 2025

arXiv 2025
[35]

Kirkpatrick, R

J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

2017
[36]

D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, Y . Ding, Z. Wang, J. Gu, B. Zhao, D. Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

Pith/arXiv arXiv 2025
[37]

S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3m: A universal visual represen- tation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022

Pith/arXiv arXiv 2022
[38]

Karamcheti, S

S. Karamcheti, S. Nair, A. S. Chen, T. Kollar, C. Finn, D. Sadigh, and P. Liang. Language- driven representation learning for robotics.arXiv preprint arXiv:2302.12766, 2023

arXiv 2023
[39]

Radosavovic, T

I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Malik, and T. Darrell. Real-world robot learning with masked visual pre-training. InConference on Robot Learning, pages 416–426. PMLR, 2023

2023
[40]

X. Chen, Y . Chen, Y . Fu, N. Gao, J. Jia, W. Jin, H. Li, Y . Mu, J. Pang, Y . Qiao, et al. Internvla- m1: A spatially guided vision-language-action framework for generalist robot policy.arXiv preprint arXiv:2510.13778, 2025

Pith/arXiv arXiv 2025
[41]

Y . Chen, K. Q. Lin, and M. Z. Shou. Code2video: A code-centric paradigm for educational video generation.arXiv preprint arXiv:2510.01174, 2025

arXiv 2025
[42]

Kachaev, M

N. Kachaev, M. Kolosov, D. Zelezetsky, A. K. Kovalev, and A. I. Panov. Don’t blind your vla: Aligning visual representations for ood generalization.arXiv preprint arXiv:2510.25616, 2025

arXiv 2025
[43]

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European conference on computer vision, pages 38–55. Springer, 2024

2024
[44]

N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, et al. Sam 2: Segment anything in images and videos. InInternational Confer- ence on Learning Representations, volume 2025, pages 28085–28128, 2025

2025
[45]

Pavlakos, D

G. Pavlakos, D. Shan, I. Radosavovic, A. Kanazawa, D. Fouhey, and J. Malik. Reconstructing hands in 3d with transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9826–9836, 2024

2024
[46]

pick up X and place on Y

E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2012. 11 A Implementation Details Table 4 consolidates the settings for the two stages ofGRAand the three baselines (Real-only, DreamGen-style, MimicDreamer-style). All five runs use the same ...

2012

[1] [1]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024

[2] [2]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024

[3] [3]

Intelligence, K

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025

[4] [4]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

2023

[5] [5]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

Pith/arXiv arXiv 2024

[6] [6]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–

[7] [7]

Y . Song, C. Liu, W. Mao, and M. Z. Shou. Mitty: Diffusion-based human-to-robot video generation.arXiv preprint arXiv:2512.17253, 2025

arXiv 2025

[8] [8]

Lepert, J

M. Lepert, J. Fang, and J. Bohg. Phantom: Training robots without robots using only human videos.arXiv preprint arXiv:2503.00779, 2025

Pith/arXiv arXiv 2025

[9] [9]

H. Li, I. Zhang, R. Ouyang, X. Wang, Z. Zhu, Z. Yang, Z. Zhang, B. Wang, C. Ni, W. Qin, et al. Mimicdreamer: Aligning human and robot demonstrations for scalable vla training. arXiv preprint arXiv:2509.22199, 2025

arXiv 2025

[10] [10]

J. Jang, S. Ye, Z. Lin, J. Xiang, J. Bjorck, Y . Fang, F. Hu, S. Huang, K. Kundalia, Y .-C. Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models. arXiv preprint arXiv:2505.12705, 2025

Pith/arXiv arXiv 2025

[11] [11]

Bjorck, F

J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025

[12] [12]

G. R. Team, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, T. Armstrong, A. Balakr- ishna, R. Baruch, M. Bauza, M. Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

Pith/arXiv arXiv 2025

[13] [13]

Pertsch, K

K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

Pith/arXiv arXiv 2025

[14] [14]

Z. Bai, C. Gao, and M. Z. Shou. Evolve-vla: Test-time training from environment feedback for vision-language-action models.arXiv preprint arXiv:2512.14666, 2025

arXiv 2025

[15] [15]

Zhang, H

W. Zhang, H. Liu, Z. Qi, Y . Wang, X. Yu, J. Zhang, R. Dong, J. He, H. Wang, Z. Zhang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge. Advances in Neural Information Processing Systems, 38:24195–24228, 2026. 9

2026

[16] [16]

H. Yuan, Z. Liang, A. Chen, Y . Wang, H. Li, P. Lin, Y . Huang, Z. Lei, T. Zhang, J. Zhang, et al. Qwen-robotmanip technical report: Alignment unlocks scale for robotic manipulation foundation models.arXiv preprint arXiv:2606.17846, 2026

Pith/arXiv arXiv 2026

[17] [17]

P. Yang, H. Ci, Y . Chen, Q. Lv, H. Cai, and M. Z. Shou. Actionmap: Robot policy learning via voxel action heatmap.arXiv preprint arXiv:2606.06904, 2026

Pith/arXiv arXiv 2026

[18] [18]

H. R. Walke, K. Black, T. Z. Zhao, Q. Vuong, C. Zheng, P. Hansen-Estruch, A. W. He, V . My- ers, M. J. Kim, M. Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR, 2023

2023

[19] [19]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

Pith/arXiv arXiv 2025

[20] [20]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

2022

[21] [21]

Y . Chen, K. Y . Ma, Q. Lv, Y . Lin, Z. Bai, C. Gao, and M. Z. Shou. Escaping the diversity trap in robotic manipulation via anchor-centric adaptation.arXiv preprint arXiv:2605.07381, 2026

Pith/arXiv arXiv 2026

[22] [22]

Tobin, R

J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 23–30. IEEE, 2017

2017

[23] [23]

Mandlekar, S

A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. arXiv preprint arXiv:2310.17596, 2023

Pith/arXiv arXiv 2023

[24] [24]

A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 3(4):6, 2025

Pith/arXiv arXiv 2025

[25] [25]

H. Ci, X. Liu, P. Yang, Y . Song, and M. Z. Shou. H2r-grounder: A paired-data-free paradigm for translating human interaction videos into physically grounded robot videos.arXiv preprint arXiv:2512.09406, 2025

arXiv 2025

[26] [26]

Z. Chen, Y . Li, Z. Liang, X. Chen, et al. Moto: Latent motion token as the bridging language for robot manipulation.arXiv preprint arXiv:2412.04445, 2024

arXiv 2024

[27] [27]

S. Ye, J. Jang, B. Jeon, S. J. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Lin, L. Liden, K. Lee, J. Gao, L. Zettlemoyer, D. Fox, and M. Seo. Latent action pretraining from videos. In Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Representations, volume 2025, pages 28213–28239, 2025

2025

[28] [28]

Liang, P

J. Liang, P. Tokmakov, R. Liu, S. Sudhakar, P. Shah, R. Ambrus, and C. V ondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

Pith/arXiv arXiv 2025

[29] [29]

Allshire, H

A. Allshire, H. Choi, J. Zhang, D. McAllister, A. Zhang, C. M. Kim, T. Darrell, P. Abbeel, J. Malik, and A. Kanazawa. Visual imitation enables contextual humanoid control.arXiv preprint arXiv:2505.03729, 2025

arXiv 2025

[30] [30]

Lepert, J

M. Lepert, J. Fang, and J. Bohg. Masquerade: Learning from in-the-wild human videos using data-editing.arXiv preprint arXiv:2508.09976, 2025

Pith/arXiv arXiv 2025

[31] [31]

Y . Li, Y . Deng, J. Zhang, J. Jang, M. Memmel, C. Garrett, F. Ramos, D. Fox, A. Li, A. Gupta, et al. Hamster: Hierarchical action models for open-world robot manipulation. InInternational Conference on Learning Representations, volume 2025, pages 24040–24068, 2025. 10

2025

[32] [32]

Y . Hu, Y . Guo, P. Wang, X. Chen, Y .-J. Wang, J. Zhang, K. Sreenath, C. Lu, and J. Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024

Pith/arXiv arXiv 2024

[33] [33]

H. Luo, Y . Feng, W. Zhang, S. Zheng, Y . Wang, H. Yuan, J. Liu, C. Xu, Q. Jin, and Z. Lu. Being-h0: vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025

arXiv 2025

[34] [34]

X. Qiu, Y . Wang, J. Cai, Z. Chen, C. Lin, T.-H. Wang, and C. Gan. Lucibot: Automated robot policy learning from generated videos.arXiv preprint arXiv:2503.09871, 2025

arXiv 2025

[35] [35]

Kirkpatrick, R

J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

2017

[36] [36]

D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, Y . Ding, Z. Wang, J. Gu, B. Zhao, D. Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

Pith/arXiv arXiv 2025

[37] [37]

S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3m: A universal visual represen- tation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022

Pith/arXiv arXiv 2022

[38] [38]

Karamcheti, S

S. Karamcheti, S. Nair, A. S. Chen, T. Kollar, C. Finn, D. Sadigh, and P. Liang. Language- driven representation learning for robotics.arXiv preprint arXiv:2302.12766, 2023

arXiv 2023

[39] [39]

Radosavovic, T

I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Malik, and T. Darrell. Real-world robot learning with masked visual pre-training. InConference on Robot Learning, pages 416–426. PMLR, 2023

2023

[40] [40]

X. Chen, Y . Chen, Y . Fu, N. Gao, J. Jia, W. Jin, H. Li, Y . Mu, J. Pang, Y . Qiao, et al. Internvla- m1: A spatially guided vision-language-action framework for generalist robot policy.arXiv preprint arXiv:2510.13778, 2025

Pith/arXiv arXiv 2025

[41] [41]

Y . Chen, K. Q. Lin, and M. Z. Shou. Code2video: A code-centric paradigm for educational video generation.arXiv preprint arXiv:2510.01174, 2025

arXiv 2025

[42] [42]

Kachaev, M

N. Kachaev, M. Kolosov, D. Zelezetsky, A. K. Kovalev, and A. I. Panov. Don’t blind your vla: Aligning visual representations for ood generalization.arXiv preprint arXiv:2510.25616, 2025

arXiv 2025

[43] [43]

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European conference on computer vision, pages 38–55. Springer, 2024

2024

[44] [44]

N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, et al. Sam 2: Segment anything in images and videos. InInternational Confer- ence on Learning Representations, volume 2025, pages 28085–28128, 2025

2025

[45] [45]

Pavlakos, D

G. Pavlakos, D. Shan, I. Radosavovic, A. Kanazawa, D. Fouhey, and J. Malik. Reconstructing hands in 3d with transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9826–9836, 2024

2024

[46] [46]

pick up X and place on Y

E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2012. 11 A Implementation Details Table 4 consolidates the settings for the two stages ofGRAand the three baselines (Real-only, DreamGen-style, MimicDreamer-style). All five runs use the same ...

2012