pith. machine review for the scientific record. sign in

arxiv: 2604.04974 · v1 · submitted 2026-04-04 · 💻 cs.RO

Recognition: 1 theorem link

· Lean Theorem

From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data

Authors on Pith no claims yet

Pith reviewed 2026-05-13 16:59 UTC · model grok-4.3

classification 💻 cs.RO
keywords video learningrobot manipulationcontrol interfacestemporal visual datasurveytaxonomyrobotics integration
0
0 comments X

The pith

Video-based robot manipulation methods are limited most by how predictions connect to reliable physical actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey organizes methods that turn non-action-annotated video into robotic manipulation interfaces. It groups them into three families according to where the video-to-control interface is placed and what control properties result. The review then examines how each family closes the loop, what can be checked before motion begins, and where errors typically arise. The cross-family comparison concludes that the robotics integration layer—the mechanisms linking video predictions to dependable robot behavior—contains the most pressing unsolved problems.

Core claim

The paper defines an interface-centric taxonomy that places existing video-to-control methods into three families: direct video-action policies that keep the mapping implicit, latent-action methods that pass temporal structure through a compact learned representation, and explicit visual interfaces that output interpretable targets for separate controllers. Analysis of control-integration properties across families shows that the robotics integration layer remains the primary barrier to dependable execution.

What carries the argument

The interface-centric taxonomy that classifies methods by the construction site of the video-to-control interface and the resulting control properties.

If this is right

  • Each family closes the control loop at a different stage and admits different forms of pre-execution verification.
  • Failure modes enter at distinct points depending on whether the interface is implicit, latent, or explicit.
  • Further progress requires targeted work on the robotics integration layer that translates video predictions into safe robot commands.
  • The taxonomy supplies a common language for comparing how different methods handle embodiment gaps and missing action labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid methods that borrow elements from multiple families could address integration weaknesses that single-family approaches leave open.
  • Standardized testbeds focused on integration properties would make it easier to measure whether new techniques actually improve dependable behavior.
  • Deployment on physical robots with changing viewpoints and contact dynamics would expose integration shortcomings more clearly than simulation results alone.

Load-bearing premise

The proposed three-family taxonomy captures the essential differences among current video-to-control methods without missing major approaches or imposing artificial divisions.

What would settle it

A published video-based manipulation technique that cannot be assigned to any of the three families in the taxonomy.

Figures

Figures reproduced from arXiv: 2604.04974 by Chen Wang, Jia Pan, Linfang Zheng, Wei Zhang, Zikai Ouyang.

Figure 1
Figure 1. Figure 1: Video-based manipulation interfaces. This survey organizes the literature by how video-derived temporal structure is connected to robot control through three recurring interface families: direct video–action policies, latent-action intermediates, and explicit visual interfaces (e.g., subgoal images, trajectories, or poses). These families differ in how explicitly that structure is exposed to the robot’s co… view at source ↗
Figure 2
Figure 2. Figure 2: Three families of video-based manipulation interfaces. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Design space of video-based manipulation [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Control-loop integration across three interface families. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Taxonomy of direct video–action modeling approaches. Left: [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Generic latent-action pipeline. Most latent-action methods follow a two-stage process. Stage 1 (top): Using only action-free videos, an encoder (inverse dynamics module) infers a latent action zt from observed transitions (ot, ot+H), a bottleneck strategy constrains capacity, and a decoder (forward dynamics module) predicts future observations or their representations. The reconstruction loss (dashed bidir… view at source ↗
Figure 7
Figure 7. Figure 7: Latent actions inducing coherent behavior in [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Explicit visual interface–based methods for robotic manipulation. [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
read the original abstract

Video is a scalable observation of physical dynamics: it captures how objects move, how contact unfolds, and how scenes evolve under interaction -- all without requiring robot action labels. Yet translating this temporal structure into reliable robotic control remains an open challenge, because video lacks action supervision and differs from robot experience in embodiment, viewpoint, and physical constraints. This survey reviews methods that exploit non-action-annotated temporal video to learn control interfaces for robotic manipulation. We introduce an \emph{interface-centric taxonomy} organized by where the video-to-control interface is constructed and what control properties it enables, identifying three families: direct video--action policies, which keep the interface implicit; latent-action methods, which route temporal structure through a compact learned intermediate; and explicit visual interfaces, which predict interpretable targets for downstream control. For each family, we analyze control-integration properties -- how the loop is closed, what can be verified before execution, and where failures enter. A cross-family synthesis reveals that the most pressing open challenges center on the \emph{robotics integration layer} -- the mechanisms that connect video-derived predictions to dependable robot behavior -- and we outline research directions toward closing this gap.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. This survey reviews methods that exploit non-action-annotated temporal video to learn control interfaces for robotic manipulation. It introduces an interface-centric taxonomy organized by where the video-to-control interface is constructed, identifying three families: direct video-action policies (implicit interface), latent-action methods (compact learned intermediate), and explicit visual interfaces (interpretable targets for downstream control). For each family the paper analyzes control-integration properties including loop closure, pre-execution verifiability, and failure modes. A cross-family synthesis concludes that the robotics integration layer constitutes the dominant open challenge and outlines research directions to close the gap.

Significance. If the taxonomy and per-family analyses hold, the survey supplies a useful organizing lens that shifts emphasis from isolated algorithmic advances to the mechanisms needed to connect video-derived predictions to dependable robot behavior. This framing can help the community prioritize integration-layer research, which is essential for translating abundant video data into practical manipulation systems.

minor comments (2)
  1. [Abstract] Abstract: the three-family taxonomy is presented as capturing essential distinctions, yet the boundary between latent-action and explicit visual interfaces is not illustrated with a borderline example; adding one concrete method that could plausibly fit either category would strengthen the taxonomy's clarity without altering the central claim.
  2. [Synthesis section] The cross-family synthesis identifies the robotics integration layer as the primary open challenge; a short table or bullet list enumerating the specific integration shortcomings observed in each family would make this claim more immediately verifiable for readers.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our survey, for highlighting the utility of the interface-centric taxonomy, and for recommending acceptance. We appreciate the recognition that the work shifts focus toward the robotics integration layer as the central open challenge.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is a survey paper that reviews existing methods, proposes an interface-centric taxonomy as an organizing lens, and synthesizes open challenges from the literature. No equations, derivations, fitted parameters, or predictions appear anywhere in the manuscript. The central claim (robotics integration layer as dominant challenge) is a qualitative observation drawn from per-family analysis of external work, not a reduction to any internal definition or self-citation chain. The taxonomy is explicitly presented as a useful framing rather than a uniqueness theorem or ansatz. All citations are to independent prior literature; no load-bearing step collapses to the authors' own prior results by construction. The paper is therefore self-contained against external benchmarks with zero circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey the paper introduces no free parameters, axioms, or invented entities; it only organizes methods already present in the cited literature.

pith-pipeline@v0.9.0 · 5511 in / 987 out tokens · 41967 ms · 2026-05-13T16:59:51.886561+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

114 extracted references · 114 canonical work pages · 16 internal anchors

  1. [1]

    Science Robotics 10(106): eadt1497

    Ai B, Tian S, Shi H, Wang Y, Pfaff T, Tan C, Christensen HI, Su H, Wu J and Li Y (2025) A review of learning-based dynamics models for robotic manipulation. Science Robotics 10(106): eadt1497. doi:10.1126/scirobotics.adt1497. ://www.science.org/doi/abs/10.1126/scirobotics.adt1497

  2. [2]

    IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-9(5): 698--700

    Arun KS, Huang TS and Blostein SD (1987) Least-squares fitting of two 3-d point sets. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-9(5): 698--700. doi:10.1109/TPAMI.1987.4767965

  3. [3]

    ://arxiv.org/abs/2304.08488

    Bahl S, Mendonca R, Chen L, Jain U and Pathak D (2023) Affordances from human videos as a versatile representation for robotics. ://arxiv.org/abs/2304.08488

  4. [4]

    Procedia Computer Science 217: 1887--1895

    Baratta A, Cimino A, Gnoni MG and Longo F (2023) Human robot collaboration in industry 4.0: a literature review. Procedia Computer Science 217: 1887--1895

  5. [5]

    arXiv preprint arXiv:2409.16283 (2024)

    Bharadhwaj H, Dwibedi D, Gupta A, Tulsiani S, Doersch C, Xiao T, Shah D, Xia F, Sadigh D and Kirmani S (2024 a ) Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation. ://arxiv.org/abs/2409.16283

  6. [6]

    ://arxiv.org/abs/2405.01527

    Bharadhwaj H, Mottaghi R, Gupta A and Tulsiani S (2024 b ) Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation. ://arxiv.org/abs/2405.01527

  7. [7]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Black K, Brown N, Driess D, Esmail A, Equi M, Finn C, Fusai N, Groom L, Hausman K, Ichter B et al. (2024) 0: A vision-language-action flow model for general robot control ://arxiv.org/abs/2410.24164

  8. [8]

    Black, M

    Black K, Nakamoto M, Atreya P, Walke H, Finn C, Kumar A and Levine S (2023) Zero-shot robotic manipulation with pretrained image-editing diffusion models. ://arxiv.org/abs/2310.10639

  9. [9]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Blattmann A, Dockhorn T, Kulal S, Mendelevitch D, Kilian M, Lorenz D, Levi Y, English Z, Voleti V, Letts A et al. (2023) Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127

  10. [10]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Brohan A, Brown N, Carbajal J, Chebotar Y, Chen X, Choromanski K, Ding T, Driess D, Dubey A, Finn C, Florence P, Fu C, Arenas MG, Gopalakrishnan K, Han K, Hausman K, Herzog A, Hsu J, Ichter B, Irpan A, Joshi N, Julian R, Kalashnikov D, Kuang Y, Leal I, Lee L, Lee TWE, Levine S, Lu Y, Michalewski H, Mordatch I, Pertsch K, Rao K, Reymann K, Ryoo M, Salazar ...

  11. [11]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Brohan A, Brown N, Carbajal J, Chebotar Y, Dabis J, Finn C, Gopalakrishnan K, Hausman K, Herzog A, Hsu J, Ibarz J, Ichter B, Irpan A, Jackson T, Jesmonth S, Joshi N, Julian R, Kalashnikov D, Kuang Y, Leal I, Lee KH, Levine S, Lu Y, Malla U, Manjunath D, Mordatch I, Nachum O, Parada C, Peralta J, Perez E, Pertsch K, Quiambao J, Rao K, Ryoo M, Salazar G, Sa...

  12. [12]

    In- structpix2pix: Learning to follow image editing instructions

    Brooks T, Holynski A and Efros AA (2023) Instructpix2pix: Learning to follow image editing instructions. ://arxiv.org/abs/2211.09800

  13. [13]

    Genie: Generative interactive environments, 2024

    Bruce J, Dennis M, Edwards A, Parker-Holder J, Shi Y, Hughes E, Lai M, Mavalankar A, Steigerwald R, Apps C, Aytar Y, Bechtle S, Behbahani F, Chan S, Heess N, Gonzalez L, Osindero S, Ozair S, Reed S, Zhang J, Zolna K, Clune J, de Freitas N, Singh S and Rocktäschel T (2024) Genie: Generative interactive environments. ://arxiv.org/abs/2402.15391

  14. [14]

    In: Proceedings of Robotics: Science and Systems

    Bu Q, Yang Y, Cai J, Gao S, Ren G, Yao M, Luo P and Li H (2025) Learning to Act Anywhere with Task-centric Latent Actions . In: Proceedings of Robotics: Science and Systems. LosAngeles, CA, USA. doi:10.15607/RSS.2025.XXI.014

  15. [15]

    ://arxiv.org/abs/2409.09016

    Bu Q, Zeng J, Chen L, Yang Y, Zhou G, Yan J, Luo P, Cui H, Ma Y and Li H (2024) Closed-loop visuomotor control with generative expectation for robotic manipulation. ://arxiv.org/abs/2409.09016

  16. [16]

    A short note on the kinetics-700 human action dataset

    Carreira J, Noland E, Hillier C and Zisserman A (2022) A short note on the kinetics-700 human action dataset. ://arxiv.org/abs/1907.06987

  17. [17]

    ://arxiv.org/abs/2104.04631

    Chao YW, Yang W, Xiang Y, Molchanov P, Handa A, Tremblay J, Narang YS, Wyk KV, Iqbal U, Birchfield S, Kautz J and Fox D (2021) Dexycb: A benchmark for capturing hand grasping of objects. ://arxiv.org/abs/2104.04631

  18. [18]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    Cheang CL, Chen G, Jing Y, Kong T, Li H, Li Y, Liu Y, Wu H, Xu J, Yang Y, Zhang H and Zhu M (2024) Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation. ://arxiv.org/abs/2410.06158

  19. [19]

    arXiv:2501.12375 (2025)

    Chen S, Guo H, Zhu S, Zhang F, Huang Z, Feng J and Kang B (2025) Video depth anything: Consistent depth estimation for super-long videos. arXiv:2501.12375

  20. [20]

    The International Journal of Robotics Research 30(11): 1343--1377

    Chen S, Li Y and Kwok NM (2011) Active vision in robotic systems: A survey of recent developments. The International Journal of Robotics Research 30(11): 1343--1377

  21. [21]

    IEEE Transactions on Cybernetics 53(3): 1682--1698

    Cong Y, Chen R, Ma B, Liu H, Hou D and Yang C (2021) A comprehensive study of 3-d vision-based robot manipulation. IEEE Transactions on Cybernetics 53(3): 1682--1698

  22. [22]

    (2018) Scaling egocentric vision: The epic-kitchens dataset

    Damen D, Doughty H, Farinella GM, Fidler S, Furnari A, Kazakos E, Moltisanti D, Munro J, Perrett T, Price W et al. (2018) Scaling egocentric vision: The epic-kitchens dataset. In: Proceedings of the European conference on computer vision (ECCV). pp. 720--736

  23. [23]

    International Journal of Computer Vision (IJCV) 130: 33–55

    Damen D, Doughty H, Farinella GM, Furnari A, Ma J, Kazakos E, Moltisanti D, Munro J, Perrett T, Price W and Wray M (2022) Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision (IJCV) 130: 33–55. ://doi.org/10.1007/s11263-021-01531-2

  24. [24]

    (2025) A survey on reinforcement learning of vision-language-action models for robotic manipulation

    Deng H, Wu Z, Liu H, Guo W, Xue Y, Shan Z, Zhang C, Jia B, Ling Y, Lu G et al. (2025) A survey on reinforcement learning of vision-language-action models for robotic manipulation. Authorea Preprints

  25. [25]

    Dream2Flow: Bridging video generation and open-world manipulation with 3D object flow.arXiv preprint arXiv:2512.24766, 2025

    Dharmarajan K, Huang W, Wu J, Fei-Fei L and Zhang R (2025) Dream2flow: Bridging video generation and open-world manipulation with 3d object flow. ://arxiv.org/abs/2512.24766

  26. [26]

    arXiv preprint arXiv:2507.10672 , year=

    Din MU, Akram W, Saoud LS, Rosell J and Hussain I (2025) Vision language action models in robotic manipulation: A systematic review. arXiv preprint arXiv:2507.10672

  27. [27]

    In: Oh A, Naumann T, Globerson A, Saenko K, Hardt M and Levine S (eds.) Advances in Neural Information Processing Systems, volume 36

    Du Y, Yang S, Dai B, Dai H, Nachum O, Tenenbaum J, Schuurmans D and Abbeel P (2023) Learning universal policies via text-guided video generation. In: Oh A, Naumann T, Globerson A, Saenko K, Hardt M and Levine S (eds.) Advances in Neural Information Processing Systems, volume 36. Curran Associates, Inc., pp. 9156--9172. ://proceedings.neurips.cc/paper_file...

  28. [28]

    In: 2008 IEEE International Conference on Automation and Logistics

    Echelmeyer W, Kirchheim A and Wellbrock E (2008) Robotics-logistics: Challenges for automation of logistic processes. In: 2008 IEEE International Conference on Automation and Logistics. IEEE, pp. 2099--2103

  29. [29]

    ://arxiv.org/abs/1805.07914

    Edwards AD, Sahni H, Schroecker Y and Isbell CL (2019) Imitating latent policies from observation. ://arxiv.org/abs/1805.07914

  30. [30]

    arXiv preprint arXiv:2205.04382

    Eisner B, Zhang H and Held D (2022) Flowbot3d: Learning 3d articulation flow to manipulate articulated objects. arXiv preprint arXiv:2205.04382

  31. [31]

    ://arxiv.org/abs/2402.07127

    Eze C and Crick C (2025) Learning by watching: A review of video-based learning approaches for robot manipulation. ://arxiv.org/abs/2402.07127

  32. [32]

    Remittances Review 9: 871--888

    Faheem MA, Zafar N, Kumar P, Melon M, Prince N and Al Mamun MA (2024) Ai and robotic: About the transformation of construction industry automation as well as labor productivity. Remittances Review 9: 871--888

  33. [33]

    Rh20t: A robotic dataset for learning diverse skills in one-shot

    Fang HS, Fang H, Tang Z, Liu J, Wang C, Wang J, Zhu H and Lu C (2023) Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. arXiv preprint arXiv:2307.00595

  34. [34]

    Autonomous robots 9(3): 227--235

    Fiorini P and Prassler E (2000) Cleaning and household robots: A technology survey. Autonomous robots 9(3): 227--235

  35. [35]

    Fischler MA and Bolles RC (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6): 381–395. doi:10.1145/358669.358692. ://doi.org/10.1145/358669.358692

  36. [36]

    something something

    Goyal R, Kahou SE, Michalski V, Materzyńska J, Westphal S, Kim H, Haenel V, Fruend I, Yianilos P, Mueller-Freitag M, Hoppe F, Thurau C, Bax I and Memisevic R (2017) The "something something" video database for learning and evaluating visual common sense. ://arxiv.org/abs/1706.04261

  37. [37]

    (2022) Ego4d: Around the world in 3,000 hours of egocentric video

    Grauman K, Westbury A, Byrne E, Chavis Z, Furnari A, Girdhar R, Hamburger J, Jiang H, Liu M, Liu X et al. (2022) Ego4d: Around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18995--19012

  38. [38]

    Maniskill2: A unified benchmark for generalizable manipulation skills.arXiv preprint arXiv:2302.04659, 2023

    Gu J, Xiang F, Li X, Ling Z, Liu X, Mu T, Tang Y, Tao S, Wei X, Yao Y et al. (2023) Maniskill2: A unified benchmark for generalizable manipulation skills. arXiv preprint arXiv:2302.04659

  39. [39]

    ://arxiv.org/abs/2411.18179

    Guo Y, Hu Y, Zhang J, Wang YJ, Chen X, Lu C and Chen J (2024) Prediction with action: Visual policy learning via joint denoising process. ://arxiv.org/abs/2411.18179

  40. [40]

    Learning Latent Dynamics for Planning from Pixels

    Hafner D, Lillicrap T, Fischer I, Villegas R, Ha D, Lee H and Davidson J (2019) Learning latent dynamics for planning from pixels. ://arxiv.org/abs/1811.04551

  41. [41]

    Mastering Atari with Discrete World Models

    Hafner D, Lillicrap T, Norouzi M and Ba J (2022) Mastering atari with discrete world models. ://arxiv.org/abs/2010.02193

  42. [42]

    Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    Hu Y, Guo Y, Wang P, Chen X, Wang YJ, Zhang J, Sreenath K, Lu C and Chen J (2024) Video prediction policy: A generalist robot policy with predictive visual representations. ://arxiv.org/abs/2412.14803

  43. [43]

    PointWorld: Scaling 3D world models for in-the-wild robotic manipulation.arXiv preprint arXiv:2601.03782, 2026

    Huang W, Chao YW, Mousavian A, Liu MY, Fox D, Mo K and Fei-Fei L (2026) Pointworld: Scaling 3d world models for in-the-wild robotic manipulation. ://arxiv.org/abs/2601.03782

  44. [44]

    Rekep: Spatio- temporal reasoning of relational keypoint constraints for robotic manipulation.arXiv preprint arXiv:2409.01652, 2024

    Huang W, Wang C, Li Y, Zhang R and Fei-Fei L (2024) Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation. arXiv preprint arXiv:2409.01652

  45. [45]

    IEEE Robotics and Automation Letters 5(2): 3019--3026

    James S, Ma Z, Arrojo DR and Davison AJ (2020) Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters 5(2): 3019--3026

  46. [46]

    In: European conference on computer vision

    Karaev N, Rocco I, Graham B, Neverova N, Vedaldi A and Rupprecht C (2024) Cotracker: It is better to track together. In: European conference on computer vision. Springer, pp. 18--35

  47. [47]

    ://arxiv.org/abs/2409.18121

    Kerr J, Kim CM, Wu M, Yi B, Wang Q, Goldberg K and Kanazawa A (2024) Robot see robot do: Imitating articulated object manipulation with monocular 4d reconstruction. ://arxiv.org/abs/2409.18121

  48. [48]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Khazatsky A, Pertsch K, Nair S, Balakrishna A, Dasari S, Karamcheti S, Nasiriany S, Srirama MK, Chen LY, Ellis K et al. (2024) Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945

  49. [49]

    Kingma and Max Welling

    Kingma DP and Welling M (2019) An introduction to variational autoencoders. Foundations and Trends® in Machine Learning 12(4): 307–392. doi:10.1561/2200000056. ://dx.doi.org/10.1561/2200000056

  50. [50]

    Tenenbaum

    Ko PC, Mao J, Du Y, Sun SH and Tenenbaum JB (2023) Learning to act from actionless videos through dense correspondences. ://arxiv.org/abs/2310.08576

  51. [51]

    Technical Report ISRN KTH/NA/P-02/01-SE, KTH Royal Institute of Technology

    Kragic D and Christensen HI (2002) Survey on visual servoing for manipulation. Technical Report ISRN KTH/NA/P-02/01-SE, KTH Royal Institute of Technology. CVAP259

  52. [52]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Kwon T, Tekin B, St\"uhmer J, Bogo F and Pollefeys M (2021) H2o: Two hands manipulating objects for first person interaction recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 10138--10148

  53. [53]

    arXiv preprint arXiv:2212.06870 (2022)

    Labbé Y, Manuelli L, Mousavian A, Tyree S, Birchfield S, Tremblay J, Carpentier J, Aubry M, Fox D and Sivic J (2022) Megapose: 6d pose estimation of novel objects via render & compare. ://arxiv.org/abs/2212.06870

  54. [54]

    International Journal of Computer Vision (IJCV) 81

    Lepetit V, Moreno-Noguer F and Fua P (2009) EPnP: An accurate O(n) solution to the PnP problem . International Journal of Computer Vision (IJCV) 81. doi:10.1007/s11263-008-0152-6

  55. [55]

    (2024) What foundation models can bring for robot learning in manipulation: A survey

    Li D, Jin Y, Sun Y, A Y, Yu H, Shi J, Hao X, Hao P, Liu H, Li X et al. (2024) What foundation models can bring for robot learning in manipulation: A survey. The International Journal of Robotics Research : 02783649251390579

  56. [56]

    Unified Video Action Model

    Li S, Gao Y, Sadigh D and Song S (2025) Unified video action model. ://arxiv.org/abs/2503.00200

  57. [57]

    In: 8th Annual Conference on Robot Learning

    Liang J, Liu R, Ozguroglu E, Sudhakar S, Dave A, Tokmakov P, Song S and Vondrick C (2024) Dreamitate: Real-world visuomotor policy learning via video generation. In: 8th Annual Conference on Robot Learning. ://openreview.net/forum?id=InT87E5sr4

  58. [58]

    LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    Liu B, Zhu Y, Gao C, Feng Y, Liu Q, Zhu Y and Stone P (2023) Libero: Benchmarking knowledge transfer for lifelong robot learning. arXiv preprint arXiv:2306.03310

  59. [59]

    arXiv preprint arXiv:2403.03174

    Liu F, Fang K, Abbeel P and Levine S (2024) Moka: Open-world robotic manipulation through mark-based visual prompting. arXiv preprint arXiv:2403.03174

  60. [60]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Liu Y, Liu Y, Jiang C, Lyu K, Wan W, Shen H, Liang B, Fu Z, Wang H and Yi L (2022) Hoi4d: A 4d egocentric dataset for category-level human-object interaction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21013--21022

  61. [61]

    arXiv preprint arXiv:2411.07223 (2024)

    Luo Y and Du Y (2025) Grounding video models to actions through goal conditioned exploration. ://arxiv.org/abs/2411.07223

  62. [62]

    IEEE Robotics and Automation Letters

    Lynch C, Wahid A, Tompson J, Ding T, Betker J, Baruch R, Armstrong T and Florence P (2023) Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters

  63. [63]

    ://arxiv.org/abs/2306.00958

    Ma YJ, Liang W, Som V, Kumar V, Zhang A, Bastani O and Jayaraman D (2023 a ) Liv: Language-image representations and rewards for robotic control. ://arxiv.org/abs/2306.00958

  64. [64]

    VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

    Ma YJ, Sodhani S, Jayaraman D, Bastani O, Kumar V and Zhang A (2023 b ) Vip: Towards universal visual reward and representation via value-implicit pre-training. ://arxiv.org/abs/2210.00030

  65. [65]

    What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

    Mandlekar A, Xu D, Wong J, Nasiriany S, Wang C, Kulkarni R, Fei-Fei L, Savarese S, Zhu Y and Mart \' n-Mart \' n R (2021) What matters in learning from offline human demonstrations for robot manipulation. arXiv preprint arXiv:2108.03298

  66. [66]

    Journal of Artificial Intelligence Research 83

    McCarthy R, Tan DCH, Schmidt D, Acero F, Herr N, Du Y, Thuruthel TG and Li Z (2025) Towards generalist robot learning from internet video: A survey. Journal of Artificial Intelligence Research 83

  67. [67]

    ://arxiv.org/abs/2112.03227

    Mees O, Hermann L, Rosete-Beas E and Burgard W (2022) Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. ://arxiv.org/abs/2112.03227

  68. [68]

    Structured world models from human videos.arXiv preprint arXiv:2308.10901, 2023

    Mendonca R, Bahl S and Pathak D (2023) Structured world models from human videos. ://arxiv.org/abs/2308.10901

  69. [69]

    In: ICCV

    Miech A, Zhukov D, Alayrac JB, Tapaswi M, Laptev I and Sivic J (2019) How T o100 M : L earning a T ext- V ideo E mbedding by W atching H undred M illion N arrated V ideo C lips. In: ICCV

  70. [70]

    (2025) Recipe for vision-language-action models in robotic manipulation: A survey

    Motoda T, Makihara K, Nakajo R, Oh H, Shirai K, Hanai R, Murooka M, Suzuki Y, Nishihara H, Takeda M et al. (2025) Recipe for vision-language-action models in robotic manipulation: A survey. Authorea Preprints

  71. [71]

    In: 2025 IEEE International Conference on Robotics and Automation (ICRA)

    Nasiriany S, Kirmani S, Ding T, Smith L, Zhu Y, Driess D, Sadigh D and Xiao T (2025) Rt-affordance: Affordances are versatile intermediate representations for robot manipulation. In: 2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp. 8249--8257

  72. [72]

    (2024) Open x-embodiment: Robotic learning datasets and rt-x models

    Open X-Embodiment Collaboration A O’Neill, Rehman A, Maddukuri A, Gupta A, Padalkar A, Lee A, Pooley A, Gupta A, Mandlekar A, Jain A et al. (2024) Open x-embodiment: Robotic learning datasets and rt-x models. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp. 6892--6903

  73. [73]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab M, Darcet T, Moutakanni T, Vo H, Szafraniec M, Khalidov V, Fernandez P, Haziza D, Massa F, El-Nouby A et al. (2023) Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193

  74. [74]

    Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations

    Patel S, Mohan S, Mai H, Jain U, Lazebnik S and Li Y (2025) Robotic manipulation by imitating generated videos without physical demonstrations. ://arxiv.org/abs/2507.00990

  75. [75]

    arXiv preprint arXiv:2402.08191 (2024) 14

    Pumacay W, Singh I, Duan J, Krishna R, Thomason J and Fox D (2024) The colosseum: A benchmark for evaluating generalization for robotic manipulation. arXiv preprint arXiv:2402.08191

  76. [76]

    In: 2020 IEEE International Conference on Robotics and Automation (ICRA)

    Qin Z, Fang K, Zhu Y, Fei-Fei L and Savarese S (2020) Keto: Learning keypoint representations for tool manipulation. In: 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp. 7278--7285

  77. [77]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Qu D, Song H, Chen Q, Yao Y, Ye X, Ding Y, Wang Z, Gu J, Zhao B, Wang D et al. (2025) Spatialvla: Exploring spatial representations for visual-language-action model. arXiv preprint arXiv:2501.15830

  78. [78]

    ://arxiv.org/abs/1806.09655

    Rybkin O, Pertsch K, Derpanis KG, Daniilidis K and Jaegle A (2019) Learning what you can do before doing anything. ://arxiv.org/abs/1806.09655

  79. [79]

    Learning to act without actions.arXiv preprint arXiv:2312.10812, 2023

    Schmidt D and Jiang M (2024) Learning to act without actions. ://arxiv.org/abs/2312.10812

  80. [80]

    ://arxiv.org/abs/2203.13880

    Seo Y, Lee K, James S and Abbeel P (2022) Reinforcement learning with action-free pre-training from videos. ://arxiv.org/abs/2203.13880

Showing first 80 references.