arxiv: 2604.04974 · v1 · submitted 2026-04-04 · 💻 cs.RO

Recognition: 1 theorem link

· Lean Theorem

From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data

Linfang Zheng , Zikai Ouyang , Chen Wang , Jia Pan , Wei Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 16:59 UTC · model grok-4.3

classification 💻 cs.RO

keywords video learningrobot manipulationcontrol interfacestemporal visual datasurveytaxonomyrobotics integration

0 comments

The pith

Video-based robot manipulation methods are limited most by how predictions connect to reliable physical actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey organizes methods that turn non-action-annotated video into robotic manipulation interfaces. It groups them into three families according to where the video-to-control interface is placed and what control properties result. The review then examines how each family closes the loop, what can be checked before motion begins, and where errors typically arise. The cross-family comparison concludes that the robotics integration layer—the mechanisms linking video predictions to dependable robot behavior—contains the most pressing unsolved problems.

Core claim

The paper defines an interface-centric taxonomy that places existing video-to-control methods into three families: direct video-action policies that keep the mapping implicit, latent-action methods that pass temporal structure through a compact learned representation, and explicit visual interfaces that output interpretable targets for separate controllers. Analysis of control-integration properties across families shows that the robotics integration layer remains the primary barrier to dependable execution.

What carries the argument

The interface-centric taxonomy that classifies methods by the construction site of the video-to-control interface and the resulting control properties.

If this is right

Each family closes the control loop at a different stage and admits different forms of pre-execution verification.
Failure modes enter at distinct points depending on whether the interface is implicit, latent, or explicit.
Further progress requires targeted work on the robotics integration layer that translates video predictions into safe robot commands.
The taxonomy supplies a common language for comparing how different methods handle embodiment gaps and missing action labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid methods that borrow elements from multiple families could address integration weaknesses that single-family approaches leave open.
Standardized testbeds focused on integration properties would make it easier to measure whether new techniques actually improve dependable behavior.
Deployment on physical robots with changing viewpoints and contact dynamics would expose integration shortcomings more clearly than simulation results alone.

Load-bearing premise

The proposed three-family taxonomy captures the essential differences among current video-to-control methods without missing major approaches or imposing artificial divisions.

What would settle it

A published video-based manipulation technique that cannot be assigned to any of the three families in the taxonomy.

Figures

Figures reproduced from arXiv: 2604.04974 by Chen Wang, Jia Pan, Linfang Zheng, Wei Zhang, Zikai Ouyang.

**Figure 1.** Figure 1: Video-based manipulation interfaces. This survey organizes the literature by how video-derived temporal structure is connected to robot control through three recurring interface families: direct video–action policies, latent-action intermediates, and explicit visual interfaces (e.g., subgoal images, trajectories, or poses). These families differ in how explicitly that structure is exposed to the robot’s co… view at source ↗

**Figure 2.** Figure 2: Three families of video-based manipulation interfaces. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Design space of video-based manipulation [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Control-loop integration across three interface families. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Taxonomy of direct video–action modeling approaches. Left: [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Generic latent-action pipeline. Most latent-action methods follow a two-stage process. Stage 1 (top): Using only action-free videos, an encoder (inverse dynamics module) infers a latent action zt from observed transitions (ot, ot+H), a bottleneck strategy constrains capacity, and a decoder (forward dynamics module) predicts future observations or their representations. The reconstruction loss (dashed bidir… view at source ↗

**Figure 7.** Figure 7: Latent actions inducing coherent behavior in [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Explicit visual interface–based methods for robotic manipulation. [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

read the original abstract

Video is a scalable observation of physical dynamics: it captures how objects move, how contact unfolds, and how scenes evolve under interaction -- all without requiring robot action labels. Yet translating this temporal structure into reliable robotic control remains an open challenge, because video lacks action supervision and differs from robot experience in embodiment, viewpoint, and physical constraints. This survey reviews methods that exploit non-action-annotated temporal video to learn control interfaces for robotic manipulation. We introduce an \emph{interface-centric taxonomy} organized by where the video-to-control interface is constructed and what control properties it enables, identifying three families: direct video--action policies, which keep the interface implicit; latent-action methods, which route temporal structure through a compact learned intermediate; and explicit visual interfaces, which predict interpretable targets for downstream control. For each family, we analyze control-integration properties -- how the loop is closed, what can be verified before execution, and where failures enter. A cross-family synthesis reveals that the most pressing open challenges center on the \emph{robotics integration layer} -- the mechanisms that connect video-derived predictions to dependable robot behavior -- and we outline research directions toward closing this gap.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The survey's interface-centric taxonomy usefully organizes methods and flags integration as the main challenge.

read the letter

This survey's main point is its interface-centric taxonomy, which groups video-to-control methods into direct video-action policies, latent-action methods, and explicit visual interfaces. It then examines each on concrete control properties such as how the loop closes, what can be checked before execution, and where failures tend to appear. That structure leads to the synthesis that the robotics integration layer is where the real work remains. The paper does a good job pulling the literature into this frame without overclaiming. The analysis stays grounded in the reviewed methods and correctly spots that translating video predictions into reliable robot actions is the sticking point across families. The cross-family view avoids getting lost in technique details and instead focuses on what actually matters for closing the gap to real robots. No new math or experiments here, just a review that organizes prior results in a way that highlights practical gaps. One minor soft spot is that the boundaries between the three families could blur for some hybrid approaches, but the paper presents them as helpful categories rather than rigid ones. The abstract suggests solid coverage of the main lines of work, though the full text would confirm the depth. This is aimed at researchers in robot learning and vision-based control who need a quick way to see how different methods compare on deployment-relevant properties. A reading group on manipulation learning would get value from the taxonomy and the challenge summary. I recommend sending it for peer review. The organization is clear and the identified open problems are worth airing in the community.

Referee Report

0 major / 2 minor

Summary. This survey reviews methods that exploit non-action-annotated temporal video to learn control interfaces for robotic manipulation. It introduces an interface-centric taxonomy organized by where the video-to-control interface is constructed, identifying three families: direct video-action policies (implicit interface), latent-action methods (compact learned intermediate), and explicit visual interfaces (interpretable targets for downstream control). For each family the paper analyzes control-integration properties including loop closure, pre-execution verifiability, and failure modes. A cross-family synthesis concludes that the robotics integration layer constitutes the dominant open challenge and outlines research directions to close the gap.

Significance. If the taxonomy and per-family analyses hold, the survey supplies a useful organizing lens that shifts emphasis from isolated algorithmic advances to the mechanisms needed to connect video-derived predictions to dependable robot behavior. This framing can help the community prioritize integration-layer research, which is essential for translating abundant video data into practical manipulation systems.

minor comments (2)

[Abstract] Abstract: the three-family taxonomy is presented as capturing essential distinctions, yet the boundary between latent-action and explicit visual interfaces is not illustrated with a borderline example; adding one concrete method that could plausibly fit either category would strengthen the taxonomy's clarity without altering the central claim.
[Synthesis section] The cross-family synthesis identifies the robotics integration layer as the primary open challenge; a short table or bullet list enumerating the specific integration shortcomings observed in each family would make this claim more immediately verifiable for readers.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our survey, for highlighting the utility of the interface-centric taxonomy, and for recommending acceptance. We appreciate the recognition that the work shifts focus toward the robotics integration layer as the central open challenge.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is a survey paper that reviews existing methods, proposes an interface-centric taxonomy as an organizing lens, and synthesizes open challenges from the literature. No equations, derivations, fitted parameters, or predictions appear anywhere in the manuscript. The central claim (robotics integration layer as dominant challenge) is a qualitative observation drawn from per-family analysis of external work, not a reduction to any internal definition or self-citation chain. The taxonomy is explicitly presented as a useful framing rather than a uniqueness theorem or ansatz. All citations are to independent prior literature; no load-bearing step collapses to the authors' own prior results by construction. The paper is therefore self-contained against external benchmarks with zero circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey the paper introduces no free parameters, axioms, or invented entities; it only organizes methods already present in the cited literature.

pith-pipeline@v0.9.0 · 5511 in / 987 out tokens · 41967 ms · 2026-05-13T16:59:51.886561+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce an interface-centric taxonomy organized by where the video-to-control interface is constructed and what control properties it enables, identifying three families: direct video–action policies... latent-action methods... explicit visual interfaces...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

114 extracted references · 114 canonical work pages · 16 internal anchors

[1]

Science Robotics 10(106): eadt1497

Ai B, Tian S, Shi H, Wang Y, Pfaff T, Tan C, Christensen HI, Su H, Wu J and Li Y (2025) A review of learning-based dynamics models for robotic manipulation. Science Robotics 10(106): eadt1497. doi:10.1126/scirobotics.adt1497. ://www.science.org/doi/abs/10.1126/scirobotics.adt1497

work page doi:10.1126/scirobotics.adt1497 2025
[2]

IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-9(5): 698--700

Arun KS, Huang TS and Blostein SD (1987) Least-squares fitting of two 3-d point sets. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-9(5): 698--700. doi:10.1109/TPAMI.1987.4767965

work page doi:10.1109/tpami.1987.4767965 1987
[3]

://arxiv.org/abs/2304.08488

Bahl S, Mendonca R, Chen L, Jain U and Pathak D (2023) Affordances from human videos as a versatile representation for robotics. ://arxiv.org/abs/2304.08488

work page arXiv 2023
[4]

Procedia Computer Science 217: 1887--1895

Baratta A, Cimino A, Gnoni MG and Longo F (2023) Human robot collaboration in industry 4.0: a literature review. Procedia Computer Science 217: 1887--1895

work page 2023
[5]

arXiv preprint arXiv:2409.16283 (2024)

Bharadhwaj H, Dwibedi D, Gupta A, Tulsiani S, Doersch C, Xiao T, Shah D, Xia F, Sadigh D and Kirmani S (2024 a ) Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation. ://arxiv.org/abs/2409.16283

work page arXiv 2024
[6]

://arxiv.org/abs/2405.01527

Bharadhwaj H, Mottaghi R, Gupta A and Tulsiani S (2024 b ) Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation. ://arxiv.org/abs/2405.01527

work page arXiv 2024
[7]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Black K, Brown N, Driess D, Esmail A, Equi M, Finn C, Fusai N, Groom L, Hausman K, Ichter B et al. (2024) 0: A vision-language-action flow model for general robot control ://arxiv.org/abs/2410.24164

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Black, M

Black K, Nakamoto M, Atreya P, Walke H, Finn C, Kumar A and Levine S (2023) Zero-shot robotic manipulation with pretrained image-editing diffusion models. ://arxiv.org/abs/2310.10639

work page arXiv 2023
[9]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann A, Dockhorn T, Kulal S, Mendelevitch D, Kilian M, Lorenz D, Levi Y, English Z, Voleti V, Letts A et al. (2023) Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Brohan A, Brown N, Carbajal J, Chebotar Y, Chen X, Choromanski K, Ding T, Driess D, Dubey A, Finn C, Florence P, Fu C, Arenas MG, Gopalakrishnan K, Han K, Hausman K, Herzog A, Hsu J, Ichter B, Irpan A, Joshi N, Julian R, Kalashnikov D, Kuang Y, Leal I, Lee L, Lee TWE, Levine S, Lu Y, Michalewski H, Mordatch I, Pertsch K, Rao K, Reymann K, Ryoo M, Salazar ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

RT-1: Robotics Transformer for Real-World Control at Scale

Brohan A, Brown N, Carbajal J, Chebotar Y, Dabis J, Finn C, Gopalakrishnan K, Hausman K, Herzog A, Hsu J, Ibarz J, Ichter B, Irpan A, Jackson T, Jesmonth S, Joshi N, Julian R, Kalashnikov D, Kuang Y, Leal I, Lee KH, Levine S, Lu Y, Malla U, Manjunath D, Mordatch I, Nachum O, Parada C, Peralta J, Perez E, Pertsch K, Quiambao J, Rao K, Ryoo M, Salazar G, Sa...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

In- structpix2pix: Learning to follow image editing instructions

Brooks T, Holynski A and Efros AA (2023) Instructpix2pix: Learning to follow image editing instructions. ://arxiv.org/abs/2211.09800

work page arXiv 2023
[13]

Genie: Generative interactive environments, 2024

Bruce J, Dennis M, Edwards A, Parker-Holder J, Shi Y, Hughes E, Lai M, Mavalankar A, Steigerwald R, Apps C, Aytar Y, Bechtle S, Behbahani F, Chan S, Heess N, Gonzalez L, Osindero S, Ozair S, Reed S, Zhang J, Zolna K, Clune J, de Freitas N, Singh S and Rocktäschel T (2024) Genie: Generative interactive environments. ://arxiv.org/abs/2402.15391

work page arXiv 2024
[14]

In: Proceedings of Robotics: Science and Systems

Bu Q, Yang Y, Cai J, Gao S, Ren G, Yao M, Luo P and Li H (2025) Learning to Act Anywhere with Task-centric Latent Actions . In: Proceedings of Robotics: Science and Systems. LosAngeles, CA, USA. doi:10.15607/RSS.2025.XXI.014

work page doi:10.15607/rss.2025.xxi.014 2025
[15]

://arxiv.org/abs/2409.09016

Bu Q, Zeng J, Chen L, Yang Y, Zhou G, Yan J, Luo P, Cui H, Ma Y and Li H (2024) Closed-loop visuomotor control with generative expectation for robotic manipulation. ://arxiv.org/abs/2409.09016

work page arXiv 2024
[16]

A short note on the kinetics-700 human action dataset

Carreira J, Noland E, Hillier C and Zisserman A (2022) A short note on the kinetics-700 human action dataset. ://arxiv.org/abs/1907.06987

work page arXiv 2022
[17]

://arxiv.org/abs/2104.04631

Chao YW, Yang W, Xiang Y, Molchanov P, Handa A, Tremblay J, Narang YS, Wyk KV, Iqbal U, Birchfield S, Kautz J and Fox D (2021) Dexycb: A benchmark for capturing hand grasping of objects. ://arxiv.org/abs/2104.04631

work page arXiv 2021
[18]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Cheang CL, Chen G, Jing Y, Kong T, Li H, Li Y, Liu Y, Wu H, Xu J, Yang Y, Zhang H and Zhu M (2024) Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation. ://arxiv.org/abs/2410.06158

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

arXiv:2501.12375 (2025)

Chen S, Guo H, Zhu S, Zhang F, Huang Z, Feng J and Kang B (2025) Video depth anything: Consistent depth estimation for super-long videos. arXiv:2501.12375

work page arXiv 2025
[20]

The International Journal of Robotics Research 30(11): 1343--1377

Chen S, Li Y and Kwok NM (2011) Active vision in robotic systems: A survey of recent developments. The International Journal of Robotics Research 30(11): 1343--1377

work page 2011
[21]

IEEE Transactions on Cybernetics 53(3): 1682--1698

Cong Y, Chen R, Ma B, Liu H, Hou D and Yang C (2021) A comprehensive study of 3-d vision-based robot manipulation. IEEE Transactions on Cybernetics 53(3): 1682--1698

work page 2021
[22]

(2018) Scaling egocentric vision: The epic-kitchens dataset

Damen D, Doughty H, Farinella GM, Fidler S, Furnari A, Kazakos E, Moltisanti D, Munro J, Perrett T, Price W et al. (2018) Scaling egocentric vision: The epic-kitchens dataset. In: Proceedings of the European conference on computer vision (ECCV). pp. 720--736

work page 2018
[23]

International Journal of Computer Vision (IJCV) 130: 33–55

Damen D, Doughty H, Farinella GM, Furnari A, Ma J, Kazakos E, Moltisanti D, Munro J, Perrett T, Price W and Wray M (2022) Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision (IJCV) 130: 33–55. ://doi.org/10.1007/s11263-021-01531-2

work page doi:10.1007/s11263-021-01531-2 2022
[24]

(2025) A survey on reinforcement learning of vision-language-action models for robotic manipulation

Deng H, Wu Z, Liu H, Guo W, Xue Y, Shan Z, Zhang C, Jia B, Ling Y, Lu G et al. (2025) A survey on reinforcement learning of vision-language-action models for robotic manipulation. Authorea Preprints

work page 2025
[25]

Dream2Flow: Bridging video generation and open-world manipulation with 3D object flow.arXiv preprint arXiv:2512.24766, 2025

Dharmarajan K, Huang W, Wu J, Fei-Fei L and Zhang R (2025) Dream2flow: Bridging video generation and open-world manipulation with 3d object flow. ://arxiv.org/abs/2512.24766

work page arXiv 2025
[26]

arXiv preprint arXiv:2507.10672 , year=

Din MU, Akram W, Saoud LS, Rosell J and Hussain I (2025) Vision language action models in robotic manipulation: A systematic review. arXiv preprint arXiv:2507.10672

work page arXiv 2025
[27]

In: Oh A, Naumann T, Globerson A, Saenko K, Hardt M and Levine S (eds.) Advances in Neural Information Processing Systems, volume 36

Du Y, Yang S, Dai B, Dai H, Nachum O, Tenenbaum J, Schuurmans D and Abbeel P (2023) Learning universal policies via text-guided video generation. In: Oh A, Naumann T, Globerson A, Saenko K, Hardt M and Levine S (eds.) Advances in Neural Information Processing Systems, volume 36. Curran Associates, Inc., pp. 9156--9172. ://proceedings.neurips.cc/paper_file...

work page 2023
[28]

In: 2008 IEEE International Conference on Automation and Logistics

Echelmeyer W, Kirchheim A and Wellbrock E (2008) Robotics-logistics: Challenges for automation of logistic processes. In: 2008 IEEE International Conference on Automation and Logistics. IEEE, pp. 2099--2103

work page 2008
[29]

://arxiv.org/abs/1805.07914

Edwards AD, Sahni H, Schroecker Y and Isbell CL (2019) Imitating latent policies from observation. ://arxiv.org/abs/1805.07914

work page arXiv 2019
[30]

arXiv preprint arXiv:2205.04382

Eisner B, Zhang H and Held D (2022) Flowbot3d: Learning 3d articulation flow to manipulate articulated objects. arXiv preprint arXiv:2205.04382

work page arXiv 2022
[31]

://arxiv.org/abs/2402.07127

Eze C and Crick C (2025) Learning by watching: A review of video-based learning approaches for robot manipulation. ://arxiv.org/abs/2402.07127

work page arXiv 2025
[32]

Remittances Review 9: 871--888

Faheem MA, Zafar N, Kumar P, Melon M, Prince N and Al Mamun MA (2024) Ai and robotic: About the transformation of construction industry automation as well as labor productivity. Remittances Review 9: 871--888

work page 2024
[33]

Rh20t: A robotic dataset for learning diverse skills in one-shot

Fang HS, Fang H, Tang Z, Liu J, Wang C, Wang J, Zhu H and Lu C (2023) Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. arXiv preprint arXiv:2307.00595

work page arXiv 2023
[34]

Autonomous robots 9(3): 227--235

Fiorini P and Prassler E (2000) Cleaning and household robots: A technology survey. Autonomous robots 9(3): 227--235

work page 2000
[35]

Fischler MA and Bolles RC (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6): 381–395. doi:10.1145/358669.358692. ://doi.org/10.1145/358669.358692

work page doi:10.1145/358669.358692 1981
[36]

something something

Goyal R, Kahou SE, Michalski V, Materzyńska J, Westphal S, Kim H, Haenel V, Fruend I, Yianilos P, Mueller-Freitag M, Hoppe F, Thurau C, Bax I and Memisevic R (2017) The "something something" video database for learning and evaluating visual common sense. ://arxiv.org/abs/1706.04261

work page arXiv 2017
[37]

(2022) Ego4d: Around the world in 3,000 hours of egocentric video

Grauman K, Westbury A, Byrne E, Chavis Z, Furnari A, Girdhar R, Hamburger J, Jiang H, Liu M, Liu X et al. (2022) Ego4d: Around the world in 3,000 hours of egocentric video. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18995--19012

work page 2022
[38]

Maniskill2: A unified benchmark for generalizable manipulation skills.arXiv preprint arXiv:2302.04659, 2023

Gu J, Xiang F, Li X, Ling Z, Liu X, Mu T, Tang Y, Tao S, Wei X, Yao Y et al. (2023) Maniskill2: A unified benchmark for generalizable manipulation skills. arXiv preprint arXiv:2302.04659

work page arXiv 2023
[39]

://arxiv.org/abs/2411.18179

Guo Y, Hu Y, Zhang J, Wang YJ, Chen X, Lu C and Chen J (2024) Prediction with action: Visual policy learning via joint denoising process. ://arxiv.org/abs/2411.18179

work page arXiv 2024
[40]

Learning Latent Dynamics for Planning from Pixels

Hafner D, Lillicrap T, Fischer I, Villegas R, Ha D, Lee H and Davidson J (2019) Learning latent dynamics for planning from pixels. ://arxiv.org/abs/1811.04551

work page Pith review arXiv 2019
[41]

Mastering Atari with Discrete World Models

Hafner D, Lillicrap T, Norouzi M and Ba J (2022) Mastering atari with discrete world models. ://arxiv.org/abs/2010.02193

work page internal anchor Pith review arXiv 2022
[42]

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Hu Y, Guo Y, Wang P, Chen X, Wang YJ, Zhang J, Sreenath K, Lu C and Chen J (2024) Video prediction policy: A generalist robot policy with predictive visual representations. ://arxiv.org/abs/2412.14803

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

PointWorld: Scaling 3D world models for in-the-wild robotic manipulation.arXiv preprint arXiv:2601.03782, 2026

Huang W, Chao YW, Mousavian A, Liu MY, Fox D, Mo K and Fei-Fei L (2026) Pointworld: Scaling 3d world models for in-the-wild robotic manipulation. ://arxiv.org/abs/2601.03782

work page arXiv 2026
[44]

Rekep: Spatio- temporal reasoning of relational keypoint constraints for robotic manipulation.arXiv preprint arXiv:2409.01652, 2024

Huang W, Wang C, Li Y, Zhang R and Fei-Fei L (2024) Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation. arXiv preprint arXiv:2409.01652

work page arXiv 2024
[45]

IEEE Robotics and Automation Letters 5(2): 3019--3026

James S, Ma Z, Arrojo DR and Davison AJ (2020) Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters 5(2): 3019--3026

work page 2020
[46]

In: European conference on computer vision

Karaev N, Rocco I, Graham B, Neverova N, Vedaldi A and Rupprecht C (2024) Cotracker: It is better to track together. In: European conference on computer vision. Springer, pp. 18--35

work page 2024
[47]

://arxiv.org/abs/2409.18121

Kerr J, Kim CM, Wu M, Yi B, Wang Q, Goldberg K and Kanazawa A (2024) Robot see robot do: Imitating articulated object manipulation with monocular 4d reconstruction. ://arxiv.org/abs/2409.18121

work page arXiv 2024
[48]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Khazatsky A, Pertsch K, Nair S, Balakrishna A, Dasari S, Karamcheti S, Nasiriany S, Srirama MK, Chen LY, Ellis K et al. (2024) Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Kingma and Max Welling

Kingma DP and Welling M (2019) An introduction to variational autoencoders. Foundations and Trends® in Machine Learning 12(4): 307–392. doi:10.1561/2200000056. ://dx.doi.org/10.1561/2200000056

work page doi:10.1561/2200000056 2019
[50]

Tenenbaum

Ko PC, Mao J, Du Y, Sun SH and Tenenbaum JB (2023) Learning to act from actionless videos through dense correspondences. ://arxiv.org/abs/2310.08576

work page arXiv 2023
[51]

Technical Report ISRN KTH/NA/P-02/01-SE, KTH Royal Institute of Technology

Kragic D and Christensen HI (2002) Survey on visual servoing for manipulation. Technical Report ISRN KTH/NA/P-02/01-SE, KTH Royal Institute of Technology. CVAP259

work page 2002
[52]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Kwon T, Tekin B, St\"uhmer J, Bogo F and Pollefeys M (2021) H2o: Two hands manipulating objects for first person interaction recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 10138--10148

work page 2021
[53]

arXiv preprint arXiv:2212.06870 (2022)

Labbé Y, Manuelli L, Mousavian A, Tyree S, Birchfield S, Tremblay J, Carpentier J, Aubry M, Fox D and Sivic J (2022) Megapose: 6d pose estimation of novel objects via render & compare. ://arxiv.org/abs/2212.06870

work page arXiv 2022
[54]

International Journal of Computer Vision (IJCV) 81

Lepetit V, Moreno-Noguer F and Fua P (2009) EPnP: An accurate O(n) solution to the PnP problem . International Journal of Computer Vision (IJCV) 81. doi:10.1007/s11263-008-0152-6

work page doi:10.1007/s11263-008-0152-6 2009
[55]

(2024) What foundation models can bring for robot learning in manipulation: A survey

Li D, Jin Y, Sun Y, A Y, Yu H, Shi J, Hao X, Hao P, Liu H, Li X et al. (2024) What foundation models can bring for robot learning in manipulation: A survey. The International Journal of Robotics Research : 02783649251390579

work page 2024
[56]

Unified Video Action Model

Li S, Gao Y, Sadigh D and Song S (2025) Unified video action model. ://arxiv.org/abs/2503.00200

work page internal anchor Pith review arXiv 2025
[57]

In: 8th Annual Conference on Robot Learning

Liang J, Liu R, Ozguroglu E, Sudhakar S, Dave A, Tokmakov P, Song S and Vondrick C (2024) Dreamitate: Real-world visuomotor policy learning via video generation. In: 8th Annual Conference on Robot Learning. ://openreview.net/forum?id=InT87E5sr4

work page 2024
[58]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Liu B, Zhu Y, Gao C, Feng Y, Liu Q, Zhu Y and Stone P (2023) Libero: Benchmarking knowledge transfer for lifelong robot learning. arXiv preprint arXiv:2306.03310

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

arXiv preprint arXiv:2403.03174

Liu F, Fang K, Abbeel P and Levine S (2024) Moka: Open-world robotic manipulation through mark-based visual prompting. arXiv preprint arXiv:2403.03174

work page arXiv 2024
[60]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liu Y, Liu Y, Jiang C, Lyu K, Wan W, Shen H, Liang B, Fu Z, Wang H and Yi L (2022) Hoi4d: A 4d egocentric dataset for category-level human-object interaction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21013--21022

work page 2022
[61]

arXiv preprint arXiv:2411.07223 (2024)

Luo Y and Du Y (2025) Grounding video models to actions through goal conditioned exploration. ://arxiv.org/abs/2411.07223

work page arXiv 2025
[62]

IEEE Robotics and Automation Letters

Lynch C, Wahid A, Tompson J, Ding T, Betker J, Baruch R, Armstrong T and Florence P (2023) Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters

work page 2023
[63]

://arxiv.org/abs/2306.00958

Ma YJ, Liang W, Som V, Kumar V, Zhang A, Bastani O and Jayaraman D (2023 a ) Liv: Language-image representations and rewards for robotic control. ://arxiv.org/abs/2306.00958

work page arXiv 2023
[64]

VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

Ma YJ, Sodhani S, Jayaraman D, Bastani O, Kumar V and Zhang A (2023 b ) Vip: Towards universal visual reward and representation via value-implicit pre-training. ://arxiv.org/abs/2210.00030

work page internal anchor Pith review arXiv 2023
[65]

What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

Mandlekar A, Xu D, Wong J, Nasiriany S, Wang C, Kulkarni R, Fei-Fei L, Savarese S, Zhu Y and Mart \' n-Mart \' n R (2021) What matters in learning from offline human demonstrations for robot manipulation. arXiv preprint arXiv:2108.03298

work page internal anchor Pith review Pith/arXiv arXiv 2021
[66]

Journal of Artificial Intelligence Research 83

McCarthy R, Tan DCH, Schmidt D, Acero F, Herr N, Du Y, Thuruthel TG and Li Z (2025) Towards generalist robot learning from internet video: A survey. Journal of Artificial Intelligence Research 83

work page 2025
[67]

://arxiv.org/abs/2112.03227

Mees O, Hermann L, Rosete-Beas E and Burgard W (2022) Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. ://arxiv.org/abs/2112.03227

work page arXiv 2022
[68]

Structured world models from human videos.arXiv preprint arXiv:2308.10901, 2023

Mendonca R, Bahl S and Pathak D (2023) Structured world models from human videos. ://arxiv.org/abs/2308.10901

work page arXiv 2023
[69]

In: ICCV

Miech A, Zhukov D, Alayrac JB, Tapaswi M, Laptev I and Sivic J (2019) How T o100 M : L earning a T ext- V ideo E mbedding by W atching H undred M illion N arrated V ideo C lips. In: ICCV

work page 2019
[70]

(2025) Recipe for vision-language-action models in robotic manipulation: A survey

Motoda T, Makihara K, Nakajo R, Oh H, Shirai K, Hanai R, Murooka M, Suzuki Y, Nishihara H, Takeda M et al. (2025) Recipe for vision-language-action models in robotic manipulation: A survey. Authorea Preprints

work page 2025
[71]

In: 2025 IEEE International Conference on Robotics and Automation (ICRA)

Nasiriany S, Kirmani S, Ding T, Smith L, Zhu Y, Driess D, Sadigh D and Xiao T (2025) Rt-affordance: Affordances are versatile intermediate representations for robot manipulation. In: 2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp. 8249--8257

work page 2025
[72]

(2024) Open x-embodiment: Robotic learning datasets and rt-x models

Open X-Embodiment Collaboration A O’Neill, Rehman A, Maddukuri A, Gupta A, Padalkar A, Lee A, Pooley A, Gupta A, Mandlekar A, Jain A et al. (2024) Open x-embodiment: Robotic learning datasets and rt-x models. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp. 6892--6903

work page 2024
[73]

DINOv2: Learning Robust Visual Features without Supervision

Oquab M, Darcet T, Moutakanni T, Vo H, Szafraniec M, Khalidov V, Fernandez P, Haziza D, Massa F, El-Nouby A et al. (2023) Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193

work page internal anchor Pith review Pith/arXiv arXiv 2023
[74]

Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations

Patel S, Mohan S, Mai H, Jain U, Lazebnik S and Li Y (2025) Robotic manipulation by imitating generated videos without physical demonstrations. ://arxiv.org/abs/2507.00990

work page Pith review arXiv 2025
[75]

arXiv preprint arXiv:2402.08191 (2024) 14

Pumacay W, Singh I, Duan J, Krishna R, Thomason J and Fox D (2024) The colosseum: A benchmark for evaluating generalization for robotic manipulation. arXiv preprint arXiv:2402.08191

work page arXiv 2024
[76]

In: 2020 IEEE International Conference on Robotics and Automation (ICRA)

Qin Z, Fang K, Zhu Y, Fei-Fei L and Savarese S (2020) Keto: Learning keypoint representations for tool manipulation. In: 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp. 7278--7285

work page 2020
[77]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Qu D, Song H, Chen Q, Yao Y, Ye X, Ding Y, Wang Z, Gu J, Zhao B, Wang D et al. (2025) Spatialvla: Exploring spatial representations for visual-language-action model. arXiv preprint arXiv:2501.15830

work page internal anchor Pith review Pith/arXiv arXiv 2025
[78]

://arxiv.org/abs/1806.09655

Rybkin O, Pertsch K, Derpanis KG, Daniilidis K and Jaegle A (2019) Learning what you can do before doing anything. ://arxiv.org/abs/1806.09655

work page arXiv 2019
[79]

Learning to act without actions.arXiv preprint arXiv:2312.10812, 2023

Schmidt D and Jiang M (2024) Learning to act without actions. ://arxiv.org/abs/2312.10812

work page arXiv 2024
[80]

://arxiv.org/abs/2203.13880

Seo Y, Lee K, James S and Abbeel P (2022) Reinforcement learning with action-free pre-training from videos. ://arxiv.org/abs/2203.13880

work page arXiv 2022

Showing first 80 references.