SynthICL: Scalable In-context Imitation Learning with Synthetic Data

Cheng Qian; Edward Johns; Ruomeng Fan; Yifei Ren; Yilong Wang

arxiv: 2606.08154 · v1 · pith:LPDF3Z6Enew · submitted 2026-06-06 · 💻 cs.RO

SynthICL: Scalable In-context Imitation Learning with Synthetic Data

Cheng Qian , Ruomeng Fan , Yifei Ren , Yilong Wang , Edward Johns This is my paper

Pith reviewed 2026-06-27 19:22 UTC · model grok-4.3

classification 💻 cs.RO

keywords in-context imitation learningsynthetic datarobot manipulationflow matchingtransformer policysubgoal predictionRGB inputzero-shot transfer

0 comments

The pith

A robot policy trained solely on synthetic RGB images can learn new manipulation tasks from one real demonstration at test time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SynthICL as a way to train in-context imitation policies using only synthetic RGB data generated through a dedicated pipeline. It trains a flow-matching transformer that conditions on example demonstrations and predicts subgoals to produce actions. If this holds, it removes the need for real-world training data, depth sensors, or camera calibration while still allowing the policy to handle unseen tasks. A reader would care because this makes scaling imitation learning far more practical by shifting data collection to simulation.

Core claim

SynthICL constructs a data generation pipeline that produces high-fidelity ICIL training examples from RGB-only synthetic scenes. A flow-matching transformer policy is trained on this dataset to perform in-context learning: at test time the policy receives one or more task demonstrations and outputs actions without any further training. The model is additionally trained to predict the next subgoal image, which grounds the control in visual predictions. When evaluated on 16 previously unseen real-world manipulation tasks, the resulting policy reaches an average success rate of 79 percent using only a single demonstration.

What carries the argument

The synthetic data generation pipeline together with a flow-matching transformer policy trained to predict both actions and next subgoal images.

If this is right

Robot policies for new tasks require no real-world data collection or domain randomization during training.
Only RGB images are needed, removing requirements for depth sensors and precise camera calibration.
A single demonstration at test time is sufficient to specify and execute a new task.
Subgoal image prediction improves precision by providing intermediate visual targets for the controller.
The same trained model outperforms earlier in-context imitation methods on the reported real-world benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Larger volumes of synthetic data could be generated to cover a wider range of object shapes and environments without additional real-robot effort.
The same pipeline might be adapted to produce training data for non-manipulation skills such as navigation or assembly if equivalent simulators exist.
Because the policy never sees real data, it could be deployed in entirely new physical settings by simply updating the synthetic scene generator.

Load-bearing premise

The distribution of synthetic RGB images is close enough to real camera images that a policy trained only on the former transfers directly to the latter.

What would settle it

Run the trained policy on the same 16 real tasks but replace the input images with real photographs that introduce lighting, texture, or viewpoint shifts not present in the synthetic data; success rate falling below 40 percent would indicate the transfer assumption does not hold.

Figures

Figures reproduced from arXiv: 2606.08154 by Cheng Qian, Edward Johns, Ruomeng Fan, Yifei Ren, Yilong Wang.

**Figure 2.** Figure 2: Overview of the proposed model architecture. The policy takes context demonstration and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The 16 tasks we use in our real-world evaluation. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of SynthICL training with different data generation pipelines on real-world tasks. Results [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Data scaling results. We evaluate how policy performance changes with the scale of dataset. Specifically, we train SynthICL on datasets ranging from 3K to 100K trajectories, evaluate them on the simulation benchmark, and report the average success rate across all tasks. Results. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Overview of the pipeline of generating pseudo demonstrations. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Details of the model architecture. observation tokens are used as query tokens, while the context tokens produced by the context encoder are used as keys and values. The encoder is implemented as a cross-attention transformer. Through cross-attention, each current-observation token can retrieve relevant information from the demonstration, such as the corresponding object, target location, or task stage. T… view at source ↗

**Figure 8.** Figure 8: Each panel visualizes the predicted subgoal images. In each panel, the first row shows the [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

read the original abstract

In-context imitation learning (ICIL) enables robots to learn new tasks from a small number of demonstrations by conditioning a pre-trained policy on task-specific examples, without retraining at test time. Despite this promise, training generalizable and scalable in-context imitation policies remains an open challenge. We present SynthICL, a scalable framework that trains ICIL policies entirely from RGB-only synthetic data. Specifically, we build a data generation pipeline to produce high-fidelity ICIL data and train a flow-matching transformer policy on the resulting dataset. SynthICL avoids the need for depth sensing, precise camera calibration, and real-world training data in prior approaches, offering a simpler and more scalable alternative. We further incorporate subgoal prediction by training the model to predict the next subgoal images, enabling more precise and visually grounded control. Evaluated on 16 unseen real-world manipulation tasks, SynthICL achieves an average success rate of 79% with only one demonstration provided at test time and outperforms prior methods. Project page: https://synth-icl.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SynthICL claims 79% real-robot success from purely synthetic RGB training for in-context imitation, but the sim-to-real transfer has almost no supporting detail.

read the letter

The paper's core move is to generate ICIL trajectories entirely in simulation as RGB images, train a flow-matching transformer policy on them, and add a subgoal-image prediction objective. It then reports 79% average success across 16 held-out real manipulation tasks using a single demonstration at test time, without any real-world training data, depth, or camera calibration.

This is new in the cited ICIL literature: prior methods apparently needed real data or depth, so the fully synthetic RGB pipeline plus subgoal prediction is a distinct setup. The approach directly targets the data-collection bottleneck, which is a practical concern.

The main weakness is the transfer story. The abstract calls the data "high-fidelity" but supplies no information on rendering parameters, lighting variation, texture changes, camera intrinsics, or any other mechanism that would close the domain gap. Without those controls or even a description of how coverage of real-world conditions was ensured, the 79% number is difficult to evaluate. Only the aggregate success rate appears; there are no per-task results, error bars, or baseline comparisons in the provided text.

The work is aimed at robot-learning groups that want to scale in-context methods without heavy real-robot data collection. A reader already working on synthetic data or flow-matching policies could extract the pipeline idea and the subgoal objective. The central claim is important enough, and the framing is coherent enough, that it should go to peer review so the full methods and controls can be checked.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces SynthICL, a framework for scalable in-context imitation learning (ICIL) that trains policies exclusively on RGB synthetic data using a flow-matching transformer policy augmented with subgoal prediction. It claims to achieve an average success rate of 79% on 16 unseen real-world manipulation tasks using only one demonstration at test time, outperforming prior methods, while avoiding the need for depth sensing, camera calibration, or real-world training data.

Significance. If the synthetic-to-real transfer is robustly demonstrated, this work could significantly advance scalable robot learning by enabling training without real data collection. The approach of using synthetic data for ICIL and incorporating subgoal prediction represents a promising direction for visually grounded control. However, the current presentation leaves the generalization mechanism under-specified.

major comments (3)

[Abstract] Abstract: the central empirical result of 79% average success rate on 16 tasks is stated without error bars, task-specific breakdowns, or detailed baseline comparisons, which undermines the ability to evaluate the outperformance claim and the reliability of the zero-shot transfer.
[Abstract] Abstract: the description of the synthetic data generation pipeline as producing 'high-fidelity' ICIL data lacks specifics on rendering parameters, lighting and texture variations, camera intrinsics, or any mechanisms to bridge the sim-to-real gap, despite claiming no domain randomization or real data is needed. This is load-bearing for the reported real-world performance.
[Abstract] Abstract: no information is provided on how the 16 unseen tasks were selected or how the synthetic data distribution ensures coverage of real-world variations, making the generalization claim difficult to assess without additional controls or ablations.

minor comments (1)

[Abstract] The project page link is provided but no details on reproducibility (e.g., code or data release) are mentioned.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the central empirical result of 79% average success rate on 16 tasks is stated without error bars, task-specific breakdowns, or detailed baseline comparisons, which undermines the ability to evaluate the outperformance claim and the reliability of the zero-shot transfer.

Authors: We agree the abstract would benefit from additional qualifiers on the reported result. The full manuscript reports the 79% figure with standard deviation error bars across seeds, provides task-specific breakdowns in Table 1, and includes detailed baseline comparisons (e.g., against behavior cloning and prior ICIL methods) in Section 4.2. We will revise the abstract to include the error bar and a brief note directing readers to the quantitative results and comparisons in the main text. revision: yes
Referee: [Abstract] Abstract: the description of the synthetic data generation pipeline as producing 'high-fidelity' ICIL data lacks specifics on rendering parameters, lighting and texture variations, camera intrinsics, or any mechanisms to bridge the sim-to-real gap, despite claiming no domain randomization or real data is needed. This is load-bearing for the reported real-world performance.

Authors: The abstract summarizes the approach at a high level. Full details on the pipeline—including rendering parameters (e.g., resolution and PBR settings), lighting/texture variations across synthetic scenes, camera intrinsics, and the absence of domain randomization—are provided in Section 3.2, with the sim-to-real transfer relying on visual diversity and the subgoal prediction objective. We will revise the abstract to include a short clause referencing these pipeline characteristics and the lack of real data or explicit randomization. revision: partial
Referee: [Abstract] Abstract: no information is provided on how the 16 unseen tasks were selected or how the synthetic data distribution ensures coverage of real-world variations, making the generalization claim difficult to assess without additional controls or ablations.

Authors: Task selection criteria and data coverage are described in Section 4.1 (16 manipulation tasks chosen as unseen household skills) and Section 3 (synthetic data generated with variations in object pose, appearance, lighting, and background to promote generalization). Supporting ablations appear in Section 5. We will revise the abstract to add a clarifying phrase on task selection and the role of synthetic data diversity in supporting the generalization claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on external real-world benchmarks

full rationale

The paper presents SynthICL as an empirical framework: a synthetic data pipeline generates RGB-only ICIL trajectories, a flow-matching transformer is trained on that data, and success is measured directly on 16 held-out real manipulation tasks (79% average with one demo). No equations, fitted parameters, or self-citations are shown to reduce the reported success rate to the synthetic inputs by construction. The evaluation uses external real-world benchmarks, making the central claim falsifiable outside any internal fit. This is the standard non-circular outcome for an applied ML paper whose headline result is an observed transfer performance rather than a derived identity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract alone supplies insufficient detail to enumerate specific free parameters, axioms, or invented entities; the central claim rests on an unstated assumption that synthetic data generalizes without further real-world fine-tuning.

pith-pipeline@v0.9.1-grok · 5722 in / 1137 out tokens · 20087 ms · 2026-06-27T19:22:12.512995+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 16 canonical work pages · 8 internal anchors

[1]

V osylius and E

V . V osylius and E. Johns. Instant policy: In-context imitation learning via graph diffusion. arXiv preprint arXiv:2411.12633, 2024

work page arXiv 2024
[2]

M. Fu, H. Huang, G. Datta, L. Y . Chen, W. Panitch, F. Liu, H. Li, and K. Goldberg. Icrt: In- context imitation learning via next-token prediction. In 2025 IEEE International Conference on Robotics and Automation (ICRA) , pages 5937–5944. IEEE, 2025

2025
[3]

Di Palo and E

N. Di Palo and E. Johns. Keypoint action tokens enable in-context imitation learning in robotics. arXiv preprint arXiv:2403.19578, 2024

work page arXiv 2024
[4]

Intelligence

P. Intelligence. π0: A vision-language-action model for general robot control. Technical report, Physical Intelligence, 2024

2024
[5]

C. Wang, Y . Li, J. Wu, et al. Groot: Learning generalizable robot policies via transformer pretraining. arXiv preprint arXiv:2401.XXXXX, 2024

2024
[6]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Nguyen, W

T. Nguyen, W. Yuan, S. Wei, H. Li, D. Seita, and Y . Wang. Iclr: In-context imitation learning with visual reasoning. arXiv preprint arXiv:2603.07530, 2026

work page arXiv 2026
[9]

Isaac Sim

NVIDIA. Isaac Sim. URL https://github.com/isaac-sim/IsaacSim
[10]

Flow Matching for Generative Modeling

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Q. Liu. Rectified flow: A marginal preserving approach to optimal transport. arXiv preprint arXiv:2209.14577, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Hussein, M

A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne. Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2):1–35, 2017

2017
[13]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025
[14]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

V osylius and E

V . V osylius and E. Johns. Few-shot in-context imitation learning via implicit graph alignment. arXiv preprint arXiv:2310.12238, 2023

work page arXiv 2023
[16]

V . Jain, M. Attarian, N. J. Joshi, A. Wahid, D. Driess, Q. Vuong, P. R. Sanketi, P. Sermanet, S. Welker, C. Chan, et al. Vid2robot: End-to-end video-conditioned policy learning with cross- attention transformers. arXiv preprint arXiv:2403.12943, 2024. 10

work page arXiv 2024
[17]

Wang and E

Y . Wang and E. Johns. One-shot dual-arm imitation learning, 2025. URL https://arxiv. org/abs/2503.06831

work page arXiv 2025
[18]

Sharma, D

P. Sharma, D. Pathak, and A. Gupta. Third-person visual imitation learning via decoupled hierarchical controller. In Advances in Neural Information Processing Systems , 2019

2019
[19]

Y . Lee, E. S. Hu, Z. Yang, and J. J. Lim. To follow or not to follow: Selective imitation learning from observations. In Proceedings of the Conference on Robot Learning , volume 100 of Proceedings of Machine Learning Research, pages 11–23. PMLR, 2020

2020
[20]

Pertsch, O

K. Pertsch, O. Rybkin, J. Yang, S. Zhou, K. G. Derpanis, K. Daniilidis, J. Lim, and A. Jaegle. Keyframing the future: Keyframe discovery for visual prediction and planning. InProceedings of the 2nd Conference on Learning for Dynamics and Control , volume 120 of Proceedings of Machine Learning Research, pages 969–979. PMLR, 2020

2020
[21]

C. Wen, J. Lin, J. Qian, Y . Gao, and D. Jayaraman. Keyframe-focused visual imitation learning. In Proceedings of the 38th International Conference on Machine Learning , volume 139 of Proceedings of Machine Learning Research, pages 11123–11133. PMLR, 2021

2021
[22]

Lynch, M

C. Lynch, M. Khansari, T. Xiao, V . Kumar, J. Tompson, S. Levine, and P. Sermanet. Learning latent plans from play. In Proceedings of the Conference on Robot Learning , volume 100 of Proceedings of Machine Learning Research, pages 1113–1132. PMLR, 2020

2020
[23]

F. Ni, J. Hao, S. Wu, L. Kou, J. Liu, Y . Zheng, B. Wang, and Y . Zhuang. Generate subgoal im- ages before act: Unlocking the chain-of-thought reasoning in diffusion model for robot manip- ulation with multimodal prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13991–14000, 2024

2024
[24]

Kang and Y .-L

X. Kang and Y .-L. Kuo. Incorporating task progress knowledge for subgoal generation in robotic manipulation through image edits. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , 2025

2025
[25]

K. B. Hatch, A. Balakrishna, O. Mees, S. Nair, S. Park, B. Wulfe, M. Itkina, B. Eysenbach, S. Levine, T. Kollar, and B. Burchfiel. GHIL-Glue: Hierarchical control with filtered subgoal images. arXiv preprint arXiv:2410.20018, 2024

work page arXiv 2024
[26]

J. Zhao, W. Lu, D. Zhang, Y . Liu, Y . Liang, T. Zhang, Y . Cao, J. Xie, Y . Hu, S. Wang, et al. Do you need proprioceptive states in visuomotor policies? arXiv preprint arXiv:2509.18644, 2025

work page arXiv 2025
[27]

Y . Mu, T. Chen, S. Peng, Z. Chen, Z. Gao, Y . Zou, L. Lin, Z. Xie, and P. Luo. Robotwin: Dual- arm robot benchmark with generative digital twins (early version). In European Conference on Computer Vision, pages 264–273. Springer, 2024

2024
[28]

T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation. arXiv preprint arXiv:2506.18088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[30]

Cimpoi, S

M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A. Vedaldi. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , 2014

2014
[31]

Coumans and Y

E. Coumans and Y . Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016–2021. 11

2016
[32]

H.-S. Fang, C. Wang, M. Gou, and C. Lu. Graspnet-1billion: A large-scale benchmark for general object grasping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), pages 11444–11453, 2020

2020
[33]

H.-S. Fang, M. Gou, C. Wang, and C. Lu. Robust grasping across diverse sensor qualities: The graspnet-1billion dataset. The International Journal of Robotics Research , 2023

2023
[34]

X. Ma, Y . Wang, X. Chen, G. Jia, Z. Liu, Y .-F. Li, C. Chen, and Y . Qiao. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Bertasius, H

G. Bertasius, H. Wang, and L. Torresani. Is space-time attention all you need for video under- standing? In Icml, volume 2, page 4, 2021

2021
[36]

James, Z

S. James, Z. Ma, D. R. Arrojo, and A. J. Davison. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters , 5(2):3019–3026, 2020

2020
[37]

Kirillov, E

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

2023
[38]

H. K. Cheng, S. W. Oh, B. Price, J.-Y . Lee, and A. Schwing. Putting the object back into video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3151–3161, 2024. 12 A Pseudo-Demonstration Generation Details We generate each pseudo-demonstration in three steps (See Fig 6). First, we sample one...

2024

[1] [1]

V osylius and E

V . V osylius and E. Johns. Instant policy: In-context imitation learning via graph diffusion. arXiv preprint arXiv:2411.12633, 2024

work page arXiv 2024

[2] [2]

M. Fu, H. Huang, G. Datta, L. Y . Chen, W. Panitch, F. Liu, H. Li, and K. Goldberg. Icrt: In- context imitation learning via next-token prediction. In 2025 IEEE International Conference on Robotics and Automation (ICRA) , pages 5937–5944. IEEE, 2025

2025

[3] [3]

Di Palo and E

N. Di Palo and E. Johns. Keypoint action tokens enable in-context imitation learning in robotics. arXiv preprint arXiv:2403.19578, 2024

work page arXiv 2024

[4] [4]

Intelligence

P. Intelligence. π0: A vision-language-action model for general robot control. Technical report, Physical Intelligence, 2024

2024

[5] [5]

C. Wang, Y . Li, J. Wu, et al. Groot: Learning generalizable robot policies via transformer pretraining. arXiv preprint arXiv:2401.XXXXX, 2024

2024

[6] [6]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Nguyen, W

T. Nguyen, W. Yuan, S. Wei, H. Li, D. Seita, and Y . Wang. Iclr: In-context imitation learning with visual reasoning. arXiv preprint arXiv:2603.07530, 2026

work page arXiv 2026

[9] [9]

Isaac Sim

NVIDIA. Isaac Sim. URL https://github.com/isaac-sim/IsaacSim

[10] [10]

Flow Matching for Generative Modeling

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

Q. Liu. Rectified flow: A marginal preserving approach to optimal transport. arXiv preprint arXiv:2209.14577, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[12] [12]

Hussein, M

A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne. Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2):1–35, 2017

2017

[13] [13]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025

[14] [14]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

V osylius and E

V . V osylius and E. Johns. Few-shot in-context imitation learning via implicit graph alignment. arXiv preprint arXiv:2310.12238, 2023

work page arXiv 2023

[16] [16]

V . Jain, M. Attarian, N. J. Joshi, A. Wahid, D. Driess, Q. Vuong, P. R. Sanketi, P. Sermanet, S. Welker, C. Chan, et al. Vid2robot: End-to-end video-conditioned policy learning with cross- attention transformers. arXiv preprint arXiv:2403.12943, 2024. 10

work page arXiv 2024

[17] [17]

Wang and E

Y . Wang and E. Johns. One-shot dual-arm imitation learning, 2025. URL https://arxiv. org/abs/2503.06831

work page arXiv 2025

[18] [18]

Sharma, D

P. Sharma, D. Pathak, and A. Gupta. Third-person visual imitation learning via decoupled hierarchical controller. In Advances in Neural Information Processing Systems , 2019

2019

[19] [19]

Y . Lee, E. S. Hu, Z. Yang, and J. J. Lim. To follow or not to follow: Selective imitation learning from observations. In Proceedings of the Conference on Robot Learning , volume 100 of Proceedings of Machine Learning Research, pages 11–23. PMLR, 2020

2020

[20] [20]

Pertsch, O

K. Pertsch, O. Rybkin, J. Yang, S. Zhou, K. G. Derpanis, K. Daniilidis, J. Lim, and A. Jaegle. Keyframing the future: Keyframe discovery for visual prediction and planning. InProceedings of the 2nd Conference on Learning for Dynamics and Control , volume 120 of Proceedings of Machine Learning Research, pages 969–979. PMLR, 2020

2020

[21] [21]

C. Wen, J. Lin, J. Qian, Y . Gao, and D. Jayaraman. Keyframe-focused visual imitation learning. In Proceedings of the 38th International Conference on Machine Learning , volume 139 of Proceedings of Machine Learning Research, pages 11123–11133. PMLR, 2021

2021

[22] [22]

Lynch, M

C. Lynch, M. Khansari, T. Xiao, V . Kumar, J. Tompson, S. Levine, and P. Sermanet. Learning latent plans from play. In Proceedings of the Conference on Robot Learning , volume 100 of Proceedings of Machine Learning Research, pages 1113–1132. PMLR, 2020

2020

[23] [23]

F. Ni, J. Hao, S. Wu, L. Kou, J. Liu, Y . Zheng, B. Wang, and Y . Zhuang. Generate subgoal im- ages before act: Unlocking the chain-of-thought reasoning in diffusion model for robot manip- ulation with multimodal prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13991–14000, 2024

2024

[24] [24]

Kang and Y .-L

X. Kang and Y .-L. Kuo. Incorporating task progress knowledge for subgoal generation in robotic manipulation through image edits. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , 2025

2025

[25] [25]

K. B. Hatch, A. Balakrishna, O. Mees, S. Nair, S. Park, B. Wulfe, M. Itkina, B. Eysenbach, S. Levine, T. Kollar, and B. Burchfiel. GHIL-Glue: Hierarchical control with filtered subgoal images. arXiv preprint arXiv:2410.20018, 2024

work page arXiv 2024

[26] [26]

J. Zhao, W. Lu, D. Zhang, Y . Liu, Y . Liang, T. Zhang, Y . Cao, J. Xie, Y . Hu, S. Wang, et al. Do you need proprioceptive states in visuomotor policies? arXiv preprint arXiv:2509.18644, 2025

work page arXiv 2025

[27] [27]

Y . Mu, T. Chen, S. Peng, Z. Chen, Z. Gao, Y . Zou, L. Lin, Z. Xie, and P. Luo. Robotwin: Dual- arm robot benchmark with generative digital twins (early version). In European Conference on Computer Vision, pages 264–273. Springer, 2024

2024

[28] [28]

T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation. arXiv preprint arXiv:2506.18088, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[30] [30]

Cimpoi, S

M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A. Vedaldi. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , 2014

2014

[31] [31]

Coumans and Y

E. Coumans and Y . Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016–2021. 11

2016

[32] [32]

H.-S. Fang, C. Wang, M. Gou, and C. Lu. Graspnet-1billion: A large-scale benchmark for general object grasping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), pages 11444–11453, 2020

2020

[33] [33]

H.-S. Fang, M. Gou, C. Wang, and C. Lu. Robust grasping across diverse sensor qualities: The graspnet-1billion dataset. The International Journal of Robotics Research , 2023

2023

[34] [34]

X. Ma, Y . Wang, X. Chen, G. Jia, Z. Liu, Y .-F. Li, C. Chen, and Y . Qiao. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Bertasius, H

G. Bertasius, H. Wang, and L. Torresani. Is space-time attention all you need for video under- standing? In Icml, volume 2, page 4, 2021

2021

[36] [36]

James, Z

S. James, Z. Ma, D. R. Arrojo, and A. J. Davison. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters , 5(2):3019–3026, 2020

2020

[37] [37]

Kirillov, E

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

2023

[38] [38]

H. K. Cheng, S. W. Oh, B. Price, J.-Y . Lee, and A. Schwing. Putting the object back into video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3151–3161, 2024. 12 A Pseudo-Demonstration Generation Details We generate each pseudo-demonstration in three steps (See Fig 6). First, we sample one...

2024