SynthICL: Scalable In-context Imitation Learning with Synthetic Data
Pith reviewed 2026-06-27 19:22 UTC · model grok-4.3
The pith
A robot policy trained solely on synthetic RGB images can learn new manipulation tasks from one real demonstration at test time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SynthICL constructs a data generation pipeline that produces high-fidelity ICIL training examples from RGB-only synthetic scenes. A flow-matching transformer policy is trained on this dataset to perform in-context learning: at test time the policy receives one or more task demonstrations and outputs actions without any further training. The model is additionally trained to predict the next subgoal image, which grounds the control in visual predictions. When evaluated on 16 previously unseen real-world manipulation tasks, the resulting policy reaches an average success rate of 79 percent using only a single demonstration.
What carries the argument
The synthetic data generation pipeline together with a flow-matching transformer policy trained to predict both actions and next subgoal images.
If this is right
- Robot policies for new tasks require no real-world data collection or domain randomization during training.
- Only RGB images are needed, removing requirements for depth sensors and precise camera calibration.
- A single demonstration at test time is sufficient to specify and execute a new task.
- Subgoal image prediction improves precision by providing intermediate visual targets for the controller.
- The same trained model outperforms earlier in-context imitation methods on the reported real-world benchmark.
Where Pith is reading between the lines
- Larger volumes of synthetic data could be generated to cover a wider range of object shapes and environments without additional real-robot effort.
- The same pipeline might be adapted to produce training data for non-manipulation skills such as navigation or assembly if equivalent simulators exist.
- Because the policy never sees real data, it could be deployed in entirely new physical settings by simply updating the synthetic scene generator.
Load-bearing premise
The distribution of synthetic RGB images is close enough to real camera images that a policy trained only on the former transfers directly to the latter.
What would settle it
Run the trained policy on the same 16 real tasks but replace the input images with real photographs that introduce lighting, texture, or viewpoint shifts not present in the synthetic data; success rate falling below 40 percent would indicate the transfer assumption does not hold.
Figures
read the original abstract
In-context imitation learning (ICIL) enables robots to learn new tasks from a small number of demonstrations by conditioning a pre-trained policy on task-specific examples, without retraining at test time. Despite this promise, training generalizable and scalable in-context imitation policies remains an open challenge. We present SynthICL, a scalable framework that trains ICIL policies entirely from RGB-only synthetic data. Specifically, we build a data generation pipeline to produce high-fidelity ICIL data and train a flow-matching transformer policy on the resulting dataset. SynthICL avoids the need for depth sensing, precise camera calibration, and real-world training data in prior approaches, offering a simpler and more scalable alternative. We further incorporate subgoal prediction by training the model to predict the next subgoal images, enabling more precise and visually grounded control. Evaluated on 16 unseen real-world manipulation tasks, SynthICL achieves an average success rate of 79% with only one demonstration provided at test time and outperforms prior methods. Project page: https://synth-icl.github.io
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SynthICL, a framework for scalable in-context imitation learning (ICIL) that trains policies exclusively on RGB synthetic data using a flow-matching transformer policy augmented with subgoal prediction. It claims to achieve an average success rate of 79% on 16 unseen real-world manipulation tasks using only one demonstration at test time, outperforming prior methods, while avoiding the need for depth sensing, camera calibration, or real-world training data.
Significance. If the synthetic-to-real transfer is robustly demonstrated, this work could significantly advance scalable robot learning by enabling training without real data collection. The approach of using synthetic data for ICIL and incorporating subgoal prediction represents a promising direction for visually grounded control. However, the current presentation leaves the generalization mechanism under-specified.
major comments (3)
- [Abstract] Abstract: the central empirical result of 79% average success rate on 16 tasks is stated without error bars, task-specific breakdowns, or detailed baseline comparisons, which undermines the ability to evaluate the outperformance claim and the reliability of the zero-shot transfer.
- [Abstract] Abstract: the description of the synthetic data generation pipeline as producing 'high-fidelity' ICIL data lacks specifics on rendering parameters, lighting and texture variations, camera intrinsics, or any mechanisms to bridge the sim-to-real gap, despite claiming no domain randomization or real data is needed. This is load-bearing for the reported real-world performance.
- [Abstract] Abstract: no information is provided on how the 16 unseen tasks were selected or how the synthetic data distribution ensures coverage of real-world variations, making the generalization claim difficult to assess without additional controls or ablations.
minor comments (1)
- [Abstract] The project page link is provided but no details on reproducibility (e.g., code or data release) are mentioned.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central empirical result of 79% average success rate on 16 tasks is stated without error bars, task-specific breakdowns, or detailed baseline comparisons, which undermines the ability to evaluate the outperformance claim and the reliability of the zero-shot transfer.
Authors: We agree the abstract would benefit from additional qualifiers on the reported result. The full manuscript reports the 79% figure with standard deviation error bars across seeds, provides task-specific breakdowns in Table 1, and includes detailed baseline comparisons (e.g., against behavior cloning and prior ICIL methods) in Section 4.2. We will revise the abstract to include the error bar and a brief note directing readers to the quantitative results and comparisons in the main text. revision: yes
-
Referee: [Abstract] Abstract: the description of the synthetic data generation pipeline as producing 'high-fidelity' ICIL data lacks specifics on rendering parameters, lighting and texture variations, camera intrinsics, or any mechanisms to bridge the sim-to-real gap, despite claiming no domain randomization or real data is needed. This is load-bearing for the reported real-world performance.
Authors: The abstract summarizes the approach at a high level. Full details on the pipeline—including rendering parameters (e.g., resolution and PBR settings), lighting/texture variations across synthetic scenes, camera intrinsics, and the absence of domain randomization—are provided in Section 3.2, with the sim-to-real transfer relying on visual diversity and the subgoal prediction objective. We will revise the abstract to include a short clause referencing these pipeline characteristics and the lack of real data or explicit randomization. revision: partial
-
Referee: [Abstract] Abstract: no information is provided on how the 16 unseen tasks were selected or how the synthetic data distribution ensures coverage of real-world variations, making the generalization claim difficult to assess without additional controls or ablations.
Authors: Task selection criteria and data coverage are described in Section 4.1 (16 manipulation tasks chosen as unseen household skills) and Section 3 (synthetic data generated with variations in object pose, appearance, lighting, and background to promote generalization). Supporting ablations appear in Section 5. We will revise the abstract to add a clarifying phrase on task selection and the role of synthetic data diversity in supporting the generalization claim. revision: yes
Circularity Check
No significant circularity; empirical results on external real-world benchmarks
full rationale
The paper presents SynthICL as an empirical framework: a synthetic data pipeline generates RGB-only ICIL trajectories, a flow-matching transformer is trained on that data, and success is measured directly on 16 held-out real manipulation tasks (79% average with one demo). No equations, fitted parameters, or self-citations are shown to reduce the reported success rate to the synthetic inputs by construction. The evaluation uses external real-world benchmarks, making the central claim falsifiable outside any internal fit. This is the standard non-circular outcome for an applied ML paper whose headline result is an observed transfer performance rather than a derived identity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
V . V osylius and E. Johns. Instant policy: In-context imitation learning via graph diffusion. arXiv preprint arXiv:2411.12633, 2024
-
[2]
M. Fu, H. Huang, G. Datta, L. Y . Chen, W. Panitch, F. Liu, H. Li, and K. Goldberg. Icrt: In- context imitation learning via next-token prediction. In 2025 IEEE International Conference on Robotics and Automation (ICRA) , pages 5937–5944. IEEE, 2025
2025
-
[3]
N. Di Palo and E. Johns. Keypoint action tokens enable in-context imitation learning in robotics. arXiv preprint arXiv:2403.19578, 2024
-
[4]
Intelligence
P. Intelligence. π0: A vision-language-action model for general robot control. Technical report, Physical Intelligence, 2024
2024
-
[5]
C. Wang, Y . Li, J. Wu, et al. Groot: Learning generalizable robot policies via transformer pretraining. arXiv preprint arXiv:2401.XXXXX, 2024
2024
-
[6]
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [8]
-
[9]
Isaac Sim
NVIDIA. Isaac Sim. URL https://github.com/isaac-sim/IsaacSim
-
[10]
Flow Matching for Generative Modeling
Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
Q. Liu. Rectified flow: A marginal preserving approach to optimal transport. arXiv preprint arXiv:2209.14577, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[12]
Hussein, M
A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne. Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2):1–35, 2017
2017
-
[13]
C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44(10-11):1684–1704, 2025
2025
-
[14]
T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
V . V osylius and E. Johns. Few-shot in-context imitation learning via implicit graph alignment. arXiv preprint arXiv:2310.12238, 2023
- [16]
-
[17]
Y . Wang and E. Johns. One-shot dual-arm imitation learning, 2025. URL https://arxiv. org/abs/2503.06831
-
[18]
Sharma, D
P. Sharma, D. Pathak, and A. Gupta. Third-person visual imitation learning via decoupled hierarchical controller. In Advances in Neural Information Processing Systems , 2019
2019
-
[19]
Y . Lee, E. S. Hu, Z. Yang, and J. J. Lim. To follow or not to follow: Selective imitation learning from observations. In Proceedings of the Conference on Robot Learning , volume 100 of Proceedings of Machine Learning Research, pages 11–23. PMLR, 2020
2020
-
[20]
Pertsch, O
K. Pertsch, O. Rybkin, J. Yang, S. Zhou, K. G. Derpanis, K. Daniilidis, J. Lim, and A. Jaegle. Keyframing the future: Keyframe discovery for visual prediction and planning. InProceedings of the 2nd Conference on Learning for Dynamics and Control , volume 120 of Proceedings of Machine Learning Research, pages 969–979. PMLR, 2020
2020
-
[21]
C. Wen, J. Lin, J. Qian, Y . Gao, and D. Jayaraman. Keyframe-focused visual imitation learning. In Proceedings of the 38th International Conference on Machine Learning , volume 139 of Proceedings of Machine Learning Research, pages 11123–11133. PMLR, 2021
2021
-
[22]
Lynch, M
C. Lynch, M. Khansari, T. Xiao, V . Kumar, J. Tompson, S. Levine, and P. Sermanet. Learning latent plans from play. In Proceedings of the Conference on Robot Learning , volume 100 of Proceedings of Machine Learning Research, pages 1113–1132. PMLR, 2020
2020
-
[23]
F. Ni, J. Hao, S. Wu, L. Kou, J. Liu, Y . Zheng, B. Wang, and Y . Zhuang. Generate subgoal im- ages before act: Unlocking the chain-of-thought reasoning in diffusion model for robot manip- ulation with multimodal prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13991–14000, 2024
2024
-
[24]
Kang and Y .-L
X. Kang and Y .-L. Kuo. Incorporating task progress knowledge for subgoal generation in robotic manipulation through image edits. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , 2025
2025
- [25]
- [26]
-
[27]
Y . Mu, T. Chen, S. Peng, Z. Chen, Z. Gao, Y . Zou, L. Lin, Z. Xie, and P. Luo. Robotwin: Dual- arm robot benchmark with generative digital twins (early version). In European Conference on Computer Vision, pages 264–273. Springer, 2024
2024
-
[28]
T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation. arXiv preprint arXiv:2506.18088, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[30]
Cimpoi, S
M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A. Vedaldi. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , 2014
2014
-
[31]
Coumans and Y
E. Coumans and Y . Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016–2021. 11
2016
-
[32]
H.-S. Fang, C. Wang, M. Gou, and C. Lu. Graspnet-1billion: A large-scale benchmark for general object grasping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), pages 11444–11453, 2020
2020
-
[33]
H.-S. Fang, M. Gou, C. Wang, and C. Lu. Robust grasping across diverse sensor qualities: The graspnet-1billion dataset. The International Journal of Robotics Research , 2023
2023
-
[34]
X. Ma, Y . Wang, X. Chen, G. Jia, Z. Liu, Y .-F. Li, C. Chen, and Y . Qiao. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Bertasius, H
G. Bertasius, H. Wang, and L. Torresani. Is space-time attention all you need for video under- standing? In Icml, volume 2, page 4, 2021
2021
-
[36]
James, Z
S. James, Z. Ma, D. R. Arrojo, and A. J. Davison. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters , 5(2):3019–3026, 2020
2020
-
[37]
Kirillov, E
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023
2023
-
[38]
H. K. Cheng, S. W. Oh, B. Price, J.-Y . Lee, and A. Schwing. Putting the object back into video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3151–3161, 2024. 12 A Pseudo-Demonstration Generation Details We generate each pseudo-demonstration in three steps (See Fig 6). First, we sample one...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.