pith. machine review for the scientific record. sign in

arxiv: 2605.02757 · v1 · submitted 2026-05-04 · 💻 cs.CV · cs.RO

Recognition: 3 theorem links

Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation

Chang Xu, Chenyu Hui, Fei Wang, Shan You, Siyu Xu, Tao Huang, Xiaodi Huang, Yunke Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:16 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords video augmentationvision-language-action modelssimulation-to-real transferrobotic data synthesisconditional video generationdiffusion model accelerationVLA training
0
0 comments X

The pith

An efficient pipeline turns simulated robotic videos into realistic ones that preserve actions and semantics, raising VLA model performance on real tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to close the visual domain gap between abundant but unrealistic simulated data and scarce real-world videos for training vision-language-action models. It does so by pulling structured conditions from simulation through segmentation and captioning, rewriting those captions to vary environments, and feeding them into a conditional video generator that produces realistic footage while keeping task meaning and trajectories intact. Efficiency comes from reusing diffusion features across frames and picking a minimal non-redundant subset via coreset sampling. If the method works, training data can scale without extra real-robot collection, and models should generalize better to physical settings. The authors test this on multiple benchmarks and a real platform, reporting consistent gains for existing VLA architectures.

Core claim

The central claim is that simulated VLA videos can be converted into realistic training videos through a pipeline of video semantic segmentation, caption extraction and rewriting, and conditional video synthesis, with added diffusion feature reuse and coreset sampling to keep the process fast and scalable, and that the resulting data measurably improves downstream VLA performance on both simulated and real robotic benchmarks.

What carries the argument

The conditional video transfer model that takes structured simulation conditions and generates realistic videos while holding action trajectories and task semantics fixed.

If this is right

  • Large volumes of cheap simulated data become usable for real-world VLA training once the visual gap is closed.
  • Models such as RDT-1B gain roughly 8 percent on Robotwin 2.0 and π0 gains 5.1 percent on the harder LIBERO-Plus set.
  • The same augmentation works across several simulated environments and transfers to physical robot platforms.
  • Feature reuse and coreset selection make the process fast enough to apply at the scale needed for modern VLA training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could cut the cost of collecting real robotic demonstrations by an order of magnitude if the realism transfer generalizes to new tasks.
  • Similar condition-extraction plus rewriting steps might help close sim-to-real gaps in other video-heavy domains such as autonomous driving or human motion prediction.
  • One could test whether feeding the rewritten captions back into the simulator itself creates even more diverse training distributions without extra generation cost.

Load-bearing premise

The generated videos must stay faithful to the original simulation's task meaning and motion paths without adding artifacts that confuse the downstream learner.

What would settle it

Train the same VLA models on the augmented dataset versus the original simulated data alone and measure whether real-world task success rates stay flat or drop instead of rising.

Figures

Figures reproduced from arXiv: 2605.02757 by Chang Xu, Chenyu Hui, Fei Wang, Shan You, Siyu Xu, Tao Huang, Xiaodi Huang, Yunke Wang.

Figure 1
Figure 1. Figure 1: Overall framework of the proposed method. Given a large-scale simulation training set, we introduce a coreset sampling algorithm to select important and diverse samples, which are then augmented to realistic video strips and used for training. highly-diverse, we ensure the significance of the augmenta￾tion effect in large-scale datasets. We conduct extensive experiments across multiple bench￾marks, includi… view at source ↗
Figure 2
Figure 2. Figure 2: Examples from LIBERO-Plus evaluation. The baseline VLA model fails under environment perturbations such as tex￾ture change (upper) and lighting change (lower), while the model trained with our augmented data performs the tasks correctly, show￾ing stronger generalization. grounds, lighting, viewpoints, and object configurations, models latch onto superficial correlations (Xu et al., 2025b; Li, 2025; Fang et… view at source ↗
Figure 3
Figure 3. Figure 3: Euclidean distance between adjacent velocity predictions. A stable phase with minimal changes enables caching and reuse. additional real-world data collection. 3.3. Efficient Video Generation via Velocity Caching Recent conditional video diffusion models such as Cosmos￾Transfer (Ali et al., 2025) and Wan (Wan et al., 2025) achieve strong visual fidelity but suffer from prohibitively high inference costs1 ,… view at source ↗
Figure 5
Figure 5. Figure 5: Two manipulation tasks (Slot Pen and Stack Tape) under three test conditions: (a) In-Distribution, (b) Position Shift, and (c) Background Shift view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of coreset sampling on the LIBERO train￾ing dataset with a 10% sampling budget. Right: the global diffi￾culty distribution, where the color spectrum represents the mean policy loss (redder colors indicate higher difficulty). Left: the selected coreset overlaid on the full dataset. 40.6 42.0 45.2 46.1 Performance (%) 40 41 42 43 44 45 46 Coreset Percentage (%) 0 10 20 30 40 50 view at source ↗
Figure 7
Figure 7. Figure 7: Performance comparison of different coreset sampling percentages on LIBERO-Plus spatial suite. As shown in view at source ↗
Figure 8
Figure 8. Figure 8: Acceleration rates across 10 Robotwin 2.0 tasks. Our method reduces runtime by over 60% on average. Task 1-10: beat block hammer, adjust bottle, handover block, haing mug, pick dual bottles, place a2b right, place burger fries, place dual shoes, stack blocks two, pick diverse bottles. A.4. Justification on hyperparameters The choice of k=0.4 and a=8 when transferring videos is based on a trade-off between … view at source ↗
Figure 9
Figure 9. Figure 9: Visualizations of augmented videos and origin videos from Robotwin 2.0 In view at source ↗
Figure 10
Figure 10. Figure 10: Visualizations of augmented videos and origin videos from LIBERO In view at source ↗
Figure 11
Figure 11. Figure 11: Visualizations of augmented videos and origin videos from real robot experiment In view at source ↗
Figure 12
Figure 12. Figure 12: Visualizations of augmented videos and origin videos from real robot experiment D.2. Augmenting Videos with or without Velocity Caching In view at source ↗
Figure 13
Figure 13. Figure 13: Comparing between using cache based acceleration and not using In view at source ↗
Figure 14
Figure 14. Figure 14: Comparing between using cache based acceleration and not using 20 view at source ↗
read the original abstract

Vision-language-action (VLA) models typically rely on large-scale real-world videos, whereas simulated data, despite being inexpensive and highly parallelizable to collect, often suffers from a substantial visual domain gap and limited environmental diversity, resulting in weak real-world generalization. We present an efficient video augmentation framework that converts simulated VLA videos into realistic training videos while preserving task semantics and action trajectories. Our pipeline extracts structured conditions from simulation via video semantic segmentation and video captioning, rewrites captions to diversify environments, and uses a conditional video transfer model to synthesize realistic videos. To make augmentation practical at scale, we introduce a diffusion feature-reuse mechanism that reuses video tokens across adjacent timesteps to accelerate generation, and a coreset sampling strategy that identifies a compact, non-redundant subset for augmentation under limited computation. Extensive experiments on Robotwin 2.0, LIBERO, LIBERO-Plus, and a real robotic platform demonstrate consistent improvements. For example, our method improves RDT-1B by 8% on Robotwin 2.0, and boosts $\pi_0$ by 5.1% on the more challenging LIBERO-Plus benchmark. Code is available at: https://github.com/nanfangxiansheng/Seeing-Realism-from-Simulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims an efficient video transfer pipeline that converts simulated VLA videos into realistic training data by extracting semantic segmentation and captions from simulation, rewriting captions for environmental diversity, and applying a conditional diffusion model. Efficiency is achieved via a diffusion feature-reuse mechanism across timesteps and coreset sampling; experiments on Robotwin 2.0, LIBERO, LIBERO-Plus, and a real robot platform report consistent gains such as +8% for RDT-1B on Robotwin 2.0 and +5.1% for π0 on LIBERO-Plus, with code released.

Significance. If the generated videos preserve action trajectories and task semantics, the framework could meaningfully expand the scale of VLA training by bridging the sim-to-real gap with inexpensive simulated data. The explicit code release at the cited GitHub repository is a clear strength that supports reproducibility and downstream use.

major comments (2)
  1. [§3] §3 (method pipeline): The claim that the conditional diffusion model 'preserves task semantics and action trajectories' rests on indirect conditioning via segmentation masks and rewritten captions alone; no explicit motion or optical-flow conditioning is described, and no quantitative verification metrics (e.g., trajectory MSE, flow consistency, or policy rollout divergence) are reported to confirm preservation. This is load-bearing for the downstream generalization improvements.
  2. [§4] §4 (experimental results): The reported gains (8% on RDT-1B, 5.1% on π0) are stated without accompanying details on the number of random seeds, statistical significance tests, full ablation controls, or exhaustive baseline comparisons; this weakens the ability to attribute improvements specifically to the video transfer rather than other factors.
minor comments (2)
  1. [Abstract] Abstract: The model name π0 should be expanded on first use for readers unfamiliar with the specific VLA architecture.
  2. [§3.3] The coreset sampling strategy is introduced for computational efficiency but lacks a precise algorithmic description or pseudocode that would allow exact reproduction from the text alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential impact of the framework along with the code release. We address each major comment below and will revise the manuscript to strengthen the claims with additional evidence and details.

read point-by-point responses
  1. Referee: [§3] §3 (method pipeline): The claim that the conditional diffusion model 'preserves task semantics and action trajectories' rests on indirect conditioning via segmentation masks and rewritten captions alone; no explicit motion or optical-flow conditioning is described, and no quantitative verification metrics (e.g., trajectory MSE, flow consistency, or policy rollout divergence) are reported to confirm preservation. This is load-bearing for the downstream generalization improvements.

    Authors: We agree that the conditioning relies on semantic segmentation masks and captions rather than explicit optical flow. The per-frame segmentation masks extracted from simulation explicitly encode spatial layout and change across timesteps to reflect the executed actions, providing implicit but strong motion guidance that the conditional diffusion model is trained to follow. Caption rewriting further ensures semantic consistency while allowing environmental diversity. This design choice prioritizes efficiency and scalability without requiring additional motion estimators. We acknowledge the value of direct quantitative verification; in the revised manuscript we will add optical-flow consistency metrics (e.g., endpoint error between input and generated videos) and policy-rollout divergence comparisons in §3 and the experiments section to directly support the preservation claim. revision: yes

  2. Referee: [§4] §4 (experimental results): The reported gains (8% on RDT-1B, 5.1% on π0) are stated without accompanying details on the number of random seeds, statistical significance tests, full ablation controls, or exhaustive baseline comparisons; this weakens the ability to attribute improvements specifically to the video transfer rather than other factors.

    Authors: The main results were obtained with 3 random seeds per setting, with mean performance reported; ablations isolating the feature-reuse mechanism and coreset sampling appear in the supplementary material. We did not include error bars, p-values, or an exhaustive set of baselines in the primary text. In the revision we will expand §4 to report standard deviations, statistical significance tests (paired t-tests), and additional baselines including raw simulation data, standard domain randomization, and alternative video translation methods. These additions will strengthen attribution of gains to the proposed video transfer pipeline. revision: yes

Circularity Check

0 steps flagged

No circularity: forward pipeline evaluated on external benchmarks

full rationale

The paper presents an engineering pipeline that extracts segmentation and captions from simulation, rewrites captions, and applies a conditional diffusion model with feature reuse and coreset sampling. All central claims rest on empirical results from independent public benchmarks (Robotwin 2.0, LIBERO, LIBERO-Plus, real robot) rather than any fitted parameter renamed as prediction, self-referential definition, or load-bearing self-citation chain. No equations or uniqueness theorems are invoked that reduce the reported gains to the method's own inputs by construction. The derivation chain is therefore self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based solely on abstract, the approach rests on standard assumptions about existing vision models rather than new postulates; no invented entities or heavy free parameters are introduced in the high-level description.

axioms (2)
  • domain assumption Video semantic segmentation and captioning accurately extract structured conditions from simulated videos.
    Invoked as the first step of the pipeline to enable subsequent transfer.
  • domain assumption Conditional video synthesis models can alter visual realism while preserving action trajectories and task semantics.
    Central premise allowing the augmentation to remain useful for VLA training.

pith-pipeline@v0.9.0 · 5550 in / 1548 out tokens · 53041 ms · 2026-05-08T18:16:40.124710+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 25 canonical work pages · 11 internal anchors

  1. [1]

    Abbas, A., Tirumala, K., Simig, D., Ganguli, S., and Mor- cos, A. S. Semdedup: Data-efficient learning at web- scale through semantic deduplication. arXiv preprint arXiv:2303.09540,

  2. [2]

    Cosmos-transfer1: Conditional world generation with adaptive multimodal control, 2025

    Alhaija, H. A., Alvarez, J., Bala, M., Cai, T., Cao, T., Cha, L., Chen, J., Chen, M., Ferroni, F., Fidler, S., et al. Cosmos-transfer1: Conditional world genera- tion with adaptive multimodal control. arXiv preprint arXiv:2503.14492,

  3. [3]

    World Simulation with Video Foundation Models for Physical AI

    Ali, A., Bai, J., Bala, M., Balaji, Y ., Blakeman, A., Cai, T., Cao, J., Cao, T., Cha, E., Chao, Y .-W., et al. World simulation with video foundation models for physical ai. arXiv preprint arXiv:2511.00062,

  4. [5]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    doi: 10.48550. arXiv preprint ARXIV .2410.24164. Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y ., English, Z., V oleti, V ., Letts, A., et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127,

  5. [6]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Brohan, A., Brown, N., Carbajal, J., Chebotar, Y ., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817,

  6. [7]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Chen, T., Chen, Z., Chen, B., Cai, Z., Liu, Y ., Li, Z., Liang, Q., Lin, X., Ge, Y ., Gu, Z., et al. Robotwin 2.0: A scal- able data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation. arXiv preprint arXiv:2506.18088,

  7. [8]

    HS Fang, C Liu, et al

    Dong, Z., Wang, X., Zhu, Z., Wang, Y ., Wang, Y ., Zhou, Y ., Wang, B., Ni, C., Ouyang, R., Qin, W., et al. Emma: Gen- eralizing real-world robot manipulation via generative visual transfer. arXiv preprint arXiv:2509.22407,

  8. [9]

    Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

    Ebert, F., Yang, Y ., Schmeckpeper, K., Bucher, B., Geor- gakis, G., Daniilidis, K., Finn, C., and Levine, S. Bridge data: Boosting generalization of robotic skills with cross- domain datasets. arXiv preprint arXiv:2109.13396,

  9. [10]

    From inten- tion to execution: Probing the generalization bound- aries of vision-language-action models

    Fang, I., Zhang, J., Tong, S., and Feng, C. From inten- tion to execution: Probing the generalization bound- aries of vision-language-action models. arXiv preprint arXiv:2506.09930, 2025a. Fang, Y ., Yang, Y ., Zhu, X., Zheng, K., Bertasius, G., Szafir, D., and Ding, M. Rebot: Scaling robot learning with real- to-sim-to-real robotic video synthesis. arXiv...

  10. [11]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027,

  11. [12]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    URL https://arxiv. org/abs/2504.16054, 1(2):3. 9 Efficient Video Transfer for Vision-Language-Action Data Augmentation Intelligence, P., Amin, A., Aniceto, R., Balakrishna, A., Black, K., Conley, K., Connors, G., Darpinian, J., Dha- balia, K., DiCarlo, J., Driess, D., Equi, M., Esmail, A., Fang, Y ., Finn, C., Glossop, C., Godden, T., Goryachev, I., Groom...

  12. [13]

    $\pi^{*}_{0.6}$: a VLA That Learns From Experience

    URL https://arxiv.org/abs/2511.14759. Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y ., and Liu, Y . Vace: All-in-one video creation and editing. arXiv preprint arXiv:2503.07598,

  13. [14]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakr- ishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., San- keti, P., et al. Openvla: An open-source vision-language- action model. arXiv preprint arXiv:2406.09246,

  14. [15]

    Crafting papers on machine learning

    Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stanford, CA,

  15. [16]

    Flow Matching for Generative Modeling

    URL https://openreview. net/forum?id=jtrhwfgseW. Li, W., Su, X., You, S., Wang, F., Qian, C., and Xu, C. Diff- nas: Bootstrapping diffusion models by prompting for bet- ter architectures. In 2023 IEEE International Conference on Data Mining (ICDM), pp. 1121–1126. IEEE, 2023b. Lipman, Y ., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for...

  16. [17]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Liu, F., Zhang, S., Wang, X., Wei, Y ., Qiu, H., Zhao, Y ., Zhang, Y ., Ye, Q., and Wan, F. Timestep em- bedding tells: It’s time to cache for video diffusion model. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 7353–7363, 2025a. Liu, L., Wang, X., Zhao, G., Li, K., Qin, W., Zhu, J., Qiu, J., Zhu, Z., Huang, G., and Su, Z. ...

  17. [18]

    D2 pruning: Message passing for balancing diversity and difficulty in data pruning

    Maharana, A., Yadav, P., and Bansal, M. D2 pruning: Mes- sage passing for balancing diversity and difficulty in data pruning. arXiv preprint arXiv:2310.07931,

  18. [19]

    Action-aware dynamic pruning for efficient vision-language-action manipulation.arXiv preprint arXiv:2509.22093,

    URL https://research. nvidia.com/labs/dir/cosmos-embed1. Pei, X., Chen, Y ., Xu, S., Wang, Y ., Shi, Y ., and Xu, C. Action-aware dynamic pruning for efficient vision-language-action manipulation. arXiv preprint arXiv:2509.22093,

  19. [20]

    The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

    Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cap- pelli, A., Alobeidli, H., Pannier, B., Almazrouei, E., and Launay, J. The refinedweb dataset for falcon llm: out- performing curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116,

  20. [21]

    SAM 2: Segment Anything in Images and Videos

    Ravi, N., Gabeur, V ., Hu, Y .-T., Hu, R., Ryali, C., Ma, T., Khedr, H., R ¨adle, R., Rolland, C., Gustafson, L., et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714,

  21. [22]

    Gigabrain-0: A world model-powered vision-language- action model.arXiv preprint arXiv:2510.19430, 2025

    Team, G., Ye, A., Wang, B., Ni, C., Huang, G., Zhao, G., Li, H., Li, J., Zhu, J., Feng, L., et al. Gigabrain-0: A world model-powered vision-language-action model. arXiv preprint arXiv:2510.19430, 2025a. Team, G., Ye, A., Wang, B., Ni, C., Huang, G., Zhao, G., Li, H., Zhu, J., Li, K., Xu, M., et al. Gigaworld-0: World models as data engine to empower embo...

  22. [23]

    Embodiedreamer: Advancing real2sim2real transfer for policy training via embodied world modeling.arXiv preprint arXiv:2507.05198, 2025

    Wang, B., Meng, X., Wang, X., Zhu, Z., Ye, A., Wang, Y ., Yang, Z., Ni, C., Huang, G., and Wang, X. Embod- iedreamer: Advancing real2sim2real transfer for policy training via embodied world modeling. arXiv preprint arXiv:2507.05198,

  23. [24]

    Videoclip-xl: Advancing long description understanding for video clip models, 2024

    Wang, J., Wang, C., Huang, K., Huang, J., and Jin, L. Videoclip-xl: Advancing long description understanding for video clip models. arXiv preprint arXiv:2410.00741,

  24. [25]

    Vla-cache: Towards efficient vision-language-action model via adaptive token caching in robotic manipulation.arXiv preprint arXiv:2502.02175, 2025

    Xu, S., Wang, Y ., Xia, C., Zhu, D., Huang, T., and Xu, C. Vla-cache: Efficient vision-language-action ma- nipulation via adaptive token caching. arXiv preprint arXiv:2502.02175, 2025a. Xu, S., Wang, Z., Wang, Y ., Xia, C., Huang, T., and Xu, C. Affordance field intervention: Enabling vlas to escape memory traps in robotic manipulation. arXiv preprint arX...

  25. [26]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Zhao, T. Z., Kumar, V ., Levine, S., and Finn, C. Learn- ing fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705,

  26. [27]

    arXiv preprint arXiv:2507.02860 (2025) 4 1.x-Distill 19 Appendix Table of Contents 1 Introduction

    Zhou, X., Liang, D., Chen, K., Feng, T., Chen, X., Lin, H., Ding, Y ., Tan, F., Zhao, H., and Bai, X. Less is enough: Training-free video diffusion acceleration via runtime- adaptive caching. arXiv preprint arXiv:2507.02860, 2025a. Zhou, X., Xu, Y ., Tie, G., Chen, Y ., Zhang, G., Chu, D., Zhou, P., and Sun, L. Libero-pro: Towards robust and fair evaluati...