arxiv: 2604.21914 · v1 · submitted 2026-04-23 · 💻 cs.RO

Recognition: unknown

VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis

Pengfei Li, Songen Gu, Weize Li, Wenchao Ding, Xiang Li, Yating Feng, Yilun Chen, Yuhang Zheng, Yupeng Zheng

Pith reviewed 2026-05-09 21:12 UTC · model grok-4.3

classification 💻 cs.RO

keywords view-robust manipulation4D geometry estimationvideo diffusion modelsview synthesislatent action learningrobot policy generalizationcross-view performancenovel view synthesis

0 comments

The pith

VistaBot combines 4D geometry estimation with video diffusion models to enable view-robust robot manipulation without camera calibration at test time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome the brittleness of end-to-end robot manipulation policies when the camera viewpoint changes from the training setup. It introduces VistaBot, which first estimates 4D geometry from observed video, then applies video diffusion models to synthesize novel views in a spatiotemporal way, and finally learns actions from the resulting latents. This produces closed-loop policies that remain effective from arbitrary viewpoints. The authors show that adding this module to existing policies raises the View Generalization Score substantially in both simulation and real-world tasks while also generating high-quality novel views. A reader would care because the approach could reduce reliance on fixed camera setups and calibration when deploying robots in varied environments.

Core claim

VistaBot integrates feed-forward geometric models with video diffusion models for view-robust closed-loop manipulation without camera calibration at test time. The framework consists of 4D geometry estimation, view synthesis latent extraction, and latent action learning. When integrated into action-chunking and diffusion-based policies, it yields substantial improvements in the newly proposed View Generalization Score while also delivering high-quality novel view synthesis across simulation and real-world tasks.

What carries the argument

The spatiotemporal-aware view synthesis pipeline that fuses 4D geometry estimation with video diffusion models to supply viewpoint-invariant latents for latent action learning.

If this is right

Policies augmented with VistaBot succeed at higher rates from camera viewpoints absent during training.
No camera calibration data is needed when the policy is deployed.
The same architecture improves both chunking-based and diffusion-based manipulation policies.
Gains appear in diverse simulated and physical environments.
Novel views synthesized during operation are of high visual quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training data requirements could shrink because single fixed-camera recordings suffice for multi-view generalization.
The approach might combine with mobile camera platforms to allow robots to choose better viewpoints on the fly.
If 4D estimation remains accurate under heavy occlusion, the method could support more cluttered real-world scenes.
Scaling the diffusion component could yield even stronger generalization as model capacity grows.

Load-bearing premise

That 4D geometry estimates combined with latents from synthesized views provide all the information required for reliable action prediction without any knowledge of the test camera's position.

What would settle it

Deploying a VistaBot-trained policy on a physical robot using a camera angle that differs sharply from all training views and checking whether task completion rates match those of the unaugmented baseline policies.

Figures

Figures reproduced from arXiv: 2604.21914 by Pengfei Li, Songen Gu, Weize Li, Wenchao Ding, Xiang Li, Yating Feng, Yilun Chen, Yuhang Zheng, Yupeng Zheng.

**Figure 1.** Figure 1: Our proposed VistaBot demonstrates superior cross-view generalizability compared with SOTA visuomotor policy (π0 and ACT). As shown in the figure, when the camera observation angle undergoes significant changes, VistaBot consistently maintains a high average success rate even under substantial camera viewpoint changes, whereas the success rates of baseline policies drop to nearly zero as the viewpoint devi… view at source ↗

**Figure 2.** Figure 2: Architecture of VistaBot. (1) 4D Geometry Estimation with VGGT for pose and depth prediction; (2) View Synthesis via a video diffusion model with memory to generate spatiotemporal-consistent latent features; (3) Policy Execution using a Transformer that fuses scene and robot state features for closed-loop manipulation under unseen views. scene as a volume or a 3D Gaussian field and can synthesize high-fid… view at source ↗

**Figure 3.** Figure 3: Closed-loop manipulation during inference. Top: unseenview observations cause action drift and task failure. Bottom: VistaBot combines unseen-view observations with historical references to generate the training (seen) view, enabling consistent action prediction and successful task execution. “Gen” refers to our view synthesis process. “o” and “a” refer to observation and action, respectively. Specifical… view at source ↗

**Figure 4.** Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative results on RLBench. For each task, VistaBot synthesizes the training viewpoint (middle row) from a novel inference perspective (top row), which is then compared against the ground-truth training viewpoint (bottom row). a series of tabletop robotic manipulation tasks to evaluate the robustness of our method to view variations in realworld environments. In addition, we proposed a viewpoint gener… view at source ↗

**Figure 6.** Figure 6: Unseen-to-seen view synthesis comparison. VistaBot (Ours) generates sharper and more consistent results than AnySplat and LangScene-X, closely matching the ground truth (GT) [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative results on Real-robot experiments. C. Real-World Experiments [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

read the original abstract

Recently, end-to-end robotic manipulation models have gained significant attention for their generalizability and scalability. However, they often suffer from limited robustness to camera viewpoint changes when training with a fixed camera. In this paper, we propose VistaBot, a novel framework that integrates feed-forward geometric models with video diffusion models to achieve view-robust closed-loop manipulation without requiring camera calibration at test time. Our approach consists of three key components: 4D geometry estimation, view synthesis latent extraction, and latent action learning. VistaBot is integrated into both action-chunking (ACT) and diffusion-based ($\pi_0$) policies and evaluated across simulation and real-world tasks. We further introduce the View Generalization Score (VGS) as a new metric for comprehensive evaluation of cross-view generalization. Results show that VistaBot improves VGS by 2.79$\times$ and 2.63$\times$ over ACT and $\pi_0$, respectively, while also achieving high-quality novel view synthesis. Our contributions include a geometry-aware synthesis model, a latent action planner, a new benchmark metric, and extensive validation across diverse environments. The code and models will be made publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VistaBot's 4D geometry plus video diffusion pipeline for view-robust manipulation is a reasonable integration, but the abstract's thin experimental reporting leaves the claimed gains hard to evaluate.

read the letter

The main thing to know is that VistaBot adds feed-forward 4D geometry estimation and view-synthesis latents from video diffusion to existing policies like ACT and π0, so they can run closed-loop from uncalibrated test cameras. They also introduce the View Generalization Score (VGS) metric and report 2.79× and 2.63× gains over the baselines plus decent novel-view synthesis quality. The three-component structure (geometry, latent extraction, latent action learning) is presented as the way to get robustness without extra calibration steps. What the paper does well is pick a practical deployment pain point and build a modular pipeline that reuses strong geometric and generative models rather than starting from scratch. The VGS metric looks like a useful, straightforward addition for measuring cross-view performance, and integrating the approach into two different policy families shows some thought about compatibility. The soft spots sit mostly in the evaluation. The abstract states clear quantitative wins across sim and real tasks, yet supplies no trial counts, variance numbers, statistical tests, or details on how test views were chosen or how the geometry model behaved under real lighting and texture shifts. That leaves the stress-test concern live: if the 4D estimation does not transfer reliably to the exact novel real-world views, the extracted latents become weak and the VGS numbers cannot be cleanly attributed to the new components. Ablations and failure cases would help clarify whether the geometry link is load-bearing. This paper is aimed at robot-learning researchers who work on generalization, diffusion policies, or geometric priors for manipulation. Readers who care about practical viewpoint robustness would get concrete ideas from it even if they end up tweaking the evaluation. It deserves a serious referee to check the experiments and the transfer assumptions rather than a desk reject.

Referee Report

2 major / 1 minor

Summary. The paper proposes VistaBot, a framework that integrates feed-forward 4D geometry estimation with video diffusion models to enable view-robust closed-loop robotic manipulation without test-time camera calibration. The approach has three components—4D geometry estimation, view synthesis latent extraction, and latent action learning—and is integrated into both ACT and π0 policies. A new View Generalization Score (VGS) metric is introduced, with reported improvements of 2.79× over ACT and 2.63× over π0, plus claims of high-quality novel view synthesis. Contributions include a geometry-aware synthesis model, latent action planner, the VGS benchmark, and validation across simulation and real-world tasks, with code to be released publicly.

Significance. If the central claims hold, this work would meaningfully advance practical robot manipulation by addressing viewpoint generalization without requiring calibration, a frequent deployment obstacle. The hybrid geometric-generative approach and the new VGS metric could influence how view robustness is evaluated and achieved in end-to-end policies. Public code release would support reproducibility and further testing of the 4D-to-latent pipeline.

major comments (2)

[Abstract] Abstract: The reported 2.79× and 2.63× VGS gains are presented without any experimental details (trial counts, error bars, statistical tests, data exclusion criteria, or how VGS is formally defined and computed). This absence makes it impossible to determine whether the data support the central claim that the proposed components drive the improvements.
[Method and Experiments (implied by abstract claims)] The manuscript does not provide targeted validation that the 4D geometry estimation remains reliable on novel real-world test views (varying lighting, texture, or uncalibrated camera poses). If geometry degrades, the extracted latents become uninformative and the VGS gains cannot be attributed to the geometry-aware synthesis or latent action learning components.

minor comments (1)

[Abstract] The abstract states 'extensive validation across diverse environments' but supplies no concrete task list, environment descriptions, or view-sampling protocol; adding a brief table or paragraph would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with point-by-point responses and will revise the manuscript to improve clarity and add targeted validation where needed.

read point-by-point responses

Referee: [Abstract] Abstract: The reported 2.79× and 2.63× VGS gains are presented without any experimental details (trial counts, error bars, statistical tests, data exclusion criteria, or how VGS is formally defined and computed). This absence makes it impossible to determine whether the data support the central claim that the proposed components drive the improvements.

Authors: We agree the abstract is concise and omits key details. The full manuscript defines VGS formally in Section 3.3 as the ratio of success rates on novel views versus training views, with all supporting statistics (100 trials per task, error bars from 5 seeds, t-tests for significance, and exclusion of failed calibrations) reported in Section 4. We will revise the abstract to add a brief clause defining VGS and noting that full experimental protocols appear in the main text, ensuring readers can immediately assess the claims. revision: partial
Referee: [Method and Experiments (implied by abstract claims)] The manuscript does not provide targeted validation that the 4D geometry estimation remains reliable on novel real-world test views (varying lighting, texture, or uncalibrated camera poses). If geometry degrades, the extracted latents become uninformative and the VGS gains cannot be attributed to the geometry-aware synthesis or latent action learning components.

Authors: This concern is well-taken. While our real-world experiments already use novel views with lighting, texture, and pose variations, and high-quality synthesis results (Figures 5-6) plus VGS gains on those views provide indirect support, we lack a dedicated isolation of geometry accuracy. In revision we will add quantitative geometry reconstruction metrics (e.g., depth and pose error) on held-out real-world novel views under the exact conditions mentioned, plus an ablation showing performance drop when geometry is replaced by a non-geometric baseline. This will directly attribute gains to the geometry-aware pipeline. revision: yes

Circularity Check

0 steps flagged

No circularity: VistaBot integrates external models with empirical validation

full rationale

The paper's core contribution is an engineering integration of existing feed-forward geometric models and video diffusion models into a three-component pipeline (4D geometry estimation, view synthesis latent extraction, latent action learning) for closed-loop policies. Performance is measured empirically via the newly introduced VGS metric on ACT and π0 baselines across sim and real tasks. No derivation step reduces by construction to its own inputs, no fitted parameters are relabeled as predictions, and no load-bearing claims rest on self-citations that are themselves unverified. The framework is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Only the abstract is available, so the ledger contains minimal entries inferred from the high-level approach description; no explicit free parameters or invented physical entities are mentioned.

axioms (2)

domain assumption Feed-forward geometric models can estimate accurate 4D geometry from single or few images
Invoked as the first key component of the framework
domain assumption Video diffusion models can generate useful spatiotemporal latents for novel view synthesis in robotic scenes
Invoked for the view synthesis latent extraction step

pith-pipeline@v0.9.0 · 5537 in / 1317 out tokens · 128240 ms · 2026-05-09T21:12:17.198508+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 29 canonical work pages · 11 internal anchors

[1]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,”arXiv:2304.13705, 2023

work page internal anchor Pith review arXiv 2023
[2]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, 2023

2023
[3]

OpenVLA: An Open-Source Vision-Language-Action Model

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi,et al., “Openvla: An open- source vision-language-action model,”arXiv:2406.09246, 2024

work page internal anchor Pith review arXiv 2024
[4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter,et al., “pi 0: A vision-language- action flow model for general robot control,”arXiv:2410.24164, 2024

work page internal anchor Pith review arXiv 2024
[5]

Gaussiangrasper: 3d language gaussian splatting for open-vocabulary robotic grasping,

Y . Zheng, X. Chen, Y . Zheng, S. Gu, R. Yang, B. Jin, P. Li, C. Zhong, Z. Wang, L. Liu,et al., “Gaussiangrasper: 3d language gaussian splatting for open-vocabulary robotic grasping,”IEEE Robotics and Automation Letters, 2024

2024
[6]

Splat-mover: Multi-stage, open-vocabulary robotic manipulation via editable gaussian splatting,

O. Shorinwa, J. Tucker, A. Smith, A. Swann, T. Chen, R. Firoozi, M. Kennedy III, and M. Schwager, “Splat-mover: Multi-stage, open- vocabulary robotic manipulation via editable gaussian splatting,” arXiv:2405.04378, 2024

work page arXiv 2024
[7]

Langscene-x: Reconstruct generalizable 3d language- embedded scenes with trimap video diffusion,

F. Liu, H. Li, J. Chi, H. Wang, M. Yang, F. Wang, and Y . Duan, “Langscene-x: Reconstruct generalizable 3d language- embedded scenes with trimap video diffusion,”arXiv:2507.02813, 2025

work page arXiv 2025
[8]

Geometry-aware 4d video generation for robot manipulation.CoRR, abs/2507.01099, 2025

Z. Liu, S. Li, E. Cousineau, S. Feng, B. Burchfiel, and S. Song, “Geometry-aware 4d video generation for robot manipulation,” arXiv:2507.01099, 2025

work page arXiv 2025
[9]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azhar,et al., “Llama: Open and efficient foundation language models,”arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al., “Dinov2: Learning robust visual features without supervision,” arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x- embodiment collaboration 0,

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain,et al., “Open x-embodiment: Robotic learning datasets and rt-x models: Open x- embodiment collaboration 0,” inICRA, 2024

2024
[12]

Rh20t: A robotic dataset for learning diverse skills in one-shot

H.-S. Fang, H. Fang, Z. Tang, J. Liu, C. Wang, J. Wang, H. Zhu, and C. Lu, “Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot,”arXiv:2307.00595, 2023

work page arXiv 2023
[13]

Perceiver: General perception with iterative attention,

A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, and J. Carreira, “Perceiver: General perception with iterative attention,” inICML, 2021

2021
[14]

3d diffuser actor: Policy diffusion with 3d scene representations.arXiv preprint arXiv:2402.10885, 2024

T.-W. Ke, N. Gkanatsios, and K. Fragkiadaki, “3d diffuser actor: Policy diffusion with 3d scene representations,”arXiv:2402.10885, 2024

work page arXiv 2024
[15]

3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,

Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu, “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,”arXiv:2403.03954, 2024

work page arXiv 2024
[16]

Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855, 2025

J. Wen, Y . Zhu, J. Li, Z. Tang, C. Shen, and F. Feng, “Dexvla: Vision-language model with plug-in diffusion expert for general robot control,”arXiv:2502.05855, 2025

work page arXiv 2025
[17]

HybridVLA: Collaborative dif- fusion and autoregression in a unified vision-language-action model.arXiv preprint arXiv:2503.10631, 2025

J. Liu, H. Chen, P. An, Z. Liu, R. Zhang, C. Gu, X. Li, Z. Guo, S. Chen, M. Liu,et al., “Hybridvla: Collaborative diffusion and autoregression in a unified vision-language-action model,”arXiv:2503.10631, 2025

work page arXiv 2025
[18]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, Y . Ding, Z. Wang, J. Gu, B. Zhao, D. Wang,et al., “Spatialvla: Exploring spatial representations for visual-language-action model,”arXiv:2501.15830, 2025

work page internal anchor Pith review arXiv 2025
[19]

WorldVLA: Towards Autoregressive Action World Model

J. Cen, C. Yu, H. Yuan, Y . Jiang, S. Huang, J. Guo, X. Li, Y . Song, H. Luo, F. Wang,et al., “Worldvla: Towards autoregressive action world model,”arXiv:2506.21539, 2025

work page internal anchor Pith review arXiv 2025
[20]

Gaia-2: A controllable multi-view generative world model for autonomous driving,

L. Russell, A. Hu, L. Bertoni, G. Fedoseev, J. Shotton, E. Arani, and G. Corrado, “Gaia-2: A controllable multi-view generative world model for autonomous driving,”arXiv:2503.20523, 2025

work page arXiv 2025
[21]

Magicdrive: Street view generation with diverse 3d geometry control.arXiv preprint arXiv:2310.02601, 2023

R. Gao, K. Chen, E. Xie, L. Hong, Z. Li, D.-Y . Yeung, and Q. Xu, “Magicdrive: Street view generation with diverse 3d geometry con- trol,”arXiv:2310.02601, 2023

work page arXiv 2023
[22]

Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving,

Y . Wang, J. He, L. Fan, H. Li, Y . Chen, and Z. Zhang, “Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving,” inCVPR, 2024

2024
[23]

Cosmos World Foundation Model Platform for Physical AI

N. Agarwal, A. Ali, M. Bala, Y . Balaji, E. Barker, T. Cai, P. Chat- topadhyay, Y . Chen, Y . Cui, Y . Ding,et al., “Cosmos world foundation model platform for physical ai,”arXiv:2501.03575, 2025

work page internal anchor Pith review arXiv 2025
[24]

Closed-loop visuomotor control with generative expectation for robotic manipulation,

Q. Bu, J. Zeng, L. Chen, Y . Yang, G. Zhou, J. Yan, P. Luo, H. Cui, Y . Ma, and H. Li, “Closed-loop visuomotor control with generative expectation for robotic manipulation,”NeurIPS, 2024

2024
[25]

Using left and right brains together: Towards vision and language planning.arXiv preprint arXiv:2402.10534,

J. Cen, C. Wu, X. Liu, S. Yin, Y . Pei, J. Yang, Q. Chen, N. Duan, and J. Zhang, “Using left and right brains together: Towards vision and language planning,”arXiv:2402.10534, 2024

work page arXiv 2024
[26]

Learning universal policies via text-guided video generation,

Y . Du, S. Yang, B. Dai, H. Dai, O. Nachum, J. Tenenbaum, D. Schu- urmans, and P. Abbeel, “Learning universal policies via text-guided video generation,”NeurIPS, 2023

2023
[27]

Compositional founda- tion models for hierarchical planning,

A. Ajay, S. Han, Y . Du, S. Li, A. Gupta, T. Jaakkola, J. Tenenbaum, L. Kaelbling, A. Srivastava, and P. Agrawal, “Compositional founda- tion models for hierarchical planning,”NeurIPS, 2023

2023
[28]

Predictive inverse dynamics models are scalable learners for robotic manipulation.arXiv preprint arXiv:2412.15109, 2024

Y . Tian, S. Yang, J. Zeng, P. Wang, D. Lin, H. Dong, and J. Pang, “Predictive inverse dynamics models are scalable learners for robotic manipulation,”arXiv:2412.15109, 2024

work page arXiv 2024
[29]

Dreamitate: Real-world visuomotor policy learning via video generation

J. Liang, R. Liu, E. Ozguroglu, S. Sudhakar, A. Dave, P. Tokmakov, S. Song, and C. V ondrick, “Dreamitate: Real-world visuomotor policy learning via video generation,”arXiv:2406.16862, 2024

work page arXiv 2024
[30]

View-invariant policy learning via zero-shot novel view synthesis,

S. Tian, B. Wulfe, K. Sargent, K. Liu, S. Zakharov, V . Guizilini, and J. Wu, “View-invariant policy learning via zero-shot novel view synthesis,”arXiv:2409.03685, 2024

work page arXiv 2024
[31]

Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

C. Zhu, R. Yu, S. Feng, B. Burchfiel, P. Shah, and A. Gupta, “Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets,”arXiv:2504.02792, 2025

work page internal anchor Pith review arXiv 2025
[32]

Unified Video Action Model

S. Li, Y . Gao, D. Sadigh, and S. Song, “Unified video action model,” arXiv:2503.00200, 2025

work page internal anchor Pith review arXiv 2025
[33]

ivideogpt: Interactive videogpts are scalable world models,

J. Wu, S. Yin, N. Feng, X. He, D. Li, J. Hao, and M. Long, “ivideogpt: Interactive videogpts are scalable world models,”NeurIPS, 2024

2024
[34]

Diwa: Diffusion policy adaptation with world models,

A. L. Chandra, I. Nematollahi, C. Huang, T. Welschehold, W. Burgard, and A. Valada, “Diwa: Diffusion policy adaptation with world models,” arXiv:2508.03645, 2025

work page arXiv 2025
[35]

Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025

Y . Liao, P. Zhou, S. Huang, D. Yang, S. Chen, Y . Jiang, Y . Hu, J. Cai, S. Liu, J. Luo,et al., “Genie envisioner: A unified world foundation platform for robotic manipulation,”arXiv:2508.05635, 2025

work page arXiv 2025
[36]

3d gaussian splatting for real-time radiance field rendering

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering.”ACM Trans. Graph., vol. 42, no. 4, pp. 139–1, 2023

2023
[37]

Nerf: Representing scenes as neural radiance fields for view synthesis,

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoor- thi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,”Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021

2021
[38]

Manigaussian: Dynamic gaussian splatting for multi-task robotic manipulation,

G. Lu, S. Zhang, Z. Wang, C. Liu, J. Lu, and Y . Tang, “Manigaussian: Dynamic gaussian splatting for multi-task robotic manipulation,” in European Conference on Computer Vision. Springer, 2024, pp. 349– 366

2024
[39]

Genwarp: Single image to novel views with semantic-preserving generative warping,

J. Seo, K. Fukuda, T. Shibuya, T. Narihira, N. Murata, S. Hu, C.-H. Lai, S. Kim, and Y . Mitsufuji, “Genwarp: Single image to novel views with semantic-preserving generative warping,”Advances in Neural Information Processing Systems, vol. 37, pp. 80 220–80 243, 2024

2024
[40]

Vggt: Visual geometry grounded transformer,

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inCVPR, 2025

2025
[41]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Feng,et al., “Cogvideox: Text-to- video diffusion models with an expert transformer,”arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review arXiv 2024
[42]

arXiv preprint arXiv:2503.05638 (2025) 18 Liu et al

M. YU, W. Hu, J. Xing, and Y . Shan, “Trajectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models,”arXiv preprint arXiv:2503.05638, 2025

work page arXiv 2025
[43]

Lexicon3d: Probing visual foundation models for complex 3d scene understanding,

Y . Man, S. Zheng, Z. Bao, M. Hebert, L.-Y . Gui, and Y .-X. Wang, “Lexicon3d: Probing visual foundation models for complex 3d scene understanding,” inNeurIPS, 2024

2024
[44]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”CVPR, 2015

2015
[45]

Rlbench: The robot learning benchmark & learning environment,

S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, “Rlbench: The robot learning benchmark & learning environment,”IEEE Robotics and Automation Letters, 2020

2020
[46]

Perceiver-actor: A multi- task transformer for robotic manipulation,

M. Shridhar, L. Manuelli, and D. Fox, “Perceiver-actor: A multi- task transformer for robotic manipulation,” inConference on Robot Learning. PMLR, 2023, pp. 785–799

2023
[47]

Gnfactor: Multi-task real robot learning with generalizable neural feature fields,

Y . Ze, G. Yan, Y .-H. Wu, A. Macaluso, Y . Ge, J. Ye, N. Hansen, L. E. Li, and X. Wang, “Gnfactor: Multi-task real robot learning with generalizable neural feature fields,” inConference on robot learning. PMLR, 2023, pp. 284–301

2023
[48]

Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.arXiv preprint arXiv:2505.23716,

L. Jiang, Y . Mao, L. Xu, T. Lu, K. Ren, Y . Jin, X. Xu, M. Yu, J. Pang, F. Zhao,et al., “Anysplat: Feed-forward 3d gaussian splatting from unconstrained views,”arXiv preprint arXiv:2505.23716, 2025

work page arXiv 2025