pith. sign in

arxiv: 2506.14135 · v5 · pith:OXMVVENDnew · submitted 2025-06-17 · 💻 cs.RO · cs.CV

GAF: Gaussian Action Field as a 4D Representation for Dynamic World Modeling in Robotic Manipulation

Pith reviewed 2026-05-25 07:45 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords Gaussian Action Field4D scene representationrobotic manipulationdynamic world modeling3D Gaussian Splattingaction estimationvision-based control
0
0 comments X

The pith

A 4D Gaussian Action Field lets robots derive actions from a single motion-aware scene model instead of separate vision or 3D steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper replaces vision-to-action or vision-to-3D-to-action pipelines with a vision-to-4D-to-action approach. It adds learnable motion attributes to 3D Gaussian Splatting so the same field can reconstruct the present scene, forecast future frames, and supply an initial action estimate. A denoising network then refines the action by aligning it with the combined scene and motion output. Experiments report higher visual fidelity and a 7.3 percent average gain in manipulation success. Readers would care because many robot failures trace to poor handling of scene change during contact.

Core claim

GAF extends 3D Gaussian Splatting by incorporating learnable motion attributes, allowing 4D modeling of dynamic scenes and manipulation actions. It provides three interrelated outputs: reconstruction of the current scene, prediction of future frames, and estimation of init action via Gaussian motion. An action-vision-aligned denoising framework conditioned on a unified representation that combines the init action and the Gaussian perception then yields more precise actions.

What carries the argument

Gaussian Action Field (GAF), the 4D extension of 3D Gaussian Splatting that equips each Gaussian with learnable motion attributes to jointly model time-varying geometry and robot actions.

If this is right

  • Reconstruction, future-frame prediction, and initial action estimation become outputs of one shared representation rather than separate modules.
  • Action accuracy improves when the denoising step receives both the perceived scene and the motion-derived action estimate together.
  • The V-4D-A structure avoids the information loss that occurs when actions are predicted from static 3D reconstructions alone.
  • Quantitative gains appear in both visual metrics (PSNR, SSIM, LPIPS) and downstream task success rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If motion attributes prove stable across longer time horizons, the same field could support multi-step planning without re-rendering intermediate states.
  • The approach might transfer to non-rigid objects or deformable materials if the motion attributes can be regularized differently.
  • Replacing the current denoising network with a learned policy that consumes the full 4D field could close the loop from perception to control more tightly.

Load-bearing premise

Adding learnable motion attributes to 3D Gaussians is sufficient to capture the dynamics of manipulation scenes and to support accurate action estimation from the resulting field.

What would settle it

A test sequence in which the learned motion attributes produce future-frame predictions that deviate systematically from observed camera or depth changes during a real manipulation trial.

Figures

Figures reproduced from arXiv: 2506.14135 by Hongwen Zhang, Jiajun Zhang, Kangchen Lv, Liangjun Xing, Litao Deng, Ruizhi Shao, Xiang Li, Yebin Liu, Ying Chai.

Figure 1
Figure 1. Figure 1: Gaussian Action Field. We present Gaussian Action Field (GAF), a dynamic world model where each 3D Gaussian is extended with motion attributes. This enables current scene rendering, future prediction, and action-aware motion learning, providing initial action hypotheses and serving as actionable guidance for robotic manipulation. Abstract Accurate action inference is critical for vision-based robotic manip… view at source ↗
Figure 2
Figure 2. Figure 2: Comparisons between the previous V-A, V-3D-A solutions and the proposed V-4D-A. 1 Introduction Effective perception is fundamental to robotic manipulation in unstructured 3D environments. Recent advances in vision-based methods [24, 38, 27, 66] have enabled robots to infer actions directly from visual observations by leveraging powerful foundation models [32, 58, 59, 11], which facilitates the high-level s… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of GAF reconstruction. Given sparse multi-view images, a Vision Transformer extracts hybrid scene features, which are decoded by three heads to predict Gaussian positions, motions, and appearance parameters, forming the GAF representation. between current and future point clouds.    Qcurrent : {g(x), t} {µ, f} −−−−→ FΘ GSt render −−−→ It Qfuture : {g(x), t} {µ+∆µ, f} −−−−−−−→ FΘ GSt… view at source ↗
Figure 4
Figure 4. Figure 4: Manipulation pipeline. The GAF current and action queries provide current multi-view observations and an initial action estimate (left). These are then used as conditions for a refinement network to generate executable motion (right). The process repeats iteratively until the task completes. The predicted displacement ∆µ t→t+∆t j are added to the current centers µ t j to obtain the future Gaus￾sian positio… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of current scene reconstruction and future scene prediction from novel views. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation action The upper image shows a failed experiment without action refinement, while the lower image depicts a successful experiment after action refinement. orientation and position of the seat in relation to the surrounding environment, such as the toilet and the seat. A 3D representation of the scene allows the robot to model these spatial relationships. This demonstrates how 3D scene modeling is … view at source ↗
Figure 7
Figure 7. Figure 7: Spatial Generalization. Outcome of GAF and baseline trained on 20 demonstrations (purple stars). The heat maps represent Gaussian kernel density estimations for relative likelihood po￾larity over the workspace, with red and blue colours representing successes and failures, respectively. We propose a systematic data collection strategy to ensure comprehensive spatial coverage of object poses within the oper… view at source ↗
Figure 8
Figure 8. Figure 8: Data Efficiency. The success rate of our method and the baseline R&D in three tasks (Lift Lid, Close Microwave, Phone On Base) varies with different demonstrations. For this set of experiments we train the models on different numbers of demonstrations collected in the same data collection strategy as in A.1 and evaluate them in a grid-like manner to ensure that the experiments present a sufficient level of… view at source ↗
Figure 9
Figure 9. Figure 9: GAF Query Result. Multiview images rendered from current and future Gaussian point clouds, along with the predicted initial actions visulization. As table 3 illustrated, our method’s average success rate only declines 10.7%. This highlights GAF’s robust multi-tasking capabilities, underscoring its effectiveness as a world model-based approach. Our success rate exhibits the most significant decline in the "… view at source ↗
read the original abstract

Accurate scene perception is critical for vision-based robotic manipulation. Existing approaches typically follow either a Vision-to-Action (V-A) paradigm, predicting actions directly from visual inputs, or a Vision-to-3D-to-Action (V-3D-A) paradigm, leveraging intermediate 3D representations. However, these methods often struggle with action inaccuracies due to the complexity and dynamic nature of manipulation scenes. In this paper, we adopt a V-4D-A framework that enables direct action reasoning from motion-aware 4D representations via a Gaussian Action Field (GAF). GAF extends 3D Gaussian Splatting (3DGS) by incorporating learnable motion attributes, allowing 4D modeling of dynamic scenes and manipulation actions. To learn time-varying scene geometry and action-aware robot motion, GAF provides three interrelated outputs: reconstruction of the current scene, prediction of future frames, and estimation of init action via Gaussian motion. Furthermore, we employ an action-vision-aligned denoising framework, conditioned on a unified representation that combines the init action and the Gaussian perception, both generated by the GAF, to further obtain more precise actions. Extensive experiments demonstrate significant improvements, with GAF achieving +11.5385 dB PSNR, +0.3864 SSIM and -0.5574 LPIPS improvements in reconstruction quality, while boosting the average +7.3% success rate in robotic manipulation tasks over state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a Vision-to-4D-to-Action (V-4D-A) framework using Gaussian Action Field (GAF), which extends 3D Gaussian Splatting with learnable motion attributes to model dynamic manipulation scenes in 4D. GAF produces scene reconstruction, future-frame prediction, and initial action estimates from Gaussian motion; these are fed into an action-vision-aligned denoising network to refine actions. Experiments report reconstruction gains of +11.5385 dB PSNR, +0.3864 SSIM, and -0.5574 LPIPS, plus a +7.3% average success-rate improvement over prior methods in robotic tasks.

Significance. If validated, the approach would offer a unified 4D Gaussian representation that jointly handles perception, prediction, and action initialization, potentially improving robustness in dynamic manipulation. The explicit linkage of motion attributes to both reconstruction and action estimation is a clear conceptual step beyond V-3D-A pipelines, though its practical value hinges on demonstrating that the added motion parameters actually capture time-varying robot/object dynamics rather than merely enhancing static geometry.

major comments (2)
  1. [Abstract] Abstract: the central claim that learnable motion attributes produce faithful 4D representations sufficient for action estimation rests on reconstruction metrics alone (+11.5385 dB PSNR etc.). No motion-specific metrics (trajectory error, velocity consistency, or end-effector pose accuracy) or ablations isolating the motion-attribute contribution are reported, so it remains possible that the gains derive from better static 3D modeling rather than the 4D extension.
  2. [Experiments] Experiments (implied by the quantitative claims): the +7.3% success-rate improvement is presented without baselines, task details, or controls that separate the effect of the Gaussian-motion init action from the subsequent denoising step. This leaves the V-4D-A advantage over V-3D-A unproven.
minor comments (1)
  1. [Abstract] The reported metric deltas are given to four or five decimal places without accompanying standard deviations or number of runs, which would aid reproducibility assessment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments. We address each major point below and indicate the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that learnable motion attributes produce faithful 4D representations sufficient for action estimation rests on reconstruction metrics alone (+11.5385 dB PSNR etc.). No motion-specific metrics (trajectory error, velocity consistency, or end-effector pose accuracy) or ablations isolating the motion-attribute contribution are reported, so it remains possible that the gains derive from better static 3D modeling rather than the 4D extension.

    Authors: We acknowledge that the abstract and main results emphasize reconstruction quality and task success. The +7.3% success-rate gain is obtained when the action-vision-aligned denoiser is conditioned on the motion-derived initial action from GAF; this provides indirect evidence that the motion attributes contribute beyond static geometry. Nevertheless, we agree that explicit motion metrics and isolating ablations would strengthen the 4D claim. In the revised manuscript we will add (i) an ablation that disables the learnable motion attributes while keeping all other components fixed and (ii) quantitative motion-consistency metrics (e.g., end-effector trajectory error on the manipulation sequences) computed from the same data. revision: yes

  2. Referee: [Experiments] Experiments (implied by the quantitative claims): the +7.3% success-rate improvement is presented without baselines, task details, or controls that separate the effect of the Gaussian-motion init action from the subsequent denoising step. This leaves the V-4D-A advantage over V-3D-A unproven.

    Authors: The manuscript already reports comparisons against multiple published V-A and V-3D-A baselines on the same robotic manipulation benchmarks, with task definitions and evaluation protocols given in Section 4. To directly isolate the contribution of the GAF motion-based initialization, we will add a controlled ablation in the revision that runs the denoising network with a neutral (zero-motion) initial action while keeping the perception representation identical. This will quantify the incremental benefit of the 4D motion attributes over a pure V-3D-A pipeline. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces GAF as an extension of 3D Gaussian Splatting with learnable motion attributes to enable 4D scene modeling, reconstruction, future prediction, and initial action estimation, followed by an action-vision denoising step. No equations or steps in the provided abstract reduce a claimed prediction or result to a fitted input by construction, nor do they rely on self-citations for uniqueness or load-bearing premises. The reported gains (+11.5 dB PSNR, +7.3% success rate) are presented as empirical outcomes rather than tautological derivations. The derivation chain remains self-contained against external benchmarks with no self-definitional, fitted-prediction, or self-citation circularity patterns exhibited.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Based on abstract only; no specific free parameters or axioms detailed.

invented entities (1)
  • Gaussian Action Field (GAF) no independent evidence
    purpose: To model 4D dynamic scenes and actions by extending 3DGS with motion attributes
    Introduced as the core new representation in the paper.

pith-pipeline@v0.9.0 · 5824 in / 1129 out tokens · 38981 ms · 2026-05-25T07:45:55.556776+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DSSP: Diffusion State Space Policy with Full-History Encoding

    cs.RO 2026-05 conditional novelty 7.0

    DSSP is a history-conditioned diffusion state space policy that uses SSMs to encode full observation streams with an auxiliary dynamics objective and hierarchical fusion, achieving SOTA results with reduced model size...

  2. Learning Visual Feature-Based World Models via Residual Latent Action

    cs.CV 2026-05 unverdicted novelty 7.0

    RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.

  3. 3D Generation for Embodied AI and Robotic Simulation: A Survey

    cs.RO 2026-04 accept novelty 7.0

    3D generation for embodied AI is shifting from visual realism toward interaction readiness, organized into data generation, simulation environments, and sim-to-real bridging roles.

  4. Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

    cs.AI 2026-04 unverdicted novelty 7.0

    Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...

  5. 3D Generation for Embodied AI and Robotic Simulation: A Survey

    cs.RO 2026-04 unverdicted novelty 3.0

    The survey organizes 3D generation for embodied AI into data generators for assets, simulation environments for interaction, and sim-to-real bridges, noting a shift toward interaction readiness and listing bottlenecks...

  6. 3D Generation for Embodied AI and Robotic Simulation: A Survey

    cs.RO 2026-04 unverdicted novelty 2.0

    The paper surveys 3D generation techniques for embodied AI and robotics, categorizing them into data generation, simulation environments, and sim-to-real bridging while identifying bottlenecks in physical validity and...

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 4 Pith papers · 11 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning, 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Ruther- ford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Miko- laj Binkow...

  2. [2]

    Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

    Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Rich Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pretrained image-editing diffusion models. ArXiv, abs/2310.10639, 2023

  3. [3]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control, 2023

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnik...

  4. [4]

    Rt-1: Robotics transformer for real-world control at scale, 2023

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, ...

  5. [5]

    Chang, and Matthias Nießner

    Dave Zhenyu Chen, Angel X. Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. ArXiv, abs/1912.08830, 2019

  6. [6]

    Polarnet: 3d point clouds for language-guided robotic manipulation

    Shizhe Chen, Ricardo Garcia Pinel, Cordelia Schmid, and Ivan Laptev. Polarnet: 3d point clouds for language-guided robotic manipulation. ArXiv, abs/2309.15596, 2023

  7. [7]

    G3flow: Generative 3d semantic flow for pose-aware and generalizable object manipulation

    Tianxing Chen, Yao Mu, Zhixuan Liang, Zanxin Chen, Shijia Peng, Qiangyu Chen, Min Xu, Ruizhen Hu, Hongyuan Zhang, Xuelong Li, and Ping Luo. G3flow: Generative 3d semantic flow for pose-aware and generalizable object manipulation. ArXiv, abs/2411.18369, 2024

  8. [8]

    Igor: Image-goal representations are the atomic control units for foundation models in embodied ai

    Xiaoyu Chen, Junliang Guo, Tianyu He, Chuheng Zhang, Pushi Zhang, Derek Cathera Yang, Li Zhao, and Jiang Bian. Igor: Image-goal representations are the atomic control units for foundation models in embodied ai. arXiv preprint arXiv:2411.00785, 2024. 10

  9. [9]

    Uniter: Universal image-text representation learning, 2020

    Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning, 2020

  10. [10]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, page 02783649241273668, 2023

  11. [11]

    An image is worth 16x16 words: Transformers for image recognition at scale, 2021

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021

  12. [12]

    Learning universal policies via text-guided video generation

    Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. Advances in neural information processing systems, 36:9156–9172, 2023

  13. [13]

    Riemann: Near real-time se (3)-equivariant robot manipulation without point cloud segmentation

    Chongkai Gao, Zhengrong Xue, Shuying Deng, Tianhai Liang, Siqi Yang, Lin Shao, and Huazhe Xu. Riemann: Near real-time se (3)-equivariant robot manipulation without point cloud segmentation. arXiv preprint arXiv:2403.19460, 2024

  14. [14]

    Enhance sample efficiency and robustness of end-to-end urban autonomous driving via semantic masked world model

    Zeyu Gao, Yao Mu, Chen Chen, Jingliang Duan, Ping Luo, Yanfeng Lu, and Shengbo Eben Li. Enhance sample efficiency and robustness of end-to-end urban autonomous driving via semantic masked world model. IEEE Transactions on Intelligent Transportation Systems, 2024

  15. [15]

    Act3d: Infinite resolution action detection transformer for robotic manipulation

    Theophile Gervet, Zhou Xian, Nikolaos Gkanatsios, and Katerina Fragkiadaki. Act3d: Infinite resolution action detection transformer for robotic manipulation. arXiv preprint arXiv:2306.17817, 1(3), 2023

  16. [16]

    Recurrent world models facilitate policy evolution

    David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. Ad- vances in neural information processing systems, 31, 2018

  17. [17]

    Deep hierarchical planning from pixels

    Danijar Hafner, Kuang-Huei Lee, Ian Fischer, and Pieter Abbeel. Deep hierarchical planning from pixels. Advances in Neural Information Processing Systems, 35:26091–26104, 2022

  18. [18]

    Dream to Control: Learning Behaviors by Latent Imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019

  19. [19]

    Mastering Atari with Discrete World Models

    Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020

  20. [21]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023

  21. [22]

    TD-MPC2: Scalable, Robust World Models for Continuous Control

    Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control. arXiv preprint arXiv:2310.16828, 2023

  22. [23]

    Model-based imitation learning for urban driving

    Anthony Hu, Gianluca Corrado, Nicolas Griffiths, Zachary Murez, Corina Gurau, Hudson Yeo, Alex Kendall, Roberto Cipolla, and Jamie Shotton. Model-based imitation learning for urban driving. Advances in Neural Information Processing Systems, 35:20703–20716, 2022

  23. [24]

    Perceiver: General perception with iterative attention

    Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, and João Carreira. Perceiver: General perception with iterative attention. ArXiv, abs/2103.03206, 2021

  24. [25]

    Rlbench: The robot learning benchmark & learning environment

    Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J Davison. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

  25. [26]

    Stephen James, Kentaro Wada, Tristan Laidlow, and Andrew J. Davison. Coarse-to-fine q- attention: Efficient learning for visual robotic manipulation via discretisation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13729–13738, 2021. 11

  26. [27]

    Graspsplats: Efficient manipulation with 3d feature splatting

    Mazeyu Ji, Ri-Zhao Qiu, Xueyan Zou, and Xiaolong Wang. Graspsplats: Efficient manipulation with 3d feature splatting. arXiv preprint arXiv:2409.02084, 2024

  27. [28]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021

  28. [29]

    Rethinking bimanual robotic manipulation: Learning with decoupled interaction framework

    Jian-Jian Jiang, Xiao-Ming Wu, Yi-Xiang He, Ling an Zeng, Yi-Lin Wei, Dandan Zhang, and Wei-Shi Zheng. Rethinking bimanual robotic manipulation: Learning with decoupled interaction framework. ArXiv, abs/2503.09186, 2025

  29. [30]

    3D Diffuser Actor: Policy Diffusion with 3D Scene Representations

    Tsung-Wei Ke, Nikolaos Gkanatsios, and Katerina Fragkiadaki. 3d diffuser actor: Policy diffusion with 3d scene representations. ArXiv, abs/2402.10885, 2024

  30. [31]

    3d gaussian splatting for real-time radiance field rendering, 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering, 2023

  31. [32]

    Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything, 2023

  32. [33]

    Accurate vision-based manipulation through contact reasoning

    Alina Kloss, Maria Bauza, Jiajun Wu, Joshua B Tenenbaum, Alberto Rodriguez, and Jean- nette Bohg. Accurate vision-based manipulation through contact reasoning. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 6738–6744. IEEE, 2020

  33. [34]

    Grounding image matching in 3d with mast3r

    Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. In European Conference on Computer Vision, pages 71–91. Springer, 2024

  34. [35]

    Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training

    Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 11336–11344, 2020

  35. [36]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023

  36. [37]

    VisualBERT: A Simple and Performant Baseline for Vision and Language

    Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019

  37. [39]

    Dreamitate: Real-world visuomotor policy learning via video generation

    Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, and Carl V ondrick. Dreamitate: Real-world visuomotor policy learning via video generation. ArXiv, abs/2406.16862, 2024

  38. [40]

    Learning to model the world with language

    Jessy Lin, Yuqing Du, Olivia Watkins, Danijar Hafner, Pieter Abbeel, Dan Klein, and Anca Dragan. Learning to model the world with language. arXiv preprint arXiv:2308.01399, 2023

  39. [41]

    V oxact-b: V oxel-based acting and stabilizing policy for bimanual manipulation

    I-Chun Arthur Liu, Sicheng He, Daniel Seita, and Gaurav Sukhatme. V oxact-b: V oxel-based acting and stabilizing policy for bimanual manipulation. In Conference on Robot Learning, 2024

  40. [42]

    Geneoh diffusion: Towards generalizable hand-object interaction denoising via denoising diffusion, 2024

    Xueyi Liu and Li Yi. Geneoh diffusion: Towards generalizable hand-object interaction denoising via denoising diffusion, 2024

  41. [43]

    Thinkbot: Embodied instruction following with thought chain reasoning

    Guanxing Lu, Ziwei Wang, Changliu Liu, Jiwen Lu, and Yansong Tang. Thinkbot: Embodied instruction following with thought chain reasoning. arXiv preprint arXiv:2312.07062, 2023

  42. [44]

    Mani- gaussian: Dynamic gaussian splatting for multi-task robotic manipulation

    Guanxing Lu, Shiyi Zhang, Ziwei Wang, Changliu Liu, Jiwen Lu, and Yansong Tang. Mani- gaussian: Dynamic gaussian splatting for multi-task robotic manipulation. arXiv preprint arXiv:2403.08321, 2024. 12

  43. [45]

    Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks, 2019

    Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks, 2019

  44. [46]

    Structured world models from human videos

    Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Structured world models from human videos. arXiv preprint arXiv:2308.10901, 2023

  45. [47]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020

  46. [48]

    Vision transformers for dense prediction

    René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. ArXiv preprint, 2021

  47. [49]

    Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer

    Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021

  48. [50]

    Mastering atari, go, chess and shogi by planning with a learned model

    Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Si- mon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020

  49. [51]

    Segal, Dirk Hähnel, and Sebastian Thrun

    Aleksandr V . Segal, Dirk Hähnel, and Sebastian Thrun. Generalized-icp. InRobotics: Science and Systems, 2009

  50. [52]

    Masked world models for visual control

    Younggyo Seo, Danijar Hafner, Hao Liu, Fangchen Liu, Stephen James, Kimin Lee, and Pieter Abbeel. Masked world models for visual control. In Conference on Robot Learning, pages 1332–1344. PMLR, 2023

  51. [53]

    Generative image as action models

    Mohit Shridhar, Yat Long Lo, and Stephen James. Generative image as action models. arXiv preprint arXiv:2407.07875, 2024

  52. [54]

    Perceiver-actor: A multi-task transformer for robotic manipulation

    Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. ArXiv, abs/2209.05451, 2022

  53. [55]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020

  54. [56]

    Instant policy: In-context imitation learning via graph diffusion

    Vitalis V osylius and Edward Johns. Instant policy: In-context imitation learning via graph diffusion. ArXiv, abs/2411.12633, 2024

  55. [57]

    Render and diffuse: Aligning image and action spaces for diffusion-based behaviour cloning

    Vitalis V osylius, Younggyo Seo, Jafar Uruç, and Stephen James. Render and diffuse: Aligning image and action spaces for diffusion-based behaviour cloning. arXiv preprint arXiv:2405.18196, 2024

  56. [58]

    Depth anything: Unleashing the power of large-scale unlabeled data

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. In CVPR, 2024

  57. [59]

    Depth Anything V2

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2. arXiv:2406.09414, 2024

  58. [60]

    No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images

    Botao Ye, Sifei Liu, Haofei Xu, Li Xueting, Marc Pollefeys, Ming-Hsuan Yang, and Peng Songyou. No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images. arXiv preprint arXiv:2410.24207, 2024

  59. [61]

    Gnfactor: Multi-task real robot learning with generalizable neural feature fields

    Yanjie Ze, Ge Yan, Yueh-Hua Wu, Annabella Macaluso, Yuying Ge, Jianglong Ye, Nicklas Hansen, Li Erran Li, and Xiaolong Wang. Gnfactor: Multi-task real robot learning with generalizable neural feature fields. In Conference on Robot Learning, pages 284–301. PMLR, 2023

  60. [62]

    3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations, 2024

    Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations, 2024

  61. [63]

    Efros, Eli Shechtman, and Oliver Wang

    Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreason- able effectiveness of deep features as a perceptual metric, 2018. 13

  62. [64]

    A universal semantic- geometric representation for robotic manipulation

    Tong Zhang, Yingdong Hu, Hanchen Cui, Hang Zhao, and Yang Gao. A universal semantic- geometric representation for robotic manipulation. arXiv preprint arXiv:2306.10474, 2023

  63. [65]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023

  64. [66]

    3D-VLA: A 3D Vision-Language-Action Generative World Model

    Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: A 3d vision-language-action generative world model. ArXiv, abs/2403.09631, 2024. A Additional Experiments In this section, we designed additional experiments to demonstrate the performance of GAF. We mainly evaluate its ability in spatial generaliz...

  65. [67]

    In the action refinement process, we use 50 diffusion ierations based on DDIM [55]

    The model is trained using a single NVIDIA RTX A800 GPU, which takes approximately 24 hours to complete. In the action refinement process, we use 50 diffusion ierations based on DDIM [55]. To obtain more precise local observations, we incorporated the GT wrist camera data as an auxiliary resource in this 16 section. We use 2 last observations as input and...