pith. machine review for the scientific record. sign in

arxiv: 2604.05544 · v1 · submitted 2026-04-07 · 💻 cs.RO · cs.CV

Recognition: no theorem link

Referring-Aware Visuomotor Policy Learning for Closed-Loop Manipulation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:53 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords visuomotor policy learningrobotic manipulationdiffusion modelsclosed-loop controlreferring pointstrajectory steeringperturbation training
0
0 comments X

The pith

ReV lets robots replan trajectories in real time by steering diffusion policies with sparse referring points trained on perturbed demos alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to solve the problem of visuomotor policies failing on unexpected execution errors or needing to change paths, since they are trained only on perfect expert demonstrations. ReV adds the ability to accept sparse referring points from a human or planner and adjust actions on the fly in a closed loop. It does this with coupled diffusion heads that first produce global action anchors and locate the referring point in time, then interpolate locally for precise movements, repeating at every step. Training uses only targeted perturbations to the original demonstrations, with no new data or fine-tuning required. If the approach holds, robots could recover from dynamic changes while keeping their original task performance.

Core claim

ReV is a closed-loop visuomotor policy that preserves standard execution patterns while integrating sparse referring points through a trajectory-steering strategy. The global diffusion head generates a sequence of globally consistent yet temporally sparse action anchors and identifies the temporal position of the referring point, after which the local diffusion head adaptively interpolates adjacent anchors. This process repeats at every execution step to enable real-time replanning, and the model is trained solely by applying targeted perturbations to expert demonstrations, yielding higher success rates on challenging simulated and real-world tasks.

What carries the argument

Coupled diffusion heads, where the global head creates temporally sparse action anchors and locates referring points temporally while the local head performs adaptive interpolation, combined with a trajectory-steering strategy that incorporates sparse referring at each step.

If this is right

  • Real-time trajectory replanning becomes possible at every execution step when the scene changes dynamically.
  • Standard task execution patterns remain intact without added degradation from the referring capability.
  • Higher success rates are achieved on challenging tasks in simulation and the real world using only perturbations to existing expert data.
  • Sparse referring from humans or high-level planners can be incorporated without requiring additional data collection or fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could reduce the amount of data needed to train robust policies for robots operating in unstructured settings.
  • It could pair with high-level planners that output referring points based on scene understanding to enable more adaptive behaviors.
  • Extensions might explore using continuous streams of referring points or language-derived guidance for longer tasks.

Load-bearing premise

That applying targeted perturbations to expert demonstrations alone is enough for the model to handle unforeseen out-of-distribution errors and dynamic re-routing without degrading standard performance.

What would settle it

Running the trained policy on both standard tasks and on tasks with injected execution errors or dynamic changes that require rerouting; if success rates fall on standard tasks or the policy fails to recover from the new errors, the central claim would not hold.

Figures

Figures reproduced from arXiv: 2604.05544 by Jiahua Ma, Liang Lin, Ruimao Zhang, Xin Wen, Yiran Qin, Yixiong Li, Yulan Guo, Yuyu Sun.

Figure 1
Figure 1. Figure 1: The role of the proposed Referring-Aware Visuomotor Policy (ReV) in mitigating out-of-distribution failures. In executing the task “Grabbing the steak from the kitchen table”, several challenges can be encountered. Case A: a small covariate shift at tk quickly leads to a compounding misalignment with expert demonstrations, driving the robot into unseen states and resulting in task failure. Case B: an unfor… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our Referring-Aware Visuomotor Policy. Given the observation O and robot proprioception history A−, the Temporal-Position Prediction Module on the left assign the optimal temporal position k for the referring point P. This temporally￾positioned referring point P is then transformed into a temporally-positioned partially-constrained referring action P a (detailed in Sec. 3.3) and injected into t… view at source ↗
Figure 3
Figure 3. Figure 3: Temporal-Position Prediction. The slots buffer S cov￾ers both retained history and extrapolated future steps. Specifically, at step i, we load the historical action anchors {a ′ j} i j=1 to initialize history part, and the future part is padded by copies of the last action anchor a ′ i . Then, the whole slots buffer is augmented with a monotonically-increasing temporal-position embeddings. while the padded… view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of the trajectories generated by ReV on Pick-Meat-via and Lift-Barrier-via. Here, we use green bounding boxes to mark the frames in which the end-effector passes through the designated via-point (green ball). generative model. As for the final method, we formulate a guidance cost based on the referring point P, which steers the denoising process via classifier-guided sampling. Benchmarks. We … view at source ↗
Figure 5
Figure 5. Figure 5: Demonstration Construction Strategy for GDH and LDH. The orange blocks represent the sparse action anchors. Among them, the lighter orange exemplifies the historical context, while the darker orange exemplifies the future targets. The green blocks denote the fine-grained interpolation actions. A.2. Binary Mask Used in Trajectory-Steering Strategy In this section, we formally define the three binary masks u… view at source ↗
Figure 6
Figure 6. Figure 6: Via-Points Generation. Via-points are generated via Gaussian sampling around a centroid derived from the robot’s initial configuration. As discussed previously, we evaluate the referring-awareness capability of ReV by augmenting the involved simulation and real-world benchmarks with randomized via-points. These via-points are constrained to the robot’s operational workspace to ensure reachability. The deta… view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of ablation studies on key hyperparameters. (a) Total trajectory length N: performance peaks when N matches task complexity, enabling the model to capture task-specific execution patterns. (b) Trajectory-length allocation N1 :N2: balanced ratios (1: 2 ∼ 2: 1) allow our policy model to learn the entire trajectory most effectively. (c) Update Frequencies of T-P prediction: excessive updates cau… view at source ↗
Figure 8
Figure 8. Figure 8: OOD-yet-Feasible Referring Points Generation. 1. Baseline Definition: Compute the midpoint pm of the line segment connecting the robot end-effector’s position pe and the geometric center po of the target object: pm = pe + po 2 2. Horizontal Sampling: On the plane parallel to the table surface, define a direction d⊥ that is perpendicular to the vector po − pe (i.e., d⊥ ⊥ (po − pe)). Then, generate a set of … view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of the trajectories generated by ReV on Place-Food-via and Camera-Alignment-via. Here, we use green bounding boxes to mark the frames in which the end-effector passes through the designated via-point (green ball). D. Real-world Experiments Details D.1. Platform As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Real-world Experimental Platform. The setup comprises a dual-robot arm system, a top-down RGB-D camera, and a black table serving as the workspace [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
read the original abstract

This paper addresses a fundamental problem of visuomotor policy learning for robotic manipulation: how to enhance robustness in out-of-distribution execution errors or dynamically re-routing trajectories, where the model relies solely on the original expert demonstrations for training. We introduce the Referring-Aware Visuomotor Policy (ReV), a closed-loop framework that can adapt to unforeseen circumstances by instantly incorporating sparse referring points provided by a human or a high-level reasoning planner. Specifically, ReV leverages the coupled diffusion heads to preserve standard task execution patterns while seamlessly integrating sparse referring via a trajectory-steering strategy. Upon receiving a specific referring point, the global diffusion head firstly generates a sequence of globally consistent yet temporally sparse action anchors, while identifies the precise temporal position for the referring point within this sequence. Subsequently, the local diffusion head adaptively interpolates adjacent anchors based on the current temporal position for specific tasks. This closed-loop process repeats at every execution step, enabling real-time trajectory replanning in response to dynamic changes in the scene. In practice, rather than relying on elaborate annotations, ReV is trained only by applying targeted perturbations to expert demonstrations. Without any additional data or fine-tuning scheme, ReV achieve higher success rates across challenging simulated and real-world tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Referring-Aware Visuomotor Policy (ReV), a closed-loop visuomotor framework for robotic manipulation tasks. ReV employs coupled global and local diffusion heads that preserve standard execution patterns from expert demonstrations while incorporating sparse referring points (from humans or planners) via a trajectory-steering strategy for real-time replanning. The global head produces temporally sparse action anchors and identifies the referring point's position; the local head then interpolates. Training uses only targeted perturbations applied to expert demonstrations, with no additional data or fine-tuning, and the abstract claims higher success rates on challenging simulated and real-world tasks.

Significance. If the empirical claims hold, the work would be significant for data-efficient robustness in visuomotor policies, as it offers a mechanism to handle OOD execution errors and dynamic re-routing without new data collection or retraining. The coupled diffusion approach for global anchoring plus local interpolation, combined with perturbation-based training for referring awareness, represents a practical integration that could influence closed-loop control designs in robotics.

major comments (3)
  1. [Abstract] Abstract: The central claim that ReV achieves higher success rates across tasks without additional data or fine-tuning is stated without any quantitative results, baselines, error bars, task descriptions, or metrics. This absence is load-bearing because the contribution rests on demonstrating that the perturbation-trained coupled heads improve performance while preserving standard execution.
  2. [Method] Method description (high-level overview in abstract and §3): No derivation, pseudocode, or loss formulation is supplied for how the global diffusion head generates anchors and identifies temporal position, or how the local head performs adaptive interpolation conditioned on the referring point. Without these details, it is impossible to verify that the trajectory-steering strategy is independent of the target result rather than implicitly fitting to the perturbation distribution.
  3. [Training] Training procedure (abstract): The assumption that targeted perturbations to expert demonstrations alone suffice to cover the distribution of real OOD errors (sensor noise, object dynamics, planner shifts) is not supported by any analysis, overlap metrics, or ablation showing that standard (non-referring) performance is preserved outside the perturbed regime. This assumption is load-bearing for the generalization claim.
minor comments (2)
  1. [Abstract] The abstract and method overview use several invented terms (e.g., 'coupled diffusion heads', 'trajectory-steering strategy', 'action anchors') without prior definition or reference to standard diffusion policy literature, which reduces readability.
  2. [Method] No mention of implementation details such as diffusion timestep schedules, network architectures for the two heads, or how referring points are encoded as input.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our work. We address each of the major comments in detail below, providing clarifications and indicating the revisions we plan to make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] The central claim that ReV achieves higher success rates across tasks without additional data or fine-tuning is stated without any quantitative results, baselines, error bars, task descriptions, or metrics. This absence is load-bearing because the contribution rests on demonstrating that the perturbation-trained coupled heads improve performance while preserving standard execution.

    Authors: We agree that including quantitative results in the abstract would better support the central claims. In the revised manuscript, we will update the abstract to incorporate key experimental outcomes, such as success rates with error bars, comparisons to baselines, and brief task descriptions, while keeping it concise. revision: yes

  2. Referee: [Method] No derivation, pseudocode, or loss formulation is supplied for how the global diffusion head generates anchors and identifies temporal position, or how the local head performs adaptive interpolation conditioned on the referring point. Without these details, it is impossible to verify that the trajectory-steering strategy is independent of the target result rather than implicitly fitting to the perturbation distribution.

    Authors: We will provide additional technical details in the revised version. Specifically, we will include pseudocode for the global and local diffusion heads, the loss formulations for training, and a derivation explaining how the referring point conditions the temporal position identification and interpolation. This will demonstrate that the strategy relies on the input referring point rather than the perturbation distribution. revision: yes

  3. Referee: [Training] The assumption that targeted perturbations to expert demonstrations alone suffice to cover the distribution of real OOD errors (sensor noise, object dynamics, planner shifts) is not supported by any analysis, overlap metrics, or ablation showing that standard (non-referring) performance is preserved outside the perturbed regime. This assumption is load-bearing for the generalization claim.

    Authors: We recognize the importance of validating this assumption empirically. We will add an ablation study to the experiments section showing that ReV maintains standard performance on unperturbed expert trajectories. We will also provide a discussion and qualitative analysis of how the targeted perturbations align with common OOD errors encountered in practice. A quantitative overlap metric between perturbation and real error distributions would require further data collection and is beyond the current scope, but the added ablation addresses the core concern. revision: partial

Circularity Check

0 steps flagged

No significant circularity; method description is self-contained with independent training procedure.

full rationale

The paper introduces ReV as a closed-loop visuomotor policy trained solely by applying targeted perturbations to expert demonstrations, with no equations, derivations, or load-bearing self-citations present in the provided text. The central claims about preserving task patterns while integrating sparse referring points via coupled diffusion heads and trajectory-steering are presented as empirical outcomes of this training regime rather than reductions to fitted inputs or prior author results by construction. Performance assertions rely on simulated and real-world task success rates without definitional equivalence to the perturbation strategy itself.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that diffusion models can generate consistent action sequences and that perturbations suffice for robustness training; no free parameters or invented entities are explicitly quantified in the abstract.

axioms (2)
  • domain assumption Diffusion-based policies can generate temporally consistent action sequences for robotic tasks
    Invoked in the description of global and local diffusion heads for action anchor generation.
  • ad hoc to paper Targeted perturbations to expert demonstrations are representative of out-of-distribution errors
    Used to justify training without additional data.
invented entities (1)
  • Coupled diffusion heads with trajectory-steering strategy no independent evidence
    purpose: To integrate sparse referring points into visuomotor policy for closed-loop replanning
    New construct introduced to enable the adaptation mechanism.

pith-pipeline@v0.9.0 · 5538 in / 1372 out tokens · 34345 ms · 2026-05-10T19:53:48.974692+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 32 canonical work pages · 8 internal anchors

  1. [1]

    Dexterous functional grasping,

    Agarwal, A., Uppal, S., Shaw, K., and Pathak, D. Dexterous functional grasping. arXiv preprint arXiv:2312.02975, 2023

  2. [2]

    Trajectory Optimization and Following for a Three Degrees of Freedom Overactuated Floating Platform

    Avigal, Y., Berscheid, L., Asfour, T., Kröger, T., and Goldberg, K. Speedfolding: Learning efficient bimanual folding of garments. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.\ 1--8, 2022. doi:10.1109/IROS47612.2022.9981402

  3. [3]

    Dexart: Benchmarking generalizable dexterous manipulation with articulated objects

    Bao, C., Xu, H., Qin, Y., and Wang, X. Dexart: Benchmarking generalizable dexterous manipulation with articulated objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 21190--21200, 2023

  4. [4]

    T., Kicki, P., Koert, D., and Peters, J

    Carvalho, J., Le, A. T., Kicki, P., Koert, D., and Peters, J. Motion planning diffusion: Learning and adapting robot motion planning with diffusion models. IEEE Transactions on Robotics, 2025

  5. [5]

    Metafold: Language-guided multi-category garment folding framework via trajectory generation and foundation model, 2025

    Chen, H., Li, J., Wu, R., Liu, Y., Hou, Y., Xu, Z., Guo, J., Gao, C., Wei, Z., Xu, S., Huang, J., and Shao, L. Metafold: Language-guided multi-category garment folding framework via trajectory generation and foundation model, 2025. URL https://arxiv.org/abs/2503.08372

  6. [6]

    Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

    Chi, C., Feng, S., Du, Y., Xu, Z., Cousineau, E., Burchfiel, B., and Song, S. Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023

  7. [7]

    From play to policy: Conditional behavior generation from uncurated robot data

    Cui, Z. J., Wang, Y., Shafiullah, N. M. M., and Pinto, L. From play to policy: Conditional behavior generation from uncurated robot data, 2022. URL https://arxiv.org/abs/2210.10047

  8. [8]

    Self-supervised correspondence in visuomotor policy learning

    Florence, P., Manuelli, L., and Tedrake, R. Self-supervised correspondence in visuomotor policy learning. IEEE Robotics and Automation Letters, 5 0 (2): 0 492--499, 2020. doi:10.1109/LRA.2019.2956365

  9. [9]

    D., Srinivasa, S

    Gammell, J. D., Srinivasa, S. S., and Barfoot, T. D. Informed rrt*: Optimal sampling-based path planning focused via direct sampling of an admissible ellipsoidal heuristic. In 2014 IEEE/RSJ international conference on intelligent robots and systems, pp.\ 2997--3004. IEEE, 2014

  10. [10]

    D., Barfoot, T

    Gammell, J. D., Barfoot, T. D., and Srinivasa, S. S. Batch informed trees (bit*): Informed asymptotically optimal anytime search. The International Journal of Robotics Research, 39 0 (5): 0 543--567, 2020

  11. [11]

    ArXivabs/2412.06782(2024),https://api.semanticscholar.org/CorpusID: 274610389

    Gong, Z., Ding, P., Lyu, S., Huang, S., Sun, M., Zhao, W., Fan, Z., and Wang, D. Carp: Visuomotor policy learning via coarse-to-fine autoregressive prediction. arXiv preprint arXiv:2412.06782, 2024

  12. [12]

    Teach a robot to fish: Versatile imitation from one minute of demonstrations

    Haldar, S., Pari, J., Rai, A., and Pinto, L. Teach a robot to fish: Versatile imitation from one minute of demonstrations. arXiv preprint arXiv:2303.01497, 2023

  13. [13]

    Denoising diffusion probabilistic models

    Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33: 0 6840--6851, 2020

  14. [14]

    VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    Huang, W., Wang, C., Zhang, R., Li, Y., Wu, J., and Fei-Fei, L. Voxposer: Composable 3d value maps for robotic manipulation with language models, 2023. URL https://arxiv.org/abs/2307.05973

  15. [15]

    Planning with Diffusion for Flexible Behavior Synthesis

    Janner, M., Du, Y., Tenenbaum, J. B., and Levine, S. Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991, 2022

  16. [16]

    and Frazzoli, E

    Karaman, S. and Frazzoli, E. Sampling-based algorithms for optimal motion planning. The international journal of robotics research, 30 0 (7): 0 846--894, 2011

  17. [17]

    Flame: Free-form language-based motion synthesis & editing

    Kim, J., Kim, J., and Choi, S. Flame: Free-form language-based motion synthesis & editing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp.\ 8255--8263, 2023

  18. [18]

    T., Chalvatzaki, G., Biess, A., and Peters, J

    Le, A. T., Chalvatzaki, G., Biess, A., and Peters, J. R. Accelerating motion planning via optimal transport. Advances in Neural Information Processing Systems, 36: 0 78453--78482, 2023

  19. [19]

    Behavior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024

    Lee, S., Wang, Y., Etukuru, H., Kim, H. J., Shafiullah, N. M. M., and Pinto, L. Behavior generation with latent actions, 2024. URL https://arxiv.org/abs/2403.03181

  20. [20]

    Cdp: To- wards robust autoregressive visuomotor policy learning via causal diffusion.arXiv preprint arXiv:2506.14769, 2025

    Ma, J., Qin, Y., Li, Y., Liao, X., Guo, Y., and Zhang, R. Cdp: Towards robust autoregressive visuomotor policy learning via causal diffusion. arXiv preprint arXiv:2506.14769, 2025

  21. [21]

    What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

    Mandlekar, A., Xu, D., Wong, J., Nasiriany, S., Wang, C., Kulkarni, R., Fei-Fei, L., Savarese, S., Zhu, Y., and Martín-Martín, R. What matters in learning from offline human demonstrations for robot manipulation, 2021. URL https://arxiv.org/abs/2108.03298

  22. [22]

    Mimicgen: A data generation system for scalable robot learning using human demonstrations

    Mandlekar, A., Nasiriany, S., Wen, B., Akinola, I., Narang, Y., Fan, L., Zhu, Y., and Fox, D. Mimicgen: A data generation system for scalable robot learning using human demonstrations, 2023. URL https://arxiv.org/abs/2310.17596

  23. [23]

    RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

    Nasiriany, S., Maddukuri, A., Zhang, L., Parikh, A., Lo, A., Joshi, A., Mandlekar, A., and Zhu, Y. Robocasa: Large-scale simulation of everyday tasks for generalist robots, 2024. URL https://arxiv.org/abs/2406.02523

  24. [24]

    Y., Sanketi, P., Vuong, Q., Xiao, T., Sadigh, D., Finn, C., and Levine, S

    Octo Model Team , Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Xu, C., Luo, J., Kreiman, T., Tan, Y., Chen, L. Y., Sanketi, P., Vuong, Q., Xiao, T., Sadigh, D., Finn, C., and Levine, S. Octo: An open-source generalist robot policy. In Proceedings of Robotics: Science and Systems, Delft, Netherlands, 2024

  25. [25]

    In: 2024 IEEE International Conference on Robotics and Automation (ICRA), pp

    O’Neill, A., Rehman, A., Maddukuri, A., Gupta, A., Padalkar, A., Lee, A., Pooley, A., Gupta, A., Mandlekar, A., Jain, A., Tung, A., Bewley, A., Herzog, A., Irpan, A., Khazatsky, A., Rai, A., Gupta, A., Wang, A., Singh, A., Garg, A., Kembhavi, A., Xie, A., Brohan, A., Raffin, A., Sharma, A., Yavary, A., Jain, A., Balakrishna, A., Wahid, A., Burgess-Limeric...

  26. [26]

    Learning agile robotic locomotion skills by imitating animals

    Peng, X. B., Coumans, E., Zhang, T., Lee, T.-W., Tan, J., and Levine, S. Learning agile robotic locomotion skills by imitating animals. arXiv preprint arXiv:2004.00784, 2020

  27. [27]

    Mixtures of gaussian processes for robot motion planning using stochastic trajectory optimization

    Petrovi \'c , L., Markovi \'c , I., and Petrovi \'c , I. Mixtures of gaussian processes for robot motion planning using stochastic trajectory optimization. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 52 0 (12): 0 7378--7390, 2022

  28. [28]

    arXiv preprint arXiv:2503.16408 (2025)

    Qin, Y., Kang, L., Song, X., Yin, Z., Liu, X., Liu, X., Zhang, R., and Bai, L. Robofactory: Exploring embodied agent collaboration with compositional constraints. arXiv preprint arXiv:2503.16408, 2025

  29. [29]

    Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

    Rajeswaran, A., Kumar, V., Gupta, A., Vezzani, G., Schulman, J., Todorov, E., and Levine, S. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087, 2017

  30. [30]

    Edmp: Ensemble-of-costs-guided diffusion for motion planning

    Saha, K., Mandadi, V., Reddy, J., Srikanth, A., Agarwal, A., Sen, B., Singh, A., and Krishna, M. Edmp: Ensemble-of-costs-guided diffusion for motion planning. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 10351--10358. IEEE, 2024

  31. [31]

    Shafiullah, N. M. M., Cui, Z. J., Altanzaya, A., and Pinto, L. Behavior transformers: Cloning k modes with one stone, 2022. URL https://arxiv.org/abs/2206.11251

  32. [32]

    Perceiver-actor: A multi-task transformer for robotic manipulation

    Shridhar, M., Manuelli, L., and Fox, D. Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning, pp.\ 785--799. PMLR, 2023

  33. [33]

    Singh, R., Allshire, A., Handa, A., Ratliff, N., and Wyk, K. V. Dextrah-rgb: Visuomotor policies to grasp anything with dexterous hands, 2025. URL https://arxiv.org/abs/2412.01791

  34. [34]

    Denoising Diffusion Implicit Models

    Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020

  35. [35]

    Strub, M. P. and Gammell, J. D. Adaptively informed trees (ait*): Fast asymptotically optimal path planning through adaptive heuristics. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 3191--3198. IEEE, 2020

  36. [36]

    ArXivabs/2503.13217 (2025),https://api.semanticscholar.org/CorpusID:277103820

    Su, Y., Zhan, X., Fang, H., Xue, H., Fang, H.-S., Li, Y.-L., Lu, C., and Yang, L. Dense policy: Bidirectional autoregressive learning of actions. arXiv preprint arXiv:2503.13217, 2025

  37. [37]

    Edge: Editable dance generation from music

    Tseng, J., Castellon, R., and Liu, K. Edge: Editable dance generation from music. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 448--458, 2023

  38. [38]

    T., Lambert, A., Chalvatzaki, G., Boots, B., and Peters, J

    Urain, J., Le, A. T., Lambert, A., Chalvatzaki, G., Boots, B., and Peters, J. Learning implicit priors for motion optimization. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.\ 7672--7679. IEEE, 2022

  39. [39]

    R., Black, K., Zhao, T

    Walke, H. R., Black, K., Zhao, T. Z., Vuong, Q., Zheng, C., Hansen-Estruch, P., He, A. W., Myers, V., Kim, M. J., Du, M., Lee, A., Fang, K., Finn, C., and Levine, S. Bridgedata v2: A dataset for robot learning at scale. In Tan, J., Toussaint, M., and Darvish, K. (eds.), Proceedings of The 7th Conference on Robot Learning, volume 229 of Proceedings of Mach...

  40. [40]

    Mimicplay: Long- horizon imitation learning by watching human play,

    Wang, C., Fan, L., Sun, J., Zhang, R., Fei-Fei, L., Xu, D., Zhu, Y., and Anandkumar, A. Mimicplay: Long-horizon imitation learning by watching human play. arXiv preprint arXiv:2302.12422, 2023

  41. [41]

    Gaudp: Reinventing multi-agent collaboration through gaussian-image synergy in diffusion policies, 2025

    Wang, Z., Kang, L., Qin, Y., Ma, J., Peng, Z., Bai, L., and Zhang, R. Gaudp: Reinventing multi-agent collaboration through gaussian-image synergy in diffusion policies, 2025. URL https://arxiv.org/abs/2511.00998

  42. [42]

    Ensuring force safety in vision-guided robotic manipulation via implicit tactile calibration

    Wei, L., Ma, J., Hu, Y., and Zhang, R. Ensuring force safety in vision-guided robotic manipulation via implicit tactile calibration. arXiv preprint arXiv:2412.10349, 2024

  43. [43]

    Chaineddiffuser: Unifying trajectory diffusion and keypose prediction for robotic manipulation

    Xian, Z., Gkanatsios, N., Gervet, T., Ke, T.-W., and Fragkiadaki, K. Chaineddiffuser: Unifying trajectory diffusion and keypose prediction for robotic manipulation. In 7th Annual Conference on Robot Learning, 2023

  44. [44]

    Demogen: Syn- thetic demonstration generation for data-efficient visuo- motor policy learning.arXiv preprint arXiv:2502.16932, 2025

    Xue, Z., Deng, S., Chen, Z., Wang, Y., Yuan, Z., and Xu, H. Demogen: Synthetic demonstration generation for data-efficient visuomotor policy learning, 2025. URL https://arxiv.org/abs/2502.16932

  45. [45]

    In: 2024 IEEE International Conference on Robotics and Automation (ICRA), pp

    Yang, J., Deng, C., Wu, J., Antonova, R., Guibas, L., and Bohg, J. Equivact: Sim(3)-equivariant visuomotor policies beyond rigid object manipulation. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 9249--9255, 2024. doi:10.1109/ICRA57147.2024.10611491

  46. [46]

    Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning

    Yu, T., Quillen, D., He, Z., Julian, R., Hausman, K., Finn, C., and Levine, S. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pp.\ 1094--1100. PMLR, 2020

  47. [47]

    E., and Wang, X

    Ze, Y., Yan, G., Wu, Y.-H., Macaluso, A., Ge, Y., Ye, J., Hansen, N., Li, L. E., and Wang, X. Gnfactor: Multi-task real robot learning with generalizable neural feature fields. In Conference on Robot Learning, pp.\ 284--301. PMLR, 2023

  48. [48]

    3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations

    Ze, Y., Zhang, G., Zhang, K., Hu, C., Wang, M., and Xu, H. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. In Proceedings of Robotics: Science and Systems (RSS), 2024

  49. [49]

    Autoregressive action sequence learning for robotic manipulation

    Zhang, X., Liu, Y., Chang, H., Schramm, L., and Boularias, A. Autoregressive action sequence learning for robotic manipulation. IEEE Robotics and Automation Letters, 2025

  50. [50]

    TouchGuide: Inference-Time Steering of Visuomotor Policies via Touch Guidance

    Zhang, Z., Ma, J., Yang, X., Wen, X., Zhang, Y., Li, B., Qin, Y., Liu, J., Zhao, C., Kang, L., Hong, H., Yin, Z., Torr, P., Su, H., Zhang, R., and Ma, D. Touchguide: Inference-time steering of visuomotor policies via touch guidance, 2026. URL https://arxiv.org/abs/2601.20239

  51. [51]

    Dartcon- trol: A diffusion-based autoregressive motion model for real-time text-driven motion control.arXiv preprint arXiv:2410.05260, 2024

    Zhao, K., Li, G., and Tang, S. Dartcontrol: A diffusion-based autoregressive motion model for real-time text-driven motion control. arXiv preprint arXiv:2410.05260, 2024

  52. [52]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Zhao, T. Z., Kumar, V., Levine, S., and Finn, C. Learning fine-grained bimanual manipulation with low-cost hardware, 2023. URL https://arxiv.org/abs/2304.13705

  53. [53]

    Roborefer: Towards spatial referring with reasoning in vision-language models for robotics.arXiv preprint arXiv:2506.04308, 2025

    Zhou, E., An, J., Chi, C., Han, Y., Rong, S., Zhang, C., Wang, P., Wang, Z., Huang, T., Sheng, L., et al. Roborefer: Towards spatial referring with reasoning in vision-language models for robotics. arXiv preprint arXiv:2506.04308, 2025

  54. [54]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...