pith. machine review for the scientific record. sign in

arxiv: 2603.05117 · v3 · submitted 2026-03-05 · 💻 cs.RO

Recognition: no theorem link

SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:38 UTC · model grok-4.3

classification 💻 cs.RO
keywords imitation learningdiffusion policyrobot manipulationgated attentionlong-horizon taskstemporal modelingSeedPolicySEGA
0
0 comments X

The pith

Integrating self-evolving gated attention into diffusion policies resolves temporal bottlenecks and scales effective horizons for long-horizon robotic manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SeedPolicy, which embeds a Self-Evolving Gated Attention (SEGA) module into the standard Diffusion Policy framework. This module maintains a compact, time-evolving latent state that accumulates relevant long-term context from observations while discarding irrelevant details. Standard diffusion policies degrade when observation horizons are extended, but SEGA enables efficient recurrent updates with only moderate added cost. On the RoboTwin 2.0 benchmark covering 50 manipulation tasks, this yields 36.8 percent relative gains in clean environments and 169 percent in randomized challenging ones, averaged over CNN and Transformer backbones. The approach also beats much larger vision-language-action models while using one to two orders of magnitude fewer parameters.

Core claim

SeedPolicy resolves the temporal modeling bottleneck in Diffusion Policy by integrating Self-Evolving Gated Attention, which maintains a time-evolving latent state via gated attention for efficient recurrent updates that accumulate long-term context into a compact representation while filtering irrelevant temporal information. This extends the effective temporal horizon with moderate overhead and leads to superior performance on long-horizon imitation learning tasks for robot manipulation.

What carries the argument

Self-Evolving Gated Attention (SEGA), a temporal module that maintains a time-evolving latent state via gated attention to enable efficient recurrent updates accumulating long-term context.

If this is right

  • Diffusion policies can scale observation horizons without the usual performance drop.
  • Large relative gains appear on 50-task benchmarks, with the biggest lifts under randomized conditions.
  • Strong efficiency advantage holds against vision-language-action models using 10-100 times more parameters.
  • The method sets a new state-of-the-art baseline for imitation learning on long-horizon robotic manipulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The gated recurrent update pattern could transfer to other policy architectures that currently rely on stacked frames.
  • Efficient latent-state memory may reduce the need for ever-larger context windows in real-time robot control.
  • Similar mechanisms might help sequential decision tasks outside manipulation, such as navigation or assembly planning.
  • The approach points toward hybrid recurrent-diffusion designs that keep compute costs low while extending temporal reach.

Load-bearing premise

The gated attention mechanism reliably accumulates relevant long-term context while filtering irrelevant temporal information across diverse manipulation tasks without introducing new failure modes or losing critical details.

What would settle it

An experiment where increasing the observation horizon in SeedPolicy causes performance to degrade at the same rate as in standard Diffusion Policy, or where SEGA fails to filter distractors in a task requiring precise recall of distant events.

Figures

Figures reproduced from arXiv: 2603.05117 by Haoqiang Fan, Peng Cheng, Shen Cheng, Shuaicheng Liu, Xinyang Yuan, Youqiang Gui, Yuxuan Zhou.

Figure 1
Figure 1. Figure 1: Horizon scaling analysis. (a) DP shows a counter-intuitive performance drop as the observation horizon grows, dropping to 0% at large horizons (data omitted). (b) In contrast, our approach enables robust horizon scaling, utilizing long observation horizons to improve task success rates. fails to capture complex temporal dependencies, an issue that becomes more pronounced as the number of frames grows. To b… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the SeedPolicy framework. The system takes current RGB images and joint poses as input, encoding them via a ResNet Encoder. The core Self-Evolving Gated Attention (SEGA) module (blue box) recursively updates a time-evolving latent state (State t) to capture long-term spatiotemporal dependencies while generating enhanced observation features (EObst). These context-rich features are then fed into… view at source ↗
Figure 3
Figure 3. Figure 3: (a) SEGA employs a dual-stream design: the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance comparison across varying task length. A con￾sistent trend emerges in both architectures: as the task length increases, the performance gap between SeedPolicy and the baseline progressively widens. This validates the architecture-agnostic effectiveness of our approach, demonstrating that the advantage of our explicit temporal modeling becomes increasingly significant in long-horizon scenarios c… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative visualization of failure cases in simulation. We compare the successful execution of SeedPolicy (top row) against representative failure modes of the DP across three tasks: (a) Put Bottles Dustbin (”clean” setting), (b) Handover Mic (”hard” setting), and (c) Grab Roller (”hard” setting). Red circles highlight critical errors, including execution stagnation (getting stuck) and spatial positionin… view at source ↗
Figure 6
Figure 6. Figure 6: Real-world quantitative results. Success rate comparison across three challenging tasks: Looping Place-Retrieval, Sequential Picking, and Bottle Handover. SeedPolicy demonstrates superior robustness, significantly outperforming the baseline (DP) in all scenarios. is essential for effective horizon scaling (Q3). To investigate the impact of our architectural designs, we conducted a progressive ablation stud… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative failure analysis in real world scenarios. We visualize the successful execution of SeedPolicy (top rows) compared to common baseline failures in (a) Looping Place-Retrieval and (b) Bottle Handover. Red circles highlight critical errors. Failure Case 1 illustrates execution stagnation caused by Perceptual Aliasing (misinterpreting the returned block as the initial state in (a)) or overfitting to… view at source ↗
Figure 8
Figure 8. Figure 8: The DOS-W1 Mobile Manipulation Platform. As illustrated, the system integrates dual 7-DoF robotic arms with a differential drive mobile chassis and a vertical lift mechanism. A front-view RGB camera is mounted on the mast for visual perception. mobile chassis equipped with a differential drive and a vertical lift mechanism. The dual arms offer a payload capacity of 1.5 kg each with a repeatability of ±0.1 … view at source ↗
Figure 9
Figure 9. Figure 9: More qualitative visualization of failure cases in simulation. We compare the successful execution of SeedPolicy (top row) against representative failure modes of the DP across three tasks: (a) Stack Bowls Three (”clean” setting), (b) Stack Blocks Two (”clean” setting), and (c) Shake Bottle (”hard” setting). Red circles highlight critical errors, including execution stagnation (getting stuck) and spatial p… view at source ↗
Figure 10
Figure 10. Figure 10: More qualitative failure analysis in real world scenarios. We visualize the successful execution of SeedPolicy (top rows) compared to common baseline failures in Sequential Picking. Red circles highlight critical errors. Failure Case 1 illustrates execution stagnation caused by Perceptual Aliasing (misinterpreting the placed block as the initial state). Failure Case 2 demonstrates spatial precision errors… view at source ↗
Figure 11
Figure 11. Figure 11: Open-loop trajectory reconstruction for Sequential Picking. The model accurately reconstructs the complex over 1000-step trajectory [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Open-loop trajectory reconstruction for Bottle Handover. Note the precise alignment in gripper channels, demonstrating the model’s ability to capture sharp discrete transitions [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Open-loop trajectory reconstruction for Looping Place-Retrieval. SeedPolicy maintains high tracking accuracy over long horizons [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
read the original abstract

Imitation Learning (IL) enables robots to acquire manipulation skills from expert demonstrations. Diffusion Policy (DP) models multi-modal expert behaviors but degrades when naively increasing stacked observation horizons, limiting long-horizon manipulation. We propose Self-Evolving Gated Attention (SEGA), a temporal module that maintains a time-evolving latent state via gated attention, enabling efficient recurrent updates that accumulate long-term context into a compact latent representation while filtering irrelevant temporal information. Integrating SEGA into DP yields Self-Evolving Diffusion Policy (SeedPolicy), which resolves the temporal modeling bottleneck and extends the effective temporal horizon with moderate overhead. On the RoboTwin 2.0 benchmark with 50 manipulation tasks, SeedPolicy outperforms DP and other IL baselines. Averaged across both CNN and Transformer backbones, SeedPolicy achieves 36.8% relative improvement in clean settings and 169% relative improvement in randomized challenging settings over the DP. Compared to vision-language-action models such as RDT with 1.2B parameters, SeedPolicy achieves stronger performance in the clean setting with one to two orders of magnitude fewer parameters, demonstrating strong efficiency. These results establish SeedPolicy as a state-of-the-art imitation learning method for long-horizon robotic manipulation. Code is available at: https://anonymous.4open.science/r/SeedPolicy-64F0/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper proposes Self-Evolving Diffusion Policy (SeedPolicy) by augmenting Diffusion Policy with a Self-Evolving Gated Attention (SEGA) temporal module. SEGA maintains a compact time-evolving latent state through gated attention to enable recurrent updates that accumulate long-horizon context while filtering irrelevant information, addressing DP's degradation with stacked observations. On the RoboTwin 2.0 benchmark across 50 manipulation tasks, SeedPolicy reports averaged relative gains of 36.8% (clean) and 169% (randomized challenging) over DP for both CNN and Transformer backbones, plus stronger clean-setting performance than the 1.2B-parameter RDT model despite using one to two orders of magnitude fewer parameters. Code is released.

Significance. If the results hold under rigorous validation, the contribution is meaningful for imitation learning: it offers a lightweight, recurrent-style fix to the known horizon-scaling bottleneck in diffusion policies without the parameter cost of large VLA models. The efficiency claims and public code are clear strengths that could influence practical long-horizon manipulation pipelines.

major comments (1)
  1. [Abstract and §4] Abstract and §4 (Experiments): The central performance claims (36.8% / 169% relative gains) are attributed to SEGA's gated updates preserving critical details across horizons, yet no horizon-scaling curves, attention-weight trajectories, or ablation replacing the gate with plain recurrence are reported. Without these, it remains possible that gains arise from added capacity or training variance rather than the claimed temporal mechanism.
minor comments (3)
  1. The anonymous code link should be replaced with a permanent repository or supplemented with full training hyperparameters, random seeds, and statistical significance tests (error bars, p-values) to support reproducibility.
  2. [Abstract] Clarify the exact parameter counts for SeedPolicy variants versus the cited RDT baseline and confirm whether the reported averages are macro- or micro-averaged across the 50 tasks.
  3. [§3.2] Figure captions and §3.2 should explicitly define the gating equations and any learned parameters in SEGA to allow readers to verify the 'parameter-free' or 'moderate overhead' claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The request for additional analyses to more directly attribute gains to the SEGA mechanism is reasonable, and we will incorporate the suggested elements into the revised manuscript to strengthen the experimental validation.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The central performance claims (36.8% / 169% relative gains) are attributed to SEGA's gated updates preserving critical details across horizons, yet no horizon-scaling curves, attention-weight trajectories, or ablation replacing the gate with plain recurrence are reported. Without these, it remains possible that gains arise from added capacity or training variance rather than the claimed temporal mechanism.

    Authors: We acknowledge that the current experiments, while showing consistent gains across backbones and settings, do not include the specific diagnostics requested. In the revision we will add: (1) horizon-scaling curves that plot task success rate versus observation horizon length (from 1 to 16 frames) for both baseline DP and SeedPolicy, directly illustrating the improved scaling behavior; (2) representative attention-weight trajectories over time that visualize how the gated mechanism selectively retains or discards information from past observations; and (3) an ablation that replaces the gated attention with a plain recurrent module (e.g., a standard GRU-style update without the learned gate) while keeping parameter count matched, to isolate the contribution of the gating operation. These additions will help distinguish the claimed temporal mechanism from capacity or variance effects. We believe the existing multi-backbone, multi-setting results already provide supporting evidence, but the new analyses will make the attribution more rigorous. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains measured on external benchmark

full rationale

The paper introduces SEGA as a new gated-attention temporal module and integrates it into Diffusion Policy to form SeedPolicy. All reported results (36.8% and 169% relative gains) are obtained by direct evaluation on the independent RoboTwin 2.0 benchmark against standard baselines. No equations, fitted parameters, or self-citations are presented that reduce the performance numbers to quantities defined inside the paper itself. The architecture is proposed rather than derived from prior self-referential results, so the central claims remain externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of a newly introduced architectural component whose benefits are shown empirically rather than derived from first principles or external benchmarks.

axioms (1)
  • domain assumption Gated attention can maintain a compact time-evolving latent state that accumulates relevant context while discarding irrelevant information
    This premise is invoked to justify the SEGA design for recurrent temporal modeling in diffusion policies.
invented entities (1)
  • Self-Evolving Gated Attention (SEGA) no independent evidence
    purpose: To enable efficient long-horizon context accumulation inside diffusion policies for robot manipulation
    New module proposed and integrated in this work.

pith-pipeline@v0.9.0 · 5556 in / 1339 out tokens · 37084 ms · 2026-05-15T16:38:35.751149+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 15 internal anchors

  1. [1]

    Global optimality of elman-type rnn in the mean-field regime

    Andrea Agazzi, Jian-Xiong Lu, and Sayan Mukherjee. Global optimality of elman-type rnn in the mean-field regime. InInternational Conference on Machine Learn- ing (ICML), pages 187–218. PMLR, 2023

  2. [2]

    Roboagent: Generalization and efficiency in robot manip- ulation via semantic augmentations and action chunking

    Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Ab- hinav Gupta, Shubham Tulsiani, and Vikash Kumar. Roboagent: Generalization and efficiency in robot manip- ulation via semantic augmentations and action chunking. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 4788–4795. IEEE, 2024

  3. [3]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  4. [4]

    Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions

    Yevgen Chebotar, Quan Vuong, Karol Hausman, Fei Xia, Yao Lu, Alex Irpan, Aviral Kumar, Tianhe Yu, Alexander Herzog, Karl Pertsch, et al. Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions. InConference on Robot Learning, pages 3909–3928. PMLR, 2023

  5. [5]

    Decision transformer: Rein- forcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021

    Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Rein- forcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021

  6. [6]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Qiwei Liang, Zixuan Li, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

  7. [7]

    G3flow: Generative 3d semantic flow for pose-aware and gen- eralizable object manipulation

    Tianxing Chen, Yao Mu, Zhixuan Liang, Zanxin Chen, Shijia Peng, Qiangyu Chen, Mingkun Xu, et al. G3flow: Generative 3d semantic flow for pose-aware and gen- eralizable object manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1735–1744, 2025

  8. [8]

    Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  9. [9]

    Learning phrase representations using rnn encoder–decoder for statistical machine trans- lation

    Kyunghyun Cho, Bart van Merrienboer, C ¸ aglar G¨ulc ¸ehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine trans- lation. InProceedings of the 2014 Conference on Empir- ical Methods in Natural Language Processing (EMNLP), pages 1724–1734, 2014

  10. [10]

    Jeffrey L. Elman. Finding structure in time.Cognitive Science, 14(2):179–211, 1990

  11. [11]

    Diffusion trajectory-guided policy for long-horizon robot manip- ulation.arXiv preprint arXiv:2502.10040, 2025

    Shichao Fan, Quantao Yang, Yajie Liu, Kun Wu, Zheng- ping Che, Qingjie Liu, and Min Wan. Diffusion trajectory-guided policy for long-horizon robot manip- ulation.arXiv preprint arXiv:2502.10040, 2025

  12. [12]

    Rvt-2: Learning precise manipulation from few demonstrations

    Ankit Goyal, Valts Blukis, Jie Xu, Yijie Guo, Yu- Wei Chao, and Dieter Fox. Rvt-2: Learning precise manipulation from few demonstrations.arXiv preprint arXiv:2406.08545, 2024

  13. [13]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023

  14. [14]

    Tscir: Towards composite weather degradation image restora- tion via a two-stage framework.Neurocomputing, page 132798, 2026

    Youqiang Gui, Fanglong Wu, and Peng Cheng. Tscir: Towards composite weather degradation image restora- tion via a two-stage framework.Neurocomputing, page 132798, 2026

  15. [15]

    IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

    Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. Idql: Implicit q-learning as an actor-critic method with diffusion poli- cies.arXiv preprint arXiv:2304.10573, 2023

  16. [16]

    Long short- term memory.Neural Computation, 9(8):1735–1780, 1997

    Sepp Hochreiter and J ¨urgen Schmidhuber. Long short- term memory.Neural Computation, 9(8):1735–1780, 1997

  17. [17]

    Offline reinforcement learning as one big sequence modeling problem.Advances in neural information processing systems, 34:1273–1286, 2021

    Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem.Advances in neural information processing systems, 34:1273–1286, 2021

  18. [18]

    Planning with Diffusion for Flexible Behavior Synthesis

    Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis.arXiv preprint arXiv:2205.09991, 2022

  19. [19]

    Crossway diffusion: Improving diffusion-based visuomotor policy via self-supervised learning

    Xiang Li, Varun Belagali, Jinghuan Shang, and Michael S Ryoo. Crossway diffusion: Improving diffusion-based visuomotor policy via self-supervised learning. In2024 IEEE International Conference on Robotics and Automa- tion (ICRA), pages 16841–16849. IEEE, 2024

  20. [20]

    Data scaling laws in im- itation learning for robotic manipulation.arXiv preprint arXiv:2410.18647, 2024

    Fanqi Lin, Yingdong Hu, Pingyue Sheng, Chuan Wen, Jiacheng You, and Yang Gao. Data scaling laws in im- itation learning for robotic manipulation.arXiv preprint arXiv:2410.18647, 2024

  21. [21]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

  22. [22]

    Synthetic experience replay.Advances in Neural Information Processing Systems, 36:46323–46344, 2023

    Cong Lu, Philip Ball, Yee Whye Teh, and Jack Parker- Holder. Synthetic experience replay.Advances in Neural Information Processing Systems, 36:46323–46344, 2023

  23. [23]

    Hierarchical diffusion policy for kinematics- aware multi-task robotic manipulation

    Xiao Ma, Sumit Patidar, Iain Haughton, and Stephen James. Hierarchical diffusion policy for kinematics- aware multi-task robotic manipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18081–18090, 2024

  24. [24]

    Consistency policy: Accelerated visuomotor policies via consistency distillation

    Aaditya Prasad, Kevin Lin, Jimmy Wu, Linqi Zhou, and Jeannette Bohg. Consistency policy: Accelerated visuomotor policies via consistency distillation.arXiv preprint arXiv:2405.07503, 2024

  25. [25]

    Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

    Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708, 2025

  26. [26]

    Real-world robot learning with masked visual pre-training

    Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel, Jitendra Malik, and Trevor Darrell. Real-world robot learning with masked visual pre-training. In Conference on Robot Learning, pages 416–426. PMLR, 2023

  27. [27]

    Goal-conditioned imitation learning using score-based diffusion policies

    Moritz Reuss, Maximilian Li, Xiaogang Jia, and Rudolf Lioutikov. Goal-conditioned imitation learning us- ing score-based diffusion policies.arXiv preprint arXiv:2304.02532, 2023

  28. [28]

    Behavior transformers: Cloningkmodes with one stone.Advances in Neural Information Processing Systems, 35:22955–22968, 2022

    Nur Muhammad Shafiullah, Zichen Cui, Ariuntuya Arty Altanzaya, and Lerrel Pinto. Behavior transformers: Cloningkmodes with one stone.Advances in Neural Information Processing Systems, 35:22955–22968, 2022

  29. [29]

    GLU Variants Improve Transformer

    Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

  30. [30]

    Perceiver-actor: A multi-task transformer for robotic ma- nipulation.arXiv preprint arXiv:2209.05451, 2022

    Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic ma- nipulation.arXiv preprint arXiv:2209.05451, 2022

  31. [31]

    Highway Networks

    Rupesh Kumar Srivastava, Klaus Greff, and J ¨urgen Schmidhuber. Highway networks.arXiv preprint arXiv:1505.00387, 2015

  32. [32]

    Retentive Network: A Successor to Transformer for Large Language Models

    Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023

  33. [33]

    Sequence to Sequence Learning with Neural Networks

    Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks.arXiv preprint arXiv:1409.3215, 2014

  34. [34]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Bap- tiste Rozi`ere, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  35. [35]

    An overview of large ai models and their applications.Visual Intelligence, 2(1):34, 2024

    Xiaoguang Tu, Zhi He, Yi Huang, Zhi-Hao Zhang, Ming Yang, and Jian Zhao. An overview of large ai models and their applications.Visual Intelligence, 2(1):34, 2024

  36. [36]

    Anti-oversmoothing in deep vision transformers via the fourier domain analysis: From the- ory to practice.arXiv preprint arXiv:2203.05962, 2022

    Peihao Wang, Wenqing Zheng, Tianlong Chen, and Zhangyang Wang. Anti-oversmoothing in deep vision transformers via the fourier domain analysis: From the- ory to practice.arXiv preprint arXiv:2203.05962, 2022

  37. [37]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks, 2024.URL https://arxiv. org/abs/2309.17453, 1, 2024

  38. [38]

    Re- active diffusion policy: Slow-fast visual-tactile policy learning for contact-rich manipulation.arXiv preprint arXiv:2503.02881, 2025

    Han Xue, Jieji Ren, Wendi Chen, Gu Zhang, Yuan Fang, Guoying Gu, Huazhe Xu, and Cewu Lu. Re- active diffusion policy: Slow-fast visual-tactile policy learning for contact-rich manipulation.arXiv preprint arXiv:2503.02881, 2025

  39. [39]

    Yanjie Ze, Ge Yan, Yueh-Hua Wu, Annabella Macaluso, Yuying Ge, Jianglong Ye, Nicklas Hansen, Li Erran Li, and X. Wang. Gnfactor: Multi-task real robot learning with generalizable neural feature fields.arXiv preprint arXiv:2308.16891, 2023

  40. [40]

    3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

    Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations.arXiv preprint arXiv:2403.03954, 2024

  41. [41]

    Flowpolicy: Enabling fast and robust 3d flow-based policy via consistency flow matching for robot manipulation.arXiv preprint arXiv:2412.04987, 2024

    Qinglun Zhang, Zhen Liu, Haoqiang Fan, Guanghui Liu, Bing Zeng, and Shuaicheng Liu. Flowpolicy: Enabling fast and robust 3d flow-based policy via consistency flow matching for robot manipulation.arXiv preprint arXiv:2412.04987, 2024

  42. [42]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

  43. [43]

    Mul- timax: Sparse and multi-modal attention learning

    Yuxuan Zhou, Mario Fritz, and Margret Keuper. Mul- timax: Sparse and multi-modal attention learning. In International Conference on Machine Learning (ICML), pages 61897–61912. PMLR, 2024. APPENDIX A. Overview In this appendix, we provide supplementary details and extensive experimental results to further substantiate the ef- fectiveness and robustness of ...

  44. [44]

    Task Setup and Randomization:We collected 50 expert demonstrations for three representative long-horizon tasks: Sequential Picking,Bottle Handover, andLooping Place- Retrieval. To foster strong generalization capabilities, we introduced systematic spatial randomization during the data TABLE VIII HARDWARESPECIFICATIONS OF THEDOS-W1 PLATFORM.THE SYSTEM INTE...

  45. [45]

    The raw data from the DOS-W1 platform comprises independent streams of RGB video and high-frequency proprioceptive logs

    Temporal Alignment and Static Filtering:Following collec- tion, we address the asynchronous nature of the raw data. The raw data from the DOS-W1 platform comprises independent streams of RGB video and high-frequency proprioceptive logs. We first synchronize these streams by aligning proprioceptive timestamps to video frame timestamps via nearest-neighbor ...

  46. [46]

    We format the imitation learning objective as a next-state prediction problem, where the action At corresponds to the state vector att+1

    Dataset Aggregation and F ormatting:In the final stage, the processed episodes are aggregated into Zarr archives to optimize I/O throughput. We format the imitation learning objective as a next-state prediction problem, where the action At corresponds to the state vector att+1. Furthermore, visual observations are decoded and transposed to the channel-fir...

  47. [47]

    to balance storage efficiency with access speed. D. Additional Quantitative and Qualitative Results a) Robustness Analysis under Randomized Settings: Table IX provides the task-level performance breakdown that underpins the significant average gains reported in the main text. As expected, the introduction of severe environmen- tal randomization leads to a...