pith. machine review for the scientific record. sign in

arxiv: 2602.09580 · v3 · submitted 2026-02-10 · 💻 cs.RO · cs.LG

Recognition: no theorem link

SERNF: Sample-Efficient Real-World Dexterous Policy Fine-Tuning via Action-Chunked Critics and Normalizing Flows

Authors on Pith no claims yet

Pith reviewed 2026-05-16 03:28 UTC · model grok-4.3

classification 💻 cs.RO cs.LG
keywords policyactiondexterousfine-tuningmultimodalnormalizingpoliciesreal-world
0
0 comments X

The pith

SERNF achieves sample-efficient real-world fine-tuning of multimodal dexterous policies by pairing exact-likelihood normalizing flow policies with action-chunked value critics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Teaching robots to perform dexterous tasks in the real world is hard because each attempt costs time and the robot must choose from many possible hand movements. Standard approaches either cannot calculate how likely a chosen movement is when there are multiple good options, or they evaluate movements one step at a time even when the robot executes several steps together. SERNF solves this by using normalizing flows, a type of model that can exactly compute the probability of any full sequence of actions, and by training a critic that scores whole chunks of actions together. This alignment lets the system update the policy conservatively and assign credit correctly over longer tasks. The authors test the method on two real-robot tasks that require careful control.

Core claim

To our knowledge, this is the first demonstration of a likelihood-based, multimodal generative policy combined with chunk-level value learning on real robotic hardware.

Load-bearing premise

That normalizing flows can be trained to produce stable, exact likelihoods for multimodal action chunks under real-world noise and limited samples, and that the action-chunked critic will produce value estimates that align with the policy's temporal execution without introducing bias.

Figures

Figures reproduced from arXiv: 2602.09580 by Chenyu Yang, Davide Liconti, Denis Tarasov, Hehui Zheng, Robert K. Katzschmann.

Figure 1
Figure 1. Figure 1: Overview of SERNF (Sample-Efficient Reinforcement Learning with Normalizing Flows). SERNF models action chunks using a conditional normalizing flow policy, enabling expressive multimodal action distributions with exact likelihoods and therefore allowing direct, conservative off-policy fine-tuning. An action-chunked critic evaluates entire action sequences, aligning value estimation with chunked execution a… view at source ↗
Figure 2
Figure 2. Figure 2: Actor–critic architecture of SERNF. The actor is a conditional [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Real-world experimental setup. Left: scissors retrieval and tape [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance evolution on the scissors task with respect to the number [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance evolution of the cube rotation task with regards to real [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative rollouts of SERNF on real hardware. Top: scissors retrieval and tape cutting task, showing grasp acquisition, lifting, and successful cutting. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Observation and real-time action-chunking structure for both tasks. At [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Examples of synthetic training samples. Images are rendered [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 8
Figure 8. Figure 8: Test configurations for the scissor retrieval and tape cutting task. The [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Effect of action chunk length H on imitation learning performance across RoboMimic Lift, Can, and Square tasks with 4 random seeds [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Effect of normalizing-flow depth (number of coupling blocks) on [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Effect of λ on offline RL performance across RoboMimic tasks with 4 random seeds. 1) Ablation Studies: We conduct ablation studies to analyze the sensitivity of SERNF to key architectural and algorithmic design choices. The ablations in this subsection are per￾formed on simulated RoboMimic environments (Lift, Can, and Square). a) Effect of action chunk length [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
read the original abstract

Real-world fine-tuning of dexterous manipulation policies remains challenging due to limited real-world interaction budgets and highly multimodal action distributions. Diffusion-based policies, while expressive, do not permit conservative likelihood-based updates during fine-tuning because action probabilities are intractable. In contrast, conventional Gaussian policies collapse under multimodality, particularly when actions are executed in chunks, and standard per-step critics fail to align with chunked execution, leading to poor credit assignment. We present SERFN, a sample-efficient off-policy fine-tuning framework with normalizing flow (NF) to address these challenges. The normalizing flow policy yields exact likelihoods for multimodal action chunks, allowing conservative, stable policy updates through likelihood regularization and thereby improving sample efficiency. An action-chunked critic evaluates entire action sequences, aligning value estimation with the policy's temporal structure and improving long-horizon credit assignment. To our knowledge, this is the first demonstration of a likelihood-based, multimodal generative policy combined with chunk-level value learning on real robotic hardware. We evaluate SERFN on two challenging dexterous manipulation tasks in the real world: cutting tape with scissors retrieved from a case, and in-hand cube rotation with a palm-down grasp -- both of which require precise, dexterous control over long horizons. On these tasks, SERFN achieves stable, sample-efficient adaptation where standard methods struggle.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities are detailed beyond the high-level framework description.

axioms (2)
  • domain assumption Normalizing flows yield exact likelihoods for multimodal action distributions in robotic control settings
    Central to allowing conservative likelihood-based updates
  • domain assumption Action-chunked critics provide better credit assignment than per-step critics for chunked execution
    Assumed to align value estimation with policy temporal structure

pith-pipeline@v0.9.0 · 5560 in / 1273 out tokens · 156967 ms · 2026-05-16T03:28:59.251908+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 16 internal anchors

  1. [1]

    Let offline rl flow: Training conservative agents in the latent space of normalizing flows.arXiv preprint arXiv:2211.11096, 2022

    Dmitriy Akimov, Vladislav Kurenkov, Alexander Nikulin, Denis Tarasov, and Sergey Kolesnikov. Let offline rl flow: Training conservative agents in the latent space of normalizing flows.arXiv preprint arXiv:2211.11096, 2022

  2. [2]

    Policyflow: Policy optimization with con- tinuous normalizing flow in reinforcement learning

    Anonymous. Policyflow: Policy optimization with con- tinuous normalizing flow in reinforcement learning. t=0st=1st=2st=3st=4st=5s t=0st=6st=7st=11st=13st=18s Fig. 6. Qualitative rollouts of SERNF on real hardware. Top: scissors retrieval and tape cutting task, showing grasp acquisition, lifting, and successful cutting. Bottom: in-hand cube rotation task, ...

  3. [3]

    Efficient online reinforcement learning with offline data

    Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. InInternational Conference on Machine Learning, pages 1577–1594. PMLR, 2023

  4. [4]

    Improving td3-bc: Relaxed policy constraint for offline learning and stable online fine-tuning.arXiv preprint arXiv:2211.11802, 2022

    Alex Beeson and Giovanni Montana. Improving td3-bc: Relaxed policy constraint for offline learning and stable online fine-tuning.arXiv preprint arXiv:2211.11802, 2022

  5. [5]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi ”Jim” Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You Liang ...

  6. [6]

    Real-Time Execution of Action Chunking Flow Policies

    Kevin Black, Manuel Y Galliker, and Sergey Levine. Real-time execution of action chunking flow policies. arXiv preprint arXiv:2506.07339

  7. [7]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π 0: A vision- language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  8. [8]

    Training-time action conditioning for efficient real-time chunking.arXiv preprint arXiv:2512.05964, 2025

    Kevin Black, Allen Z Ren, Michael Equi, and Sergey Levine. Training-time action conditioning for efficient real-time chunking.arXiv preprint arXiv:2512.05964, 2025

  9. [9]

    Maximum entropy reinforcement learning via energy-based normal- izing flow.Advances in Neural Information Processing Systems, 37:56136–56165, 2024

    Chen-Hao Chao, Chien Feng, Wei-Fang Sun, Cheng- Kuang Lee, Simon See, and Chun-Yi Lee. Maximum entropy reinforcement learning via energy-based normal- izing flow.Advances in Neural Information Processing Systems, 37:56136–56165, 2024

  10. [11]

    Christoph, Maximilian Eberlein, Filippos Katsimalis, Arturo Roberti, Aristotelis Sympetheros, Michel R

    Clemens C. Christoph, Maximilian Eberlein, Filippos Katsimalis, Arturo Roberti, Aristotelis Sympetheros, Michel R. V ogt, Davide Liconti, Chenyu Yang, Barn- abas Gavin Cangan, Ronan J. Hinchet, and Robert K. Katzschmann. Orca: An open-source, reliable, cost- effective, anthropomorphic robotic hand for uninter- rupted dexterous task learning, 2025. URL htt...

  11. [12]

    The ingredients for robotic diffusion transformers

    Sudeep Dasari, Oier Mees, Sebastian Zhao, Mohan Ku- mar Srirama, and Sergey Levine. The ingredients for robotic diffusion transformers. In2025 IEEE Interna- tional Conference on Robotics and Automation (ICRA), pages 15617–15625. IEEE, 2025

  12. [13]

    Density estimation using Real NVP

    Laurent Dinh, Jascha Sohl-Dickstein, and Samy Ben- gio. Density estimation using real nvp.arXiv preprint arXiv:1605.08803, 2016

  13. [14]

    Hyperparameters in reinforcement learning and how to tune them

    Theresa Eimer, Marius Lindauer, and Roberta Raileanu. Hyperparameters in reinforcement learning and how to tune them. InInternational conference on machine learning, pages 9104–9149. PMLR, 2023

  14. [15]

    Stop regressing: Training value functions via clas- sification for scalable deep rl.arXiv preprint arXiv:2403.03950, 2024

    Jesse Farebrother, Jordi Orbay, Quan Vuong, Adrien Ali Ta¨ıga, Yevgen Chebotar, Ted Xiao, Alex Irpan, Sergey Levine, Pablo Samuel Castro, Aleksandra Faust, et al. Stop regressing: Training value functions via clas- sification for scalable deep rl.arXiv preprint arXiv:2403.03950, 2024

  15. [16]

    D4RL: Datasets for Deep Data-Driven Reinforcement Learning

    Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning.arXiv preprint arXiv:2004.07219, 2020

  16. [17]

    A minimalist approach to offline reinforcement learning.Advances in neural information processing systems, 34:20132–20145, 2021

    Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning.Advances in neural information processing systems, 34:20132–20145, 2021

  17. [18]

    Normalizing flows are capable models for RL.CoRR, abs/2505.23527,

    Raj Ghugare and Benjamin Eysenbach. Normalizing flows are capable models for RL.CoRR, abs/2505.23527,

  18. [20]

    Dextreme: Transfer of agile in-hand manipulation from simulation to reality

    Ankur Handa, Arthur Allshire, Viktor Makoviychuk, Aleksei Petrenko, Ritvik Singh, Jingzhou Liu, Denys Makoviichuk, Karl Van Wyk, Alexander Zhurkevich, Balakumar Sundaralingam, et al. Dextreme: Transfer of agile in-hand manipulation from simulation to reality. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 5977–5984. IEEE, 2023

  19. [21]

    Deep Residual Learning for Image Recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. corr abs/1512.03385 (2015), 2015

  20. [22]

    Imitation bootstrapped reinforcement learning.arXiv preprint arXiv:2311.02198, 2023

    Hengyuan Hu, Suvir Mirchandani, and Dorsa Sadigh. Imitation bootstrapped reinforcement learning.arXiv preprint arXiv:2311.02198, 2023

  21. [23]

    Improving regres- sion performance with distributional losses

    Ehsan Imani and Martha White. Improving regres- sion performance with distributional losses. InInter- national conference on machine learning, pages 2157–

  22. [24]

    Physical Intelligence, Ali Amin, Raichelle Aniceto, Ash- win Balakrishna, Kevin Black, Ken Conley, Grace Con- nors, James Darpinian, Karan Dhabalia, Jared DiCarlo, et al.π ∗ 0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025

  23. [25]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pert...

  24. [26]

    Learning stable normalizing-flow control for robotic manipulation

    Shahbaz Abdul Khader, Hang Yin, Pietro Falco, and Danica Kragic. Learning stable normalizing-flow control for robotic manipulation. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 1644–1650. IEEE, 2021

  25. [27]

    Jet: A modern transformer-based normalizing flow.arXiv preprint arXiv:2412.15129, 2024

    Alexander Kolesnikov, Andr ´e Susano Pinto, and Michael Tschannen. Jet: A modern transformer-based normalizing flow.arXiv preprint arXiv:2412.15129, 2024

  26. [28]

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020

  27. [30]

    Normalizing flows are capable visuo- motor policy learning models.CoRR, abs/2509.21073,

    Simon Kristoffersson Lind, Jialong Li, Maj Stenmark, and V olker Kr¨uger. Normalizing flows are capable visuo- motor policy learning models.CoRR, abs/2509.21073,

  28. [32]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  29. [33]

    SERL: A software suite for sample-efficient robotic reinforcement learn- ing

    Jianlan Luo, Zheyuan Hu, Charles Xu, You Liang Tan, Jacob Berg, Archit Sharma, Stefan Schaal, Chelsea Finn, Abhishek Gupta, and Sergey Levine. SERL: A software suite for sample-efficient robotic reinforcement learn- ing. InIEEE International Conference on Robotics and Automation, ICRA 2024, Yokohama, Japan, May 13-17, 2024, pages 16961–16969. IEEE, 2024. ...

  30. [34]

    What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

    Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart ´ın-Mart´ın. What matters in learning from offline human demon- strations for robot manipulation.arXiv preprint arXiv:2108.03298, 2021

  31. [35]

    Leveraging exploration in off-policy algorithms via normalizing flows

    Bogdan Mazoure, Thang Doan, Audrey Durand, Joelle Pineau, and R Devon Hjelm. Leveraging exploration in off-policy algorithms via normalizing flows. InConfer- ence on Robot Learning, pages 430–444. PMLR, 2020

  32. [37]

    Learning robust perceptive locomotion for quadrupedal robots in the wild.Science robotics, 7(62):eabk2822, 2022

    Takahiro Miki, Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen, Vladlen Koltun, and Marco Hutter. Learning robust perceptive locomotion for quadrupedal robots in the wild.Science robotics, 7(62):eabk2822, 2022

  33. [38]

    Carlson, Ji Yuan Feng, Ani- mesh Garg, Renato Gasoto, Lionel Gulich, Yijie Guo, M

    Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Mu ˜noz, Xinjie Yao, Ren ´e Zurbr ¨ugg, Nikita Rudin, Lukasz Wawrzyniak, Milad Rakhsha, Alain Denzler, Eric Hei- den, Ales Borovicka, Ossama Ahmed, Iretiayo Akinola, Abrar Anwar, Mark T. Carlson, Ji Yuan Feng, Ani- mesh Garg, Renato Gasoto, Lionel Gulich, Yijie...

  34. [39]

    URL https://arxiv.org/abs/2511.04831

  35. [40]

    AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

    Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. Awac: Accelerating online reinforce- ment learning with offline datasets.arXiv preprint arXiv:2006.09359, 2020

  36. [41]

    Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning.Advances in Neural Information Processing Systems, 36:62244–62269, 2023

    Mitsuhiko Nakamoto, Simon Zhai, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, and Sergey Levine. Cal-ql: Calibrated offline rl pre-training for efficient online fine-tuning.Advances in Neural Information Processing Systems, 36:62244–62269, 2023

  37. [42]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  38. [43]

    Ogbench: Benchmarking offline goal- conditioned rl.arXiv preprint arXiv:2410.20092, 2024

    Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. Ogbench: Benchmarking offline goal- conditioned rl.arXiv preprint arXiv:2410.20092, 2024

  39. [44]

    Ren, Justin Lidard, Anthony Simeonov, Lars Lien Ankile, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz

    Allen Z. Ren, Justin Lidard, Anthony Simeonov, Lars Lien Ankile, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenRe- view.net, 2025. URL https://openreview.net/for...

  40. [45]

    Variational infer- ence with normalizing flows

    Danilo Rezende and Shakir Mohamed. Variational infer- ence with normalizing flows. InInternational conference on machine learning, pages 1530–1538. PMLR, 2015

  41. [46]

    Rsl-rl: A learning library for robotics research.arXiv preprint arXiv:2509.10771, 2025

    Clemens Schwarke, Mayank Mittal, Nikita Rudin, David Hoeller, and Marco Hutter. Rsl-rl: A learning library for robotics research.arXiv preprint arXiv:2509.10771, 2025

  42. [47]

    Robotic telekinesis: Learning a robotic hand imita- tor by watching humans on youtube.arXiv preprint arXiv:2202.10448, 2022

    Aravind Sivakumar, Kenneth Shaw, and Deepak Pathak. Robotic telekinesis: Learning a robotic hand imita- tor by watching humans on youtube.arXiv preprint arXiv:2202.10448, 2022

  43. [48]

    A walk in the park: Learning to walk in 20 minutes with model-free reinforcement learning.arXiv preprint arXiv:2208.07860, 2022

    Laura Smith, Ilya Kostrikov, and Sergey Levine. A walk in the park: Learning to walk in 20 minutes with model-free reinforcement learning.arXiv preprint arXiv:2208.07860, 2022

  44. [49]

    Revisiting the minimalist ap- proach to offline reinforcement learning.Advances in Neural Information Processing Systems, 36:11592– 11620, 2023

    Denis Tarasov, Vladislav Kurenkov, Alexander Nikulin, and Sergey Kolesnikov. Revisiting the minimalist ap- proach to offline reinforcement learning.Advances in Neural Information Processing Systems, 36:11592– 11620, 2023

  45. [50]

    Corl: Research-oriented deep offline reinforcement learning library.Advances in Neural Information Processing Systems, 36:30997–31020, 2023

    Denis Tarasov, Alexander Nikulin, Dmitry Akimov, Vladislav Kurenkov, and Sergey Kolesnikov. Corl: Research-oriented deep offline reinforcement learning library.Advances in Neural Information Processing Systems, 36:30997–31020, 2023

  46. [51]

    Is value functions estimation with classification plug-and-play for offline reinforcement learning?arXiv preprint arXiv:2406.06309, 2024

    Denis Tarasov, Kirill Brilliantov, and Dmitrii Khar- lapenko. Is value functions estimation with classification plug-and-play for offline reinforcement learning?arXiv preprint arXiv:2406.06309, 2024

  47. [52]

    The role of deep learning regularizations on actors in offline rl.arXiv preprint arXiv:2409.07606, 2024

    Denis Tarasov, Anja Surina, and Caglar Gulcehre. The role of deep learning regularizations on actors in offline rl.arXiv preprint arXiv:2409.07606, 2024

  48. [53]

    Nina: Normalizing flows in action

    Denis Tarasov, Alexander Nikulin, Ilya Zisman, Albina Klepach, Nikita Lyubaykin, Andrei Polubarov, Alexander Derevyagin, and Vladislav Kurenkov. Nina: Normalizing flows in action. training vla models with normalizing flows.arXiv preprint arXiv:2508.16845, 2025

  49. [54]

    3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

    Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy. CoRR, abs/2403.03954, 2024. doi: 10.48550/ARXIV . 2403.03954. URL https://doi.org/10.48550/arXiv.2403. 03954

  50. [55]

    Re- inflow: Fine-tuning flow matching policy with online re- inforcement learning.arXiv preprint arXiv:2505.22094, 2025

    Tonghe Zhang, Chao Yu, Sichang Su, and Yu Wang. Re- inflow: Fine-tuning flow matching policy with online re- inforcement learning.arXiv preprint arXiv:2505.22094, 2025

  51. [57]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

  52. [58]

    Places: A 10 million image database for scene recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017

    Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017. 1 2 3 4 5 6 7 8 9 10 11 Obs. Prefix Actions Action Chunks b) in-hand cube rotation 1 2 3 4 5 6 7 8 9 10 11 Next Obs. Next Prefix Actions a) scissor ret...

  53. [59]

    Experimental Setup Details:We mount the Orca Hand

  54. [60]

    on a Franka Emika Panda robot using a custom 3D- printed mount with a 60-degree tilt. Two OAK-1 Lite cameras are attached to the mount: one positioned underneath the hand to enable accurate finger placement, and the other facing left to provide visual guidance during scissor manipulation. The third camera is an OAK-D Lite that provides a front view. We us...

  55. [61]

    Human finger postures are retargeted to the robotic hand joint angles using an energy-based retargeting method [44]

    Teleoperation and Data Collection:Expert demonstra- tions are collected using Rokoko Smart Gloves in combination with the Rokoko Coil Pro, which together provide finger motion capture and wrist pose estimation in 3D space. Human finger postures are retargeted to the robotic hand joint angles using an energy-based retargeting method [44]. All demonstration...

  56. [62]

    7, the policy uses 1 step of observation and 3 steps of prefix actions following the observation and predicts a subsequent chunk of 10 steps of actions

    Action Chunking and Real-Time Inference Implementa- tion:As shown in Fig. 7, the policy uses 1 step of observation and 3 steps of prefix actions following the observation and predicts a subsequent chunk of 10 steps of actions. The observations contain the RGB images, the joint positions of the hand, and the end effector pose of Franka. The actions are rep...

  57. [63]

    Specifically, we define 2 distinct test positions for the scissors and 5 distinct test positions for the tape holder

    Test Configurations:We evaluate all policies under a fixed set of predefined test configurations. Specifically, we define 2 distinct test positions for the scissors and 5 distinct test positions for the tape holder. A total of 10 combinations can be seen in Fig. 8. For each evaluation episode, a test configuration is sampled from the predefined set. The l...

  58. [64]

    Both the training and the inference runs on a desktop with NVIDIA RTX 4090 GPU

    Experimental Setup Details:We mount the Orca Hand [11] horizontally with an OAK-D Lite camera providing the view from the bottom. Both the training and the inference runs on a desktop with NVIDIA RTX 4090 GPU

  59. [65]

    To estimate the cube pose, which is provided as input to the policy, we follow the same approach as in [19]

    Cube Pose Estimation:We use an OAK-D Lite camera mounted approximately 30 cm below the hand holding the cube. To estimate the cube pose, which is provided as input to the policy, we follow the same approach as in [19]. We Fig. 9. Examples of synthetic training samples. Images are rendered during parallelized IsaacLab training with randomized camera poses,...

  60. [66]

    Action Chunking and Real-Time Inference Implementa- tion:As shown in Fig. 7, at each decision step, the policy receives as input a history of four timesteps of joint posi- tions, cube position, and cube orientation (represented as a quaternion), as well as one timestep of the previous action and the previous joint position command. The goal command is als...

  61. [67]

    a) Simulation setup.:Each environment instance con- tains an Orca hand and a rigid cube object placed above a small kinematic platform

    Teacher Policy Training:We train the teacher policy fully in simulation using IsaacLab, in an environment that models single-axis spinning of a 45 mm cube with the Orca hand. a) Simulation setup.:Each environment instance con- tains an Orca hand and a rigid cube object placed above a small kinematic platform. The platform provides support for the first 6 ...

  62. [68]

    We follow the principle as in [35], with adaptation to chunked actions

    Policy Distillation Procedure:We distill a PPO-trained teacher policy into the SERNF model using IsaacLab. We follow the principle as in [35], with adaptation to chunked actions. During distillation, we applied strong observation noise to the student to improve the robustness. This noise consists of additive Gaussian noise and a random offset that is rese...

  63. [69]

    Each encoder follows the standard torchvision ResNet-18 up to the last convolutional stage (no global average pooling and no FC classifier)

    Network Architectures: Visual Encoders •ResNet-18 (Robomimic).Two separate ResNet-18 back- bones are used forimg0andimg1. Each encoder follows the standard torchvision ResNet-18 up to the last convolutional stage (no global average pooling and no FC classifier). Activation: ReLU. Normalization: Batch- Norm2d. Image normalization uses ImageNet mean/std ins...

  64. [70]

    Additional hyperparameters values:We apply dropout forπ θ regularization, using a rate of 0.5 during policy initial- ization (reduced to 0.2 for real-world experiments to accelerate convergence under limited compute). During reinforcement learning, the dropout rate is reduced to 0.1, as higher values were found to degrade offline RL performance, while a s...

  65. [71]

    To ensure a fair comparison, the dataset, optimizer settings, data augmentation strategies, and image encoders are kept identical across all methods

    Baseline Implementation Details:We compare SERNF against strong baseline methods to evaluate the expressiveness and accuracy of our approach in the imitation learning setting. To ensure a fair comparison, the dataset, optimizer settings, data augmentation strategies, and image encoders are kept identical across all methods. a) Action Chunking Transformer:...

  66. [72]

    The ablations in this subsection are per- formed on simulated RoboMimic environments (Lift, Can, and Square)

    Ablation Studies:We conduct ablation studies to analyze the sensitivity of SERNF to key architectural and algorithmic design choices. The ablations in this subsection are per- formed on simulated RoboMimic environments (Lift, Can, and Square). a) Effect of action chunk length.:Fig. 10 shows the effect of varying the action chunk lengthH∈ {6,8,10,12,14}. A...

  67. [73]

    Achieving a stable grasp is par- ticularly challenging due to the absence of tactile sensing and frequent occlusion of the index finger by the thumb in wrist- mounted camera views

    Failure Mode Analysis:The primary failure modes observed in the scissors task include grasp failure, scissor dropping, and task timeout. Achieving a stable grasp is par- ticularly challenging due to the absence of tactile sensing and frequent occlusion of the index finger by the thumb in wrist- mounted camera views. In both teleoperation and autonomous ex...

  68. [74]

    Simulation experiments were carried out on NVIDIA TITAN RTX GPUs. GPUs were utilized at full capacity to accelerate training and inference; however, all stages can be executed on less powerful hardware by reducing batch sizes or the number of sampled candidate actions during policy evaluation. For the scissors task, imitation learning pretraining required...