Behavior Uncloning: Distilling Mode Redirection into Policy Weights without Inference-Time Steering

Bangya Liu; Dayou Li; Hao Wang; Jiuzhou Lei; Manling Li; Minghui Zheng; Ruohan Zhang; Zhiwen Fan

arxiv: 2606.29201 · v1 · pith:EBSZ3Y4Qnew · submitted 2026-06-28 · 💻 cs.RO · cs.AI

Behavior Uncloning: Distilling Mode Redirection into Policy Weights without Inference-Time Steering

Hao Wang , Jiuzhou Lei , Dayou Li , Bangya Liu , Minghui Zheng , Manling Li , Ruohan Zhang , Zhiwen Fan This is my paper

Pith reviewed 2026-06-30 07:46 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords behavior cloningmode redirectionpolicy distillationrobot learninginference efficiencymode suppressionsafety in robotics

0 comments

The pith

MoRE distills redirection signals from a temporary mode classifier into policy weights so the standalone policy suppresses undesired modes without inference-time steering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Behavior-cloned policies frequently acquire multiple modes from demonstration data, some of which are unsafe or unwanted at deployment. MoRE performs a short uncloning step that distills the redirection signal produced by a temporary mode classifier directly into the policy weights. A retain loss is applied at the same time to keep the policy competent on the desired modes. The edited policy then steers its own rollouts toward safe behavior with zero added inference cost. On eight simulated and real-world tasks the method raises average deployment success rate by 44 percentage points over the original mixed-mode policy and matches or exceeds other adaptation baselines while preserving speed.

Core claim

MoRE redirects policy rollouts toward desired behavior modes through a short uncloning step that distills the redirection signal from a temporary mode classifier into the policy weights; a retain loss balances this edit by preserving desired-mode competence, allowing the standalone policy to suppress unwanted modes with zero inference-time overhead. Across eight tasks it improves average deployment success rate by 44 percentage points, achieves the strongest success rate among compared baselines, and approaches the filtered-data retraining reference while preserving competence and inference speed; the approach generalizes across Diffusion Policy, Pi0.5 VLA, diverse tasks, and real-world sett

What carries the argument

MoRE (Mode Redirection), which distills the redirection signal from a temporary mode classifier into policy weights while using a retain loss to preserve desired-mode competence.

If this is right

The edited policy achieves the strongest success rate among all compared adaptation and steering baselines.
Performance approaches that of a policy retrained on filtered demonstrations.
Task competence and inference speed remain unchanged after the edit.
The method works across Diffusion Policy and Pi0.5 VLA backbones.
Gains hold on both simulated and real-world deployments across diverse task categories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could reduce reliance on expensive data curation or filtering steps before training.
Policies trained on broad, multi-mode datasets might become deployable in safety-critical settings with only a short post-training edit.
Similar distillation of mode signals might apply to other multimodal policies outside robotics, such as language or vision models.

Load-bearing premise

The temporary mode classifier produces an accurate redirection signal that the distillation step can embed into the policy without degrading performance on desired modes.

What would settle it

An experiment in which the MoRE-edited policy exhibits lower success on desired modes than the original mixed policy or requires additional inference-time steering to reach the reported success rates.

read the original abstract

Behavior-cloned policies often learn multiple behavior modes from demonstration datasets, including modes that are unsafe or otherwise undesired at deployment. For example, a policy trained on diverse handover demonstrations may learn to pass a knife blade-first. Standard remedies such as data curation and inference-time steering either require access to the original demonstrations for full retraining or add substantial inference-time overhead. To address this gap, we propose MoRE(Mode Redirection), which redirects policy rollouts toward desired behavior modes through a short "uncloning" step. Specifically, MoRE distills the redirection signal from a temporary mode classifier into the policy weights to steer behavior. A retain loss balances this edit by preserving desired-mode competence, allowing the standalone policy to suppress unwanted modes with zero inference-time overhead. Across eight simulated and real-world tasks, MoRE improves the average deployment success rate (SR) by 44 percentage points over the original mixed-mode policy. Among all compared adaptation and steering baselines, MoRE achieves the strongest SR and approaches the filtered-data retraining reference, while preserving task competence and inference speed. MoRE also generalizes across robot policy backbones, including Diffusion Policy and the Pi0.5 VLA, diverse task categories, and real-world deployments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MoRE distills a temporary classifier's redirection signal into policy weights via a short uncloning step plus retain loss, delivering claimed 44pp SR gains with no inference overhead, but the balance of that retain loss is the unproven hinge.

read the letter

MoRE takes a mixed-mode imitation policy and runs a brief edit that pulls in redirection from a mode classifier, then uses a retain loss to keep the good modes intact so the final weights run standalone. That is the actual move: turning an inference-time steering trick into a one-time weight change.

The approach is practical for the robotics setting where you have a behavior-cloned policy but cannot afford runtime classifiers or full retraining from curated data. Reporting results on eight tasks that mix simulation and real hardware, plus two different backbones (Diffusion Policy and Pi0.5), gives the claim some breadth. Approaching the performance of filtered-data retraining while keeping inference speed is the useful part if the numbers hold.

The soft spot is exactly the retain loss. The abstract states it preserves desired-mode competence, yet supplies no pre/post numbers on good-mode success, no coefficient ablations, and no check on whether unwanted modes still leak in some tasks. If that term is under-weighted or interacts badly with the distillation objective, either the bad modes return or the good-mode performance drops; either breaks the zero-overhead guarantee. The 44pp average lift is stated without trial counts, variance, or statistical tests, so it is hard to judge how stable the result is.

This paper is for groups shipping imitation policies on real robots who already have a working but multi-mode policy and want a lightweight fix. It deserves a serious referee because the problem is concrete and the method is simple enough that the experiments can be checked directly.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes MoRE (Mode Redirection), which performs a short 'uncloning' step to distill redirection signals from a temporary mode classifier into the weights of a behavior-cloned policy. A retain loss is used to balance the edit and preserve competence on desired modes, yielding a standalone policy that suppresses undesired modes at deployment with zero inference-time overhead. Across eight simulated and real-world tasks the method is reported to raise average success rate by 44 percentage points relative to the original mixed-mode policy, approaching the performance of filtered-data retraining while preserving task competence and inference speed; results are shown for Diffusion Policy and Pi0.5 VLA backbones.

Significance. If the quantitative claims are supported by properly controlled experiments, the work would provide a practical, low-overhead route to post-training behavior editing in robot policies. The zero-inference-cost guarantee distinguishes it from steering baselines and could be valuable wherever runtime compute or latency is constrained.

major comments (1)

[Abstract] Abstract: The headline 44 pp SR gain and the zero-overhead claim both rest on the assertion that the retain loss successfully preserves desired-mode competence after the redirection distillation. No pre-/post-uncloning success rates on desired modes, no ablation on the retain-loss coefficient, and no statistical details (number of trials, variance, significance tests) are supplied, so it is impossible to verify that the reported improvement is not accompanied by degradation on the modes the policy is supposed to retain.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The headline 44 pp SR gain and the zero-overhead claim both rest on the assertion that the retain loss successfully preserves desired-mode competence after the redirection distillation. No pre-/post-uncloning success rates on desired modes, no ablation on the retain-loss coefficient, and no statistical details (number of trials, variance, significance tests) are supplied, so it is impossible to verify that the reported improvement is not accompanied by degradation on the modes the policy is supposed to retain.

Authors: We agree that the abstract would be strengthened by explicitly reporting pre-/post-uncloning success rates on desired modes, an ablation of the retain-loss coefficient, and basic statistical details. In the revised version we will add a concise sentence to the abstract summarizing that desired-mode success rates are preserved (with a pointer to the relevant table), note the outcome of the retain-loss ablation, and state the number of evaluation trials together with observed variance. These changes directly address the verifiability concern without altering the core claims. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method with independent validation

full rationale

The paper describes an empirical adaptation technique (MoRE) that applies a short uncloning step to distill redirection from a temporary classifier into policy weights, balanced by a retain loss. All reported outcomes (44pp SR gain, comparison to baselines and filtered retraining) are presented as experimental results on eight tasks rather than as a first-principles derivation or fitted prediction that reduces to its own inputs by construction. No equations, self-citations, or uniqueness theorems are invoked in the provided text to force the central claim; the retain loss is a tunable design element whose balance is asserted to be validated by the deployment metrics, not presupposed. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5773 in / 1105 out tokens · 58024 ms · 2026-06-30T07:46:10.178492+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 31 canonical work pages · 16 internal anchors

[1]

Specialized deep residual policy safe reinforcement learning-based controller for complex and continuous state-action spaces.arXiv preprint arXiv:2310.14788,

Ammar N Abbas, Georgios C Chasparis, and John D Kelleher. Specialized deep residual policy safe reinforcement learning-based controller for complex and continuous state-action spaces.arXiv preprint arXiv:2310.14788,

work page arXiv
[2]

Opal: Offline primitive discovery for accelerating offline reinforcement learning.arXiv preprint arXiv:2010.13611,

Anurag Ajay, Aviral Kumar, Pulkit Agrawal, Sergey Levine, and Ofir Nachum. Opal: Offline primitive discovery for accelerating offline reinforcement learning.arXiv preprint arXiv:2010.13611,

work page arXiv 2010
[3]

Update-Free On-Policy Steering via Verifiers

Maria Attarian, Ian Vyse, Claas Voelcker, Jasper Gerigk, Evgenii Opryshko, Anas Almasri, Sumeet Singh, Yilun Du, and Igor Gilitschenski. Update-free on-policy steering via verifiers.arXiv preprint arXiv:2603.10282,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Safe imitation learning via fast Bayesian reward inference from preferences

Daniel Brown, Russell Coleman, Ravi Srinivasan, and Scott Niekum. Safe imitation learning via fast Bayesian reward inference from preferences. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 1165–1177. PMLR, 13–18 Jul 2020.https://p...

2020
[7]

, Madotto, A

Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation.arXiv preprint arXiv:1912.02164,

work page arXiv 1912
[8]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Diversity is All You Need: Learning Skills without a Reward Function

Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function.arXiv preprint arXiv:1802.06070,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Brax–a differentiable physics engine for large scale rigid body simulation.arXiv preprint arXiv:2106.13281,

C Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem. Brax–a differentiable physics engine for large scale rigid body simulation.arXiv preprint arXiv:2106.13281,

work page arXiv
[11]

Eliciting compatible demonstrations for multi-human imitation learning

9 Kanishk Gandhi, Siddharth Karamcheti, Madeline Liao, and Dorsa Sadigh. Eliciting compatible demonstrations for multi-human imitation learning. InConference on Robot Learning, pages 1981–1991. PMLR,

1981
[12]

Mechanistic interpretability for steering vision-language- action models.arXiv preprint arXiv:2509.00328,

Bear Häon, Kaylene Stocking, Ian Chuang, and Claire Tomlin. Mechanistic interpretability for steering vision-language- action models.arXiv preprint arXiv:2509.00328,

work page arXiv
[13]

Re-mix: Optimizing data mixtures for large scale imitation learning.arXiv preprint arXiv:2408.14037, 2024a

Joey Hejna, Chethan Bhateja, Yichen Jiang, Karl Pertsch, and Dorsa Sadigh. Re-mix: Optimizing data mixtures for large scale imitation learning.arXiv preprint arXiv:2408.14037, 2024a. Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W Bradley Knox, and Dorsa Sadigh. Contrastive preference learning: Learning from human feedback witho...

work page arXiv 2024
[14]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Nora-1.5: A vision-language-action model trained using world model-and action-based preference rewards.arXiv preprint arXiv:2511.14659,

Chia-Yu Hung, Navonil Majumder, Haoyuan Deng, Liu Renhang, Yankang Ang, Amir Zadeh, Chuan Li, Dorien Herremans, Ziwei Wang, and Soujanya Poria. Nora-1.5: A vision-language-action model trained using world model-and action-based preference rewards.arXiv preprint arXiv:2511.14659,

work page arXiv
[16]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Towards diverse behaviors: A benchmark for imitation learning with human demonstrations.arXiv preprint arXiv:2402.14606,

Xiaogang Jia, Denis Blessing, Xinkai Jiang, Moritz Reuss, Atalay Donat, Rudolf Lioutikov, and Gerhard Neumann. Towards diverse behaviors: A benchmark for imitation learning with human demonstrations.arXiv preprint arXiv:2402.14606,

work page arXiv
[18]

Streaming flow policy: Simplifying diffusion/flow-matching policies by treating action trajectories as flow trajectories

Sunshine Jiang, Xiaolin Fang, Nicholas Roy, Tomás Lozano-Pérez, Leslie Pack Kaelbling, and Siddharth Ancha. Streaming flow policy: Simplifying diffusion/flow-matching policies by treating action trajectories as flow trajectories. arXiv preprint arXiv:2505.21851,

work page arXiv
[19]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model, 2024.https://arxiv.org/abs/2406.0...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

doi: 10.1561/2300000052.https://doi.org/10.1561/ 2300000052

ISSN 1935-8253. doi: 10.1561/2300000052.https://doi.org/10.1561/ 2300000052. Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions.arXiv preprint arXiv:2403.03181,

work page doi:10.1561/2300000052.https://doi.org/10.1561/ 1935
[22]

What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation.arXiv preprint arXiv:2108.03298,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Mass-Editing Memory in a Transformer

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35:17359–17372, 2022a. Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. Mass-editing memory in a transformer.arXiv preprint arXiv:2210.07229, 2022b. Mitsuhiko Nakamoto...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Dynamics-aware unsupervised discovery of skills.arXiv preprint arXiv:1907.01657,

Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics-aware unsupervised discovery of skills.arXiv preprint arXiv:1907.01657,

work page arXiv 1907
[26]

Residual Policy Learning

Tom Silver, Kelsey Allen, Josh Tenenbaum, and Leslie Kaelbling. Residual policy learning.arXiv preprint arXiv:1812.06298,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.arXiv preprint arXiv:2410.00425,

Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse-kai Chan, et al. Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.arXiv preprint arXiv:2410.00425,

work page arXiv
[28]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Mu- JoCo: A physics engine for model-based control

doi: 10.1109/IROS.2012.6386109. Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR,

work page doi:10.1109/iros.2012.6386109 2012
[30]

Self-improving vision- language-action models with data generation via residual rl.arXiv preprint arXiv:2511.00091,

Wenli Xiao, Haotian Lin, Andy Peng, Haoru Xue, Tairan He, Yuqi Xie, Fengyuan Hu, Jimmy Wu, Zhengyi Luo, Linxi Fan, et al. Self-improving vision-language-action models with data generation via residual rl.arXiv preprint arXiv:2511.00091,

work page arXiv
[31]

Mujoco playground.arXiv preprint arXiv:2502.08844,

Kevin Zakka, Baruch Tabanpour, Qiayuan Liao, Mustafa Haiderbhai, Samuel Holt, Jing Yuan Luo, Arthur Allshire, Erik Frey, Koushil Sreenath, Lueder A Kahrs, et al. Mujoco playground.arXiv preprint arXiv:2502.08844,

work page arXiv
[32]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Specialized deep residual policy safe reinforcement learning-based controller for complex and continuous state-action spaces.arXiv preprint arXiv:2310.14788,

Ammar N Abbas, Georgios C Chasparis, and John D Kelleher. Specialized deep residual policy safe reinforcement learning-based controller for complex and continuous state-action spaces.arXiv preprint arXiv:2310.14788,

work page arXiv

[2] [2]

Opal: Offline primitive discovery for accelerating offline reinforcement learning.arXiv preprint arXiv:2010.13611,

Anurag Ajay, Aviral Kumar, Pulkit Agrawal, Sergey Levine, and Ofir Nachum. Opal: Offline primitive discovery for accelerating offline reinforcement learning.arXiv preprint arXiv:2010.13611,

work page arXiv 2010

[3] [3]

Update-Free On-Policy Steering via Verifiers

Maria Attarian, Ian Vyse, Claas Voelcker, Jasper Gerigk, Evgenii Opryshko, Anas Almasri, Sumeet Singh, Yilun Du, and Igor Gilitschenski. Update-free on-policy steering via verifiers.arXiv preprint arXiv:2603.10282,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Safe imitation learning via fast Bayesian reward inference from preferences

Daniel Brown, Russell Coleman, Ravi Srinivasan, and Scott Niekum. Safe imitation learning via fast Bayesian reward inference from preferences. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 1165–1177. PMLR, 13–18 Jul 2020.https://p...

2020

[7] [7]

, Madotto, A

Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation.arXiv preprint arXiv:1912.02164,

work page arXiv 1912

[8] [8]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Diversity is All You Need: Learning Skills without a Reward Function

Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function.arXiv preprint arXiv:1802.06070,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Brax–a differentiable physics engine for large scale rigid body simulation.arXiv preprint arXiv:2106.13281,

C Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem. Brax–a differentiable physics engine for large scale rigid body simulation.arXiv preprint arXiv:2106.13281,

work page arXiv

[11] [11]

Eliciting compatible demonstrations for multi-human imitation learning

9 Kanishk Gandhi, Siddharth Karamcheti, Madeline Liao, and Dorsa Sadigh. Eliciting compatible demonstrations for multi-human imitation learning. InConference on Robot Learning, pages 1981–1991. PMLR,

1981

[12] [12]

Mechanistic interpretability for steering vision-language- action models.arXiv preprint arXiv:2509.00328,

Bear Häon, Kaylene Stocking, Ian Chuang, and Claire Tomlin. Mechanistic interpretability for steering vision-language- action models.arXiv preprint arXiv:2509.00328,

work page arXiv

[13] [13]

Re-mix: Optimizing data mixtures for large scale imitation learning.arXiv preprint arXiv:2408.14037, 2024a

Joey Hejna, Chethan Bhateja, Yichen Jiang, Karl Pertsch, and Dorsa Sadigh. Re-mix: Optimizing data mixtures for large scale imitation learning.arXiv preprint arXiv:2408.14037, 2024a. Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W Bradley Knox, and Dorsa Sadigh. Contrastive preference learning: Learning from human feedback witho...

work page arXiv 2024

[14] [14]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Nora-1.5: A vision-language-action model trained using world model-and action-based preference rewards.arXiv preprint arXiv:2511.14659,

Chia-Yu Hung, Navonil Majumder, Haoyuan Deng, Liu Renhang, Yankang Ang, Amir Zadeh, Chuan Li, Dorien Herremans, Ziwei Wang, and Soujanya Poria. Nora-1.5: A vision-language-action model trained using world model-and action-based preference rewards.arXiv preprint arXiv:2511.14659,

work page arXiv

[16] [16]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Towards diverse behaviors: A benchmark for imitation learning with human demonstrations.arXiv preprint arXiv:2402.14606,

Xiaogang Jia, Denis Blessing, Xinkai Jiang, Moritz Reuss, Atalay Donat, Rudolf Lioutikov, and Gerhard Neumann. Towards diverse behaviors: A benchmark for imitation learning with human demonstrations.arXiv preprint arXiv:2402.14606,

work page arXiv

[18] [18]

Streaming flow policy: Simplifying diffusion/flow-matching policies by treating action trajectories as flow trajectories

Sunshine Jiang, Xiaolin Fang, Nicholas Roy, Tomás Lozano-Pérez, Leslie Pack Kaelbling, and Siddharth Ancha. Streaming flow policy: Simplifying diffusion/flow-matching policies by treating action trajectories as flow trajectories. arXiv preprint arXiv:2505.21851,

work page arXiv

[19] [19]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model, 2024.https://arxiv.org/abs/2406.0...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

doi: 10.1561/2300000052.https://doi.org/10.1561/ 2300000052

ISSN 1935-8253. doi: 10.1561/2300000052.https://doi.org/10.1561/ 2300000052. Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions.arXiv preprint arXiv:2403.03181,

work page doi:10.1561/2300000052.https://doi.org/10.1561/ 1935

[22] [22]

What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation.arXiv preprint arXiv:2108.03298,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Mass-Editing Memory in a Transformer

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35:17359–17372, 2022a. Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. Mass-editing memory in a transformer.arXiv preprint arXiv:2210.07229, 2022b. Mitsuhiko Nakamoto...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Dynamics-aware unsupervised discovery of skills.arXiv preprint arXiv:1907.01657,

Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics-aware unsupervised discovery of skills.arXiv preprint arXiv:1907.01657,

work page arXiv 1907

[26] [26]

Residual Policy Learning

Tom Silver, Kelsey Allen, Josh Tenenbaum, and Leslie Kaelbling. Residual policy learning.arXiv preprint arXiv:1812.06298,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.arXiv preprint arXiv:2410.00425,

Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse-kai Chan, et al. Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.arXiv preprint arXiv:2410.00425,

work page arXiv

[28] [28]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

Mu- JoCo: A physics engine for model-based control

doi: 10.1109/IROS.2012.6386109. Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR,

work page doi:10.1109/iros.2012.6386109 2012

[30] [30]

Self-improving vision- language-action models with data generation via residual rl.arXiv preprint arXiv:2511.00091,

Wenli Xiao, Haotian Lin, Andy Peng, Haoru Xue, Tairan He, Yuqi Xie, Fengyuan Hu, Jimmy Wu, Zhengyi Luo, Linxi Fan, et al. Self-improving vision-language-action models with data generation via residual rl.arXiv preprint arXiv:2511.00091,

work page arXiv

[31] [31]

Mujoco playground.arXiv preprint arXiv:2502.08844,

Kevin Zakka, Baruch Tabanpour, Qiayuan Liao, Mustafa Haiderbhai, Samuel Holt, Jing Yuan Luo, Arthur Allshire, Erik Frey, Koushil Sreenath, Lueder A Kahrs, et al. Mujoco playground.arXiv preprint arXiv:2502.08844,

work page arXiv

[32] [32]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705,

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405,

work page internal anchor Pith review Pith/arXiv arXiv