WARP-RM: A Warp-Augmented Relative Progress Reward Model for Data Curation

Andrew Goldberg; Ethan Ransing; Fred Shentu; Justin Yu; Karim El-Refai; Kavish Kondap; Ken Goldberg; Mac Schwager; Philipp Wu; Qianzhong Chen

arxiv: 2606.28320 · v1 · pith:UKI32EAQnew · submitted 2026-06-26 · 💻 cs.RO

WARP-RM: A Warp-Augmented Relative Progress Reward Model for Data Curation

Justin Yu , Andrew Goldberg , Kavish Kondap , Karim El-Refai , Ethan Ransing , Qianzhong Chen , Mac Schwager , Fred Shentu

show 2 more authors

Philipp Wu Ken Goldberg

This is my paper

Pith reviewed 2026-06-29 03:52 UTC · model grok-4.3

classification 💻 cs.RO

keywords imitation learningbehavior cloningreward modeldata curationrobot manipulationself-supervised learningT-shirt foldingrelative progress

0 comments

The pith

A self-supervised reward model using time-warp augmentations on demonstrations lets behavior cloning maintain 19/20 success on T-shirt folding even as training data grows more inefficient.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WARP-RM to learn dense frame-level progress signals without human annotations by applying time-warp augmentations to successful demonstrations. These signals are aggregated into chunk-level advantage estimates that upweight high-quality action segments during behavior cloning. On a physical bimanual robot performing T-shirt folding from crumpled starts, the resulting WARP-BC policy sustains a 19/20 success rate as the training set is widened to include more inefficient episodes, while vanilla behavior cloning falls to 2/20 and task throughput rises by up to 18 times.

Core claim

WARP generates per-frame progress targets via time-warp augmentations of demonstrations (variable playback speeds and reversals) and trains WARP-RM to predict the normalized elapsed time between input frames; aggregating these predictions across overlapping windows produces a dense signed progress signal that is then used to compute chunk-level advantage for upweighting actions in behavior cloning.

What carries the argument

WARP (Warp-Augmented Relative Progress) algorithm that creates signed relative progress targets from time-warp augmentations to train a model predicting normalized elapsed time between frames.

Load-bearing premise

Episode length is a sufficient proxy for teleoperation sub-optimality when constructing training datasets of varying quality for the T-shirt folding task.

What would settle it

A controlled experiment that varies dataset quality using a different proxy such as counted hesitations or recovery motions and finds that WARP-BC success rates drop to match those of vanilla behavior cloning would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.28320 by Andrew Goldberg, Ethan Ransing, Fred Shentu, Justin Yu, Karim El-Refai, Kavish Kondap, Ken Goldberg, Mac Schwager, Philipp Wu, Qianzhong Chen.

**Figure 1.** Figure 1: WARP-RM signed progress measure vˆt on an unseen mixed-quality teleoperated T-shirt-folding demonstration. Large negative magnitudes occur when the right gripper drops the shirt in (b), and near-zero magnitude during stagnation between (g) and (h). These values are used to filter and weight downstream policy training. Predictions on more examples in Appendix F. Abstract: Scaling imitation learning requires… view at source ↗

**Figure 2.** Figure 2: Time-Warp Sampler. WARP resamples trajectories using a warped playback schedule. (1): Playback speed varies to span slow-motion to fast-forward. Playback direction is randomly inverted to expose the model to negative progress (regression). (2): Accumulating these playback speeds yields a window of 32 source frames. The relative offset of each frame from the starting frame serves as the self-supervised prog… view at source ↗

**Figure 3.** Figure 3: WARP-RM Architecture. A 32-frame demonstration window (left) is encoded by a frozen DINOv3 backbone ϕ and aggregated by a bidirectional-attention transformer that emits a distribution over 30 cumulative-progress bins at each input frame. The yellow shaded region (bottom-left) illustrates one such sliding prediction window applied to the continuous episode. Their per-frame expectations form the window’s pre… view at source ↗

**Figure 4.** Figure 4: Time-to-completion distribution for successes. Performance is evaluated across three datasets tiered by increasing demonstration sub-optimality: D1 (≤ 60s, efficient demonstrations), D2 (≤ 90s, moderate inefficiencies), and D3 (≤ 120s, demonstrations with more operator hesitations and recoveries). Policy rollouts which exceed 240 seconds are considered failures and are not shown. it on the workspace, fold … view at source ↗

**Figure 5.** Figure 5: Per-bottle placement-time distribution for the bottle-in-bin task. Each point is the time to place a single bottle (interval between consecutive drops); gray points are vanilla BC (59 bottles placed) and blue points are WARP-BC (74 bottles placed), with the total placed out of 80 shown under each label. Black bars denote the mean (15.9 s vs. 11.3 s). WARP-BC places bottles faster and with a tighter distrib… view at source ↗

**Figure 6.** Figure 6: Episode-length distribution of D1–D3 (blue) with the SARM-annotated supplement DA overlaid (orange). The base distribution exhibits a dominant mode near 50–60 s with a broader tail beyond ∼ 85 s containing episodes with more hesitations, fumbles, and recoveries. Dashed vertical lines mark the three length filters used in Section 4; unioning DA with D1 and D2 yields the matched datasets D4 = D1 ∪ DA and D5 … view at source ↗

**Figure 7.** Figure 7: Episode-length distribution of the bottle-in-bin dataset. WARP-RM is trained on the shortest demonstrations (orange, ≤ 74.6 s); the dashed line marks the cutoff. As in the T-shirt setting ( [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Time-to-completion distribution for success across baseline comparisons. Evaluated on D4 = D1 ∪ DA and D5 = D2 ∪ DA. Policy rollouts which exceed 240 seconds are considered failures and are not shown. SCIZOR [34] successfully folds a T-shirt right before the 240 second timeout boundary on D5 . 18 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: WARP-RM output on a near-unit average progress-velocity T-shirt-folding demonstration. Predicted magnitude varies around 1.0 for most of the demonstration. (34 second demonstration) [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: WARP-RM output on a T-shirt-folding demonstration with fluctuating progressvelocity. (97 second demonstration) [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: WARP-RM output on a T-shirt-folding demonstration with fluctuating progressvelocity. (98 second demonstration) [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: WARP-RM output on a T-shirt-folding demonstration with fluctuating progressvelocity. (105 second demonstration). 19 [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Randomly sampled frames from the T-shirt-folding dataset (D3), demonstrating a representative sample of the visual diversity present in the training data, including varied garment colors, workspace surfaces, and arm configurations. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Randomly sampled frames from the bottle-in-bin dataset, drawn from demonstrations across distinct collection sessions. The data spans varied bin types and placements, bottle colors and counts, and workspace surfaces. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

read the original abstract

Scaling imitation learning requires large datasets, yet human teleoperation inevitably produces mixed-quality demonstrations containing hesitations and recoveries. Prior frame-level progress reward models supervise on absolute temporal progress proxies that suffer from label noise, or require costly human annotations to define subtask boundaries. We present WARP (Warp-Augmented Relative Progress), a novel fully self-supervised algorithm for learning dense, signed relative progress magnitudes directly from successful demonstrations. WARP generates per-frame progress targets via time-warp augmentations of demonstrations (variable playback speeds and reversals) and we train WARP-RM to predict the normalized elapsed time between input frames. Aggregating these predictions across overlapping windows yields a dense frame-level progress signal. We then introduce WARP-BC, which leverages these scalar reward estimates to upweight high-advantage action chunks during behavior cloning, where chunk-level advantage is obtained by aggregating per-frame rewards. We evaluate our approach on a physical bimanual robot system performing a long-horizon deformable object manipulation task: folding T-shirts from a random crumpled start. To evaluate policy robustness against suboptimal data, we construct training datasets of varying quality using episode length as a proxy for teleoperation sub-optimality. As the dataset is widened to admit more inefficiencies, WARP-BC maintains a 19/20 success rate compared to vanilla BC's collapse to 2/20, improving throughput by up to 18x.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WARP-RM gives a clean self-supervised route to dense signed progress signals via time-warps, but the headline robustness result rests on an unvalidated episode-length proxy for demonstration quality.

read the letter

The main takeaway is that WARP generates relative progress targets by applying speed and reversal warps to successful demonstrations, trains a model to predict normalized elapsed time between frames, and then uses the resulting dense signal to weight action chunks in behavior cloning. This removes the need for subtask annotations or absolute progress labels.

The method is new in its specific use of time-warp augmentations to produce signed per-frame targets that aggregate into chunk-level advantage estimates. The T-shirt folding experiments on a physical bimanual system show the approach maintaining high success rates as training data widens to include longer episodes, while vanilla BC drops sharply. That practical focus on messy teleop data is useful.

The soft spot is the dataset construction. Episode length is treated as a proxy for teleoperation sub-optimality such as hesitations and recoveries, yet the abstract gives no direct evidence that longer episodes actually contain more of those behaviors rather than slower execution, different initial states, or task-intrinsic variation in the crumpled T-shirt setup. If the proxy is weak, the controlled comparison does not isolate the effect of data quality. The reported 19/20 success and 18x throughput numbers are striking but cannot be assessed without the full methods, ablations, and statistical details.

The work shows clear engagement with the imitation learning and reward modeling literature and avoids obvious circularity. It is aimed at researchers scaling BC to physical deformable manipulation tasks where data quality varies. A reader working on self-supervised dense rewards or data curation would find concrete value. The paper deserves a serious referee to examine the experimental protocol and verify whether the proxy holds up.

Referee Report

2 major / 1 minor

Summary. The paper presents WARP-RM, a self-supervised algorithm that learns dense signed relative progress rewards from successful demonstrations via time-warp augmentations (variable speeds and reversals), training a model to predict normalized elapsed time between frames. These rewards are aggregated to produce chunk-level advantages for upweighting actions in behavior cloning (WARP-BC). On a physical bimanual T-shirt folding task, datasets of varying quality are constructed by widening to include longer episodes (proxy for sub-optimality); WARP-BC maintains 19/20 success while vanilla BC drops to 2/20, with up to 18x throughput gains.

Significance. If the central robustness result holds under controlled conditions, the method offers a fully self-supervised route to dense progress signals that could improve data curation and policy performance in imitation learning for long-horizon deformable manipulation without requiring subtask annotations or external labels.

major comments (2)

[Evaluation] Evaluation section: the headline robustness claim (19/20 vs 2/20 success as datasets widen) rests on episode length as a proxy for teleoperation sub-optimality (hesitations/recoveries); no validation is reported that length correlates with those behaviors rather than initial-state variation, execution speed, or task-intrinsic factors in the crumpled T-shirt setup, leaving the controlled comparison between WARP-BC and BC open to confounding.
[Methods] Methods: the aggregation of per-frame progress predictions into chunk-level advantage (used for upweighting in BC) is described at a high level; without explicit equations or pseudocode showing the windowing, normalization, and advantage computation, it is difficult to verify that the signal isolates progress magnitude independently of the time-warp training objective.

minor comments (1)

[Abstract] Abstract and introduction: the phrase 'normalized elapsed time between input frames' could be clarified with respect to the sign and range of the learned targets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of a fully self-supervised approach to dense progress signals in imitation learning. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the headline robustness claim (19/20 vs 2/20 success as datasets widen) rests on episode length as a proxy for teleoperation sub-optimality (hesitations/recoveries); no validation is reported that length correlates with those behaviors rather than initial-state variation, execution speed, or task-intrinsic factors in the crumpled T-shirt setup, leaving the controlled comparison between WARP-BC and BC open to confounding.

Authors: We acknowledge the concern that episode length serves as an indirect proxy and that explicit validation of its correlation with hesitations and recoveries (versus other factors) is not provided in the current manuscript. In the T-shirt folding setup, initial states are drawn from the same randomized distribution for all dataset widths, and the task geometry and physics remain fixed; thus longer episodes predominantly reflect additional recovery actions rather than changes in start configuration or intrinsic task difficulty. Nevertheless, to address the potential for confounding, we will add a supplementary analysis in the revision that includes (i) qualitative trajectory inspection showing increased hesitation segments in longer episodes and (ii) a simple correlation between episode length and the number of recovery actions manually annotated on a subset of demonstrations. This will strengthen the controlled comparison. revision: yes
Referee: [Methods] Methods: the aggregation of per-frame progress predictions into chunk-level advantage (used for upweighting in BC) is described at a high level; without explicit equations or pseudocode showing the windowing, normalization, and advantage computation, it is difficult to verify that the signal isolates progress magnitude independently of the time-warp training objective.

Authors: We agree that the aggregation procedure is currently described at a high level and would benefit from greater formality. In the revised manuscript we will insert explicit equations for (a) the sliding-window aggregation of per-frame normalized elapsed-time predictions, (b) the normalization step that converts raw predictions into signed relative progress, and (c) the subsequent computation of chunk-level advantage used for action upweighting. We will also include pseudocode that makes clear the separation between the time-warp training objective and the downstream advantage signal derived from the learned progress magnitudes. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents a self-supervised method where time-warp augmentations explicitly generate per-frame progress targets (normalized elapsed time) for training WARP-RM, which are then aggregated into rewards for weighting in BC. This construction is independent and does not reduce by definition or fit to its own outputs. Dataset construction via episode length as proxy is an explicit evaluation assumption rather than a load-bearing derivation step. No self-citations, uniqueness theorems, or renamings of known results are invoked as the central justification. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review yields limited visibility into parameters and assumptions; the method rests on the domain assumption that temporal warps of successful trajectories preserve recoverable progress information.

axioms (1)

domain assumption Successful demonstrations contain recoverable progress signals that can be extracted via temporal augmentations without external labels.
This underpins the self-supervised target generation step described in the abstract.

invented entities (1)

WARP-RM no independent evidence
purpose: Model trained to predict normalized elapsed time between frames from warped demonstration pairs.
The reward model itself is the central new artifact introduced by the paper.

pith-pipeline@v0.9.1-grok · 5814 in / 1313 out tokens · 57310 ms · 2026-06-29T03:52:04.752100+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 3 canonical work pages

[1]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023

2023
[2]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π0: A vision-language- action flow model for general robot control. InProceedings of Robotics: ...

2025
[3]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025
[4]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

2024
[5]

O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

Pith/arXiv arXiv 2024
[6]

Huang, F

H. Huang, F. Liu, L. Fu, T. Wu, M. Mukadam, J. Malik, K. Goldberg, and P. Abbeel. Otter: A vision-language-action model with text-aware feature extraciton.arXiv preprint arXiv:2503.03734, 2025

arXiv 2025
[7]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

Pith/arXiv arXiv 2023
[8]

H. R. Walke, K. Black, T. Z. Zhao, Q. Vuong, C. Zheng, P. Hansen-Estruch, A. W. He, V . Myers, M. J. Kim, M. Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR, 2023

2023
[9]

L. Fu, H. Huang, G. Datta, L. Y . Chen, W. C.-H. Panitch, F. Liu, H. Li, and K. Goldberg. In-context imitation learning via next-token prediction.arXiv preprint arXiv:2408.15980, 2024

arXiv 2024
[10]

Mandlekar, D

A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Mart´ın-Mart´ın. What matters in learning from offline human demonstrations for robot manipulation. InConference on Robot Learning (CoRL), volume 164 ofProceedings of Machine Learning Research. PMLR, 2021. 10

2021
[11]

Beliaev, A

M. Beliaev, A. Shih, S. Ermon, D. Sadigh, and R. Pedarsani. Imitation learning by estimating expertise of demonstrators. InInternational Conference on Machine Learning, pages 1732–1748. PMLR, 2022

2022
[12]

D. S. Brown, W. Goo, and S. Niekum. Better-than-demonstrator imitation learning via automatically-ranked demonstrations. InConference on robot learning, pages 330–359. PMLR, 2020

2020
[13]

S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artifi- cial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011

2011
[14]

H. Liu, S. Nasiriany, L. Zhang, Z. Bao, and Y . Zhu. Robot learning on the job: Human-in- the-loop autonomy and learning during deployment. InRobotics: Science and Systems (RSS), 2023

2023
[15]

P. Wu, Y . Shentu, Q. Liao, D. Jin, M. Guo, K. Sreenath, X. Lin, and P. Abbeel. Robocopilot: Human-in-the-loop interactive imitation learning for robot manipulation, 2025

2025
[16]

Q. Li, Z. Peng, and B. Zhou. Efficient learning of safe driving policy via human-ai copilot optimization.arXiv preprint arXiv:2202.10341, 2022

arXiv 2022
[17]

Kelly, C

M. Kelly, C. Sidrane, K. Driggs-Campbell, and M. J. Kochenderfer. Hg-dagger: Interactive imitation learning with human experts. In2019 International Conference on Robotics and Automation (ICRA), pages 8077–8083. IEEE, 2019

2019
[18]

C. Agia, R. Sinha, J. Yang, R. Antonova, M. Pavone, H. Nishimura, M. Itkina, and J. Bohg. Cupid: Curating data your robot loves with influence functions. InConference on Robot Learning (CoRL), volume 305 ofProceedings of Machine Learning Research, pages 2907–2932. PMLR, 2025

2025
[19]

Hejna, S

J. Hejna, S. Mirchandani, A. Balakrishna, A. Xie, A. Wahid, J. Tompson, P. Sanketi, D. Shah, C. Devin, and D. Sadigh. Robot data curation with mutual information estimators. InProceed- ings of Robotics: Science and Systems (RSS), 2025

2025
[20]

H. Lee, T. Min, J. Kim, S. Kang, F. Liu, L. Pinto, and K. Lee. Quality over quantity: Demonstration curation via influence functions for data-centric robot learning.arXiv preprint arXiv:2603.09056, 2026

arXiv 2026
[21]

Zhang, Y

J. Zhang, Y . Luo, A. Anwar, S. A. Sontakke, J. J. Lim, J. Thomason, E. Bıyık, and J. Zhang. Rewind: Language-guided rewards teach robot policies without new demonstrations. In Conference on Robot Learning, 2025

2025
[22]

Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training. InInternational Conference on Learning Representations, 2023

2023
[23]

Y . J. Ma, W. Liang, V . Somani, B. Stadie, O. Bastani, D. Jayaraman, A. Zhang, S. Sodhani, and V . Kumar. Liv: Language-image representations and rewards for robotic control. In International Conference on Machine Learning. PMLR, 2023

2023
[24]

Dwibedi, Y

D. Dwibedi, Y . Aytar, J. Tompson, P. Sermanet, and A. Zisserman. Temporal cycle-consistency learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion, pages 1801–1810, 2019

2019
[25]

Q. Chen, J. Yu, M. Schwager, P. Abbeel, Y . Shentu, and P. Wu. Sarm: Stage-aware reward modeling for long horizon robot manipulation. InInternational Conference on Learning Representations (ICLR), 2026. 11

2026
[26]

Y . Mao, Z. Yu, W. Mao, Y . Li, Q. Hu, Z. Lan, M. Zhu, and H. Chen. Arm: Advantage reward modeling for long-horizon manipulation.arXiv preprint arXiv:2604.03037, 2026

Pith/arXiv arXiv 2026
[27]

Y . Yao, C. Liu, D. Luo, Y . Zhou, and Q. Ye. Video playback rate perception for self-supervised spatio-temporal representation learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

2020
[28]

J. Wang, J. Jiao, and Y . Liu. Self-supervised video representation learning by pace prediction. InEuropean Conference on Computer Vision, 2020

2020
[29]

P. Chen, D. Huang, D. He, X. Long, R. Zeng, S. Wen, M. Tan, and C. Gan. Rspnet: Relative speed perception for unsupervised video representation learning. InThe AAAI Conference on Artificial Intelligence (AAAI), 2021

2021
[30]

Deepsd: Automatic deep skinning and pose space deformation for 3d garment animation

D. Huang, W. Hu, X. Liu, D. He, Z. Wu, X. Wu, M. Tan, and E. Ding. Ascnet: Self- supervised video representation learning with appearance-speed consistency. InThe IEEE/CVF International Conference on Computer Vision (ICCV), pages 8076–8085, 10 2021. doi: 10.1109/ICCV48922.2021.00799

work page doi:10.1109/iccv48922.2021.00799 2021
[31]

I. R. Dave, S. Jenni, and M. Shah. No more shortcuts: Realizing the potential of tempo- ral self-supervision.Proceedings of the AAAI Conference on Artificial Intelligence, 38(2): 1481–1491, Mar. 2024. doi:10.1609/aaai.v38i2.27913. URL https://ojs.aaai.org/index. php/AAAI/article/view/27913

work page doi:10.1609/aaai.v38i2.27913 2024
[32]

Jenni, M

S. Jenni, M. Woodson, and F. C. Heilbron. Video-retime: Learning temporally varying speedi- ness for time remapping, 2022. URLhttps://arxiv.org/abs/2205.05609

arXiv 2022
[33]

Hejna, C

J. Hejna, C. Bhateja, Y . Jiang, K. Pertsch, and D. Sadigh. Re-mix: Optimizing data mixtures for large scale imitation learning. InConference on Robot Learning (CoRL), volume 270 of Proceedings of Machine Learning Research, pages 145–164. PMLR, 2024

2024
[34]

Zhang, Y

Y . Zhang, Y . Xie, H. Liu, R. Shah, M. Wan, L. Fan, and Y . Zhu. Scizor: A self-supervised approach to data curation for large-scale imitation learning. InIEEE International Conference on Robotics and Automation (ICRA), 2026

2026
[35]

Liang, Y

A. Liang, Y . Korkmaz, J. Zhang, M. Hwang, A. Anwar, S. Kaushik, A. Shah, A. S. Huang, L. Zettlemoyer, D. Fox, et al. Robometer: Scaling general-purpose robotic reward models via trajectory comparisons. InProceedings of Robotics: Science and Systems (RSS), 2026

2026
[36]

S. Chen, C. Harrison, Y .-C. Lee, A. J. Yang, Z. Ren, L. J. Ratliff, J. Duan, D. Fox, and R. Krishna. Topreward: Token probabilities as hidden zero-shot rewards for robotics.arXiv preprint arXiv:2602.19313, 2026

arXiv 2026
[37]

H. Tan, S. Chen, Y . Xu, Z. Wang, Y . Ji, C. Chi, Y . Lyu, Z. Zhao, X. Chen, P. Co, et al. Robo-dopamine: General process reward modeling for high-precision robotic manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026
[38]

Liang, R

P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. A. Cosgrove, C. D. Manning, C. Re, D. Acosta-Navas, D. A. Hudson, E. Zelikman, E. Durmus, F. Ladhak, F. Rong, H. Ren, H. Yao, J. WANG, K. Santhanam, L. Orr, L. Zheng, M. Yuksekgonul, M. Suzgun, N. Kim, N. G...

2023
[39]

Farebrother, J

J. Farebrother, J. Orbay, Q. Vuong, A. A. Ta ¨ıga, Y . Chebotar, T. Xiao, A. Irpan, S. Levine, P. S. Castro, A. Faust, A. Kumar, and R. Agarwal. Stop regressing: Training value functions via classification for scalable deep rl. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pa...

work page doi:10.48550/arxiv.2403.03950 2024
[40]

Sim´eoni, H

O. Sim´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. J ´egou, P. Labatut, and P. Bojanowski. Dinov3, 2025. URL https://arxiv.org/abs...

Pith/arXiv arXiv 2025
[41]

put the plastic bottles in the bin

J. Grigsby and Y . Qi. A closer look at advantage-filtered behavioral cloning in high-noise datasets, 2023. URLhttps://arxiv.org/abs/2110.04698. 13 A Dataset Statistics Table 5 reports per-tier statistics for the three policy training datasets used in Section 4, as well as the fixed reference subset on which W ARP is trained. All tiers are length-filtered...

arXiv 2023

[1] [1]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023

2023

[2] [2]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π0: A vision-language- action flow model for general robot control. InProceedings of Robotics: ...

2025

[3] [3]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025

[4] [4]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

2024

[5] [5]

O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

Pith/arXiv arXiv 2024

[6] [6]

Huang, F

H. Huang, F. Liu, L. Fu, T. Wu, M. Mukadam, J. Malik, K. Goldberg, and P. Abbeel. Otter: A vision-language-action model with text-aware feature extraciton.arXiv preprint arXiv:2503.03734, 2025

arXiv 2025

[7] [7]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

Pith/arXiv arXiv 2023

[8] [8]

H. R. Walke, K. Black, T. Z. Zhao, Q. Vuong, C. Zheng, P. Hansen-Estruch, A. W. He, V . Myers, M. J. Kim, M. Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR, 2023

2023

[9] [9]

L. Fu, H. Huang, G. Datta, L. Y . Chen, W. C.-H. Panitch, F. Liu, H. Li, and K. Goldberg. In-context imitation learning via next-token prediction.arXiv preprint arXiv:2408.15980, 2024

arXiv 2024

[10] [10]

Mandlekar, D

A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Mart´ın-Mart´ın. What matters in learning from offline human demonstrations for robot manipulation. InConference on Robot Learning (CoRL), volume 164 ofProceedings of Machine Learning Research. PMLR, 2021. 10

2021

[11] [11]

Beliaev, A

M. Beliaev, A. Shih, S. Ermon, D. Sadigh, and R. Pedarsani. Imitation learning by estimating expertise of demonstrators. InInternational Conference on Machine Learning, pages 1732–1748. PMLR, 2022

2022

[12] [12]

D. S. Brown, W. Goo, and S. Niekum. Better-than-demonstrator imitation learning via automatically-ranked demonstrations. InConference on robot learning, pages 330–359. PMLR, 2020

2020

[13] [13]

S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artifi- cial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011

2011

[14] [14]

H. Liu, S. Nasiriany, L. Zhang, Z. Bao, and Y . Zhu. Robot learning on the job: Human-in- the-loop autonomy and learning during deployment. InRobotics: Science and Systems (RSS), 2023

2023

[15] [15]

P. Wu, Y . Shentu, Q. Liao, D. Jin, M. Guo, K. Sreenath, X. Lin, and P. Abbeel. Robocopilot: Human-in-the-loop interactive imitation learning for robot manipulation, 2025

2025

[16] [16]

Q. Li, Z. Peng, and B. Zhou. Efficient learning of safe driving policy via human-ai copilot optimization.arXiv preprint arXiv:2202.10341, 2022

arXiv 2022

[17] [17]

Kelly, C

M. Kelly, C. Sidrane, K. Driggs-Campbell, and M. J. Kochenderfer. Hg-dagger: Interactive imitation learning with human experts. In2019 International Conference on Robotics and Automation (ICRA), pages 8077–8083. IEEE, 2019

2019

[18] [18]

C. Agia, R. Sinha, J. Yang, R. Antonova, M. Pavone, H. Nishimura, M. Itkina, and J. Bohg. Cupid: Curating data your robot loves with influence functions. InConference on Robot Learning (CoRL), volume 305 ofProceedings of Machine Learning Research, pages 2907–2932. PMLR, 2025

2025

[19] [19]

Hejna, S

J. Hejna, S. Mirchandani, A. Balakrishna, A. Xie, A. Wahid, J. Tompson, P. Sanketi, D. Shah, C. Devin, and D. Sadigh. Robot data curation with mutual information estimators. InProceed- ings of Robotics: Science and Systems (RSS), 2025

2025

[20] [20]

H. Lee, T. Min, J. Kim, S. Kang, F. Liu, L. Pinto, and K. Lee. Quality over quantity: Demonstration curation via influence functions for data-centric robot learning.arXiv preprint arXiv:2603.09056, 2026

arXiv 2026

[21] [21]

Zhang, Y

J. Zhang, Y . Luo, A. Anwar, S. A. Sontakke, J. J. Lim, J. Thomason, E. Bıyık, and J. Zhang. Rewind: Language-guided rewards teach robot policies without new demonstrations. In Conference on Robot Learning, 2025

2025

[22] [22]

Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training. InInternational Conference on Learning Representations, 2023

2023

[23] [23]

Y . J. Ma, W. Liang, V . Somani, B. Stadie, O. Bastani, D. Jayaraman, A. Zhang, S. Sodhani, and V . Kumar. Liv: Language-image representations and rewards for robotic control. In International Conference on Machine Learning. PMLR, 2023

2023

[24] [24]

Dwibedi, Y

D. Dwibedi, Y . Aytar, J. Tompson, P. Sermanet, and A. Zisserman. Temporal cycle-consistency learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion, pages 1801–1810, 2019

2019

[25] [25]

Q. Chen, J. Yu, M. Schwager, P. Abbeel, Y . Shentu, and P. Wu. Sarm: Stage-aware reward modeling for long horizon robot manipulation. InInternational Conference on Learning Representations (ICLR), 2026. 11

2026

[26] [26]

Y . Mao, Z. Yu, W. Mao, Y . Li, Q. Hu, Z. Lan, M. Zhu, and H. Chen. Arm: Advantage reward modeling for long-horizon manipulation.arXiv preprint arXiv:2604.03037, 2026

Pith/arXiv arXiv 2026

[27] [27]

Y . Yao, C. Liu, D. Luo, Y . Zhou, and Q. Ye. Video playback rate perception for self-supervised spatio-temporal representation learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

2020

[28] [28]

J. Wang, J. Jiao, and Y . Liu. Self-supervised video representation learning by pace prediction. InEuropean Conference on Computer Vision, 2020

2020

[29] [29]

P. Chen, D. Huang, D. He, X. Long, R. Zeng, S. Wen, M. Tan, and C. Gan. Rspnet: Relative speed perception for unsupervised video representation learning. InThe AAAI Conference on Artificial Intelligence (AAAI), 2021

2021

[30] [30]

Deepsd: Automatic deep skinning and pose space deformation for 3d garment animation

D. Huang, W. Hu, X. Liu, D. He, Z. Wu, X. Wu, M. Tan, and E. Ding. Ascnet: Self- supervised video representation learning with appearance-speed consistency. InThe IEEE/CVF International Conference on Computer Vision (ICCV), pages 8076–8085, 10 2021. doi: 10.1109/ICCV48922.2021.00799

work page doi:10.1109/iccv48922.2021.00799 2021

[31] [31]

I. R. Dave, S. Jenni, and M. Shah. No more shortcuts: Realizing the potential of tempo- ral self-supervision.Proceedings of the AAAI Conference on Artificial Intelligence, 38(2): 1481–1491, Mar. 2024. doi:10.1609/aaai.v38i2.27913. URL https://ojs.aaai.org/index. php/AAAI/article/view/27913

work page doi:10.1609/aaai.v38i2.27913 2024

[32] [32]

Jenni, M

S. Jenni, M. Woodson, and F. C. Heilbron. Video-retime: Learning temporally varying speedi- ness for time remapping, 2022. URLhttps://arxiv.org/abs/2205.05609

arXiv 2022

[33] [33]

Hejna, C

J. Hejna, C. Bhateja, Y . Jiang, K. Pertsch, and D. Sadigh. Re-mix: Optimizing data mixtures for large scale imitation learning. InConference on Robot Learning (CoRL), volume 270 of Proceedings of Machine Learning Research, pages 145–164. PMLR, 2024

2024

[34] [34]

Zhang, Y

Y . Zhang, Y . Xie, H. Liu, R. Shah, M. Wan, L. Fan, and Y . Zhu. Scizor: A self-supervised approach to data curation for large-scale imitation learning. InIEEE International Conference on Robotics and Automation (ICRA), 2026

2026

[35] [35]

Liang, Y

A. Liang, Y . Korkmaz, J. Zhang, M. Hwang, A. Anwar, S. Kaushik, A. Shah, A. S. Huang, L. Zettlemoyer, D. Fox, et al. Robometer: Scaling general-purpose robotic reward models via trajectory comparisons. InProceedings of Robotics: Science and Systems (RSS), 2026

2026

[36] [36]

S. Chen, C. Harrison, Y .-C. Lee, A. J. Yang, Z. Ren, L. J. Ratliff, J. Duan, D. Fox, and R. Krishna. Topreward: Token probabilities as hidden zero-shot rewards for robotics.arXiv preprint arXiv:2602.19313, 2026

arXiv 2026

[37] [37]

H. Tan, S. Chen, Y . Xu, Z. Wang, Y . Ji, C. Chi, Y . Lyu, Z. Zhao, X. Chen, P. Co, et al. Robo-dopamine: General process reward modeling for high-precision robotic manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026

[38] [38]

Liang, R

P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. A. Cosgrove, C. D. Manning, C. Re, D. Acosta-Navas, D. A. Hudson, E. Zelikman, E. Durmus, F. Ladhak, F. Rong, H. Ren, H. Yao, J. WANG, K. Santhanam, L. Orr, L. Zheng, M. Yuksekgonul, M. Suzgun, N. Kim, N. G...

2023

[39] [39]

Farebrother, J

J. Farebrother, J. Orbay, Q. Vuong, A. A. Ta ¨ıga, Y . Chebotar, T. Xiao, A. Irpan, S. Levine, P. S. Castro, A. Faust, A. Kumar, and R. Agarwal. Stop regressing: Training value functions via classification for scalable deep rl. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pa...

work page doi:10.48550/arxiv.2403.03950 2024

[40] [40]

Sim´eoni, H

O. Sim´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. J ´egou, P. Labatut, and P. Bojanowski. Dinov3, 2025. URL https://arxiv.org/abs...

Pith/arXiv arXiv 2025

[41] [41]

put the plastic bottles in the bin

J. Grigsby and Y . Qi. A closer look at advantage-filtered behavioral cloning in high-noise datasets, 2023. URLhttps://arxiv.org/abs/2110.04698. 13 A Dataset Statistics Table 5 reports per-tier statistics for the three policy training datasets used in Section 4, as well as the fixed reference subset on which W ARP is trained. All tiers are length-filtered...

arXiv 2023