Beyond Monotonic Progress: Retry-Supervised Value Learning for Robot Imitation

Bin Liang; Chuheng Zhang; Junjie Lu; Jun Yang; Kaixin Wang; Kimin Lee; Li Zhao; Min Xu; Sinjae Kang; Xinyao Qin

arxiv: 2606.24633 · v1 · pith:5MWKWIUXnew · submitted 2026-06-23 · 💻 cs.RO

Beyond Monotonic Progress: Retry-Supervised Value Learning for Robot Imitation

Xinyao Qin , Junjie Lu , Kaixin Wang , Chuheng Zhang , Sinjae Kang , Kimin Lee , Min Xu , Bin Liang

show 2 more authors

Jun Yang Li Zhao

This is my paper

Pith reviewed 2026-06-25 23:37 UTC · model grok-4.3

classification 💻 cs.RO

keywords imitation learningvalue function learningretry supervisionrobot manipulationimperfect demonstrationsbehavior cloningpreference learningmistake detection

0 comments

The pith

Retry events in demonstrations supply sparse supervision for value functions that detect local mistakes and improve imitation from imperfect robot data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that retry events mark local degradation and recovery in mixed-quality demonstrations, supplying a signal that lets value functions learn pairwise preferences around those points instead of assuming steady monotonic progress. If this holds, the resulting values can reweight demonstration chunks during behavior cloning so that harmful errors lose influence while useful corrections remain, making imitation learning more robust when only realistic, error-containing data is available. This matters for robot manipulation because perfect demonstrations are expensive to collect and existing progress-based methods miss fine-grained execution quality. Experiments on real-robot tasks indicate that the learned values are more detailed than baseline ones and yield better downstream imitation performance.

Core claim

ReTVL learns mistake-sensitive value functions by combining global progress calibration with local pairwise preference learning induced by sparsely annotated retry keypoints, then applies the values to reweight demonstration chunks for behavior cloning so that execution errors are down-weighted while corrective behaviors are preserved.

What carries the argument

Retry-supervised value learning that uses retry keypoints to induce local pairwise preferences for modeling degradation-and-recovery structure around mistakes.

If this is right

Value estimates become more fine-grained than those produced by monotonic progress baselines.
Imitation learning from imperfect demonstrations improves on real-robot manipulation tasks.
Harmful execution errors receive reduced weight while useful corrective segments are retained during behavior cloning.
The approach operates directly on mixed-quality demonstration data without requiring additional dense rewards or perfect trajectories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same retry-signal idea could be tested in other sequential domains where correction events occur naturally, such as language-model fine-tuning from human edits.
It offers a route to extract preference pairs from existing demonstration logs without new human annotation.
The framework might combine with online data collection to continually refine values from observed retries during deployment.

Load-bearing premise

Retry events in the demonstrations reliably mark unbiased local degradation-and-recovery points that can serve as supervision without task-specific bias or annotation artifacts.

What would settle it

A controlled comparison on the same real-robot manipulation tasks in which behavior cloning reweighted by ReTVL values shows no improvement or worse performance than reweighting by progress-based value estimates.

Figures

Figures reproduced from arXiv: 2606.24633 by Bin Liang, Chuheng Zhang, Junjie Lu, Jun Yang, Kaixin Wang, Kimin Lee, Li Zhao, Min Xu, Sinjae Kang, Xinyao Qin.

**Figure 1.** Figure 1: ReTVL turns retry events into pairwise value supervision. Progress-based value models may overlook subtle execution errors and assign overly smooth increasing values. ReTVL uses retry keypoints to learn local value drops before correction and rebounds after recovery, enabling better identification of harmful and corrective trajectory segments. supervisory signals for value learning. Specifically, the tempo… view at source ↗

**Figure 2.** Figure 2: ReTVL learns retry-sensitive value estimates from sparse retry annotations. The model takes an observation history and language instruction as input, and predicts a scalar value through a VLM backbone and discrete value head. Training combines absolute progress calibration with retry-induced preference supervision, where values drop near retry states and rebound after recovery. ri,j marks the start of the … view at source ↗

**Figure 3.** Figure 3: Visualization of value evaluation. We show value predictions on three other tasks beyond stack blocks. ReTVL captures local value drops around retry keypoints and rebounds after correction more clearly than progress-based baselines. Task Standard BC RECAP -BC ReTVL -BC Pick up Spoon 60 65 85 Stack Blocks 45 80 95 Fold Towel 50 65 80 Open Drawer 10 40 60 Average 41 63 80 [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗

**Figure 4.** Figure 4: Average training weight assigned to annotated bad-action chunks in recovery trajectories. Lower is better. and distinguish successful executions from failures. The main advantage of ReTVL lies in local retrycentered metrics. It achieves consistent improvements across all four local metrics, indicating that it more reliably captures retry-related local value changes. The improvement is especially large on… view at source ↗

**Figure 5.** Figure 5: Real-world manipulation tasks used for policy evaluation. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Representative value-curve visualizations for ablation variants on held-out trajectories. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

read the original abstract

Human demonstrations for robot imitation learning often contain mistakes and corrective behaviors, such as imprecise grasps, object misalignment, unstable contact, and repeated attempts. While these segments are commonly treated as noisy or suboptimal data, they provide valuable evidence about when execution deviates from a desirable path and how task feasibility can be restored. However, existing reward and value models often rely on monotonic progress assumptions, which capture coarse task advancement but may overlook local execution errors and corrective behaviors in imperfect demonstrations. In this work, we propose ReTVL (ReTry-Supervised Value Learning), a framework for learning mistake-sensitive value functions from mixed-quality robot demonstrations by leveraging retry events as sparse supervision. ReTVL captures the local degradation-and-recovery structure around mistakes by combining global progress calibration with local pairwise preference learning induced by sparsely annotated retry keypoints. The learned value model is then used to reweight demonstration chunks for downstream behavior cloning, reducing the influence of harmful execution errors while preserving useful corrective behaviors. Experiments on real-robot manipulation tasks show that ReTVL produces more fine-grained value estimates than progress-based baselines and improves imitation learning from imperfect demonstrations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReTVL frames retry events as sparse supervision for local value learning in imitation from imperfect demos, but the abstract gives almost no evidence on whether it works.

read the letter

The main thing here is a method that treats retry keypoints in human demonstrations as markers of local degradation and recovery, then combines that with global progress to train a value function for reweighting behavior cloning data. This is positioned against monotonic progress assumptions in prior reward models.

The paper does a clean job of identifying a practical gap: real demos often have mistakes and corrections that standard progress signals miss, and the retry framing gives a way to extract pairwise preferences without needing dense labels. Using those for local value learning and then reweighting chunks is a straightforward extension that could preserve useful corrective behavior while downweighting errors.

The soft spots are mostly around missing substance. The abstract claims better fine-grained value estimates and improved imitation on real-robot tasks, but supplies no numbers, baselines, or controls, so there is no way to tell if the gains are real or just noise. The central assumption—that sparsely annotated retry events give unbiased local preferences—also sits untested in what is visible; if retry locations correlate with task geometry or demonstrator habits, the value function could just learn those heuristics instead of general mistake sensitivity. That concern from the stress-test note lands directly on the abstract.

This is for researchers doing imitation learning on physical robots who already work with noisy human data. A reader already thinking about value functions or preference learning from demonstrations would find the retry angle worth seeing, but anyone needing quantitative grounding will have to wait for the full experiments.

I would send it to peer review. The idea is distinct enough and the problem is real enough that referees should see the full paper and the data, even if the current write-up is thin.

Referee Report

2 major / 0 minor

Summary. The paper proposes ReTVL (ReTry-Supervised Value Learning), a framework that learns mistake-sensitive value functions from mixed-quality robot demonstrations by treating sparsely annotated retry events as markers of local degradation-and-recovery. It combines global progress calibration with local pairwise preference learning from these keypoints, then reweights demonstration chunks using the learned value model for downstream behavior cloning. The central claim is that this produces more fine-grained value estimates than monotonic progress baselines and improves imitation learning performance on real-robot manipulation tasks with imperfect demonstrations.

Significance. If the central construction holds without bias, the work offers a practical advance in imitation learning by extracting supervisory signal from corrective behaviors that are typically discarded as noise. It directly targets a common real-world data issue (mistakes and recoveries) without requiring dense rewards or perfect demonstrations, and the reweighting step for behavior cloning is a clear downstream application. The approach is grounded in observable retry events rather than invented dense labels.

major comments (2)

[Abstract and method section] The assumption that retry keypoints provide unbiased sparse supervision for pairwise preferences is load-bearing for both the fine-grained value claim and the reweighting benefit. The manuscript provides no annotation protocol, inter-annotator reliability metrics, or controls demonstrating that retry locations are independent of task geometry, object properties, or demonstrator idiosyncrasies (Abstract; method description). If retry events correlate with these factors, the learned value function may rediscover task-specific heuristics rather than general mistake sensitivity.
[Abstract and experiments section] The abstract asserts that experiments show ReTVL produces more fine-grained value estimates and improves imitation learning, yet supplies no quantitative results, baseline comparisons, error bars, or statistical tests. This prevents assessment of whether the data support the central claims about value granularity and downstream BC improvement (Abstract; experiments section).

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract and method section] The assumption that retry keypoints provide unbiased sparse supervision for pairwise preferences is load-bearing for both the fine-grained value claim and the reweighting benefit. The manuscript provides no annotation protocol, inter-annotator reliability metrics, or controls demonstrating that retry locations are independent of task geometry, object properties, or demonstrator idiosyncrasies (Abstract; method description). If retry events correlate with these factors, the learned value function may rediscover task-specific heuristics rather than general mistake sensitivity.

Authors: We agree that an explicit annotation protocol would improve clarity. Retry events are identified directly from trajectory data as repeated attempts following observable failures (e.g., grasp slips or misalignments), which are task-agnostic markers of local degradation. The local pairwise preference learning is restricted to short windows around these keypoints to focus on recovery dynamics rather than global task structure. We will add a dedicated subsection describing the annotation procedure and acknowledge the absence of inter-annotator metrics and explicit independence controls as a limitation. New experiments to demonstrate full independence are outside the scope of the current study. revision: partial
Referee: [Abstract and experiments section] The abstract asserts that experiments show ReTVL produces more fine-grained value estimates and improves imitation learning, yet supplies no quantitative results, baseline comparisons, error bars, or statistical tests. This prevents assessment of whether the data support the central claims about value granularity and downstream BC improvement (Abstract; experiments section).

Authors: The experiments section reports baseline comparisons on real-robot tasks and states that ReTVL yields finer value estimates and better BC performance. However, we acknowledge that the abstract contains no numerical values and that error bars plus statistical tests are not presented. We will revise the abstract to include key quantitative metrics and augment the experiments section with error bars and significance tests. revision: yes

standing simulated objections not resolved

Providing inter-annotator reliability metrics or new controls proving retry locations are independent of task geometry, object properties, and demonstrator idiosyncrasies, as these were not collected in the original study.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The provided abstract and description contain no equations, fitting procedures, or derivation steps that reduce any prediction or result to its own inputs by construction. No self-citations, ansatzes, or uniqueness claims are referenced. The method is described at a high level using retry events for supervision, but without visible load-bearing reductions or self-referential definitions, the central claims remain independent of the inputs in the given text. This is the expected outcome for papers without explicit mathematical derivations shown.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, datasets, or implementation details; cannot identify specific free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5751 in / 967 out tokens · 21999 ms · 2026-06-25T23:37:16.262955+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 2 canonical work pages

[1]

and Ng, A

Abbeel, P . and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. InProceedings of the twenty-first international conference on Machine learning, pp. 1, 2004

2004
[2]

Video- language critic: Transferable reward functions for language-conditioned robotics.arXiv preprint arXiv:2405.19988, 2024

Alakuijala, M., McLean, R., Woungang, I., Farsad, N., Kaski, S., Marttinen, P ., and Yuan, K. Video- language critic: Transferable reward functions for language-conditioned robotics.arXiv preprint arXiv:2405.19988, 2024

arXiv 2024
[3]

Z., Sharma, C., Shi, L

Amin, A., Aniceto, R., Balakrishna, A., Black, K., Conley, K., Connors, G., Darpinian, J., Dhabalia, K., DiCarlo, J., Driess, D., Equi, M., Esmail, A., Fang, Y., Finn, C., Glossop, C., Godden, T., Goryachev, I., Groom, L., Hancock, H., Hausman, K., Hussein, G., Ichter, B., Jakubczak, S., Jen, R., Jones, T., Katz, B., Ke, L., Kuchi, C., Lamb, M., LeBlanc, ...

Pith/arXiv arXiv 2025
[4]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025
[5]

X., Tanner, J., Vuong, Q., Walling, A., Wang, H., and Zhilinsky, U

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Shi, L. X., Tanner, J., Vuong, Q., Walling, A., Wang, H., and Zhilinsky, U. π0: A vision-language- action flow model for general robot control, 202...

Pith/arXiv arXiv 2026
[6]

Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

1952
[7]

G., Gopalakrishnan, K., Han, K., Hausman, K., Herzog, A., Hsu, J., Ichter, B., Irpan, A., Joshi, N., Julian, R., Kalashnikov, D., Kuang, Y., Leal, I., Lee, L., Lee, T.-W

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., Florence, P ., Fu, C., Arenas, M. G., Gopalakrishnan, K., Han, K., Hausman, K., Herzog, A., Hsu, J., Ichter, B., Irpan, A., Joshi, N., Julian, R., Kalashnikov, D., Kuang, Y., Leal, I., Lee, L., Lee, T.-W. E., Levine, S., Lu, Y., Michale...

Pith/arXiv arXiv 2023
[8]

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., Ibarz, J., Ichter, B., Irpan, A., Jackson, T., Jesmonth, S., Joshi, N. J., Julian, R., Kalashnikov, D., Kuang, Y., Leal, I., Lee, K.-H., Levine, S., Lu, Y., Malla, U., Manjunath, D., Mordatch, I., Nachum, O., Parada, C., Peralta, J...

Pith/arXiv arXiv 2023
[9]

S., Goo, W., Nagarajan, P ., and Niekum, S

Brown, D. S., Goo, W., Nagarajan, P ., and Niekum, S. Extrapolating beyond suboptimal demonstra- tions via inverse reinforcement learning from observations. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pp. 783–792. PMLR, 2019. 9 Beyond Monotonic Progress

2019
[10]

S., Goo, W., and Niekum, S

Brown, D. S., Goo, W., and Niekum, S. Better-than-demonstrator imitation learning via automatically-ranked demonstrations. InProceedings of the Conference on Robot Learning, volume 100 ofProceedings of Machine Learning Research, pp. 330–359. PMLR, 2020

2020
[11]

Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems

Bu, Q., Cai, J., Chen, L., Cui, X., Ding, Y., Feng, S., Gao, S., He, X., Hu, X., Huang, X., et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669, 2025

Pith/arXiv arXiv 2025
[12]

in-the-wild

Chen, A. S., Nair, S., and Finn, C. Learning generalizable robotic reward functions from “in-the-wild” human videos. InProceedings of Robotics: Science and Systems (RSS), 2021

2021
[13]

Sarm: Stage-aware reward modeling for long horizon robot manipulation.arXiv preprint arXiv:2509.25358, 2025

Chen, Q., Yu, J., Schwager, M., Abbeel, P ., Shentu, F., and Wu, P . Sarm: Stage-aware reward modeling for long horizon robot manipulation.arXiv preprint arXiv:2509.25358, 2025

Pith/arXiv arXiv 2025
[14]

J., Ren, Z., Ratliff, L

Chen, S., Harrison, C., Lee, Y.-C., Yang, A. J., Ren, Z., Ratliff, L. J., Duan, J., Fox, D., and Kr- ishna, R. Topreward: Token probabilities as hidden zero-shot rewards for robotics.arXiv preprint arXiv:2602.19313, 2026

arXiv 2026
[15]

villa-x: Enhancing latent action modeling in vision-language-action models,

Chen, X., Wei, H., Zhang, P ., Zhang, C., Wang, K., Guo, Y., Yang, R., Wang, Y., Xiao, X., Zhao, L., Chen, J., and Bian, J. villa-x: Enhancing latent action modeling in vision-language-action models,
[16]

URLhttps://arxiv.org/abs/2507.23682

Pith/arXiv arXiv
[17]

F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D

Christiano, P . F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

2017
[18]

F., Leike, J., Brown, T

Christiano, P . F., Leike, J., Brown, T. B., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems, volume 30, 2017

2017
[19]

Guided cost learning: Deep inverse optimal control via policy optimization

Finn, C., Levine, S., and Abbeel, P . Guided cost learning: Deep inverse optimal control via policy optimization. InInternational conference on machine learning, pp. 49–58. PMLR, 2016

2016
[20]

Awr: Adaptive weighting regression for 3d hand pose estimation

Huang, W., Ren, P ., Wang, J., Qi, Q., and Sun, H. Awr: Adaptive weighting regression for 3d hand pose estimation. InProceedings of the AAAI Conference on Artificial Intelligence, pp. 11061–11068, 2020

2020
[21]

Y., Ghosh, D., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., LeBlanc, D., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Ren, A

Intelligence, P ., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Galliker, M. Y., Ghosh, D., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., LeBlanc, D., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Ren, A. Z., Shi, L. X., Smith, L., Springenberg, J. T., St...

Pith/arXiv arXiv 2025
[22]

Vima: General robot manipulation with multimodal prompts, 2023

Jiang, Y., Gupta, A., Zhang, Z., Wang, G., Dou, Y., Chen, Y., Fei-Fei, L., Anandkumar, A., Zhu, Y., and Fan, L. Vima: General robot manipulation with multimodal prompts, 2023. URL https: //arxiv.org/abs/2210.03094

arXiv 2023
[23]

Kelly, M., Sidrane, C., Driggs-Campbell, K., and Kochenderfer, M. J. HG-DAgger: Interactive imitation learning with human experts. InProceedings of the IEEE International Conference on Robotics and Automation, pp. 8077–8083, 2019. doi: 10.1109/ICRA.2019.8793698

work page doi:10.1109/icra.2019.8793698 2019
[24]

Demodice: Offline imitation learning with supplementary imperfect demonstrations

Kim, G.-H., Seo, S., Lee, J., Jeon, W., Hwang, H., Yang, H., and Kim, K.-E. Demodice: Offline imitation learning with supplementary imperfect demonstrations. InInternational Conference on Learning Representations, 2022

2022
[25]

Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P ., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P ., and Finn, C. Openvla: An open-source vision-language-action model, 2024. URL https://arxiv.org/abs/2406.09246

Pith/arXiv arXiv 2024
[26]

Dart: Noise injection for robust imitation learning, 2017

Laskey, M., Lee, J., Fox, R., Dragan, A., and Goldberg, K. Dart: Noise injection for robust imitation learning, 2017. URLhttps://arxiv.org/abs/1703.09327

Pith/arXiv arXiv 2017
[27]

Roboreward: General- purpose vision-language reward models for robotics.arXiv preprint arXiv:2601.00675, 2026

Lee, T., Wagenmaker, A., Pertsch, K., Liang, P ., Levine, S., and Finn, C. Roboreward: General- purpose vision-language reward models for robotics.arXiv preprint arXiv:2601.00675, 2026. 10 Beyond Monotonic Progress

arXiv 2026
[28]

Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models, 2025

Li, P ., Chen, Y., Wu, H., Ma, X., Wu, X., Huang, Y., Wang, L., Kong, T., and Tan, T. Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models, 2025. URLhttps://arxiv.org/abs/2506.07961

arXiv 2025
[29]

Gr-rl: Going dexterous and precise for long-horizon robotic manipulation.arXiv preprint arXiv:2512.01801, 2025

Li, Y., Ma, X., Xu, J., Cui, Y., Cui, Z., Han, Z., Huang, L., Kong, T., Liu, Y., Niu, H., et al. Gr-rl: Going dexterous and precise for long-horizon robotic manipulation.arXiv preprint arXiv:2512.01801, 2025

arXiv 2025
[30]

S., Zettlemoyer, L., Fox, D., Xiang, Y., Li, A., Bobu, A., Gupta, A., Tu, S., Biyik, E., and Zhang, J

Liang, A., Korkmaz, Y., Zhang, J., Hwang, M., Anwar, A., Kaushik, S., Shah, A., Huang, A. S., Zettlemoyer, L., Fox, D., Xiang, Y., Li, A., Bobu, A., Gupta, A., Tu, S., Biyik, E., and Zhang, J. Robometer: Scaling general-purpose robotic reward models via trajectory comparisons.arXiv preprint arXiv:2603.02115, 2026

Pith/arXiv arXiv 2026
[31]

Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H., and Zhu, J. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

Pith/arXiv arXiv 2024
[32]

J., Sodhani, S., Jayaraman, D., Bastani, O., Kumar, V ., and Zhang, A

Ma, Y. J., Sodhani, S., Jayaraman, D., Bastani, O., Kumar, V ., and Zhang, A. Vip: Towards universal visual reward and representation via value-implicit pre-training.arXiv preprint arXiv:2210.00030, 2022

Pith/arXiv arXiv 2022
[33]

J., Liang, W., Som, V ., Kumar, V ., Zhang, A., Bastani, O., and Jayaraman, D

Ma, Y. J., Liang, W., Som, V ., Kumar, V ., Zhang, A., Bastani, O., and Jayaraman, D. Liv: Language- image representations and rewards for robotic control. InProceedings of the 40th International Conference on Machine Learning (ICML), 2023

2023
[34]

Arm: Advantage reward modeling for long-horizon manipulation.arXiv preprint arXiv:2604.03037, 2026

Mao, Y., Yu, Z., Mao, W., Li, Y., Hu, Q., Lan, Z., Zhu, M., and Chen, H. Arm: Advantage reward modeling for long-horizon manipulation.arXiv preprint arXiv:2604.03037, 2026

Pith/arXiv arXiv 2026
[35]

Awac: Accelerating online reinforcement learning with offline datasets, 2021

Nair, A., Gupta, A., Dalal, M., and Levine, S. Awac: Accelerating online reinforcement learning with offline datasets, 2021. URLhttps://arxiv.org/abs/2006.09359

Pith/arXiv arXiv 2021
[36]

Ng, A. Y. and Russell, S. J. Algorithms for inverse reinforcement learning. InProceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, pp. 663–670, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. ISBN 1558607072

2000
[37]

C., Shevchuk, G., and Sadigh, D

Palan, M., Landolfi, N. C., Shevchuk, G., and Sadigh, D. Learning reward functions by integrating human demonstrations and preferences. InProceedings of Robotics: Science and Systems, 2019

2019
[38]

B., Kumar, A., Zhang, G., and Levine, S

Peng, X. B., Kumar, A., Zhang, G., and Levine, S. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning, 2019. URLhttps://arxiv.org/abs/1910.00177

Pith/arXiv arXiv 2019
[39]

D., Sastry, S

Sadigh, D., Dragan, A. D., Sastry, S. S., and Seshia, S. A. Active preference-based learning of reward functions. InProceedings of Robotics: Science and Systems, 2017. doi: 10.15607/RSS.2017.XIII.053

work page doi:10.15607/rss.2017.xiii.053 2017
[40]

Smolvla: A vision-language-action model for affordable and efficient robotics, 2025

Shukor, M., Aubakirova, D., Capuano, F., Kooijmans, P ., Palma, S., Zouitine, A., Aractingi, M., Pascal, C., Russi, M., Marafioti, A., Alibert, S., Cord, M., Wolf, T., and Cadene, R. Smolvla: A vision-language-action model for affordable and efficient robotics, 2025. URL https://arxiv.org/ abs/2506.01844

Pith/arXiv arXiv 2025
[41]

Robo- dopamine: General process reward modeling for high-precision robotic manipulation.arXiv preprint arXiv:2512.23703, 2025

Tan, H., Chen, S., Xu, Y., Wang, Z., Ji, Y., Chi, C., Lyu, Y., Zhao, Z., Chen, X., Co, P ., et al. Robo- dopamine: General process reward modeling for high-precision robotic manipulation.arXiv preprint arXiv:2512.23703, 2025

arXiv 2025
[42]

M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., Luo, J., Tan, Y

Team, O. M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., Luo, J., Tan, Y. L., Chen, L. Y., Sanketi, P ., Vuong, Q., Xiao, T., Sadigh, D., Finn, C., and Levine, S. Octo: An open-source generalist robot policy, 2024. URL https://arxiv.org/abs/2405.12213

Pith/arXiv arXiv 2024
[43]

Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855, 2025

Wen, J., Zhu, Y., Li, J., Tang, Z., Shen, C., and Feng, F. Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855, 2025

Pith/arXiv arXiv 2025
[44]

Imitation learning from imperfect demonstration, 2019

Wu, Y.-H., Charoenphakdee, N., Bao, H., Tangkaratt, V ., and Sugiyama, M. Imitation learning from imperfect demonstration, 2019. URLhttps://arxiv.org/abs/1901.09387

Pith/arXiv arXiv 2019
[45]

Imitation learning from imperfect demonstration

Wu, Y.-H., Charoenphakdee, N., Bao, H., Tangkaratt, V ., and Sugiyama, M. Imitation learning from imperfect demonstration. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pp. 6818–6827. PMLR, 2019. 11 Beyond Monotonic Progress

2019
[46]

Discriminator-weighted offline imitation learning from suboptimal demonstrations

Xu, H., Zhan, X., Yin, H., and Qin, H. Discriminator-weighted offline imitation learning from suboptimal demonstrations. InProceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pp. 24725–24742. PMLR, 2022

2022
[47]

Compliant residual dagger: Improving real-world contact- rich manipulation with human corrections.Advances in Neural Information Processing Systems, 38: 139559–139581, 2026

Xu, X., Hou, Y., Liu, Z., and Song, S. Compliant residual dagger: Improving real-world contact- rich manipulation with human corrections.Advances in Neural Information Processing Systems, 38: 139559–139581, 2026

2026
[48]

Rise: Self-improving robot policy with compositional world model.arXiv preprint arXiv:2602.11075, 2026

Yang, J., Lin, K., Li, J., Zhang, W., Lin, T., Wu, L., Su, Z., Zhao, H., Zhang, Y.-Q., Chen, L., et al. Rise: Self-improving robot policy with compositional world model.arXiv preprint arXiv:2602.11075, 2026

Pith/arXiv arXiv 2026
[49]

Aloe: Action-level off-policy evaluation for vision-language-action model post-training.arXiv preprint arXiv:2602.12691, 2026

Yang, R., Wang, H., Liu, C., Yan, X., Wang, Y., Du, X., Yue, S., Liu, Y., Zhang, C., Qi, L., et al. Aloe: Action-level off-policy evaluation for vision-language-action model post-training.arXiv preprint arXiv:2602.12691, 2026

Pith/arXiv arXiv 2026
[50]

Confidence-aware imitation learning from demonstrations with varying optimality

Zhang, S., Cao, Z., Sadigh, D., and Sui, Y. Confidence-aware imitation learning from demonstrations with varying optimality. InAdvances in Neural Information Processing Systems, volume 34, 2021

2021
[51]

Vlas: Vision-language- action model with speech instructions for customized robot manipulation, 2025

Zhao, W., Ding, P ., Zhang, M., Gong, Z., Bai, S., Zhao, H., and Wang, D. Vlas: Vision-language- action model with speech instructions for customized robot manipulation, 2025. URL https: //arxiv.org/abs/2502.13508

arXiv 2025
[52]

D., Maas, A

Ziebart, B. D., Maas, A. L., Bagnell, J. A., Dey, A. K., et al. Maximum entropy inverse reinforcement learning. InAaai, volume 8, pp. 1433–1438. Chicago, IL, USA, 2008. 12 Beyond Monotonic Progress A Implementation Details A.1 Value Model Training Data preprocessing.All value models are trained and evaluated using the same local 5 Hz data protocol. The ra...

arXiv 2008

[1] [1]

and Ng, A

Abbeel, P . and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. InProceedings of the twenty-first international conference on Machine learning, pp. 1, 2004

2004

[2] [2]

Video- language critic: Transferable reward functions for language-conditioned robotics.arXiv preprint arXiv:2405.19988, 2024

Alakuijala, M., McLean, R., Woungang, I., Farsad, N., Kaski, S., Marttinen, P ., and Yuan, K. Video- language critic: Transferable reward functions for language-conditioned robotics.arXiv preprint arXiv:2405.19988, 2024

arXiv 2024

[3] [3]

Z., Sharma, C., Shi, L

Amin, A., Aniceto, R., Balakrishna, A., Black, K., Conley, K., Connors, G., Darpinian, J., Dhabalia, K., DiCarlo, J., Driess, D., Equi, M., Esmail, A., Fang, Y., Finn, C., Glossop, C., Godden, T., Goryachev, I., Groom, L., Hancock, H., Hausman, K., Hussein, G., Ichter, B., Jakubczak, S., Jen, R., Jones, T., Katz, B., Ke, L., Kuchi, C., Lamb, M., LeBlanc, ...

Pith/arXiv arXiv 2025

[4] [4]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025

[5] [5]

X., Tanner, J., Vuong, Q., Walling, A., Wang, H., and Zhilinsky, U

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Shi, L. X., Tanner, J., Vuong, Q., Walling, A., Wang, H., and Zhilinsky, U. π0: A vision-language- action flow model for general robot control, 202...

Pith/arXiv arXiv 2026

[6] [6]

Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

1952

[7] [7]

G., Gopalakrishnan, K., Han, K., Hausman, K., Herzog, A., Hsu, J., Ichter, B., Irpan, A., Joshi, N., Julian, R., Kalashnikov, D., Kuang, Y., Leal, I., Lee, L., Lee, T.-W

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., Florence, P ., Fu, C., Arenas, M. G., Gopalakrishnan, K., Han, K., Hausman, K., Herzog, A., Hsu, J., Ichter, B., Irpan, A., Joshi, N., Julian, R., Kalashnikov, D., Kuang, Y., Leal, I., Lee, L., Lee, T.-W. E., Levine, S., Lu, Y., Michale...

Pith/arXiv arXiv 2023

[8] [8]

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., Ibarz, J., Ichter, B., Irpan, A., Jackson, T., Jesmonth, S., Joshi, N. J., Julian, R., Kalashnikov, D., Kuang, Y., Leal, I., Lee, K.-H., Levine, S., Lu, Y., Malla, U., Manjunath, D., Mordatch, I., Nachum, O., Parada, C., Peralta, J...

Pith/arXiv arXiv 2023

[9] [9]

S., Goo, W., Nagarajan, P ., and Niekum, S

Brown, D. S., Goo, W., Nagarajan, P ., and Niekum, S. Extrapolating beyond suboptimal demonstra- tions via inverse reinforcement learning from observations. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pp. 783–792. PMLR, 2019. 9 Beyond Monotonic Progress

2019

[10] [10]

S., Goo, W., and Niekum, S

Brown, D. S., Goo, W., and Niekum, S. Better-than-demonstrator imitation learning via automatically-ranked demonstrations. InProceedings of the Conference on Robot Learning, volume 100 ofProceedings of Machine Learning Research, pp. 330–359. PMLR, 2020

2020

[11] [11]

Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems

Bu, Q., Cai, J., Chen, L., Cui, X., Ding, Y., Feng, S., Gao, S., He, X., Hu, X., Huang, X., et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669, 2025

Pith/arXiv arXiv 2025

[12] [12]

in-the-wild

Chen, A. S., Nair, S., and Finn, C. Learning generalizable robotic reward functions from “in-the-wild” human videos. InProceedings of Robotics: Science and Systems (RSS), 2021

2021

[13] [13]

Sarm: Stage-aware reward modeling for long horizon robot manipulation.arXiv preprint arXiv:2509.25358, 2025

Chen, Q., Yu, J., Schwager, M., Abbeel, P ., Shentu, F., and Wu, P . Sarm: Stage-aware reward modeling for long horizon robot manipulation.arXiv preprint arXiv:2509.25358, 2025

Pith/arXiv arXiv 2025

[14] [14]

J., Ren, Z., Ratliff, L

Chen, S., Harrison, C., Lee, Y.-C., Yang, A. J., Ren, Z., Ratliff, L. J., Duan, J., Fox, D., and Kr- ishna, R. Topreward: Token probabilities as hidden zero-shot rewards for robotics.arXiv preprint arXiv:2602.19313, 2026

arXiv 2026

[15] [15]

villa-x: Enhancing latent action modeling in vision-language-action models,

Chen, X., Wei, H., Zhang, P ., Zhang, C., Wang, K., Guo, Y., Yang, R., Wang, Y., Xiao, X., Zhao, L., Chen, J., and Bian, J. villa-x: Enhancing latent action modeling in vision-language-action models,

[16] [16]

URLhttps://arxiv.org/abs/2507.23682

Pith/arXiv arXiv

[17] [17]

F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D

Christiano, P . F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

2017

[18] [18]

F., Leike, J., Brown, T

Christiano, P . F., Leike, J., Brown, T. B., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems, volume 30, 2017

2017

[19] [19]

Guided cost learning: Deep inverse optimal control via policy optimization

Finn, C., Levine, S., and Abbeel, P . Guided cost learning: Deep inverse optimal control via policy optimization. InInternational conference on machine learning, pp. 49–58. PMLR, 2016

2016

[20] [20]

Awr: Adaptive weighting regression for 3d hand pose estimation

Huang, W., Ren, P ., Wang, J., Qi, Q., and Sun, H. Awr: Adaptive weighting regression for 3d hand pose estimation. InProceedings of the AAAI Conference on Artificial Intelligence, pp. 11061–11068, 2020

2020

[21] [21]

Y., Ghosh, D., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., LeBlanc, D., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Ren, A

Intelligence, P ., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Galliker, M. Y., Ghosh, D., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., LeBlanc, D., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Ren, A. Z., Shi, L. X., Smith, L., Springenberg, J. T., St...

Pith/arXiv arXiv 2025

[22] [22]

Vima: General robot manipulation with multimodal prompts, 2023

Jiang, Y., Gupta, A., Zhang, Z., Wang, G., Dou, Y., Chen, Y., Fei-Fei, L., Anandkumar, A., Zhu, Y., and Fan, L. Vima: General robot manipulation with multimodal prompts, 2023. URL https: //arxiv.org/abs/2210.03094

arXiv 2023

[23] [23]

Kelly, M., Sidrane, C., Driggs-Campbell, K., and Kochenderfer, M. J. HG-DAgger: Interactive imitation learning with human experts. InProceedings of the IEEE International Conference on Robotics and Automation, pp. 8077–8083, 2019. doi: 10.1109/ICRA.2019.8793698

work page doi:10.1109/icra.2019.8793698 2019

[24] [24]

Demodice: Offline imitation learning with supplementary imperfect demonstrations

Kim, G.-H., Seo, S., Lee, J., Jeon, W., Hwang, H., Yang, H., and Kim, K.-E. Demodice: Offline imitation learning with supplementary imperfect demonstrations. InInternational Conference on Learning Representations, 2022

2022

[25] [25]

Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P ., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P ., and Finn, C. Openvla: An open-source vision-language-action model, 2024. URL https://arxiv.org/abs/2406.09246

Pith/arXiv arXiv 2024

[26] [26]

Dart: Noise injection for robust imitation learning, 2017

Laskey, M., Lee, J., Fox, R., Dragan, A., and Goldberg, K. Dart: Noise injection for robust imitation learning, 2017. URLhttps://arxiv.org/abs/1703.09327

Pith/arXiv arXiv 2017

[27] [27]

Roboreward: General- purpose vision-language reward models for robotics.arXiv preprint arXiv:2601.00675, 2026

Lee, T., Wagenmaker, A., Pertsch, K., Liang, P ., Levine, S., and Finn, C. Roboreward: General- purpose vision-language reward models for robotics.arXiv preprint arXiv:2601.00675, 2026. 10 Beyond Monotonic Progress

arXiv 2026

[28] [28]

Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models, 2025

Li, P ., Chen, Y., Wu, H., Ma, X., Wu, X., Huang, Y., Wang, L., Kong, T., and Tan, T. Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models, 2025. URLhttps://arxiv.org/abs/2506.07961

arXiv 2025

[29] [29]

Gr-rl: Going dexterous and precise for long-horizon robotic manipulation.arXiv preprint arXiv:2512.01801, 2025

Li, Y., Ma, X., Xu, J., Cui, Y., Cui, Z., Han, Z., Huang, L., Kong, T., Liu, Y., Niu, H., et al. Gr-rl: Going dexterous and precise for long-horizon robotic manipulation.arXiv preprint arXiv:2512.01801, 2025

arXiv 2025

[30] [30]

S., Zettlemoyer, L., Fox, D., Xiang, Y., Li, A., Bobu, A., Gupta, A., Tu, S., Biyik, E., and Zhang, J

Liang, A., Korkmaz, Y., Zhang, J., Hwang, M., Anwar, A., Kaushik, S., Shah, A., Huang, A. S., Zettlemoyer, L., Fox, D., Xiang, Y., Li, A., Bobu, A., Gupta, A., Tu, S., Biyik, E., and Zhang, J. Robometer: Scaling general-purpose robotic reward models via trajectory comparisons.arXiv preprint arXiv:2603.02115, 2026

Pith/arXiv arXiv 2026

[31] [31]

Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H., and Zhu, J. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

Pith/arXiv arXiv 2024

[32] [32]

J., Sodhani, S., Jayaraman, D., Bastani, O., Kumar, V ., and Zhang, A

Ma, Y. J., Sodhani, S., Jayaraman, D., Bastani, O., Kumar, V ., and Zhang, A. Vip: Towards universal visual reward and representation via value-implicit pre-training.arXiv preprint arXiv:2210.00030, 2022

Pith/arXiv arXiv 2022

[33] [33]

J., Liang, W., Som, V ., Kumar, V ., Zhang, A., Bastani, O., and Jayaraman, D

Ma, Y. J., Liang, W., Som, V ., Kumar, V ., Zhang, A., Bastani, O., and Jayaraman, D. Liv: Language- image representations and rewards for robotic control. InProceedings of the 40th International Conference on Machine Learning (ICML), 2023

2023

[34] [34]

Arm: Advantage reward modeling for long-horizon manipulation.arXiv preprint arXiv:2604.03037, 2026

Mao, Y., Yu, Z., Mao, W., Li, Y., Hu, Q., Lan, Z., Zhu, M., and Chen, H. Arm: Advantage reward modeling for long-horizon manipulation.arXiv preprint arXiv:2604.03037, 2026

Pith/arXiv arXiv 2026

[35] [35]

Awac: Accelerating online reinforcement learning with offline datasets, 2021

Nair, A., Gupta, A., Dalal, M., and Levine, S. Awac: Accelerating online reinforcement learning with offline datasets, 2021. URLhttps://arxiv.org/abs/2006.09359

Pith/arXiv arXiv 2021

[36] [36]

Ng, A. Y. and Russell, S. J. Algorithms for inverse reinforcement learning. InProceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, pp. 663–670, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. ISBN 1558607072

2000

[37] [37]

C., Shevchuk, G., and Sadigh, D

Palan, M., Landolfi, N. C., Shevchuk, G., and Sadigh, D. Learning reward functions by integrating human demonstrations and preferences. InProceedings of Robotics: Science and Systems, 2019

2019

[38] [38]

B., Kumar, A., Zhang, G., and Levine, S

Peng, X. B., Kumar, A., Zhang, G., and Levine, S. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning, 2019. URLhttps://arxiv.org/abs/1910.00177

Pith/arXiv arXiv 2019

[39] [39]

D., Sastry, S

Sadigh, D., Dragan, A. D., Sastry, S. S., and Seshia, S. A. Active preference-based learning of reward functions. InProceedings of Robotics: Science and Systems, 2017. doi: 10.15607/RSS.2017.XIII.053

work page doi:10.15607/rss.2017.xiii.053 2017

[40] [40]

Smolvla: A vision-language-action model for affordable and efficient robotics, 2025

Shukor, M., Aubakirova, D., Capuano, F., Kooijmans, P ., Palma, S., Zouitine, A., Aractingi, M., Pascal, C., Russi, M., Marafioti, A., Alibert, S., Cord, M., Wolf, T., and Cadene, R. Smolvla: A vision-language-action model for affordable and efficient robotics, 2025. URL https://arxiv.org/ abs/2506.01844

Pith/arXiv arXiv 2025

[41] [41]

Robo- dopamine: General process reward modeling for high-precision robotic manipulation.arXiv preprint arXiv:2512.23703, 2025

Tan, H., Chen, S., Xu, Y., Wang, Z., Ji, Y., Chi, C., Lyu, Y., Zhao, Z., Chen, X., Co, P ., et al. Robo- dopamine: General process reward modeling for high-precision robotic manipulation.arXiv preprint arXiv:2512.23703, 2025

arXiv 2025

[42] [42]

M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., Luo, J., Tan, Y

Team, O. M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., Luo, J., Tan, Y. L., Chen, L. Y., Sanketi, P ., Vuong, Q., Xiao, T., Sadigh, D., Finn, C., and Levine, S. Octo: An open-source generalist robot policy, 2024. URL https://arxiv.org/abs/2405.12213

Pith/arXiv arXiv 2024

[43] [43]

Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855, 2025

Wen, J., Zhu, Y., Li, J., Tang, Z., Shen, C., and Feng, F. Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855, 2025

Pith/arXiv arXiv 2025

[44] [44]

Imitation learning from imperfect demonstration, 2019

Wu, Y.-H., Charoenphakdee, N., Bao, H., Tangkaratt, V ., and Sugiyama, M. Imitation learning from imperfect demonstration, 2019. URLhttps://arxiv.org/abs/1901.09387

Pith/arXiv arXiv 2019

[45] [45]

Imitation learning from imperfect demonstration

Wu, Y.-H., Charoenphakdee, N., Bao, H., Tangkaratt, V ., and Sugiyama, M. Imitation learning from imperfect demonstration. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pp. 6818–6827. PMLR, 2019. 11 Beyond Monotonic Progress

2019

[46] [46]

Discriminator-weighted offline imitation learning from suboptimal demonstrations

Xu, H., Zhan, X., Yin, H., and Qin, H. Discriminator-weighted offline imitation learning from suboptimal demonstrations. InProceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pp. 24725–24742. PMLR, 2022

2022

[47] [47]

Compliant residual dagger: Improving real-world contact- rich manipulation with human corrections.Advances in Neural Information Processing Systems, 38: 139559–139581, 2026

Xu, X., Hou, Y., Liu, Z., and Song, S. Compliant residual dagger: Improving real-world contact- rich manipulation with human corrections.Advances in Neural Information Processing Systems, 38: 139559–139581, 2026

2026

[48] [48]

Rise: Self-improving robot policy with compositional world model.arXiv preprint arXiv:2602.11075, 2026

Yang, J., Lin, K., Li, J., Zhang, W., Lin, T., Wu, L., Su, Z., Zhao, H., Zhang, Y.-Q., Chen, L., et al. Rise: Self-improving robot policy with compositional world model.arXiv preprint arXiv:2602.11075, 2026

Pith/arXiv arXiv 2026

[49] [49]

Aloe: Action-level off-policy evaluation for vision-language-action model post-training.arXiv preprint arXiv:2602.12691, 2026

Yang, R., Wang, H., Liu, C., Yan, X., Wang, Y., Du, X., Yue, S., Liu, Y., Zhang, C., Qi, L., et al. Aloe: Action-level off-policy evaluation for vision-language-action model post-training.arXiv preprint arXiv:2602.12691, 2026

Pith/arXiv arXiv 2026

[50] [50]

Confidence-aware imitation learning from demonstrations with varying optimality

Zhang, S., Cao, Z., Sadigh, D., and Sui, Y. Confidence-aware imitation learning from demonstrations with varying optimality. InAdvances in Neural Information Processing Systems, volume 34, 2021

2021

[51] [51]

Vlas: Vision-language- action model with speech instructions for customized robot manipulation, 2025

Zhao, W., Ding, P ., Zhang, M., Gong, Z., Bai, S., Zhao, H., and Wang, D. Vlas: Vision-language- action model with speech instructions for customized robot manipulation, 2025. URL https: //arxiv.org/abs/2502.13508

arXiv 2025

[52] [52]

D., Maas, A

Ziebart, B. D., Maas, A. L., Bagnell, J. A., Dey, A. K., et al. Maximum entropy inverse reinforcement learning. InAaai, volume 8, pp. 1433–1438. Chicago, IL, USA, 2008. 12 Beyond Monotonic Progress A Implementation Details A.1 Value Model Training Data preprocessing.All value models are trained and evaluated using the same local 5 Hz data protocol. The ra...

arXiv 2008