UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement Learning

Jingmin Wang; Leo Kaixuan Cheng; Mohamed Nabail; Nicholas Rhinehart

arxiv: 2606.19328 · v2 · pith:CPZKZYOOnew · submitted 2026-06-17 · 💻 cs.LG · cs.AI· cs.RO

UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement Learning

Mohamed Nabail , Leo Kaixuan Cheng , Jingmin Wang , Nicholas Rhinehart This is my paper

Pith reviewed 2026-06-26 21:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.RO

keywords preference-based reinforcement learningmodel-based planningepistemic uncertaintysample efficiencyregret boundsMeta-World benchmark

0 comments

The pith

UBP2 improves sample efficiency in preference-based RL by planning trajectories scored on expected reward, terminal value, and epistemic uncertainty from three model ensembles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a model-based method for preference-based reinforcement learning that actively selects trajectories by jointly considering uncertainty in the reward model, the dynamics model, and the value function. It replaces passive data collection with planning under a single score that trades off known performance against information gain. A sympathetic reader would care because preference-based methods let humans guide learning without hand-crafted rewards, yet they often waste many early comparisons on uninformative data; an explicit balance could reduce that cost. The approach also supplies sublinear regret bounds for finite-horizon and infinite-horizon cases under standard assumptions.

Core claim

UBP2 maintains ensembles of reward, dynamics, and value-function models. Candidate trajectories are scored by the sum of expected reward under the reward ensemble, terminal value under the value ensemble, and an epistemic uncertainty term. Planning with this objective yields sublinear regret in both finite-horizon and infinite-horizon settings and produces substantially higher sample efficiency than model-free preference methods and non-optimistic model-based baselines on the Meta-World benchmark.

What carries the argument

The unified trajectory score that adds expected reward, terminal value, and epistemic uncertainty drawn from three separate ensembles.

If this is right

Sublinear regret holds for finite-horizon and infinite-horizon preference-based settings under the stated assumptions.
Active planning over the three uncertainties reduces the number of preference queries needed relative to passive collection.
The method outperforms both model-free preference-based algorithms and non-optimistic model-based baselines in sample efficiency on Meta-World.
The single planning objective supplies an explicit exploitation-information tradeoff without separate heuristics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The ensemble-based scoring could be tested in preference settings with noisy or inconsistent human labels to check robustness.
If the planning step can be approximated cheaply, the same balance might apply to other human-in-the-loop sequential tasks beyond robotics.
Extending the approach to continuous or high-dimensional preference data would reveal whether the uncertainty terms remain informative at scale.

Load-bearing premise

The sublinear regret guarantees rest on unspecified standard regularity assumptions holding for both finite-horizon and infinite-horizon preference-based settings.

What would settle it

An experiment in which UBP2 fails to show higher sample efficiency than the listed baselines on Meta-World tasks would falsify the empirical claim; a finite-horizon preference-based task that produces linear rather than sublinear regret would falsify the theoretical claim.

read the original abstract

Preference-based RL provides an approach to learning reward models from pairwise comparisons of behaviors, bypassing the need for explicit reward design. However, existing methods typically rely on passive data collection and suffer from poor sample efficiency, especially during the early stages of learning. We introduce a model-based approach that actively directs exploration by jointly reasoning over uncertainties in the reward, dynamics, and value functions. Our method, Uncertainty-Balanced Preference Planning (UBP2), uses ensembles of reward, dynamics, and value function models to evaluate candidate trajectories according to a unified score that combines expected reward, terminal value, and epistemic uncertainty. Planning under this objective yields an explicit tradeoff between exploitation and information acquisition without requiring ad hoc exploration heuristics. Under standard regularity assumptions, we establish sublinear regret guarantees for both finite-horizon and infinite-horizon settings. Empirically, experiments on the Meta-World benchmark show UBP2 achieves substantially higher sample efficiency than model-free preference-based methods and non-optimistic model-based baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UBP2's main contribution is a joint planning score over reward, dynamics, and value ensembles that trades off performance and information gain in preference-based RL, with decent Meta-World results but regret claims resting on unverified assumptions.

read the letter

The paper's core move is to run model-based planning in preference RL using three separate ensembles and a single score that folds in expected reward, terminal value, and epistemic uncertainty. This produces an explicit exploration-exploitation tradeoff without extra heuristics.

What stands out as new is the specific combination: planning directly on the mixture of those three uncertainty sources rather than treating them separately or adding ad-hoc bonuses. The empirical section reports substantially better sample efficiency than model-free preference methods and non-optimistic model-based baselines on Meta-World.

The experiments appear to be the strongest part. If the implementation details and controls are solid, the gains address a genuine bottleneck in human-aligned RL.

The main weakness is the theory. Sublinear regret is claimed for both finite- and infinite-horizon cases under "standard regularity assumptions," but the abstract gives no list of those assumptions and no argument that they survive when the reward model is itself learned from pairwise preferences. The extra epistemic uncertainty and possible non-stationarity from the preference ensemble could easily invalidate the usual Lipschitz or boundedness conditions. Without seeing the full derivations, it's unclear whether the analysis accounts for this.

The paper is aimed at researchers working on sample-efficient preference RL and model-based exploration. Anyone already using ensembles in RL planning will recognize the setup and can judge whether the unified score is worth trying.

It should go to peer review. The empirical direction is concrete and the problem matters; the theory needs tightening but does not look circular or invented.

Referee Report

2 major / 2 minor

Summary. The paper introduces Uncertainty-Balanced Preference Planning (UBP2), a model-based preference-based RL algorithm that maintains ensembles over reward, dynamics, and value functions. Candidate trajectories are scored by a unified objective that trades off expected reward, terminal value, and epistemic uncertainty; planning under this score is claimed to yield an explicit exploration-exploitation tradeoff. Sublinear regret bounds are stated for both finite-horizon and infinite-horizon MDPs under standard regularity assumptions, and experiments on Meta-World are reported to show substantially higher sample efficiency than model-free preference baselines and non-optimistic model-based methods.

Significance. If the regret analysis is shown to apply to the preference-learned ensemble setting and the empirical gains prove robust, the work would provide a principled, uncertainty-aware planning mechanism for preference-based RL that avoids ad-hoc exploration bonuses. The explicit use of three separate ensembles and the unified planning score constitute a concrete technical contribution that could be adopted or extended by subsequent model-based preference methods.

major comments (2)

[Theoretical analysis] Theoretical analysis section: the sublinear regret guarantees for finite- and infinite-horizon settings are asserted under 'standard regularity assumptions,' yet the manuscript does not enumerate these assumptions nor demonstrate that they continue to hold once the reward is replaced by an ensemble trained on pairwise preferences. In particular, it is unclear whether the UBP2 planning score (which mixes expected reward, terminal value, and uncertainty) preserves the Lipschitz or boundedness conditions required by the regret proof when the reward model itself carries epistemic uncertainty.
[Section 4] Section 4 (planning algorithm): the claim that the unified score yields an 'explicit tradeoff between exploitation and information acquisition without requiring ad hoc exploration heuristics' is load-bearing for the method's novelty, but the precise functional form of the score (weights on reward, value, and uncertainty terms) and the planning procedure (e.g., how trajectories are generated and selected) are not shown to be free of implicit tuning parameters that could affect the reported sample-efficiency gains.

minor comments (2)

[Abstract] The abstract states that experiments use 'the Meta-World benchmark' without specifying which tasks, number of runs, or preference-query budget; these details should be stated explicitly in the experimental protocol.
[Preliminaries] Notation for the three ensembles (reward, dynamics, value) and the unified planning score should be introduced with consistent symbols early in the paper to improve readability of the subsequent analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment below, indicating where revisions will be made to improve clarity and rigor.

read point-by-point responses

Referee: [Theoretical analysis] Theoretical analysis section: the sublinear regret guarantees for finite- and infinite-horizon settings are asserted under 'standard regularity assumptions,' yet the manuscript does not enumerate these assumptions nor demonstrate that they continue to hold once the reward is replaced by an ensemble trained on pairwise preferences. In particular, it is unclear whether the UBP2 planning score (which mixes expected reward, terminal value, and uncertainty) preserves the Lipschitz or boundedness conditions required by the regret proof when the reward model itself carries epistemic uncertainty.

Authors: We agree that the assumptions should be stated explicitly rather than referenced as 'standard.' In the revised manuscript we will add a subsection that enumerates the regularity conditions (bounded rewards and values, Lipschitz continuity of dynamics and reward functions, and controlled ensemble variance) and supply a short argument that these conditions are inherited by the preference-trained reward ensemble: the ensemble mean is used for the reward estimate and remains Lipschitz under the same function class assumptions used in prior model-based RL analyses, while the separate uncertainty penalty is bounded by construction and does not violate the overall Lipschitz or boundedness requirements needed for the regret proof. revision: yes
Referee: [Section 4] Section 4 (planning algorithm): the claim that the unified score yields an 'explicit tradeoff between exploitation and information acquisition without requiring ad hoc exploration heuristics' is load-bearing for the method's novelty, but the precise functional form of the score (weights on reward, value, and uncertainty terms) and the planning procedure (e.g., how trajectories are generated and selected) are not shown to be free of implicit tuning parameters that could affect the reported sample-efficiency gains.

Authors: The manuscript presents the planning score as a single objective that explicitly incorporates the uncertainty term to drive information acquisition. We will revise Section 4 to restate the exact functional form of the score and the trajectory generation/selection procedure, and we will add a sentence clarifying that any scalar weights are fixed by the theoretical analysis and are not adjusted on a per-task or per-run basis. This makes the tradeoff principled rather than heuristic; if the referee still finds the description insufficiently precise, we are prepared to include pseudocode for the planning loop. revision: partial

Circularity Check

0 steps flagged

No circularity; regret bounds and method defined independently of inputs

full rationale

The provided abstract and description define UBP2 via ensembles of reward/dynamics/value models and a planning score mixing expected reward, terminal value, and epistemic uncertainty. Sublinear regret is asserted under external 'standard regularity assumptions' for finite/infinite horizons without any equations shown that reduce the bounds to the method's own fitted quantities or definitions. No self-citation load-bearing steps, self-definitional relations, or fitted-input-as-prediction patterns appear. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only; no free parameters, invented entities, or additional axioms are visible beyond the single domain assumption required for the regret statement.

axioms (1)

domain assumption Standard regularity assumptions
Invoked to establish sublinear regret guarantees for finite-horizon and infinite-horizon settings.

pith-pipeline@v0.9.1-grok · 5713 in / 1181 out tokens · 30265 ms · 2026-06-26T21:25:45.519746+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 29 canonical work pages · 4 internal anchors

[1]

Ho, Michael L

David Abel, Will Dabney, Anna Harutyunyan, Mark K. Ho, Michael L. Littman, Doina Precup, and Satinder Singh. On the expressivity of markov reward, 2022. URLhttps://arxiv.org/abs/2111.00876

work page arXiv 2022
[2]

Deep Reinforcement Learning at the Edge of the Statistical Precipice,

Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, and Marc G. Bellemare. Deep reinforcement learning at the edge of the statistical precipice, 2022. URLhttps://arxiv.org/abs/2108.13264

work page arXiv 2022
[3]

Model-based offline planning, 2021

Arthur Argenson and Gabriel Dulac-Arnold. Model-based offline planning, 2021. URLhttps://arxiv.org/abs/2008. 05556

2021
[4]

Evaluating model- based planning and planner amortization for continuous control, 2021

Arunkumar Byravan, Leonard Hasenclever, Piotr Trochim, Mehdi Mirza, Alessandro Davide Ialongo, Yuval Tassa, Jost Tobias Springenberg, Abbas Abdolmaleki, Nicolas Heess, Josh Merel, and Martin Riedmiller. Evaluating model- based planning and planner amortization for continuous control, 2021. URLhttps://arxiv.org/abs/2110.03363

work page arXiv 2021
[5]

On Kernelized Multi-armed Bandits

Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits, 2017. URLhttps://arxiv.org/abs/ 1704.00445

work page internal anchor Pith review Pith/arXiv arXiv 2017
[6]

Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

2017
[7]

Efficient model-based reinforcement learning through optimistic policy search and planning, 2020

Sebastian Curi, Felix Berkenkamp, and Andreas Krause. Efficient model-based reinforcement learning through optimistic policy search and planning, 2020. URLhttps://arxiv.org/abs/2006.08684

work page arXiv 2020
[8]

Kroese, Shie Mannor, and Reuven Y

Pieter-Tjerk de Boer, Dirk P. Kroese, Shie Mannor, and Reuven Y. Rubinstein. A tutorial on the cross-entropy method. Annals of Operations Research, 2005. URLhttps://people.smp.uq.edu.au/DirkKroese/ps/aortut.pdf

2005
[9]

Finetuning offline world models in the real world, 2023

Yunhai Feng, Nicklas Hansen, Ziyan Xiong, Chandramouli Rajagopalan, and Xiaolong Wang. Finetuning offline world models in the real world, 2023. URLhttps://arxiv.org/abs/2310.16029. 11

work page arXiv 2023
[10]

Stochastic first- and zeroth-order methods for nonconvex stochastic programming,

Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming,
[11]

URLhttps://arxiv.org/abs/1309.5549

work page internal anchor Pith review Pith/arXiv arXiv
[12]

TD-MPC2: Scalable, robust world models for continuous control

Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, robust world models for continuous control. In The Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id= Oxh5CstDJU

2024
[13]

Few-shot preference learning for human-in-the-loop rl, 2022

Joey Hejna and Dorsa Sadigh. Few-shot preference learning for human-in-the-loop rl, 2022. URLhttps://arxiv.org/ abs/2212.03363

work page arXiv 2022
[14]

Burt, and Javier González

David Janz, David R. Burt, and Javier González. Bandit optimisation of functions in the matérn kernel rkhs, 2023. URLhttps://arxiv.org/abs/2001.10396

work page arXiv 2023
[15]

Reflect-then-plan: Offline model-based planning through a doubly bayesian lens, 2025

Jihwan Jeong, Xiaoyu Wang, Jingmin Wang, Scott Sanner, and Pascal Poupart. Reflect-then-plan: Offline model-based planning through a doubly bayesian lens, 2025. URLhttps://arxiv.org/abs/2506.06261

work page arXiv 2025
[16]

Information theoretic regret bounds for online nonlinear control, 2020

Sham Kakade, Akshay Krishnamurthy, Kendall Lowrey, Motoya Ohnishi, and Wen Sun. Information theoretic regret bounds for online nonlinear control, 2020. URLhttps://arxiv.org/abs/2006.12466

work page arXiv 2020
[17]

Morel : Model-based offline reinforcement learning, 2021

Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel : Model-based offline reinforcement learning, 2021. URLhttps://arxiv.org/abs/2005.05951

work page arXiv 2021
[18]

B-pref: Benchmarking preference-based reinforcement learning

Kimin Lee, Laura Smith, Anca Dragan, and Pieter Abbeel. B-pref: Benchmarking preference-based reinforcement learning. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021. URLhttps://openreview.net/forum?id=ps95-mkHF_

2021
[19]

Reward uncertainty for exploration in preference-based reinforcement learning, 2022

Xinran Liang, Katherine Shu, Kimin Lee, and Pieter Abbeel. Reward uncertainty for exploration in preference-based reinforcement learning, 2022. URLhttps://arxiv.org/abs/2205.12401

work page arXiv 2022
[20]

Meta-reward-net: Implicitly differentiable reward learning for preference-based reinforcement learning

Runze Liu, Fengshuo Bai, Yali Du, and Yaodong Yang. Meta-reward-net: Implicitly differentiable reward learning for preference-based reinforcement learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/forum?id= OZKBReUF-wX

2022
[21]

Efficient preference-based reinforcement learning using learned dynamics models

Yi Liu, Gaurav Datta, Ellen Novoseller, and Daniel S Brown. Efficient preference-based reinforcement learning using learned dynamics models. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 2921–2928. IEEE, 2023

2023
[22]

Rewards encoding environment dynamics improves preference-based reinforcement learning.arXiv preprint arXiv:2211.06527, 2022

Katherine Metcalf, Miguel Sarabia, and Barry-John Theobald. Rewards encoding environment dynamics improves preference-based reinforcement learning.arXiv preprint arXiv:2211.06527, 2022

work page arXiv 2022
[23]

arXiv preprint arXiv:2006.16712 , year=

Thomas M. Moerland, Joost Broekens, Aske Plaat, and Catholijn M. Jonker. Model-based reinforcement learning: A survey, 2022. URLhttps://arxiv.org/abs/2006.16712

work page arXiv 2022
[24]

Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning

Anusha Nagabandi, Gregory Kahn, Ronald S. Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning, 2017. URLhttps://arxiv.org/abs/1708.02596

work page internal anchor Pith review Pith/arXiv arXiv 2017
[25]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Lab...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

A. R’enyi. On measures of entropy and information. InProceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, volume 4, pages 547–562. University of California Press, 1961

1961
[27]

Hallucinated adversarial control for conservative offline policy evaluation, 2023

Jonas Rothfuss, Bhavya Sukhija, Tobias Birchler, Parnian Kassraie, and Andreas Krause. Hallucinated adversarial control for conservative offline policy evaluation, 2023. URLhttps://arxiv.org/abs/2303.01076

work page arXiv 2023
[28]

End-to-end learning to warm-start for real-time quadratic optimization, 2022

Rajiv Sambharya, Georgina Hall, Brandon Amos, and Bartolomeo Stellato. End-to-end learning to warm-start for real-time quadratic optimization, 2022. URLhttps://arxiv.org/abs/2212.08260

work page arXiv 2022
[29]

Uncertainty-aware latent safety filters for avoiding out-of- distribution failures, 2025

Junwon Seo, Kensuke Nakamura, and Andrea Bajcsy. Uncertainty-aware latent safety filters for avoiding out-of- distribution failures, 2025. URLhttps://arxiv.org/abs/2505.00779

work page arXiv 2025
[30]

Learning off-policy with online planning, 2021

Harshit Sikchi, Wenxuan Zhou, and David Held. Learning off-policy with online planning, 2021. URLhttps: //arxiv.org/abs/2008.10066. 12

work page arXiv 2021
[31]

Lewis, and Andrew G

Satinder Singh, Richard L. Lewis, and Andrew G. Barto. Where do rewards come from? InProceedings of the Annual Conference of the Cognitive Science Society, pages 2601–2606. Cognitive Science Society, 2009

2009
[32]

Kakade, and Matthias W

Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias W. Seeger. Information-theoretic regret bounds for gaussian process optimization in the bandit setting.IEEE Transactions on Information Theory, 58(5):3250–3265,
[33]

doi: 10.1109/TIT.2011.2182033

work page doi:10.1109/tit.2011.2182033 2011
[34]

Optimistic active exploration of dynamical systems, 2023

Bhavya Sukhija, Lenart Treven, Cansu Sancaktar, Sebastian Blaes, Stelian Coros, and Andreas Krause. Optimistic active exploration of dynamical systems, 2023. URLhttps://arxiv.org/abs/2306.12371

work page arXiv 2023
[35]

Maxinforl: Boosting exploration in reinforcement learning through information gain maximization, 2025

Bhavya Sukhija, Stelian Coros, Andreas Krause, Pieter Abbeel, and Carmelo Sferrazza. Maxinforl: Boosting exploration in reinforcement learning through information gain maximization, 2025. URLhttps://arxiv.org/abs/2412.12098

work page arXiv 2025
[36]

Sombrl: Scalable and optimistic model-based rl, 2025

Bhavya Sukhija, Lenart Treven, Carmelo Sferrazza, Florian Dörfler, Pieter Abbeel, and Andreas Krause. Sombrl: Scalable and optimistic model-based rl, 2025. URLhttps://arxiv.org/abs/2511.20066

work page arXiv 2025
[37]

Model-based causal bayesian optimization, 2023

Scott Sussex, Anastasiia Makarova, and Andreas Krause. Model-based causal bayesian optimization, 2023. URL https://arxiv.org/abs/2211.10257

work page arXiv 2023
[38]

Asurveyofpreference-basedreinforcement learning methods.Journal of Machine Learning Research, 18(136):1–46, 2017

ChristianWirth, RiadAkrour, GerhardNeumann, andJohannesFürnkranz. Asurveyofpreference-basedreinforcement learning methods.Journal of Machine Learning Research, 18(136):1–46, 2017

2017
[39]

Mopo: Model-based offline policy optimization, 2020

Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based offline policy optimization, 2020. URLhttps://arxiv.org/abs/2005.13239

work page arXiv 2020
[40]

Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learn- ing,

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Avnish Narayan, Hayden Shively, Adithya Bellathur, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning, 2021. URLhttps://arxiv.org/abs/1910.10897

work page arXiv 2021
[41]

latent-reward

Wenhao Zhan, Masatoshi Uehara, Wen Sun, and Jason D. Lee. Provable reward-agnostic preference-based reinforce- ment learning, 2024. URLhttps://arxiv.org/abs/2305.18505. 13 A Algorithms Algorithm 2Optimistic Preference Pair Selection Input:replay bufferB, reward ensemble{r (m) θ }E m=1, pref. bufferP Hyperparameters:segment lengthL, pairs to addK Candidate...

work page arXiv 2024
[42]

reward signals

a high-probability confidence relationship between the reward/dynamics model error and their uncertainty radii, and 2) a cumulative uncertainty bound on these uncertainty radii (in our case, via GP information gain bounds). GP posterior standard deviations provide these quantities in closed form, which is why they are used in Lemma 5.7 and Theorem 5.8. He...

[1] [1]

Ho, Michael L

David Abel, Will Dabney, Anna Harutyunyan, Mark K. Ho, Michael L. Littman, Doina Precup, and Satinder Singh. On the expressivity of markov reward, 2022. URLhttps://arxiv.org/abs/2111.00876

work page arXiv 2022

[2] [2]

Deep Reinforcement Learning at the Edge of the Statistical Precipice,

Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, and Marc G. Bellemare. Deep reinforcement learning at the edge of the statistical precipice, 2022. URLhttps://arxiv.org/abs/2108.13264

work page arXiv 2022

[3] [3]

Model-based offline planning, 2021

Arthur Argenson and Gabriel Dulac-Arnold. Model-based offline planning, 2021. URLhttps://arxiv.org/abs/2008. 05556

2021

[4] [4]

Evaluating model- based planning and planner amortization for continuous control, 2021

Arunkumar Byravan, Leonard Hasenclever, Piotr Trochim, Mehdi Mirza, Alessandro Davide Ialongo, Yuval Tassa, Jost Tobias Springenberg, Abbas Abdolmaleki, Nicolas Heess, Josh Merel, and Martin Riedmiller. Evaluating model- based planning and planner amortization for continuous control, 2021. URLhttps://arxiv.org/abs/2110.03363

work page arXiv 2021

[5] [5]

On Kernelized Multi-armed Bandits

Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits, 2017. URLhttps://arxiv.org/abs/ 1704.00445

work page internal anchor Pith review Pith/arXiv arXiv 2017

[6] [6]

Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

2017

[7] [7]

Efficient model-based reinforcement learning through optimistic policy search and planning, 2020

Sebastian Curi, Felix Berkenkamp, and Andreas Krause. Efficient model-based reinforcement learning through optimistic policy search and planning, 2020. URLhttps://arxiv.org/abs/2006.08684

work page arXiv 2020

[8] [8]

Kroese, Shie Mannor, and Reuven Y

Pieter-Tjerk de Boer, Dirk P. Kroese, Shie Mannor, and Reuven Y. Rubinstein. A tutorial on the cross-entropy method. Annals of Operations Research, 2005. URLhttps://people.smp.uq.edu.au/DirkKroese/ps/aortut.pdf

2005

[9] [9]

Finetuning offline world models in the real world, 2023

Yunhai Feng, Nicklas Hansen, Ziyan Xiong, Chandramouli Rajagopalan, and Xiaolong Wang. Finetuning offline world models in the real world, 2023. URLhttps://arxiv.org/abs/2310.16029. 11

work page arXiv 2023

[10] [10]

Stochastic first- and zeroth-order methods for nonconvex stochastic programming,

Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming,

[11] [11]

URLhttps://arxiv.org/abs/1309.5549

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

TD-MPC2: Scalable, robust world models for continuous control

Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, robust world models for continuous control. In The Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id= Oxh5CstDJU

2024

[13] [13]

Few-shot preference learning for human-in-the-loop rl, 2022

Joey Hejna and Dorsa Sadigh. Few-shot preference learning for human-in-the-loop rl, 2022. URLhttps://arxiv.org/ abs/2212.03363

work page arXiv 2022

[14] [14]

Burt, and Javier González

David Janz, David R. Burt, and Javier González. Bandit optimisation of functions in the matérn kernel rkhs, 2023. URLhttps://arxiv.org/abs/2001.10396

work page arXiv 2023

[15] [15]

Reflect-then-plan: Offline model-based planning through a doubly bayesian lens, 2025

Jihwan Jeong, Xiaoyu Wang, Jingmin Wang, Scott Sanner, and Pascal Poupart. Reflect-then-plan: Offline model-based planning through a doubly bayesian lens, 2025. URLhttps://arxiv.org/abs/2506.06261

work page arXiv 2025

[16] [16]

Information theoretic regret bounds for online nonlinear control, 2020

Sham Kakade, Akshay Krishnamurthy, Kendall Lowrey, Motoya Ohnishi, and Wen Sun. Information theoretic regret bounds for online nonlinear control, 2020. URLhttps://arxiv.org/abs/2006.12466

work page arXiv 2020

[17] [17]

Morel : Model-based offline reinforcement learning, 2021

Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel : Model-based offline reinforcement learning, 2021. URLhttps://arxiv.org/abs/2005.05951

work page arXiv 2021

[18] [18]

B-pref: Benchmarking preference-based reinforcement learning

Kimin Lee, Laura Smith, Anca Dragan, and Pieter Abbeel. B-pref: Benchmarking preference-based reinforcement learning. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021. URLhttps://openreview.net/forum?id=ps95-mkHF_

2021

[19] [19]

Reward uncertainty for exploration in preference-based reinforcement learning, 2022

Xinran Liang, Katherine Shu, Kimin Lee, and Pieter Abbeel. Reward uncertainty for exploration in preference-based reinforcement learning, 2022. URLhttps://arxiv.org/abs/2205.12401

work page arXiv 2022

[20] [20]

Meta-reward-net: Implicitly differentiable reward learning for preference-based reinforcement learning

Runze Liu, Fengshuo Bai, Yali Du, and Yaodong Yang. Meta-reward-net: Implicitly differentiable reward learning for preference-based reinforcement learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors,Advances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/forum?id= OZKBReUF-wX

2022

[21] [21]

Efficient preference-based reinforcement learning using learned dynamics models

Yi Liu, Gaurav Datta, Ellen Novoseller, and Daniel S Brown. Efficient preference-based reinforcement learning using learned dynamics models. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 2921–2928. IEEE, 2023

2023

[22] [22]

Rewards encoding environment dynamics improves preference-based reinforcement learning.arXiv preprint arXiv:2211.06527, 2022

Katherine Metcalf, Miguel Sarabia, and Barry-John Theobald. Rewards encoding environment dynamics improves preference-based reinforcement learning.arXiv preprint arXiv:2211.06527, 2022

work page arXiv 2022

[23] [23]

arXiv preprint arXiv:2006.16712 , year=

Thomas M. Moerland, Joost Broekens, Aske Plaat, and Catholijn M. Jonker. Model-based reinforcement learning: A survey, 2022. URLhttps://arxiv.org/abs/2006.16712

work page arXiv 2022

[24] [24]

Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning

Anusha Nagabandi, Gregory Kahn, Ronald S. Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning, 2017. URLhttps://arxiv.org/abs/1708.02596

work page internal anchor Pith review Pith/arXiv arXiv 2017

[25] [25]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Lab...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

A. R’enyi. On measures of entropy and information. InProceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, volume 4, pages 547–562. University of California Press, 1961

1961

[27] [27]

Hallucinated adversarial control for conservative offline policy evaluation, 2023

Jonas Rothfuss, Bhavya Sukhija, Tobias Birchler, Parnian Kassraie, and Andreas Krause. Hallucinated adversarial control for conservative offline policy evaluation, 2023. URLhttps://arxiv.org/abs/2303.01076

work page arXiv 2023

[28] [28]

End-to-end learning to warm-start for real-time quadratic optimization, 2022

Rajiv Sambharya, Georgina Hall, Brandon Amos, and Bartolomeo Stellato. End-to-end learning to warm-start for real-time quadratic optimization, 2022. URLhttps://arxiv.org/abs/2212.08260

work page arXiv 2022

[29] [29]

Uncertainty-aware latent safety filters for avoiding out-of- distribution failures, 2025

Junwon Seo, Kensuke Nakamura, and Andrea Bajcsy. Uncertainty-aware latent safety filters for avoiding out-of- distribution failures, 2025. URLhttps://arxiv.org/abs/2505.00779

work page arXiv 2025

[30] [30]

Learning off-policy with online planning, 2021

Harshit Sikchi, Wenxuan Zhou, and David Held. Learning off-policy with online planning, 2021. URLhttps: //arxiv.org/abs/2008.10066. 12

work page arXiv 2021

[31] [31]

Lewis, and Andrew G

Satinder Singh, Richard L. Lewis, and Andrew G. Barto. Where do rewards come from? InProceedings of the Annual Conference of the Cognitive Science Society, pages 2601–2606. Cognitive Science Society, 2009

2009

[32] [32]

Kakade, and Matthias W

Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias W. Seeger. Information-theoretic regret bounds for gaussian process optimization in the bandit setting.IEEE Transactions on Information Theory, 58(5):3250–3265,

[33] [33]

doi: 10.1109/TIT.2011.2182033

work page doi:10.1109/tit.2011.2182033 2011

[34] [34]

Optimistic active exploration of dynamical systems, 2023

Bhavya Sukhija, Lenart Treven, Cansu Sancaktar, Sebastian Blaes, Stelian Coros, and Andreas Krause. Optimistic active exploration of dynamical systems, 2023. URLhttps://arxiv.org/abs/2306.12371

work page arXiv 2023

[35] [35]

Maxinforl: Boosting exploration in reinforcement learning through information gain maximization, 2025

Bhavya Sukhija, Stelian Coros, Andreas Krause, Pieter Abbeel, and Carmelo Sferrazza. Maxinforl: Boosting exploration in reinforcement learning through information gain maximization, 2025. URLhttps://arxiv.org/abs/2412.12098

work page arXiv 2025

[36] [36]

Sombrl: Scalable and optimistic model-based rl, 2025

Bhavya Sukhija, Lenart Treven, Carmelo Sferrazza, Florian Dörfler, Pieter Abbeel, and Andreas Krause. Sombrl: Scalable and optimistic model-based rl, 2025. URLhttps://arxiv.org/abs/2511.20066

work page arXiv 2025

[37] [37]

Model-based causal bayesian optimization, 2023

Scott Sussex, Anastasiia Makarova, and Andreas Krause. Model-based causal bayesian optimization, 2023. URL https://arxiv.org/abs/2211.10257

work page arXiv 2023

[38] [38]

Asurveyofpreference-basedreinforcement learning methods.Journal of Machine Learning Research, 18(136):1–46, 2017

ChristianWirth, RiadAkrour, GerhardNeumann, andJohannesFürnkranz. Asurveyofpreference-basedreinforcement learning methods.Journal of Machine Learning Research, 18(136):1–46, 2017

2017

[39] [39]

Mopo: Model-based offline policy optimization, 2020

Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based offline policy optimization, 2020. URLhttps://arxiv.org/abs/2005.13239

work page arXiv 2020

[40] [40]

Meta- world: A benchmark and evaluation for multi-task and meta reinforcement learn- ing,

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Avnish Narayan, Hayden Shively, Adithya Bellathur, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning, 2021. URLhttps://arxiv.org/abs/1910.10897

work page arXiv 2021

[41] [41]

latent-reward

Wenhao Zhan, Masatoshi Uehara, Wen Sun, and Jason D. Lee. Provable reward-agnostic preference-based reinforce- ment learning, 2024. URLhttps://arxiv.org/abs/2305.18505. 13 A Algorithms Algorithm 2Optimistic Preference Pair Selection Input:replay bufferB, reward ensemble{r (m) θ }E m=1, pref. bufferP Hyperparameters:segment lengthL, pairs to addK Candidate...

work page arXiv 2024

[42] [42]

reward signals

a high-probability confidence relationship between the reward/dynamics model error and their uncertainty radii, and 2) a cumulative uncertainty bound on these uncertainty radii (in our case, via GP information gain bounds). GP posterior standard deviations provide these quantities in closed form, which is why they are used in Lemma 5.7 and Theorem 5.8. He...