arxiv: 2605.08946 · v1 · submitted 2026-05-09 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

A Single Deep Preference-Conditioned Policy for Learning Pareto Coverage Sets

Akihiro Kubo , Kosuke Nakanishi , Shin Ishii

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:17 UTC · model grok-4.3

classification 💻 cs.LG

keywords multi-objective reinforcement learningPareto front coveragepreference-conditioned policyTchebycheff scalarizationpolicy iterationoccupancy measuresactor-critic

0 comments

The pith

Under mild conditions each preference maps to a unique Lipschitz-continuous Pareto-optimal return vector, enabling one policy to cover the front.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that smooth Tchebycheff scalarization in tabular multi-objective MDPs produces a one-to-one Lipschitz-continuous correspondence between preferences and Pareto-optimal return vectors when the preference set satisfies mild interior conditions. This correspondence supplies a rigorous basis for sweeping preferences to obtain dense Pareto coverage with a single policy. The authors introduce Concave Mirror Descent Policy Iteration, which attains an O(1/k) suboptimality rate and reduces each step to a KL-regularized MDP using the previous policy as reference. They implement the iteration as a deep actor-critic algorithm that preserves the regularization and report top average hypervolume rank on eight MO-Gymnasium tasks together with gains in continuous control.

Core claim

Under mild interior conditions on the preference set, smooth Tchebycheff scalarization induces a unique Pareto-optimal return vector for each preference that depends Lipschitz-continuously on it. The problem is formulated over occupancy measures and solved by Concave Mirror Descent Policy Iteration, which achieves O(1/k) objective-suboptimality and is equivalent at each step to solving a Kullback-Leibler-regularized MDP with the prior policy as reference; the resulting deep actor-critic instantiation covers the Pareto set on MO-Gymnasium benchmarks.

What carries the argument

Concave Mirror Descent Policy Iteration (CMDPI) over occupancy measures, which equates each update to a KL-regularized MDP with the previous policy as reference.

If this is right

CMDPI attains an O(1/k) rate of objective-suboptimality.
Each policy update is exactly equivalent to solving a KL-regularized MDP.
The learned policy is continuous in the preference parameter across finite iterations.
The deep instantiation achieves the best average hypervolume rank among recent baselines on eight MO-Gymnasium tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The continuity result may allow nearby preferences to share policy parameters without full retraining.
The same occupancy-measure formulation could be applied to other monotone scalarizations that satisfy analogous interior conditions.
Gains observed in continuous-control experiments suggest the method scales beyond discrete actions when the actor-critic approximation remains faithful to the KL-regularized update.

Load-bearing premise

Mild interior conditions on the preference set are needed to guarantee that each preference produces a unique Pareto-optimal return vector that changes continuously with the preference.

What would settle it

An explicit preference vector inside the interior region for which two distinct Pareto-optimal return vectors yield the same scalarized value, or a sequence of preferences converging to a limit preference whose optimal return vectors fail to converge.

Figures

Figures reproduced from arXiv: 2605.08946 by Akihiro Kubo, Kosuke Nakanishi, Shin Ishii.

**Figure 1.** Figure 1: Illustration of a key limitation of existing methods: linear scalarization recovers only vertex solutions of the Pareto front (black curve), while CAPQL exhibits biased coverage. CMDPI (ours) achieves denser and more uniform Pareto-optimal coverage. To solve the resulting nonlinear scalarized problem without altering the objective, we derive mirror descent in the occupation-measure space. Using the Bregm… view at source ↗

**Figure 2.** Figure 2: Scatter plots of the converged objective vectors in a two-objective MOMDP, obtained [PITH_FULL_IMAGE:figures/full_fig_p024_2.png] view at source ↗

**Figure 3.** Figure 3: Final-step return scatter plots for five two- or three-objective tasks. [PITH_FULL_IMAGE:figures/full_fig_p029_3.png] view at source ↗

**Figure 4.** Figure 4: Hypervolume of model-free methods over envionment steps on eight continuous action [PITH_FULL_IMAGE:figures/full_fig_p030_4.png] view at source ↗

**Figure 5.** Figure 5: Expected Utility Metric of model-free methods over envionment steps on eight continuous [PITH_FULL_IMAGE:figures/full_fig_p030_5.png] view at source ↗

**Figure 6.** Figure 6: Hypervolume of model-based methods over envionment steps on eight continuous action [PITH_FULL_IMAGE:figures/full_fig_p031_6.png] view at source ↗

**Figure 7.** Figure 7: Expected Utility Metric of model-based methods over envionment steps on eight continuous [PITH_FULL_IMAGE:figures/full_fig_p031_7.png] view at source ↗

**Figure 8.** Figure 8: Final-step (2M steps) sum-of-reward-vectors scatter plots for model-free methods on [PITH_FULL_IMAGE:figures/full_fig_p032_8.png] view at source ↗

**Figure 9.** Figure 9: Final-step (500K steps) sum-of-reward-vectors scatter plots for model-based methods on [PITH_FULL_IMAGE:figures/full_fig_p033_9.png] view at source ↗

read the original abstract

Preference-conditioned multi-objective reinforcement learning aims to learn a single policy that captures trade-offs across preferences, but under nonlinear scalarization the uniqueness and continuity of the preference-to-solution correspondence remain unclear. We study this problem in tabular multi-objective Markov decision processes (MDPs) using smooth Tchebycheff scalarization as a monotone utility. Under mild interior conditions on the preference set, we prove that each preference induces a unique Pareto-optimal return vector and that this vector depends Lipschitz-continuously on the preference, providing a principled foundation for preference sweeping toward dense Pareto-front coverage. To compute these targets, we formulate the problem over occupancy measures and derive Concave Mirror Descent Policy Iteration (CMDPI), which achieves an $O(1/k)$ objective-suboptimality rate. We further show that each update is equivalent to solving a Kullback-Leibler-regularized MDP with the previous policy as reference, yielding a policy-iteration interpretation and finite-iterate policy continuity across preferences. We instantiate the update as a deep actor-critic algorithm preserving previous-policy regularization. On eight MO-Gymnasium tasks, it achieves the best average hypervolume rank among recent baselines and strong expected-utility performance. Continuous-control experiments indicate gains beyond the discrete-action setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean occupancy-measure derivation for a single policy that sweeps Pareto coverage under smooth Tchebycheff, but the uniqueness and Lipschitz claims rest on interior conditions that stay vague in the abstract.

read the letter

The main point is that under smooth Tchebycheff scalarization and some interior conditions on preferences, each preference gives a unique Pareto return vector that changes continuously with the preference. This lets them train one deep policy that can cover the Pareto front by sweeping preferences. They derive this from occupancy measures and get Concave Mirror Descent Policy Iteration with an O(1/k) rate. Each step turns out to be the same as solving a KL-regularized MDP using the last policy as reference. The occupancy-measure approach is a good move. It turns the problem into a concave optimization over measures, which mirror descent handles cleanly. The equivalence to KL regularization gives a natural way to interpret the updates and keeps the policy from jumping around when the preference changes a little. The deep implementation carries that regularization forward, and the results on eight MO-Gymnasium tasks look competitive on hypervolume and expected utility. The weak part is the mild interior conditions. The abstract mentions them to get uniqueness and Lipschitz continuity, but does not list what they are. If they require all preferences to be strictly positive or to avoid the boundary of the simplex, then cases with zero-weight objectives or flat fronts might fall outside the guarantee. That would limit how dense the coverage can be in practice. The rate and the policy-iteration view both depend on the mapping being single-valued and continuous, so any gap there affects the rest. The continuous-control experiments are promising but stay empirical; they do not come with the same rate. This is for researchers who build preference-conditioned agents in multi-objective settings and want some theory behind the single-policy approach. Someone looking for a method that can trade off objectives without retraining from scratch will see value in the CMDPI construction and the benchmark numbers. I would send it for peer review. The theoretical pieces are worth a careful look even if the conditions need more explicit treatment.

Referee Report

2 major / 2 minor

Summary. The paper studies preference-conditioned multi-objective RL in tabular MDPs using smooth Tchebycheff scalarization. Under mild interior conditions on the preference set, it proves that each preference maps to a unique Pareto-optimal return vector with Lipschitz continuity. It introduces Concave Mirror Descent Policy Iteration (CMDPI) achieving O(1/k) suboptimality, shows each update equals solving a KL-regularized MDP with the prior policy as reference, and implements this as a deep actor-critic algorithm that ranks best in average hypervolume on eight MO-Gymnasium tasks.

Significance. If the mapping theorem and convergence hold, the work supplies a principled basis for learning a single policy that densely covers Pareto fronts via preference sweeping, which is useful for applications requiring explicit trade-offs. The KL-regularized MDP equivalence offers a practical policy-iteration view and finite-iterate continuity, while the empirical results on discrete and continuous control tasks demonstrate competitive performance against recent baselines.

major comments (2)

[Abstract and theoretical analysis section] Abstract and theoretical analysis section: the 'mild interior conditions' guaranteeing uniqueness and Lipschitz continuity of the preference-to-Pareto mapping are invoked but never stated precisely (e.g., whether they require all preference components strictly positive or the vector to lie in the relative interior of the simplex). This assumption is load-bearing for the CMDPI rate and the policy-iteration interpretation, yet its necessity is not demonstrated by counter-example or boundary case.
[CMDPI derivation and equivalence claim] CMDPI derivation and equivalence claim: the O(1/k) objective-suboptimality rate and the statement that each update is equivalent to a KL-regularized MDP with the previous policy as reference are presented as following from the occupancy-measure formulation, but the manuscript provides no explicit derivation steps, error bounds, or verification that the smooth Tchebycheff scalarization preserves the required monotonicity and concavity for the mirror-descent analysis to go through.

minor comments (2)

[Experiments] The experimental section would benefit from an explicit statement of how preferences are sampled during training and evaluation to support reproducibility of the hypervolume results.
[Notation] Notation for occupancy measures and the scalarization function should be introduced once and used consistently; occasional re-use of symbols for different quantities appears in the background and method sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our theoretical results. We address each major comment below and will revise the manuscript accordingly to improve precision and completeness.

read point-by-point responses

Referee: [Abstract and theoretical analysis section] Abstract and theoretical analysis section: the 'mild interior conditions' guaranteeing uniqueness and Lipschitz continuity of the preference-to-Pareto mapping are invoked but never stated precisely (e.g., whether they require all preference components strictly positive or the vector to lie in the relative interior of the simplex). This assumption is load-bearing for the CMDPI rate and the policy-iteration interpretation, yet its necessity is not demonstrated by counter-example or boundary case.

Authors: We agree that the precise statement of the interior conditions is missing and should be made explicit. These conditions require that the preference vector lies in the relative interior of the probability simplex (i.e., all components strictly positive). This ensures the smooth Tchebycheff scalarization yields a strictly concave objective, enabling uniqueness and Lipschitz continuity of the preference-to-Pareto mapping. In the revision we will add the exact definition to the abstract and theory section, explain its role in the CMDPI analysis, and include a brief discussion of boundary cases (with a simple counter-example sketch) where uniqueness can fail when a component is zero. revision: yes
Referee: [CMDPI derivation and equivalence claim] CMDPI derivation and equivalence claim: the O(1/k) objective-suboptimality rate and the statement that each update is equivalent to a KL-regularized MDP with the previous policy as reference are presented as following from the occupancy-measure formulation, but the manuscript provides no explicit derivation steps, error bounds, or verification that the smooth Tchebycheff scalarization preserves the required monotonicity and concavity for the mirror-descent analysis to go through.

Authors: We acknowledge that the current manuscript omits the full step-by-step derivation. The O(1/k) rate follows from applying concave mirror descent to the occupancy-measure formulation of the smooth Tchebycheff objective; each CMDPI update is exactly equivalent to solving a KL-regularized MDP whose reference policy is the previous iterate. The smooth Tchebycheff scalarization preserves monotonicity and concavity under the interior conditions. In the revision we will insert the explicit derivation (including the key lemmas on concavity preservation and the error-bound analysis) into the main text or a dedicated appendix subsection, thereby making the policy-iteration view and finite-iterate continuity fully rigorous. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claims rest on standard MDP occupancy-measure and mirror-descent analysis

full rationale

The paper proves uniqueness and Lipschitz continuity of the preference-to-Pareto mapping under explicitly stated mild interior conditions on the preference set, using smooth Tchebycheff scalarization as a monotone utility. CMDPI and its O(1/k) rate are derived from concave mirror descent over occupancy measures, a standard technique independent of any fitted parameters or self-referential predictions. The KL-regularized MDP equivalence follows directly from the update rule without reducing to prior self-citations or ansatzes. No load-bearing steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption of mild interior conditions for the uniqueness and continuity proofs and on standard mathematical properties of mirror descent and occupancy measures; no free parameters or new invented entities are introduced in the abstract.

axioms (1)

domain assumption Mild interior conditions on the preference set ensure uniqueness and Lipschitz continuity of the preference-to-Pareto mapping
Invoked to guarantee that each preference induces a unique Pareto-optimal return vector.

pith-pipeline@v0.9.0 · 5523 in / 1426 out tokens · 54461 ms · 2026-05-12T02:17:46.071971+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Under mild interior conditions on the preference set, we prove that each preference induces a unique Pareto-optimal return vector and that this vector depends Lipschitz-continuously on the preference
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we derive Concave Mirror Descent Policy Iteration (CMDPI), which achieves an O(1/k) objective-suboptimality rate

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 1 internal anchor

[1]

D. J. White. Multi-objective infinite-horizon discounted markov decision processes.Journal of Mathematical Analysis and Applications, 89(2):639–647, 1982. doi: 10.1016/0022-247X(82) 90122-6

work page doi:10.1016/0022-247x(82 1982
[2]

Multi-objective reinforcement learning using sets of pareto dominating policies.J

Kristof Van Moffaert and A Nowé. Multi-objective reinforcement learning using sets of pareto dominating policies.J. Mach. Learn. Res., 15(107):3483–3512, 2014

work page 2014
[3]

A survey of multi-objective sequential decision-making.Journal of Artificial Intelligence Research, 48: 67–113, 2013

Diederik M Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-objective sequential decision-making.Journal of Artificial Intelligence Research, 48: 67–113, 2013

work page 2013
[4]

Assael, Diederik M

Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, and Shimon Whiteson. Multi- objective deep reinforcement learning. arXiv:1610.02707, 2016. URL https://arxiv.org/ abs/1610.02707

work page arXiv 2016
[5]

Prediction-guided multi-objective reinforcement learning for continuous robot control

Jie Xu, Yunsheng Tian, Pingchuan Ma, Daniela Rus, Shinjiro Sueda, and Wojciech Matusik. Prediction-guided multi-objective reinforcement learning for continuous robot control. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 10607–106...

work page 2020
[6]

A generalized algorithm for multi- objective reinforcement learning and policy adaptation.Advances in neural information pro- cessing systems, 32, 2019

Runzhe Yang, Xingyuan Sun, and Karthik Narasimhan. A generalized algorithm for multi- objective reinforcement learning and policy adaptation.Advances in neural information pro- cessing systems, 32, 2019

work page 2019
[7]

Pareto conditioned networks

Mathieu Reymond, Eugenio Bargiacchi, and Ann Nowé. Pareto conditioned networks. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, pages 1110–1118, 2022

work page 2022
[8]

Pd-morl: Preference-driven multi- objective reinforcement learning algorithm.arXiv preprint arXiv:2208.07914, 2022

Toygun Basaklar, Suat Gumussoy, and Umit Y Ogras. Pd-morl: Preference-driven multi- objective reinforcement learning algorithm.arXiv preprint arXiv:2208.07914, 2022

work page arXiv 2022
[9]

Efficient discovery of pareto front for multi-objective reinforcement learning

Ruohong Liu, Yuxin Pan, Linjie Xu, Lei Song, Pengcheng You, Yize Chen, and Jiang Bian. Efficient discovery of pareto front for multi-objective reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/ forum?id=fDGPIuCdGi

work page 2025
[10]

Pareto set learning for multi-objective reinforcement learning

Erlong Liu, Yu-Chang Wu, Xiaobin Huang, Chengrui Gao, Ren-Jian Wang, Ke Xue, and Chao Qian. Pareto set learning for multi-objective reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 18789–18797, 2025

work page 2025
[11]

Multi-objective reinforcement learning with continuous pareto frontier approximation

Matteo Pirotta, Simone Parisi, and Marcello Restelli. Multi-objective reinforcement learning with continuous pareto frontier approximation. InProceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI), 2015

work page 2015
[12]

Pareto-dqn: Approximating the pareto front in complex multi-objective decision problems

Mathieu Reymond and Ann Nowe. Pareto-dqn: Approximating the pareto front in complex multi-objective decision problems. InProceedings of the Adaptive and Learning Agents Work- shop 2019 (ALA-19) at AAMAS, May 2019. URL https://ala2019.vub.ac.be. 2019 Adaptive Learning Agents (ALA) workshop: Workshop of the AAMAS conference ; Confer- ence

work page 2019
[13]

A practical guide to multi-objective reinforcement learning and planning.Autonomous Agents and Multi-Agent Systems, 36(1):26, 2022

Conor F Hayes, Roxana R˘adulescu, Eugenio Bargiacchi, Johan Källström, Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M Zintgraf, Richard Dazeley, Fredrik Heintz, et al. A practical guide to multi-objective reinforcement learning and planning.Autonomous Agents and Multi-Agent Systems, 36(1):26, 2022

work page 2022
[14]

Cambridge university press, 2004

Stephen Boyd and Lieven Vandenberghe.Convex optimization. Cambridge university press, 2004. 10

work page 2004
[15]

Dynamic weights in multi-objective deep reinforcement learning

Axel Abels, Diederik Roijers, Tom Lenaerts, Ann Nowé, and Denis Steckelmacher. Dynamic weights in multi-objective deep reinforcement learning. InInternational conference on machine learning, pages 11–20. PMLR, 2019

work page 2019
[16]

A theory of regularized markov decision processes

Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized markov decision processes. InInternational conference on machine learning, pages 2160–2169. PMLR, 2019

work page 2019
[17]

Multi-objective reinforcement learning: Convexity, stationarity and pareto optimality

Haoye Lu, Daniel Herman, and Yaoliang Yu. Multi-objective reinforcement learning: Convexity, stationarity and pareto optimality. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=TjEzIsyEsQ6

work page 2023
[18]

Multi-objective reinforcement learning with nonlinear preferences: Provable approximation for maximizing expected scalarized return

Nianli Peng, Muhang Tian, and Brandon Fain. Multi-objective reinforcement learning with nonlinear preferences: Provable approximation for maximizing expected scalarized return. InProceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 1632–1640, 2025

work page 2025
[19]

Smooth tchebycheff scalarization for multi-objective optimization

Xi Lin, Xiaoyuan Zhang, Zhiyuan Yang, Fei Liu, Zhenkun Wang, and Qingfu Zhang. Smooth tchebycheff scalarization for multi-objective optimization. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volume ...

work page 2024
[20]

Smoothing and first order methods: A unified framework.SIAM Journal on Optimization, 22(2):557–580, 2012

Amir Beck and Marc Teboulle. Smoothing and first order methods: A unified framework.SIAM Journal on Optimization, 22(2):557–580, 2012

work page 2012
[21]

On the relationship of the tchebycheff norm and the efficient frontier of multiple-criteria objectives

V Joseph Bowman Jr. On the relationship of the tchebycheff norm and the efficient frontier of multiple-criteria objectives. InMultiple Criteria Decision Making: Proceedings of a Conference Jouy-en-Josas, France May 21–23, 1975, pages 76–86. Springer, 1976

work page 1975
[22]

How to find the exact pareto front for multi-objective mdps? InThe Thirteenth International Conference on Learning Representations, 2025

Yining Li, Peizhong Ju, and Ness Shroff. How to find the exact pareto front for multi-objective mdps? InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[23]

Freund, and Yurii Nesterov

Haihao Lu, Robert M. Freund, and Yurii Nesterov. Relatively-smooth convex optimization by first-order methods, and applications, 2017. URLhttps://arxiv.org/abs/1610.05708

work page arXiv 2017
[24]

Efficient model- based concave utility reinforcement learning through greedy mirror descent

Bianca M Moreno, Margaux Bregere, Pierre Gaillard, and Nadia Oudjane. Efficient model- based concave utility reinforcement learning through greedy mirror descent. InInternational Conference on Artificial Intelligence and Statistics, pages 2206–2214. PMLR, 2024

work page 2024
[25]

Multi-objective reinforcement learning with non-linear scalarization.Adapt Agent Multi-agent Syst, pages 9–17, 2022

Mridul Agarwal, V Aggarwal, and Tian Lan. Multi-objective reinforcement learning with non-linear scalarization.Adapt Agent Multi-agent Syst, pages 9–17, 2022

work page 2022
[26]

Concave utility reinforcement learning: The mean-field game viewpoint

Matthieu Geist, Julien Pérolat, Mathieu Laurière, Romuald Elie, Sarah Perrin, Oliver Bachem, Rémi Munos, and Olivier Pietquin. Concave utility reinforcement learning: The mean-field game viewpoint. InProceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, pages 489–497, 2022

work page 2022
[27]

Florian Felten, Lucas Nunes Alegre, Ann Now’e, Ana L. C. Bazzan, El Ghazali Talbi, Gr’egoire Danoy, and Bruno Castro da Silva. A toolkit for reliable benchmarking and research in multi- objective reinforcement learning. InNeurIPS Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=jfwRLudQyj

work page 2023
[28]

arXiv preprint arXiv:2103.09568 , year=

Conor F Hayes, Roxana R˘adulescu, Eugenio Bargiacchi, Johan Källström, Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M Zintgraf, Richard Dazeley, Fredrik Heintz, et al. A practical guide to multi-objective reinforcement learning and planning.arXiv preprint arXiv:2103.09568, 2021

work page arXiv 2021
[29]

Routledge, 2021

Eitan Altman.Constrained Markov decision processes. Routledge, 2021

work page 2021
[30]

Thomas Degris, Martha White, and Richard S. Sutton. Off-policy actor-critic. InProceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, page 179–186, Madison, WI, USA, 2012. Omnipress. ISBN 9781450312851. 11

work page 2012
[31]

Soft Actor-Critic Algorithms and Applications

Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications.arXiv preprint arXiv:1812.05905, 2018

work page internal anchor Pith review arXiv 2018
[32]

Human-level control through deep reinforcement learning.Nature, 518(7540):529–533, February 2015

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Pe- tersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement l...

work page 2015
[33]

Addressing function approximation error in actor-critic methods

Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. InInternational conference on machine learning, pages 1587–1596. PMLR, 2018

work page 2018
[34]

Alegre, Ann Nowé, Ana L

Florian Felten, Lucas N. Alegre, Ann Nowé, Ana L. C. Bazzan, El Ghazali Talbi, Grégoire Danoy, and Bruno C. da Silva. A toolkit for reliable benchmarking and research in multi- objective reinforcement learning. InProceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), 2023

work page 2023
[35]

Preference controllable reinforcement learning with advanced multi-objective optimization

Yucheng Yang, Tianyi Zhou, Mykola Pechenizkiy, and Meng Fang. Preference controllable reinforcement learning with advanced multi-objective optimization. InForty-second Interna- tional Conference on Machine Learning, 2025. URL https://openreview.net/forum? id=49g4c8MWHy

work page 2025
[36]

COLA: Towards efficient multi-objective reinforcement learning with conflict objective regularization in latent space

Pengyi Li, Hongyao Tang, Yifu Yuan, Jianye HAO, Zibin Dong, and YAN ZHENG. COLA: Towards efficient multi-objective reinforcement learning with conflict objective regularization in latent space. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

work page
[37]

URLhttps://openreview.net/forum?id=Cldpn7H3NN

work page
[38]

Sample-efficient multi-objective learning via generalized policy improvement prioritization

Lucas N Alegre, Ana LC Bazzan, Diederik M Roijers, Ann Nowé, and Bruno C da Silva. Sample-efficient multi-objective learning via generalized policy improvement prioritization. InProceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, pages 2003–2012, 2023

work page 2023
[39]

Computing convex coverage sets for multi-objective coordination graphs

Diederik M Roijers, Shimon Whiteson, and Frans A Oliehoek. Computing convex coverage sets for multi-objective coordination graphs. InInternational conference on algorithmic decision theory, pages 309–323. Springer, 2013

work page 2013
[40]

Roijers, Shimon Whiteson, and Frans A

Diederik M. Roijers, Shimon Whiteson, and Frans A. Oliehoek. Linear support for multi- objective coordination graphs. InProceedings of the 13th International Conference on Au- tonomous Agents and Multiagent Systems (AAMAS), 2014. URL https://www.cs.ox.ac. uk/people/shimon.whiteson/pubs/roijers-aamas14.bib

work page 2014
[41]

Howard and James E

Ronald A. Howard and James E. Matheson. Risk-sensitive markov decision processes.Man- agement Science, 18(7):356–369, 1972. doi: 10.1287/mnsc.18.7.356

work page doi:10.1287/mnsc.18.7.356 1972
[42]

Risk-averse dynamic programming for markov decision processes

Andrzej Ruszczy’nski. Risk-averse dynamic programming for markov decision processes. Mathematical Programming, 125(2):235–261, 2010. doi: 10.1007/s10107-010-0393-3

work page doi:10.1007/s10107-010-0393-3 2010
[43]

A unified view of entropy-regularized markov decision processes

Gergely Neu, Vicenç G’omez, and Anders Jonsson. A unified view of entropy-regularized markov decision processes. InNeurIPS Workshop: Deep Reinforcement Learning Symposium,

work page
[44]

URLhttp://arxiv.org/abs/1705.07798

work page arXiv
[45]

Error bounds for approximate policy iteration

Rémi Munos. Error bounds for approximate policy iteration. InProceedings of the Twentieth International Conference on International Conference on Machine Learning, pages 560–567, 2003

work page 2003
[46]

Homotopic policy mirror descent: policy convergence, algorithmic regularization, and improved sample complexity.Mathematical Programming, 207 (1):457–513, 2024

Yan Li, Guanghui Lan, and Tuo Zhao. Homotopic policy mirror descent: policy convergence, algorithmic regularization, and improved sample complexity.Mathematical Programming, 207 (1):457–513, 2024. doi: 10.1007/s10107-023-02017-4

work page doi:10.1007/s10107-023-02017-4 2024
[47]

Double horizon model-based policy optimization

Akihiro Kubo, Paavo Parmas, and Shin Ishii. Double horizon model-based policy optimization. Transactions on Machine Learning Research, 2025. 12

work page 2025
[48]

When to trust your model: Model-based policy optimization.Advances in neural information processing systems, 32, 2019

Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization.Advances in neural information processing systems, 32, 2019

work page 2019
[49]

On the model- based stochastic value gradient for continuous reinforcement learning learning

Brandon Amos, Samuel Stanton, Denis Yarats, and Andrew Gordon Wilson. On the model- based stochastic value gradient for continuous reinforcement learning learning. InProceedings of the 3rd Conference on Learning for Dynamics and Control, volume 144 ofProceedings of Machine Learning Research, pages 6–20. PMLR, 07 – 08 June 2021. URL https:// proceedings.ml...

work page 2021
[50]

Deep reinforcement learning in a handful of trials using probabilistic dynamics models.Advances in neural information processing systems, 31, 2018

Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models.Advances in neural information processing systems, 31, 2018

work page 2018
[51]

ideal point

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=Bkg6RiCqY7. 13 APPENDIX A Uniqueness of STCH Solutions in the Objective Space In the main text, we treat the policyπas the decision variable and consider max π∈Π u(J(π), ω). Since the u...

work page 2019