pith. machine review for the scientific record. sign in

arxiv: 2605.08946 · v1 · submitted 2026-05-09 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

A Single Deep Preference-Conditioned Policy for Learning Pareto Coverage Sets

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:17 UTC · model grok-4.3

classification 💻 cs.LG
keywords multi-objective reinforcement learningPareto front coveragepreference-conditioned policyTchebycheff scalarizationpolicy iterationoccupancy measuresactor-critic
0
0 comments X

The pith

Under mild conditions each preference maps to a unique Lipschitz-continuous Pareto-optimal return vector, enabling one policy to cover the front.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that smooth Tchebycheff scalarization in tabular multi-objective MDPs produces a one-to-one Lipschitz-continuous correspondence between preferences and Pareto-optimal return vectors when the preference set satisfies mild interior conditions. This correspondence supplies a rigorous basis for sweeping preferences to obtain dense Pareto coverage with a single policy. The authors introduce Concave Mirror Descent Policy Iteration, which attains an O(1/k) suboptimality rate and reduces each step to a KL-regularized MDP using the previous policy as reference. They implement the iteration as a deep actor-critic algorithm that preserves the regularization and report top average hypervolume rank on eight MO-Gymnasium tasks together with gains in continuous control.

Core claim

Under mild interior conditions on the preference set, smooth Tchebycheff scalarization induces a unique Pareto-optimal return vector for each preference that depends Lipschitz-continuously on it. The problem is formulated over occupancy measures and solved by Concave Mirror Descent Policy Iteration, which achieves O(1/k) objective-suboptimality and is equivalent at each step to solving a Kullback-Leibler-regularized MDP with the prior policy as reference; the resulting deep actor-critic instantiation covers the Pareto set on MO-Gymnasium benchmarks.

What carries the argument

Concave Mirror Descent Policy Iteration (CMDPI) over occupancy measures, which equates each update to a KL-regularized MDP with the previous policy as reference.

If this is right

  • CMDPI attains an O(1/k) rate of objective-suboptimality.
  • Each policy update is exactly equivalent to solving a KL-regularized MDP.
  • The learned policy is continuous in the preference parameter across finite iterations.
  • The deep instantiation achieves the best average hypervolume rank among recent baselines on eight MO-Gymnasium tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The continuity result may allow nearby preferences to share policy parameters without full retraining.
  • The same occupancy-measure formulation could be applied to other monotone scalarizations that satisfy analogous interior conditions.
  • Gains observed in continuous-control experiments suggest the method scales beyond discrete actions when the actor-critic approximation remains faithful to the KL-regularized update.

Load-bearing premise

Mild interior conditions on the preference set are needed to guarantee that each preference produces a unique Pareto-optimal return vector that changes continuously with the preference.

What would settle it

An explicit preference vector inside the interior region for which two distinct Pareto-optimal return vectors yield the same scalarized value, or a sequence of preferences converging to a limit preference whose optimal return vectors fail to converge.

Figures

Figures reproduced from arXiv: 2605.08946 by Akihiro Kubo, Kosuke Nakanishi, Shin Ishii.

Figure 1
Figure 1. Figure 1: Illustration of a key limitation of existing methods: linear scalarization recov￾ers only vertex solutions of the Pareto front (black curve), while CAPQL exhibits biased coverage. CMDPI (ours) achieves denser and more uniform Pareto-optimal coverage. To solve the resulting nonlinear scalarized problem without altering the objective, we derive mirror de￾scent in the occupation-measure space. Using the Bregm… view at source ↗
Figure 2
Figure 2. Figure 2: Scatter plots of the converged objective vectors in a two-objective MOMDP, obtained [PITH_FULL_IMAGE:figures/full_fig_p024_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Final-step return scatter plots for five two- or three-objective tasks. [PITH_FULL_IMAGE:figures/full_fig_p029_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Hypervolume of model-free methods over envionment steps on eight continuous action [PITH_FULL_IMAGE:figures/full_fig_p030_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Expected Utility Metric of model-free methods over envionment steps on eight continuous [PITH_FULL_IMAGE:figures/full_fig_p030_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Hypervolume of model-based methods over envionment steps on eight continuous action [PITH_FULL_IMAGE:figures/full_fig_p031_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Expected Utility Metric of model-based methods over envionment steps on eight continuous [PITH_FULL_IMAGE:figures/full_fig_p031_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Final-step (2M steps) sum-of-reward-vectors scatter plots for model-free methods on [PITH_FULL_IMAGE:figures/full_fig_p032_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Final-step (500K steps) sum-of-reward-vectors scatter plots for model-based methods on [PITH_FULL_IMAGE:figures/full_fig_p033_9.png] view at source ↗
read the original abstract

Preference-conditioned multi-objective reinforcement learning aims to learn a single policy that captures trade-offs across preferences, but under nonlinear scalarization the uniqueness and continuity of the preference-to-solution correspondence remain unclear. We study this problem in tabular multi-objective Markov decision processes (MDPs) using smooth Tchebycheff scalarization as a monotone utility. Under mild interior conditions on the preference set, we prove that each preference induces a unique Pareto-optimal return vector and that this vector depends Lipschitz-continuously on the preference, providing a principled foundation for preference sweeping toward dense Pareto-front coverage. To compute these targets, we formulate the problem over occupancy measures and derive Concave Mirror Descent Policy Iteration (CMDPI), which achieves an $O(1/k)$ objective-suboptimality rate. We further show that each update is equivalent to solving a Kullback-Leibler-regularized MDP with the previous policy as reference, yielding a policy-iteration interpretation and finite-iterate policy continuity across preferences. We instantiate the update as a deep actor-critic algorithm preserving previous-policy regularization. On eight MO-Gymnasium tasks, it achieves the best average hypervolume rank among recent baselines and strong expected-utility performance. Continuous-control experiments indicate gains beyond the discrete-action setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper studies preference-conditioned multi-objective RL in tabular MDPs using smooth Tchebycheff scalarization. Under mild interior conditions on the preference set, it proves that each preference maps to a unique Pareto-optimal return vector with Lipschitz continuity. It introduces Concave Mirror Descent Policy Iteration (CMDPI) achieving O(1/k) suboptimality, shows each update equals solving a KL-regularized MDP with the prior policy as reference, and implements this as a deep actor-critic algorithm that ranks best in average hypervolume on eight MO-Gymnasium tasks.

Significance. If the mapping theorem and convergence hold, the work supplies a principled basis for learning a single policy that densely covers Pareto fronts via preference sweeping, which is useful for applications requiring explicit trade-offs. The KL-regularized MDP equivalence offers a practical policy-iteration view and finite-iterate continuity, while the empirical results on discrete and continuous control tasks demonstrate competitive performance against recent baselines.

major comments (2)
  1. [Abstract and theoretical analysis section] Abstract and theoretical analysis section: the 'mild interior conditions' guaranteeing uniqueness and Lipschitz continuity of the preference-to-Pareto mapping are invoked but never stated precisely (e.g., whether they require all preference components strictly positive or the vector to lie in the relative interior of the simplex). This assumption is load-bearing for the CMDPI rate and the policy-iteration interpretation, yet its necessity is not demonstrated by counter-example or boundary case.
  2. [CMDPI derivation and equivalence claim] CMDPI derivation and equivalence claim: the O(1/k) objective-suboptimality rate and the statement that each update is equivalent to a KL-regularized MDP with the previous policy as reference are presented as following from the occupancy-measure formulation, but the manuscript provides no explicit derivation steps, error bounds, or verification that the smooth Tchebycheff scalarization preserves the required monotonicity and concavity for the mirror-descent analysis to go through.
minor comments (2)
  1. [Experiments] The experimental section would benefit from an explicit statement of how preferences are sampled during training and evaluation to support reproducibility of the hypervolume results.
  2. [Notation] Notation for occupancy measures and the scalarization function should be introduced once and used consistently; occasional re-use of symbols for different quantities appears in the background and method sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our theoretical results. We address each major comment below and will revise the manuscript accordingly to improve precision and completeness.

read point-by-point responses
  1. Referee: [Abstract and theoretical analysis section] Abstract and theoretical analysis section: the 'mild interior conditions' guaranteeing uniqueness and Lipschitz continuity of the preference-to-Pareto mapping are invoked but never stated precisely (e.g., whether they require all preference components strictly positive or the vector to lie in the relative interior of the simplex). This assumption is load-bearing for the CMDPI rate and the policy-iteration interpretation, yet its necessity is not demonstrated by counter-example or boundary case.

    Authors: We agree that the precise statement of the interior conditions is missing and should be made explicit. These conditions require that the preference vector lies in the relative interior of the probability simplex (i.e., all components strictly positive). This ensures the smooth Tchebycheff scalarization yields a strictly concave objective, enabling uniqueness and Lipschitz continuity of the preference-to-Pareto mapping. In the revision we will add the exact definition to the abstract and theory section, explain its role in the CMDPI analysis, and include a brief discussion of boundary cases (with a simple counter-example sketch) where uniqueness can fail when a component is zero. revision: yes

  2. Referee: [CMDPI derivation and equivalence claim] CMDPI derivation and equivalence claim: the O(1/k) objective-suboptimality rate and the statement that each update is equivalent to a KL-regularized MDP with the previous policy as reference are presented as following from the occupancy-measure formulation, but the manuscript provides no explicit derivation steps, error bounds, or verification that the smooth Tchebycheff scalarization preserves the required monotonicity and concavity for the mirror-descent analysis to go through.

    Authors: We acknowledge that the current manuscript omits the full step-by-step derivation. The O(1/k) rate follows from applying concave mirror descent to the occupancy-measure formulation of the smooth Tchebycheff objective; each CMDPI update is exactly equivalent to solving a KL-regularized MDP whose reference policy is the previous iterate. The smooth Tchebycheff scalarization preserves monotonicity and concavity under the interior conditions. In the revision we will insert the explicit derivation (including the key lemmas on concavity preservation and the error-bound analysis) into the main text or a dedicated appendix subsection, thereby making the policy-iteration view and finite-iterate continuity fully rigorous. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claims rest on standard MDP occupancy-measure and mirror-descent analysis

full rationale

The paper proves uniqueness and Lipschitz continuity of the preference-to-Pareto mapping under explicitly stated mild interior conditions on the preference set, using smooth Tchebycheff scalarization as a monotone utility. CMDPI and its O(1/k) rate are derived from concave mirror descent over occupancy measures, a standard technique independent of any fitted parameters or self-referential predictions. The KL-regularized MDP equivalence follows directly from the update rule without reducing to prior self-citations or ansatzes. No load-bearing steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption of mild interior conditions for the uniqueness and continuity proofs and on standard mathematical properties of mirror descent and occupancy measures; no free parameters or new invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Mild interior conditions on the preference set ensure uniqueness and Lipschitz continuity of the preference-to-Pareto mapping
    Invoked to guarantee that each preference induces a unique Pareto-optimal return vector.

pith-pipeline@v0.9.0 · 5523 in / 1426 out tokens · 54461 ms · 2026-05-12T02:17:46.071971+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 1 internal anchor

  1. [1]

    D. J. White. Multi-objective infinite-horizon discounted markov decision processes.Journal of Mathematical Analysis and Applications, 89(2):639–647, 1982. doi: 10.1016/0022-247X(82) 90122-6

  2. [2]

    Multi-objective reinforcement learning using sets of pareto dominating policies.J

    Kristof Van Moffaert and A Nowé. Multi-objective reinforcement learning using sets of pareto dominating policies.J. Mach. Learn. Res., 15(107):3483–3512, 2014

  3. [3]

    A survey of multi-objective sequential decision-making.Journal of Artificial Intelligence Research, 48: 67–113, 2013

    Diederik M Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-objective sequential decision-making.Journal of Artificial Intelligence Research, 48: 67–113, 2013

  4. [4]

    Assael, Diederik M

    Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, and Shimon Whiteson. Multi- objective deep reinforcement learning. arXiv:1610.02707, 2016. URL https://arxiv.org/ abs/1610.02707

  5. [5]

    Prediction-guided multi-objective reinforcement learning for continuous robot control

    Jie Xu, Yunsheng Tian, Pingchuan Ma, Daniela Rus, Shinjiro Sueda, and Wojciech Matusik. Prediction-guided multi-objective reinforcement learning for continuous robot control. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 10607–106...

  6. [6]

    A generalized algorithm for multi- objective reinforcement learning and policy adaptation.Advances in neural information pro- cessing systems, 32, 2019

    Runzhe Yang, Xingyuan Sun, and Karthik Narasimhan. A generalized algorithm for multi- objective reinforcement learning and policy adaptation.Advances in neural information pro- cessing systems, 32, 2019

  7. [7]

    Pareto conditioned networks

    Mathieu Reymond, Eugenio Bargiacchi, and Ann Nowé. Pareto conditioned networks. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, pages 1110–1118, 2022

  8. [8]

    Pd-morl: Preference-driven multi- objective reinforcement learning algorithm.arXiv preprint arXiv:2208.07914, 2022

    Toygun Basaklar, Suat Gumussoy, and Umit Y Ogras. Pd-morl: Preference-driven multi- objective reinforcement learning algorithm.arXiv preprint arXiv:2208.07914, 2022

  9. [9]

    Efficient discovery of pareto front for multi-objective reinforcement learning

    Ruohong Liu, Yuxin Pan, Linjie Xu, Lei Song, Pengcheng You, Yize Chen, and Jiang Bian. Efficient discovery of pareto front for multi-objective reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/ forum?id=fDGPIuCdGi

  10. [10]

    Pareto set learning for multi-objective reinforcement learning

    Erlong Liu, Yu-Chang Wu, Xiaobin Huang, Chengrui Gao, Ren-Jian Wang, Ke Xue, and Chao Qian. Pareto set learning for multi-objective reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 18789–18797, 2025

  11. [11]

    Multi-objective reinforcement learning with continuous pareto frontier approximation

    Matteo Pirotta, Simone Parisi, and Marcello Restelli. Multi-objective reinforcement learning with continuous pareto frontier approximation. InProceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI), 2015

  12. [12]

    Pareto-dqn: Approximating the pareto front in complex multi-objective decision problems

    Mathieu Reymond and Ann Nowe. Pareto-dqn: Approximating the pareto front in complex multi-objective decision problems. InProceedings of the Adaptive and Learning Agents Work- shop 2019 (ALA-19) at AAMAS, May 2019. URL https://ala2019.vub.ac.be. 2019 Adaptive Learning Agents (ALA) workshop: Workshop of the AAMAS conference ; Confer- ence

  13. [13]

    A practical guide to multi-objective reinforcement learning and planning.Autonomous Agents and Multi-Agent Systems, 36(1):26, 2022

    Conor F Hayes, Roxana R˘adulescu, Eugenio Bargiacchi, Johan Källström, Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M Zintgraf, Richard Dazeley, Fredrik Heintz, et al. A practical guide to multi-objective reinforcement learning and planning.Autonomous Agents and Multi-Agent Systems, 36(1):26, 2022

  14. [14]

    Cambridge university press, 2004

    Stephen Boyd and Lieven Vandenberghe.Convex optimization. Cambridge university press, 2004. 10

  15. [15]

    Dynamic weights in multi-objective deep reinforcement learning

    Axel Abels, Diederik Roijers, Tom Lenaerts, Ann Nowé, and Denis Steckelmacher. Dynamic weights in multi-objective deep reinforcement learning. InInternational conference on machine learning, pages 11–20. PMLR, 2019

  16. [16]

    A theory of regularized markov decision processes

    Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized markov decision processes. InInternational conference on machine learning, pages 2160–2169. PMLR, 2019

  17. [17]

    Multi-objective reinforcement learning: Convexity, stationarity and pareto optimality

    Haoye Lu, Daniel Herman, and Yaoliang Yu. Multi-objective reinforcement learning: Convexity, stationarity and pareto optimality. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=TjEzIsyEsQ6

  18. [18]

    Multi-objective reinforcement learning with nonlinear preferences: Provable approximation for maximizing expected scalarized return

    Nianli Peng, Muhang Tian, and Brandon Fain. Multi-objective reinforcement learning with nonlinear preferences: Provable approximation for maximizing expected scalarized return. InProceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 1632–1640, 2025

  19. [19]

    Smooth tchebycheff scalarization for multi-objective optimization

    Xi Lin, Xiaoyuan Zhang, Zhiyuan Yang, Fei Liu, Zhenkun Wang, and Qingfu Zhang. Smooth tchebycheff scalarization for multi-objective optimization. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volume ...

  20. [20]

    Smoothing and first order methods: A unified framework.SIAM Journal on Optimization, 22(2):557–580, 2012

    Amir Beck and Marc Teboulle. Smoothing and first order methods: A unified framework.SIAM Journal on Optimization, 22(2):557–580, 2012

  21. [21]

    On the relationship of the tchebycheff norm and the efficient frontier of multiple-criteria objectives

    V Joseph Bowman Jr. On the relationship of the tchebycheff norm and the efficient frontier of multiple-criteria objectives. InMultiple Criteria Decision Making: Proceedings of a Conference Jouy-en-Josas, France May 21–23, 1975, pages 76–86. Springer, 1976

  22. [22]

    How to find the exact pareto front for multi-objective mdps? InThe Thirteenth International Conference on Learning Representations, 2025

    Yining Li, Peizhong Ju, and Ness Shroff. How to find the exact pareto front for multi-objective mdps? InThe Thirteenth International Conference on Learning Representations, 2025

  23. [23]

    Freund, and Yurii Nesterov

    Haihao Lu, Robert M. Freund, and Yurii Nesterov. Relatively-smooth convex optimization by first-order methods, and applications, 2017. URLhttps://arxiv.org/abs/1610.05708

  24. [24]

    Efficient model- based concave utility reinforcement learning through greedy mirror descent

    Bianca M Moreno, Margaux Bregere, Pierre Gaillard, and Nadia Oudjane. Efficient model- based concave utility reinforcement learning through greedy mirror descent. InInternational Conference on Artificial Intelligence and Statistics, pages 2206–2214. PMLR, 2024

  25. [25]

    Multi-objective reinforcement learning with non-linear scalarization.Adapt Agent Multi-agent Syst, pages 9–17, 2022

    Mridul Agarwal, V Aggarwal, and Tian Lan. Multi-objective reinforcement learning with non-linear scalarization.Adapt Agent Multi-agent Syst, pages 9–17, 2022

  26. [26]

    Concave utility reinforcement learning: The mean-field game viewpoint

    Matthieu Geist, Julien Pérolat, Mathieu Laurière, Romuald Elie, Sarah Perrin, Oliver Bachem, Rémi Munos, and Olivier Pietquin. Concave utility reinforcement learning: The mean-field game viewpoint. InProceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, pages 489–497, 2022

  27. [27]

    Florian Felten, Lucas Nunes Alegre, Ann Now’e, Ana L. C. Bazzan, El Ghazali Talbi, Gr’egoire Danoy, and Bruno Castro da Silva. A toolkit for reliable benchmarking and research in multi- objective reinforcement learning. InNeurIPS Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=jfwRLudQyj

  28. [28]

    arXiv preprint arXiv:2103.09568 , year=

    Conor F Hayes, Roxana R˘adulescu, Eugenio Bargiacchi, Johan Källström, Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M Zintgraf, Richard Dazeley, Fredrik Heintz, et al. A practical guide to multi-objective reinforcement learning and planning.arXiv preprint arXiv:2103.09568, 2021

  29. [29]

    Routledge, 2021

    Eitan Altman.Constrained Markov decision processes. Routledge, 2021

  30. [30]

    Thomas Degris, Martha White, and Richard S. Sutton. Off-policy actor-critic. InProceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, page 179–186, Madison, WI, USA, 2012. Omnipress. ISBN 9781450312851. 11

  31. [31]

    Soft Actor-Critic Algorithms and Applications

    Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications.arXiv preprint arXiv:1812.05905, 2018

  32. [32]

    Human-level control through deep reinforcement learning.Nature, 518(7540):529–533, February 2015

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Pe- tersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement l...

  33. [33]

    Addressing function approximation error in actor-critic methods

    Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. InInternational conference on machine learning, pages 1587–1596. PMLR, 2018

  34. [34]

    Alegre, Ann Nowé, Ana L

    Florian Felten, Lucas N. Alegre, Ann Nowé, Ana L. C. Bazzan, El Ghazali Talbi, Grégoire Danoy, and Bruno C. da Silva. A toolkit for reliable benchmarking and research in multi- objective reinforcement learning. InProceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), 2023

  35. [35]

    Preference controllable reinforcement learning with advanced multi-objective optimization

    Yucheng Yang, Tianyi Zhou, Mykola Pechenizkiy, and Meng Fang. Preference controllable reinforcement learning with advanced multi-objective optimization. InForty-second Interna- tional Conference on Machine Learning, 2025. URL https://openreview.net/forum? id=49g4c8MWHy

  36. [36]

    COLA: Towards efficient multi-objective reinforcement learning with conflict objective regularization in latent space

    Pengyi Li, Hongyao Tang, Yifu Yuan, Jianye HAO, Zibin Dong, and YAN ZHENG. COLA: Towards efficient multi-objective reinforcement learning with conflict objective regularization in latent space. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

  37. [37]

    URLhttps://openreview.net/forum?id=Cldpn7H3NN

  38. [38]

    Sample-efficient multi-objective learning via generalized policy improvement prioritization

    Lucas N Alegre, Ana LC Bazzan, Diederik M Roijers, Ann Nowé, and Bruno C da Silva. Sample-efficient multi-objective learning via generalized policy improvement prioritization. InProceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, pages 2003–2012, 2023

  39. [39]

    Computing convex coverage sets for multi-objective coordination graphs

    Diederik M Roijers, Shimon Whiteson, and Frans A Oliehoek. Computing convex coverage sets for multi-objective coordination graphs. InInternational conference on algorithmic decision theory, pages 309–323. Springer, 2013

  40. [40]

    Roijers, Shimon Whiteson, and Frans A

    Diederik M. Roijers, Shimon Whiteson, and Frans A. Oliehoek. Linear support for multi- objective coordination graphs. InProceedings of the 13th International Conference on Au- tonomous Agents and Multiagent Systems (AAMAS), 2014. URL https://www.cs.ox.ac. uk/people/shimon.whiteson/pubs/roijers-aamas14.bib

  41. [41]

    Howard and James E

    Ronald A. Howard and James E. Matheson. Risk-sensitive markov decision processes.Man- agement Science, 18(7):356–369, 1972. doi: 10.1287/mnsc.18.7.356

  42. [42]

    Risk-averse dynamic programming for markov decision processes

    Andrzej Ruszczy’nski. Risk-averse dynamic programming for markov decision processes. Mathematical Programming, 125(2):235–261, 2010. doi: 10.1007/s10107-010-0393-3

  43. [43]

    A unified view of entropy-regularized markov decision processes

    Gergely Neu, Vicenç G’omez, and Anders Jonsson. A unified view of entropy-regularized markov decision processes. InNeurIPS Workshop: Deep Reinforcement Learning Symposium,

  44. [44]

    URLhttp://arxiv.org/abs/1705.07798

  45. [45]

    Error bounds for approximate policy iteration

    Rémi Munos. Error bounds for approximate policy iteration. InProceedings of the Twentieth International Conference on International Conference on Machine Learning, pages 560–567, 2003

  46. [46]

    Homotopic policy mirror descent: policy convergence, algorithmic regularization, and improved sample complexity.Mathematical Programming, 207 (1):457–513, 2024

    Yan Li, Guanghui Lan, and Tuo Zhao. Homotopic policy mirror descent: policy convergence, algorithmic regularization, and improved sample complexity.Mathematical Programming, 207 (1):457–513, 2024. doi: 10.1007/s10107-023-02017-4

  47. [47]

    Double horizon model-based policy optimization

    Akihiro Kubo, Paavo Parmas, and Shin Ishii. Double horizon model-based policy optimization. Transactions on Machine Learning Research, 2025. 12

  48. [48]

    When to trust your model: Model-based policy optimization.Advances in neural information processing systems, 32, 2019

    Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization.Advances in neural information processing systems, 32, 2019

  49. [49]

    On the model- based stochastic value gradient for continuous reinforcement learning learning

    Brandon Amos, Samuel Stanton, Denis Yarats, and Andrew Gordon Wilson. On the model- based stochastic value gradient for continuous reinforcement learning learning. InProceedings of the 3rd Conference on Learning for Dynamics and Control, volume 144 ofProceedings of Machine Learning Research, pages 6–20. PMLR, 07 – 08 June 2021. URL https:// proceedings.ml...

  50. [50]

    Deep reinforcement learning in a handful of trials using probabilistic dynamics models.Advances in neural information processing systems, 31, 2018

    Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models.Advances in neural information processing systems, 31, 2018

  51. [51]

    ideal point

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=Bkg6RiCqY7. 13 APPENDIX A Uniqueness of STCH Solutions in the Objective Space In the main text, we treat the policyπas the decision variable and consider max π∈Π u(J(π), ω). Since the u...