pith. machine review for the scientific record. sign in

arxiv: 2604.15757 · v1 · submitted 2026-04-17 · 💻 cs.LG

Recognition: unknown

Multi-objective Reinforcement Learning With Augmented States Requires Rewards After Deployment

Authors on Pith no claims yet

Pith reviewed 2026-05-10 09:34 UTC · model grok-4.3

classification 💻 cs.LG
keywords multi-objective reinforcement learningaugmented statesnon-linear utility functionsreward signal accesspolicy conditioningdeployment requirements
0
0 comments X

The pith

Multi-objective RL agents using augmented states for non-linear utilities must retain access to reward signals after deployment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that multi-objective reinforcement learning with non-linear utility functions requires policies conditioned on both the current environmental state and the rewards accrued so far. This conditioning is achieved in practice by augmenting the state vector with the discounted sum of past rewards. Because the policy depends on this augmented information, the agent continues to need the reward signal (or a proxy) to select actions correctly, even once training ends and the policy is fixed. Prior literature has adopted augmented states without noting this ongoing requirement. A reader would care because it restricts where such agents can be deployed without continuous reward feedback.

Core claim

The optimal policy for an MORL agent with a non-linear utility function must be conditioned on both the current environmental state and some measure of the previously accrued reward. When this conditioning is implemented via state augmentation with the discounted sum of rewards, the agent requires continued access to the reward signal after deployment, even if no further learning occurs.

What carries the argument

Augmented state created by concatenating the observed environmental state with the discounted sum of previous rewards, enabling the policy to condition on accrued rewards for non-linear utilities.

If this is right

  • Agents cannot be deployed in settings that provide no reward information after training without losing optimality.
  • Fixed policies still require reward access at every time step to compute the correct augmented state.
  • Real-world applications must either supply ongoing rewards or use alternative conditioning mechanisms.
  • Deployment costs increase when rewards are expensive or unavailable post-training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Alternative conditioning approaches that do not embed rewards directly in the state could remove the post-deployment requirement.
  • The same issue would arise in any RL setting that augments states with internal signals the agent cannot observe at test time.
  • This constraint may favor linear scalarization methods in domains where reward access cannot be guaranteed after deployment.

Load-bearing premise

The standard implementation of conditioning on accrued reward uses state augmentation with the discounted sum, and this augmentation is required to achieve optimality under non-linear utilities.

What would settle it

An explicit optimal policy for a non-linear-utility MORL task that matches the performance of the augmented-state policy while making decisions without any reward signal after deployment.

Figures

Figures reproduced from arXiv: 2604.15757 by Cameron Foale, Peter Vamplew.

Figure 1
Figure 1. Figure 1: A simple MOMDP illustrating the need for an augmented state. The optimal choice of action at state 𝑠1 depends on the reward received when transiting from state 𝑠𝑜 to 𝑠1 . (Equation 2) applies 𝑢 within the expectation operator, and is suited to tasks where we care about the trade-off between objectives achieved within each individual episode 1 . Scalarised Expected Return: 𝑉 𝜋 𝑢 = 𝑢 ( 𝔼 [∑∞ 𝑖=0 𝛾 𝑖 𝐫𝑖 | | |… view at source ↗
Figure 2
Figure 2. Figure 2: A simple MOMDP illustrating that the optimal policy may be different for the fully-observable case where the actual reward on each transition is available, and the case where rewards can not be observed and a proxy reward model is used instead. 4. Conclusion Maximising the value of a non-linear utility function within a multi-objective environment (a MOMDP) requires a MORL agent to condition its policy on … view at source ↗
read the original abstract

This research note identifies a previously overlooked distinction between multi-objective reinforcement learning (MORL), and more conventional single-objective reinforcement learning (RL). It has previously been noted that the optimal policy for an MORL agent with a non-linear utility function is required to be conditioned on both the current environmental state and on some measure of the previously accrued reward. This is generally implemented by concatenating the observed state of the environment with the discounted sum of previous rewards to create an augmented state. While augmented states have been widely-used in the MORL literature, one implication of their use has not previously been reported -- namely that they require the agent to have continued access to the reward signal (or a proxy thereof) after deployment, even if no further learning is required. This note explains why this is the case, and considers the practical repercussions of this requirement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper claims that in multi-objective reinforcement learning (MORL) with non-linear utility functions, the standard practice of augmenting the state with the discounted sum of accrued rewards (to condition the policy on past returns) implies that the deployed agent must retain access to the reward signal or a proxy, even without further learning; this is presented as an overlooked distinction from single-objective RL, with discussion of practical deployment repercussions.

Significance. If the implication holds, the note usefully flags a deployment constraint for the widely adopted state-augmentation approach in MORL, which could affect real-world applications where post-deployment reward access is unavailable or costly. The observation is a direct logical consequence of existing MORL constructions rather than a new empirical result, so its value lies in surfacing an unremarked practical consequence.

minor comments (2)
  1. The abstract states that augmented states 'have been widely-used in the MORL literature' but does not cite specific representative papers or implementations; adding 1-2 canonical references would strengthen the grounding of the 'generally implemented' claim.
  2. The note would benefit from a brief concrete example (e.g., a simple two-objective environment) illustrating why the augmented state must be updated from observed rewards at test time, to make the requirement more tangible for readers unfamiliar with MORL.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and for recommending acceptance. The note's core observation—that state augmentation in MORL with non-linear utilities requires post-deployment reward access—is indeed a direct logical consequence of standard constructions, and we appreciate the referee's recognition of its practical relevance for deployment scenarios.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a short research note that identifies an unreported deployment consequence of the standard state-augmentation technique already used throughout the MORL literature for conditioning policies on accrued vector returns under non-linear scalarization. The argument proceeds by direct implication from the definition of augmented states (concatenating the environmental state with the discounted sum of observed rewards) and does not introduce any new equations, fitted parameters, predictions, or uniqueness theorems. No self-citations appear as load-bearing premises, no ansatz is smuggled, and no known result is merely renamed. The central claim is therefore an observation about existing practice rather than a derivation that reduces to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No new mathematical objects or fitted values are introduced; the note relies on standard definitions of MORL, non-linear utility functions, and state augmentation already present in the prior literature.

pith-pipeline@v0.9.0 · 5435 in / 1056 out tokens · 36544 ms · 2026-05-10T09:34:23.110620+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 2 canonical work pages

  1. [1]

    D.M.Roijers,P.Vamplew,S.Whiteson,R.Dazeley, Asurveyofmulti-objectivesequentialdecision-making, JournalofArtificialIntelligence Research 48 (2013) 67–113

  2. [2]

    C. F. Hayes, R. Rădulescu, E. Bargiacchi, J. Källström, M. Macfarlane, M. Reymond, T. Verstraeten, L. M. Zintgraf, R. Dazeley, F. Heintz, et al., A practical guide to multi-objective reinforcement learning and planning, Autonomous Agents and Multi-Agent Systems 36 (2022)

  3. [3]

    Felten, E.-G

    F. Felten, E.-G. Talbi, G. Danoy, Multi-objective reinforcement learning based on decomposition: A taxonomy and framework, Journal of Artificial Intelligence Research 79 (2024) 679–723

  4. [4]

    Y.Cao,H.Zhan, Efficientmulti-objectivereinforcementlearningviamultiple-gradientdescentwithiterativelydiscoveredweight-vectorsets, Journal of Artificial Intelligence Research 70 (2021) 319–349

  5. [5]

    Parisi, M

    S. Parisi, M. Pirotta, M. Restelli, Multi-objective reinforcement learning through continuous pareto manifold approximation, Journal of Artificial Intelligence Research 57 (2016) 187–227

  6. [6]

    Q.Bai,M.Agarwal,V.Aggarwal, Jointoptimizationofconcavescalarizedmulti-objectivereinforcementlearningwithpolicygradientbased algorithm, Journal of Artificial Intelligence Research 74 (2022) 1565–1597. P. Vamplew and C. Foale:Preprint submitted to ElsevierPage 5 of 6 Multi-objective Reinforcement Learning With Augmented States Requires Rewards After Deployment

  7. [7]

    Vamplew, C

    P. Vamplew, C. Foale, R. Dazeley, The impact of environmental stochasticity on value-based multiobjective reinforcement learning, Neural Computing and Applications 34 (2022) 1783–1799

  8. [8]

    Geibel, Reinforcement learning for mdps with constraints, in: European conference on machine learning, Springer, 2006, pp

    P. Geibel, Reinforcement learning for mdps with constraints, in: European conference on machine learning, Springer, 2006, pp. 646–653

  9. [9]

    R.Issabekov,P.Vamplew, Anempiricalcomparisonoftwocommonmultiobjectivereinforcementlearningalgorithms, in:AustralasianJoint Conference on Artificial Intelligence (AJCAI), Springer, 2012, pp. 626–636

  10. [10]

    M.Reymond,C.F.Hayes,D.Steckelmacher,D.M.Roijers,A.Nowé, Actor-criticmulti-objectivereinforcementlearningfornon-linearutility functions, Autonomous Agents and Multi-Agent Systems 37 (2023) 23

  11. [11]

    G. Yu, U. Siddique, P. Weng, Fair deep reinforcement learning with generalized gini welfare functions, in: International Conference on Autonomous Agents and Multiagent Systems, Springer, 2023, pp. 3–29

  12. [12]

    F.Chouaki,A.Beynier,N.Maudet,P.Viappiani,Fairnessincooperativemulti-agentmulti-objectivereinforcementlearningusingtheexpected scalarizedreturn, in:Proceedingsofthe24thInternationalConferenceonAutonomousAgentsandMultiagentSystems,2025,pp.2469–2471

  13. [13]

    K. Ding, P. Vamplew, C. Foale, R. Dazeley, An empirical investigation of value-based multi-objective reinforcement learning for stochastic environments, The Knowledge Engineering Review 40 (2025)

  14. [14]

    G. Peng, E. Pauwels, H. Baier, Multi-objective utility actor critic with utility critic for nonlinear utility function, in: Eighteenth European Workshop on Reinforcement Learning, 2025

  15. [15]

    N. Peng, M. Tian, B. Fain, Multi-objective reinforcement learning with nonlinear preferences: Provable approximation for maximizing expected scalarized return, in: Proceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems, 2025, pp. 1632–1640

  16. [16]

    S.Zhao,S.Fu,L.Bai,H.Liang,Q.Zhao,T.Li, Adaptivemulti-objectivereinforcementlearningfornon-linearandimplicitutilityfunctions, IEICE Transactions on Information and Systems (2025)

  17. [17]

    P. R. Wurman, S. Barrett, K. Kawamoto, J. MacGlashan, K. Subramanian, T. J. Walsh, R. Capobianco, A. Devlic, F. Eckert, F. Fuchs, et al., Outracing champion gran turismo drivers with deep reinforcement learning, Nature 602 (2022) 223–228

  18. [18]

    Z. Li, Q. Ji, X. Ling, Q. Liu, A comprehensive review of multi-agent reinforcement learning in video games, IEEE Transactions on Games (2025)

  19. [19]

    Avramelou, P

    L. Avramelou, P. Nousi, N. Passalis, A. Tefas, Deep reinforcement learning for financial trading using multi-modal features, Expert Systems with Applications 238 (2024) 121849

  20. [20]

    Singh, L

    A. Singh, L. Yang, C. Finn, S. Levine, End-to-end robotic reinforcement learning without reward engineering, in: Proceedings of Robotics: Science and Systems, FreiburgimBreisgau, Germany, 2019. doi:10.15607/RSS.2019.XV.073

  21. [21]

    Ibarz, J

    J. Ibarz, J. Tan, C. Finn, M. Kalakrishnan, P. Pastor, S. Levine, How to train your robot with deep reinforcement learning: lessons we have learned, The International Journal of Robotics Research 40 (2021) 698–721

  22. [22]

    Parisi, M

    S. Parisi, M. Mohammedalamen, A. Kazemipour, M. E. Taylor, M. Bowling, Monitored markov decision processes, in: Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems, 2024, pp. 1549–1557

  23. [23]

    W. Luo, P. Sun, F. Zhong, W. Liu, T. Zhang, Y. Wang, End-to-end active object tracking and its real-world deployment via reinforcement learning, IEEE transactions on pattern analysis and machine intelligence 42 (2019) 1317–1332

  24. [24]

    R. Yu, S. Wan, Y. Wang, C.-X. Gao, L. Gan, Z. Zhang, D.-C. Zhan, Reward models in deep reinforcement learning: A survey, arXiv preprint arXiv:2506.15421 (2025)

  25. [25]

    Kaufmann, P

    T. Kaufmann, P. Weng, V. Bengs, E. Hüllermeier, A survey of reinforcement learning from human feedback, CoRR (2023)

  26. [26]

    Baisero, B

    A. Baisero, B. Daley, C. Amato, Asymmetric dqn for partially observable reinforcement learning, in: Uncertainty in Artificial Intelligence, PMLR, 2022, pp. 107–117

  27. [27]

    Eberhard, H

    A. Eberhard, H. Metni, G. Fahland, A. Stroh, P. Friederich, Actively learning costly reward functions for reinforcement learning, Machine Learning: Science and Technology 5 (2024) 015055

  28. [28]

    35970–35988

    A.Wang,A.C.Li,T.Q.Klassen,R.T.Icarte,S.A.McIlraith, Learningbeliefrepresentationsforpartiallyobservabledeeprl, in:International Conference on Machine Learning, PMLR, 2023, pp. 35970–35988

  29. [29]

    Y. Cai, X. Liu, A. Oikonomou, K. Zhang, Provable partially observable reinforcement learning with privileged information, Advances in Neural Information Processing Systems 37 (2024) 63790–63857

  30. [30]

    J.Li,E.Zhao,T.Wei,J.Xing,S.Xiang, Leveragingprivilegedinformationforpartiallyobservablereinforcementlearning, IEEETransactions on Games (2025). P. Vamplew and C. Foale:Preprint submitted to ElsevierPage 6 of 6