Recognition: unknown
AdaGamma: State-Dependent Discounting for Temporal Adaptation in Reinforcement Learning
Pith reviewed 2026-05-08 13:42 UTC · model grok-4.3
The pith
State-dependent discounting becomes practical in deep actor-critic RL when paired with a return-consistency objective.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AdaGamma learns a state-dependent discount function jointly with the policy and value networks and regularizes it with a return-consistency objective that prevents the target manipulation and TD-error collapse that otherwise occur with naive state-dependent discounting. The paper establishes contraction and well-posedness properties for the associated Bellman operator. The resulting algorithm integrates directly into existing actor-critic methods and yields measurable performance gains on standard continuous-control tasks as well as in an online deployment.
What carries the argument
The return-consistency objective, which enforces agreement between multi-step returns computed under the learned state-dependent discount function and thereby regularizes the backup structure to avoid degeneracy.
If this is right
- AdaGamma can be inserted into existing SAC and PPO implementations with only the addition of a learned discount head and the consistency loss.
- The method produces consistent improvements across standard continuous-control benchmark suites.
- An online A/B test on a real logistics platform shows statistically significant gains over fixed-discount baselines.
Where Pith is reading between the lines
- State-dependent discounting could let agents automatically shorten planning horizons in states dominated by immediate outcomes and lengthen them where distant consequences matter.
- The same consistency-regularization pattern might stabilize other state-dependent hyperparameters such as per-state learning rates.
- In environments that change over time, learned per-state discounts could provide a built-in mechanism for forgetting outdated value estimates more selectively than a global discount.
Load-bearing premise
Adding the return-consistency objective is enough to stop the TD-error collapse and target manipulation that arise when state-dependent discounting is implemented without it.
What would settle it
Train an actor-critic agent with a learned state-dependent discount function but remove the return-consistency term and check whether training becomes unstable or yields no gain over a fixed-discount baseline.
Figures
read the original abstract
The discount factor in reinforcement learning controls both the effective planning horizon and the strength of bootstrapping, yet most deep RL methods use a single fixed value across all states. While state-dependent discounting is conceptually appealing, naive deep actor--critic implementations can become unstable and degenerate toward TD-error collapse. We propose AdaGamma, a practical deep actor--critic method for state-dependent discounting that learns a state-dependent discount function together with a return-consistency objective to regularize the induced backup structure. On the theory side, we analyze the Bellman operator induced by state-dependent discounting and establish its basic well-posedness properties under suitable conditions. Empirically, AdaGamma integrates into both SAC and PPO, yielding consistent improvements on continuous-control benchmarks, and achieves statistically significant gains in an online A/B test on the JD Logistics platform. These results suggest that state-dependent discounting can be made effective in deep RL when coupled with a return-consistency objective that prevents degenerate target manipulation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AdaGamma, a deep actor-critic method that learns a state-dependent discount function γ(s) together with a return-consistency objective. It analyzes the induced Bellman operator to establish basic well-posedness properties under suitable conditions, integrates the approach into SAC and PPO, and reports consistent gains on continuous-control benchmarks plus statistically significant improvements in an online A/B test on the JD Logistics platform. The central claim is that the return-consistency regularizer prevents the TD-error collapse and degenerate target manipulation that otherwise arise in naive state-dependent discounting implementations.
Significance. If the return-consistency objective reliably stabilizes the induced operator under neural function approximation, the work would offer a practical route to adaptive planning horizons in RL without sacrificing stability. The integration into standard algorithms and the real-world A/B test provide concrete evidence of utility beyond synthetic benchmarks. The theoretical analysis of the state-dependent Bellman operator is a useful contribution even if the empirical gains ultimately trace to other factors.
major comments (3)
- [§4] §4 (Bellman operator analysis): the contraction and well-posedness results are derived under Lipschitz continuity of γ(s) and boundedness assumptions that are not shown to be preserved when γ(s) is parameterized by a neural network; no subsequent argument or experiment demonstrates that the return-consistency term enforces these conditions in the function-approximation regime.
- [§5.2 and §6] §5.2 and §6 (empirical validation): the claim that the return-consistency objective 'prevents degenerate target manipulation' is supported only by overall performance gains; no ablation isolating the objective, no measurement of TD-error magnitude or target variance with/without the term, and no statistical details (sample sizes, p-values, confidence intervals) are provided for the 'statistically significant' A/B-test result.
- [§3] §3 (method): the return-consistency objective is introduced as a regularizer on the induced backup structure, yet the paper supplies neither a derivation showing necessity/sufficiency for preventing collapse nor a proof that the combined objective remains a contraction mapping once γ(s) is learned.
minor comments (2)
- Notation for the state-dependent discount γ(s) and the return-consistency loss should be introduced with explicit definitions before the operator analysis to avoid forward references.
- The continuous-control benchmark results would benefit from reporting both mean and standard deviation across seeds rather than aggregate curves alone.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, providing clarifications on the manuscript's contributions while agreeing to revisions that strengthen the presentation without overstating the results.
read point-by-point responses
-
Referee: [§4] §4 (Bellman operator analysis): the contraction and well-posedness results are derived under Lipschitz continuity of γ(s) and boundedness assumptions that are not shown to be preserved when γ(s) is parameterized by a neural network; no subsequent argument or experiment demonstrates that the return-consistency term enforces these conditions in the function-approximation regime.
Authors: The analysis in §4 derives contraction and well-posedness for the state-dependent Bellman operator under the explicit assumptions of Lipschitz continuity of γ(s) and boundedness. These are standard conditions for establishing operator properties and are not claimed to hold automatically for arbitrary neural-network parameterizations. The return-consistency objective is presented as a practical regularizer that stabilizes learning in the function-approximation regime, supported by the empirical results. We agree that an explicit argument or experiment linking the regularizer to preservation of the Lipschitz/boundedness conditions would be valuable. In the revision we will add a paragraph in §4 acknowledging this gap between the theoretical assumptions and neural implementations, and we will include new experiments that track the empirical Lipschitz constant and range of the learned γ(s) during training to provide supporting evidence. revision: yes
-
Referee: [§5.2 and §6] §5.2 and §6 (empirical validation): the claim that the return-consistency objective 'prevents degenerate target manipulation' is supported only by overall performance gains; no ablation isolating the objective, no measurement of TD-error magnitude or target variance with/without the term, and no statistical details (sample sizes, p-values, confidence intervals) are provided for the 'statistically significant' A/B-test result.
Authors: The manuscript reports consistent performance gains on continuous-control benchmarks and statistically significant improvement in the JD Logistics A/B test as evidence that the return-consistency term contributes to stability. We acknowledge that these aggregate results alone do not isolate the objective's effect on TD-error collapse or target manipulation. We will revise §5.2 and §6 to include: (i) an ablation comparing AdaGamma with and without the return-consistency term, (ii) plots and statistics of TD-error magnitude and target variance with/without the term, and (iii) the requested statistical details for the A/B test (sample sizes, p-values, and confidence intervals). These additions will be placed in the main text or supplementary material as appropriate. revision: yes
-
Referee: [§3] §3 (method): the return-consistency objective is introduced as a regularizer on the induced backup structure, yet the paper supplies neither a derivation showing necessity/sufficiency for preventing collapse nor a proof that the combined objective remains a contraction mapping once γ(s) is learned.
Authors: The return-consistency objective is motivated by the potential for TD-error collapse under naive state-dependent discounting, as analyzed via the induced Bellman operator in §4. We do not supply a formal derivation establishing necessity or sufficiency, nor a proof that the joint objective remains a contraction once γ(s) is learned jointly. In the revision we will expand the motivation in §3 with a clearer derivation from the observed instability of the backup operator when γ(s) varies, and we will explicitly state the scope of the theoretical guarantees: the contraction result applies to the operator with fixed γ(s) satisfying the Lipschitz and boundedness conditions, while the regularizer is offered as an empirical mechanism for stability during learning. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper proposes AdaGamma as a new method that learns a state-dependent discount function jointly with a return-consistency objective, then analyzes the induced Bellman operator for well-posedness under stated conditions. No derivation reduces a claimed prediction or result to a fitted parameter or self-citation by construction; the consistency objective is introduced as an explicit regularizer rather than derived from the discount itself. Empirical integration into SAC/PPO and platform results are presented as validation, not as forced outputs of the theory. The analysis relies on standard operator properties rather than renaming or smuggling prior results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The Bellman operator induced by state-dependent discounting has basic well-posedness properties under suitable conditions.
invented entities (1)
-
return-consistency objective
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Sutton and A.G
R.S. Sutton and A.G. Barto. Reinforcement learning: An introduction.IEEE Transactions on Neural Networks, 9(5):1054–1054, 1998
1998
-
[2]
Sutton and Andrew G
Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2 edition, 2018
2018
-
[3]
Puterman
Martin L. Puterman. Markov decision processes: Discrete stochastic dynamic programming. In Wiley Series in Probability and Statistics, 1994
1994
-
[4]
Addressing function approximation error in actor-critic methods, 2018
Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods, 2018
2018
-
[5]
Jordan, and Pieter Abbeel
John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization, 2017
2017
-
[6]
Lillicrap, Jonathan J
Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning, 2019
2019
-
[7]
Proximal policy optimization algorithms, 2017
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017
2017
-
[8]
Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, 2018
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, 2018
2018
-
[9]
Human-level control through deep reinforcement learning
V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei Rusu, Joel Veness, Marc Bellemare, Alex Graves, Martin Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning....
2015
-
[10]
How to discount deep rein- forcement learning: Towards new dynamic strategies, 2016
Vincent François-Lavet, Raphael Fonteneau, and Damien Ernst. How to discount deep rein- forcement learning: Towards new dynamic strategies, 2016
2016
-
[11]
Rethinking the discount factor in reinforcement learning: A decision theoretic approach, 2019
Silviu Pitis. Rethinking the discount factor in reinforcement learning: A decision theoretic approach, 2019
2019
-
[12]
On the role of discount factor in offline reinforcement learning
Hao Hu, Yiqin Yang, Qianchuan Zhao, and Chongjie Zhang. On the role of discount factor in offline reinforcement learning. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors,Proceedings of the 39th International Confer- ence on Machine Learning, volume 162 ofProceedings of Machine Learning Research, page...
2022
-
[13]
Discount factor as a regularizer in reinforcement learning
Ron Amit, Ron Meir, and Kamil Ciosek. Discount factor as a regularizer in reinforcement learning. InProceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020
2020
-
[14]
Abhishek Naik, Roshan Shariff, Niko Yasui, Hengshuai Yao, and Richard S. Sutton. Discounted reinforcement learning is not an optimization problem, 2019
2019
-
[15]
On the role of discount factor in offline reinforcement learning, 2022
Hao Hu, Yiqin Yang, Qianchuan Zhao, and Chongjie Zhang. On the role of discount factor in offline reinforcement learning, 2022
2022
-
[16]
Murphy, and Finale Doshi-Velez
Sarah Rathnam, Sonali Parbhoo, Siddharth Swaroop, Weiwei Pan, Susan A. Murphy, and Finale Doshi-Velez. Rethinking discount regularization: New interpretations, unintended consequences, and solutions for regularization in reinforcement learning.Journal of Machine Learning Research, 25(255):1–48, 2024
2024
-
[17]
Foster, Akshay Krishnamurthy, David Simchi-Levi, and Yunzong Xu
Dylan J. Foster, Akshay Krishnamurthy, David Simchi-Levi, and Yunzong Xu. Offline rein- forcement learning: Fundamental barriers for value function approximation, 2022
2022
-
[18]
Wenhao Zhan, Baihe Huang, Audrey Huang, Nan Jiang, and Jason D. Lee. Offline reinforcement learning with realizability and single-policy concentrability, 2022. 11
2022
-
[19]
Markov decision processes with state-dependent discount factors and unbounded rewards/costs.Operations Research Letters, 39(5):369–374, 2011
Qingda Wei and Xianping Guo. Markov decision processes with state-dependent discount factors and unbounded rewards/costs.Operations Research Letters, 39(5):369–374, 2011
2011
-
[20]
Reinforcement learning with state-dependent discount factor
Naoto Yoshida, Eiji Uchibe, and Kenji Doya. Reinforcement learning with state-dependent discount factor. In2013 IEEE Third Joint International Conference on Development and Learning and Epigenetic Robotics (ICDL), pages 1–6, 2013
2013
-
[21]
An improvement for value-based reinforce- ment learning method through increasing discount factor substitution
Linjian Hou, Zhengming Wang, and Han Long. An improvement for value-based reinforce- ment learning method through increasing discount factor substitution. In2021 IEEE 24th International Conference on Computational Science and Engineering (CSE), pages 94–100, 2021
2021
-
[22]
Rothkopf, and Heinz Koeppl
Matthias Schultheis, Constantin A. Rothkopf, and Heinz Koeppl. Reinforcement learning with non-exponential discounting, 2022
2022
-
[23]
Rethinking the discount factor in reinforcement learning: a decision theoretic approach
Silviu Pitis. Rethinking the discount factor in reinforcement learning: a decision theoretic approach. InProceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’19/IAAI’19/EAAI’...
2019
-
[24]
Adaptive discount factor for deep reinforcement learning in continuing tasks with uncertainty.Sensors, 22(19), 2022
MyeongSeop Kim, Jung-Su Kim, Myoung-Su Choi, and Jae-Han Park. Adaptive discount factor for deep reinforcement learning in continuing tasks with uncertainty.Sensors, 22(19), 2022
2022
-
[25]
Yang Gu, Yuhu Cheng, C. L. Philip Chen, and Xuesong Wang. Proximal policy optimiza- tion with policy feedback.IEEE Transactions on Systems, Man, and Cybernetics: Systems, 52(7):4600–4610, 2022
2022
-
[26]
Safety gymnasium: A unified safe reinforcement learning benchmark
Jiaming Ji, Borong Zhang, Jiayi Zhou, Xuehai Pan, Weidong Huang, Ruiyang Sun, Yiran Geng, Yifan Zhong, Josef Dai, and Yaodong Yang. Safety gymnasium: A unified safe reinforcement learning benchmark. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023
2023
-
[27]
Gymnasium: A Standard Interface for Reinforcement Learning Environments
Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard interface for reinforcement learning environments.arXiv preprint arXiv:2407.17032, 2024
work page internal anchor Pith review arXiv 2024
-
[28]
A tale of sampling and estimation in discounted reinforcement learning, 2023
Alberto Maria Metelli, Mirco Mutti, and Marcello Restelli. A tale of sampling and estimation in discounted reinforcement learning, 2023
2023
-
[29]
Zeyu Jia, Alexander Rakhlin, Ayush Sekhari, and Chen-Yu Wei. Offline reinforcement learning: Role of state aggregation and trajectory data.CoRR, abs/2403.17091, 2024
-
[30]
Sutton, and Satinder P
Doina Precup, Richard S. Sutton, and Satinder P. Singh. Eligibility traces for off-policy policy evaluation. InProceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, page 759–766, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc
2000
-
[31]
The option-critic architecture
Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. InProceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI’17, page 1726–1734. AAAI Press, 2017
2017
-
[32]
Anna Harutyunyan, Will Dabney, Diana Borsa, Nicolas Heess, Remi Munos, and Doina Precup. The termination critic. In Kamalika Chaudhuri and Masashi Sugiyama, editors,Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 ofProceedings of Machine Learning Research, pages 2231–2240. PMLR, 16–18 Apr 2019...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.