pith. sign in

arxiv: 1906.09035 · v1 · pith:OQLCKNVRnew · submitted 2019-06-21 · 📡 eess.SY · cs.LG· cs.SY

Revised Progressive-Hedging-Algorithm Based Two-layer Solution Scheme for Bayesian Reinforcement Learning

Pith reviewed 2026-05-25 18:53 UTC · model grok-4.3

classification 📡 eess.SY cs.LGcs.SY
keywords Bayesian reinforcement learningprogressive hedging algorithmdynamic programminglinear quadratic Gaussiandual controluncertainty decompositiontwo-layer schemestochastic control
0
0 comments X

The pith

A two-layer scheme approximates the optimal policy for Bayesian reinforcement learning by separating reducible and irreducible uncertainties.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a two-layer solution scheme for Bayesian RL problems that involve both inherent noise and unknown parameters. The lower layer uses time-decomposition dynamic programming while the upper layer applies scenario-decomposition with a revised progressive hedging algorithm. This structure allows separating reducible system uncertainty from irreducible uncertainty. A reader would care because it offers a direct approximation of the optimal policy for challenging non-episodic cases like the linear-quadratic-Gaussian problem with unknown gain, a problem that has persisted for decades.

Core claim

The central claim is that combining time-decomposition based dynamic programming at the lower layer and scenario-decomposition based revised progressive hedging algorithm at the upper layer provides a two-layer scheme to approximate the optimal policy directly in a type of Bayesian RL problem, with the key feature being the separation of reducible system uncertainty from irreducible one at two different layers, as demonstrated in the linear-quadratic-Gaussian problem with unknown gain.

What carries the argument

The two-layer solution scheme that uses dynamic programming for time decomposition and revised progressive hedging algorithm for scenario decomposition to separate reducible from irreducible uncertainty.

If this is right

  • The scheme enables direct policy approximation rather than value function approximation in Bayesian RL.
  • It addresses the dual control challenge in linear-quadratic-Gaussian systems with unknown parameters.
  • By decomposing and conquering uncertainties at different layers, it improves handling of non-episodic online learning.
  • Existing approaches like Thompson sampling can be compared against this decomposition method.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of uncertainty types could be tested in other stochastic control problems beyond LQG.
  • If effective, this might reduce the computational burden in high-dimensional Bayesian RL by handling scenarios separately.
  • Future work could integrate this with online learning to update the layers dynamically.

Load-bearing premise

The revised progressive hedging algorithm applies effectively at the upper layer to decompose scenarios and separate reducible from irreducible uncertainty in the Bayesian RL problem.

What would settle it

Simulation results on the linear-quadratic-Gaussian problem with unknown gain where the two-layer scheme fails to produce a policy with lower cost than standard Bayesian RL approximations.

read the original abstract

Stochastic control with both inherent random system noise and lack of knowledge on system parameters constitutes the core and fundamental topic in reinforcement learning (RL), especially under non-episodic situations where online learning is much more demanding. This challenge has been notably addressed in Bayesian RL recently where some approximation techniques have been developed to find suboptimal policies. While existing approaches mainly focus on approximating the value function, or on involving Thompson sampling, we propose a novel two-layer solution scheme in this paper to approximate the optimal policy directly, by combining the time-decomposition based dynamic programming (DP) at the lower layer and the scenario-decomposition based revised progressive hedging algorithm (PHA) at the upper layer, for a type of Bayesian RL problem. The key feature of our approach is to separate reducible system uncertainty from irreducible one at two different layers, thus decomposing and conquering. We demonstrate our solution framework more especially via the linear-quadratic-Gaussian problem with unknown gain, which, although seemingly simple, has been a notorious subject over more than half century in dual control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes a novel two-layer solution scheme for a class of Bayesian RL problems (exemplified by the LQG problem with unknown gain). The lower layer applies time-decomposition dynamic programming while the upper layer applies scenario-decomposition via a revised progressive hedging algorithm; the central feature is the separation of reducible parameter uncertainty from irreducible noise at the two layers in order to approximate the optimal policy directly.

Significance. If the claimed decomposition and the extension of the revised PHA are valid, the work would supply a direct policy-approximation route for non-episodic dual-control problems that have resisted solution for decades, complementing existing value-function or Thompson-sampling approximations.

major comments (1)
  1. [Abstract] Abstract: the manuscript asserts that the revised PHA at the upper layer successfully separates reducible from irreducible uncertainty, yet supplies neither a derivation showing preservation of non-anticipativity constraints under the Bayesian update nor any verification that the convexity or penalty-update rules of standard PHA remain intact when the measure itself depends on the policy.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below and will strengthen the manuscript with additional derivations as indicated.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the manuscript asserts that the revised PHA at the upper layer successfully separates reducible from irreducible uncertainty, yet supplies neither a derivation showing preservation of non-anticipativity constraints under the Bayesian update nor any verification that the convexity or penalty-update rules of standard PHA remain intact when the measure itself depends on the policy.

    Authors: We agree the current version does not contain an explicit derivation of non-anticipativity preservation or a formal check that convexity and penalty updates survive when the measure is policy-dependent. In the revision we will add a new subsection (or appendix) that (i) shows the Bayesian update occurs only at the upper layer after the lower-layer DP has produced a candidate policy, so that the scenario set remains non-anticipative with respect to the information available at each stage; (ii) verifies that the quadratic structure of the LQG cost preserves convexity of the augmented Lagrangian even under the posterior measure; and (iii) confirms that the standard PHA penalty-update rule continues to drive the iterates to a feasible non-anticipative solution because the measure update is independent of the intra-scenario decisions. These additions will be placed in Section 3 and will not alter the algorithmic claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in two-layer DP-PHA decomposition for Bayesian RL

full rationale

The paper's core contribution is a methodological proposal: a two-layer scheme that applies time-decomposition DP at the lower layer and scenario-decomposition via a revised PHA at the upper layer to separate reducible parameter uncertainty from irreducible noise in Bayesian RL (exemplified on LQG with unknown gain). This decomposition is introduced as a novel construction rather than derived from prior fitted parameters or self-referential definitions. No equations or steps in the abstract reduce a claimed prediction back to an input by construction, and the revision/extension of PHA is framed as part of the new scheme without load-bearing reliance on unverified self-citations for the separation logic. The derivation remains self-contained as an algorithmic framework.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; the proposal relies on standard DP and PHA but details are absent.

pith-pipeline@v0.9.0 · 5719 in / 1115 out tokens · 37534 ms · 2026-05-25T18:53:47.878794+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 1 internal anchor

  1. [1]

    Aoki, M. (1967). Optimization of Stochastic Systems: Topics in Discrete-Time Systems , volume 32. Academic Press

  2. [2]

    str \"o m, K. J. and Helmersson, A. (1986). Dual control of an integrator with unknown gain. Computers & Mathematics with Applications , 12(6):653--662

  3. [3]

    Bar-Shalom, Y. (1981). Stochastic dynamic programming: Caution and probing. IEEE Transactions on Automatic Control , 26(5):1184--1195

  4. [4]

    Bertsekas, D. P. (2019). Reinforcement Learning and Optimal Control . Unpublished textbook manuscript, see https://web.mit.edu/dimitrib/www/RLbook.html

  5. [5]

    Dallaire, P., Besse, C., Ross, S., and Chaib-draa, B. (2009). Bayesian reinforcement learning in continuous POMDP s with G aussian processes. In 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems , pages 2604--2609. IEEE

  6. [6]

    Deshpande, J., Upadhyay, T., and Lainiotis, D. (1973). Adaptive control of linear stochastic systems. Automatica , 9(1):107--115

  7. [7]

    (1960--1961)

    Feldbaum, A. (1960--1961). Dual control theory I -- IV . Avtomatika i Telemekhanika , 21(9), 21(11), 22(1), 22(2)

  8. [8]

    Ghavamzadeh, M., Mannor, S., Pineau, J., Tamar, A., et al. (2015). Bayesian reinforcement learning: A survey. Foundations and Trends in Machine Learning , 8(5-6):359--483

  9. [9]

    Kirk, D. E. (1970). Optimal Control Theory: An Introduction . Springer

  10. [10]

    Klenske, E. D. and Hennig, P. (2016). Dual control for approximate bayesian reinforcement learning. Journal of Machine Learning Research , 17:1--30

  11. [11]

    and Ng, W.-L

    Li, D. and Ng, W.-L. (2000). Optimal dynamic portfolio selection: Multiperiod mean-variance formulation. Mathematical Finance , 10(3):387--406

  12. [12]

    Li, D., Qian, F., and Fu, P. (2008). Optimal nominal dual control for discrete-time linear-quadratic gaussian problems with unknown parameters. Automatica , 44(1):119--127

  13. [13]

    Ouyang, Y., Gagrani, M., and Jain, R. (2017). Learning-based control of unknown linear systems with T hompson sampling. arXiv preprint arXiv:1709.04047

  14. [14]

    Poupart, P., Vlassis, N., Hoey, J., and Regan, K. (2006). An analytic solution to discrete bayesian reinforcement learning. In Proceedings of the 23rd International Conference on Machine Learning , pages 697--704. ACM

  15. [15]

    Rockafellar, R. T. (2018). Progressive hedging in nonconvex stochastic optimization. In The Workshop on Variational Analysis and Stochastic Optimization , Hong Kong Polytechnic University

  16. [16]

    Rockafellar, R. T. and Wets, R. J.-B. (1991). Scenarios and policy aggregation in optimization under uncertainty. Mathematics of Operations Research , 16(1):119--147

  17. [17]

    Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction . MIT press

  18. [18]

    Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika , 25(3/4):285--294

  19. [19]

    and Bar-Shalom, Y

    Tse, E. and Bar-Shalom, Y. (1973). An actively adaptive control for linear systems with random parameters via the dual control approach. IEEE Transactions on Automatic Control , 18(2):109--117