pith. machine review for the scientific record. sign in

arxiv: 2605.14297 · v1 · submitted 2026-05-14 · 💻 cs.LG · cs.AI· math.OC· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Policy Optimization in Hybrid Discrete-Continuous Action Spaces via Mixed Gradients

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OCstat.ML
keywords hybrid action spacespolicy gradientsmixed estimatorsreinforcement learningdiscrete-continuous controlinventory controlswitched systems
0
0 comments X

The pith

Hybrid Policy Optimization mixes pathwise and score-function gradients to keep policy updates unbiased in hybrid discrete-continuous action spaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reinforcement learning tasks often require choosing both a discrete regime and continuous values inside it, but pure score-function gradients suffer poor credit assignment while full backpropagation through simulators introduces bias at discontinuities. The paper introduces Hybrid Policy Optimization that backpropagates through the simulator for smooth segments and mixes in score-function terms only where needed, preserving unbiasedness overall. The same framework lets problems with action jumps be rewritten as hybrid problems. On inventory control and switched linear-quadratic regulator benchmarks the method beats PPO, and the advantage widens as the continuous action dimension grows. The authors further show that the cross term linking continuous actions to later discrete choices shrinks near a discrete best response, supporting simpler decentralized updates.

Core claim

Hybrid Policy Optimization maintains unbiasedness by combining pathwise derivatives through the simulator where dynamics are smooth with score-function estimators for discrete components, and problems with action discontinuities can be recast in this hybrid form to enable the same optimization technique.

What carries the argument

The mixed gradient estimator that adds pathwise gradients for continuous actions to score-function gradients for discrete actions while preserving unbiasedness.

If this is right

  • Problems featuring action discontinuities can be reformulated as hybrid discrete-continuous problems to apply the same optimization technique.
  • The cross term in the mixed gradient, which links continuous actions to future discrete decisions, becomes negligible near a discrete best response.
  • This negligibility enables approximate decentralized updates of continuous and discrete policy components with reduced variance.
  • Performance advantages over PPO grow as the dimension of the continuous action component increases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reformulation step could let many existing operations-research models with regime switches be solved directly with differentiable simulators.
  • Decentralized updates near optimality might simplify training loops for hierarchical policies without sacrificing final performance.
  • Inventory and switched-linear systems are representative of broader classes of hybrid control problems in robotics and logistics that could adopt the same estimator.

Load-bearing premise

The combination of pathwise and score-function terms produces an estimator whose expectation equals the true policy gradient even when discrete choices affect continuous trajectories.

What would settle it

Compute the exact policy gradient on a low-dimensional hybrid control task via finite differences or dynamic programming and check whether the mixed estimator matches it in expectation; or run HPO versus PPO on the inventory benchmark and verify whether the performance gap shrinks or reverses as continuous dimension increases.

Figures

Figures reproduced from arXiv: 2605.14297 by Daniel Russo, Matias Alvo, Yash Kanoria.

Figure 1
Figure 1. Figure 1: Median number of policy updates required for HPO and PPO to reach [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Alignment and RMSE of batch-level gradient estimates for the mixed and SF estimators for [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) Gradient norms of the PW and cross terms for varying performance gaps in the JRP [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Policy architecture for hybrid MDPs. Algorithm 1 Hybrid Policy Optimization with PPO-style Updates 1: Initialize policy parameters θ = (ϕ, κ) and value parameters ψ 2: Let D = {ξ h 0:T −1 } H h=1 denote the fixed training dataset of exogenous scenarios 3: for iteration = 1, 2, . . . , max_iterations do 4: Partition D into training batches 5: for each training batch H ⊂ D do 6: Roll out the policy πθ = (π X… view at source ↗
Figure 5
Figure 5. Figure 5: Median number of policy updates required to reach a target validation-performance gap [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Median number of policy updates required to reach a target validation-performance gap in [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Median number of policy updates required to reach a target validation-performance gap in [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Relative cost difference between HPO and Riccati-based baselines as a function of the [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Cross alignment between batch-level PW and iteration-level mixed gradient estimates [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Gradient norm and RMSE of batch-level gradient estimates of the cross term in the JRP [PITH_FULL_IMAGE:figures/full_fig_p031_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Probability of not placing an order as a function of total system inventory, for several [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Median policy updates to reach a 10% target gap for HPO and six PPO hyperparameter configurations, as the continuous action dimension p varies. Configurations for which the median did not converge within the training budget are omitted. D.4 PPO Hyperparameter Robustness [PITH_FULL_IMAGE:figures/full_fig_p032_12.png] view at source ↗
read the original abstract

We study reinforcement learning in hybrid discrete-continuous action spaces, such as settings where the discrete component selects a regime (or index) and the continuous component optimizes within it -- a structure common in robotics, control, and operations problems. Standard model-free policy gradient methods rely on score-function (SF) estimators and suffer from severe credit-assignment issues in high-dimensional settings, leading to poor gradient quality. On the other hand, differentiable simulation largely sidesteps these issues by backpropagating through a simulator, but the presence of discrete actions or non-smooth dynamics yields biased or uninformative gradients. To address this, we propose Hybrid Policy Optimization (HPO), which backpropagates through the simulator wherever smoothness permits, using a mixed gradient estimator that combines pathwise and SF gradients while maintaining unbiasedness. We also show how problems with action discontinuities can be reformulated in hybrid form, further broadening its applicability. Empirically, HPO substantially outperforms PPO on inventory control and switched linear-quadratic regulator problems, with performance gaps increasing as the continuous action dimension grows. Finally, we characterize the structure of the mixed gradient, showing that its cross term -- which captures how continuous actions influence future discrete decisions -- becomes negligible near a discrete best response, thereby enabling approximate decentralized updates of the continuous and discrete components and reducing variance near optimality. All resources are available at github.com/MatiasAlvo/hybrid-rl.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Hybrid Policy Optimization (HPO) for RL in hybrid discrete-continuous action spaces, where a discrete component selects a regime and a continuous component optimizes within it. It proposes a mixed gradient estimator combining pathwise gradients (via backpropagation through smooth simulator segments) and score-function gradients, asserting that the combination remains unbiased. The approach includes reformulating discontinuous-action problems into hybrid form. Empirically, HPO outperforms PPO on inventory control and switched linear-quadratic regulator tasks, with larger gains as continuous action dimension increases. The paper also analyzes the mixed gradient structure, showing that the cross-term (continuous actions affecting future discrete decisions) becomes negligible near a discrete best response.

Significance. If the unbiasedness of the mixed estimator holds under the stated conditions, the work would meaningfully advance policy optimization for hybrid action spaces common in control and robotics. It bridges score-function methods (high variance) with differentiable simulation (biased at discontinuities), offers a practical reformulation for discontinuous problems, and provides a structural characterization that could support lower-variance decentralized updates near optimality. The reported empirical gains over PPO, scaling with continuous dimension, indicate potential practical impact if the theoretical claims are verified.

major comments (2)
  1. [Theoretical derivation of mixed gradient estimator] The central claim of unbiasedness for the mixed pathwise-SF estimator under discrete-continuous coupling is load-bearing but insufficiently detailed in the provided derivation. The abstract states that the estimator 'maintains unbiasedness' and that the cross-term 'becomes negligible near a discrete best response,' yet the precise decomposition (how pathwise gradients on continuous segments combine with SF on discrete switches without residual bias from regime selection) is not shown; this leaves open whether the expectation of the total estimator equals the true policy gradient when continuous actions influence discrete probabilities.
  2. [Reformulation section] The reformulation of action-discontinuity problems into hybrid form is presented as broadening applicability, but the manuscript does not specify the exact measure-theoretic conditions or approximation error introduced when mapping non-smooth dynamics onto the hybrid structure; if this step involves any relaxation, it could affect the unbiasedness guarantee.
minor comments (2)
  1. [Experiments] The abstract and empirical claims mention performance gaps but do not reference error bars, number of seeds, or statistical significance tests; these should be added to the experimental section and figures for reproducibility.
  2. [Preliminaries] Notation for the hybrid policy and the mixed estimator (e.g., how the pathwise component is restricted to differentiable segments) should be defined more explicitly early in the paper to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below. Where the comments identify areas needing greater clarity, we have revised the manuscript by expanding the relevant sections and proofs.

read point-by-point responses
  1. Referee: [Theoretical derivation of mixed gradient estimator] The central claim of unbiasedness for the mixed pathwise-SF estimator under discrete-continuous coupling is load-bearing but insufficiently detailed in the provided derivation. The abstract states that the estimator 'maintains unbiasedness' and that the cross-term 'becomes negligible near a discrete best response,' yet the precise decomposition (how pathwise gradients on continuous segments combine with SF on discrete switches without residual bias from regime selection) is not shown; this leaves open whether the expectation of the total estimator equals the true policy gradient when continuous actions influence discrete probabilities.

    Authors: We agree that the original derivation would benefit from greater explicitness. The mixed estimator is constructed so that the pathwise (differentiable simulation) component is applied only to the continuous dynamics conditional on a fixed discrete regime, while the score-function component handles the discrete regime selection and any dependence of regime probabilities on continuous actions. Unbiasedness follows from the law of total expectation: the conditional pathwise gradient is unbiased for the continuous contribution, and the score-function term is unbiased for the discrete choice; their combination yields the full policy gradient with no residual bias term. The cross-term analysis shows it vanishes in expectation near a discrete best response. To address the concern directly, the revised manuscript adds a complete step-by-step derivation (including the full expectation expansion) as a new subsection in Section 3 and a self-contained proof in Appendix B. revision: yes

  2. Referee: [Reformulation section] The reformulation of action-discontinuity problems into hybrid form is presented as broadening applicability, but the manuscript does not specify the exact measure-theoretic conditions or approximation error introduced when mapping non-smooth dynamics onto the hybrid structure; if this step involves any relaxation, it could affect the unbiasedness guarantee.

    Authors: The reformulation expresses problems with action discontinuities by introducing an auxiliary discrete variable that selects among smooth continuous regimes whose union recovers the original (possibly discontinuous) dynamics. When the discontinuity set has Lebesgue measure zero and the simulator is differentiable almost everywhere within each regime, the mapping is exact and introduces no approximation error to the policy gradient. We acknowledge that the original manuscript did not state these conditions explicitly. The revised version adds a dedicated paragraph in Section 4 that specifies the measure-theoretic requirements (discontinuities of measure zero, differentiability a.e. within regimes) under which the reformulation preserves the unbiasedness of the mixed estimator. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation builds on external estimators without self-reduction.

full rationale

The paper proposes a mixed pathwise-SF gradient estimator for hybrid discrete-continuous actions and claims it remains unbiased. This claim rests on standard properties of pathwise derivatives (where dynamics are differentiable) and score-function estimators (unbiased by construction in policy gradients), both drawn from prior RL literature rather than fitted or defined within the paper itself. No equations reduce the target gradient to a parameter fit from the same data, nor does the central unbiasedness result collapse to a self-citation chain or ansatz smuggled from the authors' prior work. The reformulation of discontinuous problems into hybrid form is a modeling step, not a derivation that presupposes its own output. Empirical comparisons to PPO are independent of the theoretical claims. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that a mixed estimator can be constructed to be unbiased, which is a domain assumption in RL gradient estimation. No free parameters or invented entities are specified in the abstract.

axioms (1)
  • domain assumption The simulator is differentiable in the continuous action components.
    Invoked to allow pathwise gradients through the simulator.

pith-pipeline@v0.9.0 · 5563 in / 1337 out tokens · 56937 ms · 2026-05-15T02:09:09.475787+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

118 extracted references · 118 canonical work pages · 12 internal anchors

  1. [1]

    Machine learning , volume=

    Q-learning , author=. Machine learning , volume=. 1992 , publisher=

  2. [2]

    International conference on machine learning , pages=

    Trust region policy optimization , author=. International conference on machine learning , pages=. 2015 , organization=

  3. [3]

    Dynamic programming and optimal control 3rd edition, volume ii , author=

  4. [4]

    Advances in neural information processing systems , volume=

    A natural policy gradient , author=. Advances in neural information processing systems , volume=

  5. [5]

    Advances in neural information processing systems , volume=

    Actor-critic algorithms , author=. Advances in neural information processing systems , volume=

  6. [6]

    2012 , publisher=

    Dynamic programming and optimal control: Volume I , author=. 2012 , publisher=

  7. [7]

    , author=

    Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning. , author=. Journal of Machine Learning Research , volume=

  8. [8]

    Soft Actor-Critic Algorithms and Applications

    Soft actor-critic algorithms and applications , author=. arXiv preprint arXiv:1812.05905 , year=

  9. [9]

    Manufacturing & Service Operations Management , year=

    Can deep reinforcement learning improve inventory management? performance on dual sourcing, lost sales and multi-echelon problems , author=. Manufacturing & Service Operations Management , year=

  10. [10]

    Optimal pricing, inflation, and the cost of price adjustment , pages=

    The optimality of (S, s) policies in the dynamic inventory problem , author=. Optimal pricing, inflation, and the cost of price adjustment , pages=. 1960 , publisher=

  11. [11]

    Management science , volume=

    Optimal policies for a multi-echelon inventory problem , author=. Management science , volume=. 1960 , publisher=

  12. [12]

    The annals of mathematical statistics , pages=

    A stochastic approximation method , author=. The annals of mathematical statistics , pages=. 1951 , publisher=

  13. [13]

    Operations research , volume=

    Old and new methods for lost-sales inventory systems , author=. Operations research , volume=. 2008 , publisher=

  14. [14]

    Operations Research , volume=

    Understanding the performance of capped base-stock policies in lost-sales inventory models , author=. Operations Research , volume=. 2021 , publisher=

  15. [15]

    Harvard Business Review , volume=

    Stock-outs cause walkouts , author=. Harvard Business Review , volume=. 2004 , publisher=

  16. [16]

    Management Science , volume=

    Approximations of dynamic, multilocation production and inventory problems , author=. Management Science , volume=. 1984 , publisher=

  17. [17]

    Operations Research , volume=

    Computational issues in an infinite-horizon, multiechelon inventory model , author=. Operations Research , volume=. 1984 , publisher=

  18. [18]

    1958 , publisher=

    Studies in the mathematical theory of inventory and production , author=. 1958 , publisher=

  19. [19]

    2007 , publisher=

    Approximate Dynamic Programming: Solving the curses of dimensionality , author=. 2007 , publisher=

  20. [20]

    Stochastic Systems , volume=

    Queueing network controls via deep reinforcement learning , author=. Stochastic Systems , volume=. 2022 , publisher=

  21. [21]

    Proximal Policy Optimization Algorithms

    Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

  22. [22]

    The ICLR Blog Track 2023 , year=

    The 37 implementation details of proximal policy optimization , author=. The ICLR Blog Track 2023 , year=

  23. [23]

    European Journal of Operational Research , volume=

    Deep reinforcement learning for inventory control: A roadmap , author=. European Journal of Operational Research , volume=. 2022 , publisher=

  24. [24]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Deep reinforcement learning that matters , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  25. [25]

    2021 American Control Conference (ACC) , pages=

    Scalable deep reinforcement learning for ride-hailing , author=. 2021 American Control Conference (ACC) , pages=. 2021 , organization=

  26. [26]

    IEEE INFOCOM 2018-IEEE Conference on Computer Communications , pages=

    MOVI: A model-free approach to dynamic fleet management , author=. IEEE INFOCOM 2018-IEEE Conference on Computer Communications , pages=. 2018 , organization=

  27. [27]

    Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining , pages=

    A deep value-network based approach for multi-driver order dispatching , author=. Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining , pages=

  28. [28]

    2018 , publisher=

    Reinforcement learning: An introduction , author=. 2018 , publisher=

  29. [29]

    nature , volume=

    Mastering the game of Go with deep neural networks and tree search , author=. nature , volume=. 2016 , publisher=

  30. [30]

    nature , volume=

    Mastering the game of go without human knowledge , author=. nature , volume=. 2017 , publisher=

  31. [31]

    nature , volume=

    Human-level control through deep reinforcement learning , author=. nature , volume=. 2015 , publisher=

  32. [32]

    The Journal of Machine Learning Research , volume=

    End-to-end training of deep visuomotor policies , author=. The Journal of Machine Learning Research , volume=. 2016 , publisher=

  33. [33]

    Unpublished manuscript , year=

    A new and simple policy for the continuous review lost sales inventory model , author=. Unpublished manuscript , year=

  34. [34]

    Proceedings of the 36th IEEE Conference on Decision and Control , volume=

    A neuro-dynamic programming approach to retailer inventory management , author=. Proceedings of the 36th IEEE Conference on Decision and Control , volume=. 1997 , organization=

  35. [35]

    International Journal of Production Economics , volume=

    Inventory management in supply chains: a reinforcement learning approach , author=. International Journal of Production Economics , volume=. 2002 , publisher=

  36. [36]

    European Journal of Operational Research , volume=

    An integrated data-driven method using deep learning for a newsvendor problem with unobservable features , author=. European Journal of Operational Research , volume=. 2022 , publisher=

  37. [37]

    IISE Transactions , volume=

    Applying deep learning to the newsvendor problem , author=. IISE Transactions , volume=. 2020 , publisher=

  38. [38]

    Manufacturing & Service Operations Management , volume=

    A deep q-network for the beer game: Deep reinforcement learning for inventory optimization , author=. Manufacturing & Service Operations Management , volume=. 2022 , publisher=

  39. [39]

    International conference on machine learning , pages=

    Asynchronous methods for deep reinforcement learning , author=. International conference on machine learning , pages=. 2016 , organization=

  40. [40]

    and KAKADE, S

    Deep Inventory Management , author=. arXiv preprint arXiv:2210.03137 , year=

  41. [41]

    Reinforcement learning , pages=

    Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Reinforcement learning , pages=. 1992 , publisher=

  42. [42]

    Applied Intelligence , pages=

    A review of cooperative multi-agent deep reinforcement learning , author=. Applied Intelligence , pages=. 2022 , publisher=

  43. [43]

    Advances in neural information processing systems , volume=

    Multi-agent actor-critic for mixed cooperative-competitive environments , author=. Advances in neural information processing systems , volume=

  44. [44]

    Multi-Agent Actor-Critic with Generative Cooperative Policy Network

    Multi-agent actor-critic with generative cooperative policy network , author=. arXiv preprint arXiv:1810.09206 , year=

  45. [45]

    Parameter Sharing Deep Deterministic Policy Gradient for Cooperative Multi-agent Reinforcement Learning

    Parameter sharing deep deterministic policy gradient for cooperative multi-agent reinforcement learning , author=. arXiv preprint arXiv:1710.00336 , year=

  46. [46]

    Modelling the Dynamic Joint Policy of Teammates with Attention Multi-agent DDPG

    Modelling the dynamic joint policy of teammates with attention multi-agent DDPG , author=. arXiv preprint arXiv:1811.07029 , year=

  47. [47]

    arXiv preprint arXiv:2207.06272 , year=

    Hindsight Learning for MDPs with Exogenous Inputs , author=. arXiv preprint arXiv:2207.06272 , year=

  48. [48]

    Management Science , volume=

    Sensitivity analysis for base-stock levels in multiechelon production-inventory systems , author=. Management Science , volume=. 1995 , publisher=

  49. [49]

    arXiv preprint arXiv:2106.13281 , year=

    Brax--A Differentiable Physics Engine for Large Scale Rigid Body Simulation , author=. arXiv preprint arXiv:2106.13281 , year=

  50. [50]

    arXiv preprint arXiv:1910.00935 , year=

    Difftaichi: Differentiable programming for physical simulation , author=. arXiv preprint arXiv:1910.00935 , year=

  51. [51]

    International Conference on Machine Learning , pages=

    Do differentiable simulators give better policy gradients? , author=. International Conference on Machine Learning , pages=. 2022 , organization=

  52. [52]

    The Journal of Machine Learning Research , volume=

    Monte carlo gradient estimation in machine learning , author=. The Journal of Machine Learning Research , volume=. 2020 , publisher=

  53. [53]

    Simple random search provides a competitive approach to reinforcement learning

    Simple random search provides a competitive approach to reinforcement learning , author=. arXiv preprint arXiv:1803.07055 , year=

  54. [54]

    arXiv:1911.12360 , author =

    How much over-parameterization is sufficient to learn deep relu networks? , author=. arXiv preprint arXiv:1911.12360 , year=

  55. [55]

    Fuqua School of Business, Duke University, Durham, NC , year=

    Quadratic approximation of cost functions in lost sales and perishable inventory control problems , author=. Fuqua School of Business, Duke University, Durham, NC , year=

  56. [56]

    2017 , url =

    Corporacion Favorita , title =. 2017 , url =

  57. [57]

    predict, then optimize

    Smart “predict, then optimize” , author=. Management Science , volume=. 2022 , publisher=

  58. [58]

    AAAI/IAAI , pages=

    Solving very large weakly coupled Markov decision processes , author=. AAAI/IAAI , pages=

  59. [59]

    Available at SSRN , year=

    Multi-Agent Deep Reinforcement Learning for Multi-Echelon Inventory Management , author=. Available at SSRN , year=

  60. [60]

    arXiv preprint arXiv:2212.07684 , year=

    Multi-Agent Reinforcement Learning with Shared Resources for Inventory Management , author=. arXiv preprint arXiv:2212.07684 , year=

  61. [61]

    International Journal of Production Economics , volume=

    Deep Reinforcement Learning for One-Warehouse Multi-Retailer inventory management , author=. International Journal of Production Economics , volume=. 2024 , publisher=

  62. [62]

    Computers in Industry , volume=

    Use of proximal policy optimization for the joint replenishment problem , author=. Computers in Industry , volume=. 2020 , publisher=

  63. [63]

    The Use of Continuous Action Representations to Scale Deep Reinforcement Learning for Inventory Control , author=

  64. [64]

    International Journal of Production Research , volume=

    Using the proximal policy optimisation algorithm for solving the stochastic capacitated lot sizing problem , author=. International Journal of Production Research , volume=. 2023 , publisher=

  65. [65]

    Available at SSRN 3901070 , year=

    Math programming based reinforcement learning for multi-echelon inventory management , author=. Available at SSRN 3901070 , year=

  66. [66]

    AI for Decision Optimization Workshop of the AAAI Conference on Artificial Intelligence , year=

    Endto-end learning via constraint-enforcing approximators for linear programs with applications to supply chains , author=. AI for Decision Optimization Workshop of the AAAI Conference on Artificial Intelligence , year=

  67. [67]

    Management Science , volume=

    A practical end-to-end inventory management model with deep learning , author=. Management Science , volume=. 2023 , publisher=

  68. [68]

    arXiv preprint arXiv:2310.18803 , year=

    Weakly Coupled Deep Q-Networks , author=. arXiv preprint arXiv:2310.18803 , year=

  69. [69]

    Operations Research , volume=

    The near-myopic nature of the lagged-proportional-cost inventory problem with lost sales , author=. Operations Research , volume=. 1971 , publisher=

  70. [70]

    European Journal of Operational Research , volume=

    A typology and literature review on stochastic multi-echelon inventory models , author=. European Journal of Operational Research , volume=. 2018 , publisher=

  71. [71]

    2021 , publisher=

    Deep Reinforcement Learning for Asymmetric One-Warehouse Multi-Retailer Inventory Management , author=. 2021 , publisher=

  72. [72]

    2023 IMS International Conference on Statistics and Data Science (ICSDS) , pages=

    Data science at the singularity , author=. 2023 IMS International Conference on Statistics and Data Science (ICSDS) , pages=

  73. [73]

    Advances in neural information processing systems , volume=

    Imagenet classification with deep convolutional neural networks , author=. Advances in neural information processing systems , volume=

  74. [74]

    accessed 2024 , note =

    Hugging Face , title =. accessed 2024 , note =

  75. [75]

    2019 , author=

    The bitter lesson. 2019 , author=. URL http://www. incompleteideas. net/IncIdeas/BitterLesson. html , year=

  76. [76]

    arXiv preprint arXiv:1912.02178 , year=

    Fantastic generalization measures and where to find them , author=. arXiv preprint arXiv:1912.02178 , year=

  77. [77]

    International conference on machine learning , pages=

    A convergence theory for deep learning via over-parameterization , author=. International conference on machine learning , pages=. 2019 , organization=

  78. [78]

    Advances in neural information processing systems , volume=

    Policy gradient methods for reinforcement learning with function approximation , author=. Advances in neural information processing systems , volume=

  79. [79]

    Advances in neural information processing systems , volume=

    Pytorch: An imperative style, high-performance deep learning library , author=. Advances in neural information processing systems , volume=

  80. [80]

    James Bradbury and Roy Frostig and Peter Hawkins and Matthew James Johnson and Chris Leary and Dougal Maclaurin and George Necula and Adam Paszke and Jake Vander

Showing first 80 references.