arxiv: 2605.12261 · v1 · submitted 2026-05-12 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Delay-Empowered Causal Hierarchical Reinforcement Learning

Chenran Zhao, Chunping Qiu, Dianxi Shi, Haotian Wang, Mengzhu Wang, Shaowu Yang, Yaowen Zhang

Pith reviewed 2026-05-13 05:37 UTC · model grok-4.3

classification 💻 cs.LG

keywords delay-aware reinforcement learninghierarchical reinforcement learningcausal modelingstochastic delaysempowermenttemporal uncertaintyproactive exploration

0 comments

The pith

DECHRL learns causal structures and stochastic delay distributions from delayed observations to drive empowerment-based exploration in hierarchical reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Many real-world tasks involve actions whose effects appear only after random time lags, which disrupts standard reinforcement learning. Existing delay-aware methods typically need prior knowledge of the delay distribution or access to undelayed data, while hierarchical methods have been limited to fixed delays. The paper proposes DECHRL, which discovers both the underlying causal relationships in state transitions and the probability distributions of those delays directly from observations. It folds these models into a delay-aware empowerment objective that steers the agent toward states offering greater control despite timing noise. Tests in grid and Minecraft-style environments with added stochastic delays show the method models the delays effectively and outperforms baselines in decision quality.

Core claim

DECHRL explicitly models both the causal structure of state transitions and their associated stochastic delay distributions. These are then incorporated into a delay-aware empowerment objective that drives proactive exploration toward highly controllable states, thereby improving performance under temporal uncertainty.

What carries the argument

The delay-aware empowerment objective, built on learned causal models of state transitions and stochastic delay distributions inside a hierarchical reinforcement learning architecture.

If this is right

Agents can manage variable and unknown delays without requiring advance knowledge of their statistics.
The combination of hierarchy, causal modeling, and delay-aware empowerment produces more robust decision-making than state-augmentation or prior-knowledge baselines.
Proactive exploration is directed toward states that remain controllable even when effects are stochastically delayed.
Performance gains appear in both grid-world and Minecraft-like domains once stochastic delays are introduced.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same modeling approach could be applied to continuous-control domains such as robotic manipulation where actuator or sensor delays vary with load.
If the learned causal graphs remain stable under changing delay statistics, the method might serve as a building block for lifelong learning under non-stationary timing.
One could test whether the empowerment term still yields gains when delays are allowed to depend on the chosen actions rather than being independent.

Load-bearing premise

The causal structure of state transitions and the stochastic delay distributions can be accurately learned from delayed observations alone without prior knowledge or non-delayed data.

What would settle it

In a controlled test environment with known ground-truth causal structure and delay distributions, measure whether DECHRL recovers those distributions accurately and whether removing the causal or delay components causes a clear drop in performance relative to the full method.

Figures

Figures reproduced from arXiv: 2605.12261 by Chenran Zhao, Chunping Qiu, Dianxi Shi, Haotian Wang, Mengzhu Wang, Shaowu Yang, Yaowen Zhang.

**Figure 1.** Figure 1: The overall framework of DECHRL. builds higher-level policies, thereby gaining control over an increasingly rich set of subgoals Listdo and enabling exploration of increasingly deeper causal structures. This cycle continues until the learned causal chain includes subgoals that directly accomplish the target task. 4.1. Learning Causal Delay Distributions from Delayed Interactions Without modeling the distri… view at source ↗

**Figure 2.** Figure 2: State-transition dynamics and their average delay characteristics involved in the [PITH_FULL_IMAGE:figures/full_fig_p018_2.png] view at source ↗

**Figure 3.** Figure 3: The delay distribution of transitions on the MineCraft tasks under different [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗

**Figure 4.** Figure 4: The delay distribution of transitions on the MiniGrid tasks under [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: The sub-goal training efficiency of different delay-handling methods across [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: The ASR and ADC of DECHRL and enhanced HRL baselines across four tasks [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: DECHRL’s performance on the GetSilverore (τmax = 4) tasks with varying gradient estimators (REINFORCE, PPO, A2C). Experimental Results. The performance of all three gradient estimators is 25 [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: DECHRL’s performance on the GetSilverore (τmax = 4) tasks with varying σdelay = {0.4, 0.6, 0.8, 1.0}. Experimental Results. Increasing σdelay leads to greater uncertainty in state transition delays, posing challenges not only for delay detection but also for effective adaptation. Nevertheless, as shown in [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗

**Figure 9.** Figure 9: The ASR and ADC of Simplified-DECHRL on the [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

**Figure 10.** Figure 10: DECHRL’s performance on the GetSilverore (τmax = 4, σdelay = 0.8) task with varying values of hyperparameters λ1, λ2. Based on these observations, we set both λ1 and λ2 to 0.05, which lies near the midpoint of the effective range (0.01 ∼ 0.1). We would like to emphasize that the brown and red curves only begin to produce meaningful results after approximately 25k training episodes. This delay is due to th… view at source ↗

**Figure 11.** Figure 11: DECHRL’s performance on the GetSilverore (τmax = 4, σdelay = 0.4) task with varying values of subgoal success ratio threshold. Experimental Results. The achievement of higher-level subgoals relies on the reliable execution of lower-level subgoals. If the success-rate threshold is set 29 [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗

read the original abstract

Many real-world tasks involve delayed effects, where the outcomes of actions emerge after varying time lags. Existing delay-aware reinforcement learning methods often rely on state augmentation, prior knowledge of delay distributions, or access to non-delayed data, limiting their generalization. Hierarchical reinforcement learning, by contrast, inherently offers advantages in handling delays due to its hierarchical structure, yet existing methods are restricted to fixed delays. To address these limitations, we propose Delay-Empowered Causal Hierarchical Reinforcement Learning (DECHRL). DECHRL explicitly models both the causal structure of state transitions and their associated stochastic delay distributions. These are then incorporated into a delay-aware empowerment objective that drives proactive exploration toward highly controllable states, thereby improving performance under temporal uncertainty. We evaluate DECHRL in modified 2D-Minecraft and MiniGrid environments featuring stochastic delays. Experimental results show that DECHRL effectively models temporal delays and significantly outperforms baselines in decision-making under temporal uncertainty.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DECHRL learns causal transitions and stochastic delays from delayed data then folds them into a hierarchical empowerment objective, but the identifiability step looks under-supported.

read the letter

DECHRL tries to learn both the causal structure over state transitions and the per-transition stochastic delay distributions directly from sequences of delayed observations, then plugs those into a delay-aware empowerment objective that encourages exploration of controllable states. The paper positions this as fixing gaps in prior delay-aware RL (which often needs state augmentation or known delay distributions) and fixed-delay HRL. That combination is the actual new piece: variable stochastic delays handled inside a hierarchical setup without those common restrictions. The experiments in modified 2D-Minecraft and MiniGrid environments with stochastic delays report that the method models the delays and beats baselines on decision-making under temporal uncertainty. That is concrete progress on a practical issue that shows up in robotics and games. The soft spot is the recovery step. Each observed transition is a mixture over unknown lags, so multiple causal graphs plus delay kernels can produce the same marginal statistics. The abstract gives no identifiability argument, no proof, and no ablation that checks whether the learned graph and delay distributions match the ground truth in the test environments. If the full paper only shows downstream task performance without validating the intermediate causal and delay models, then the gains could be fragile or environment-specific. The empowerment objective then runs on whatever model was recovered, so any mis-specification carries through. This is aimed at RL people who already work on hierarchical methods and need to handle real timing noise. It is worth sending to peer review because the problem matters and the framing is fresh, but the referee will need to see explicit checks on the causal learning part before the central claim can be trusted.

Referee Report

2 major / 2 minor

Summary. The paper proposes Delay-Empowered Causal Hierarchical Reinforcement Learning (DECHRL) for RL tasks with stochastic action delays. It claims to jointly learn the causal structure of state transitions and the associated per-transition stochastic delay distributions p(τ|s,a,s') directly from delayed observation sequences, embed these into a delay-aware empowerment objective that promotes exploration of controllable states, and thereby achieve superior decision-making compared to baselines. Evaluation is performed on modified 2D-Minecraft and MiniGrid environments that inject stochastic delays; the abstract states that DECHRL effectively models delays and significantly outperforms existing methods.

Significance. If the identifiability and performance claims hold, the work would offer a meaningful step toward practical hierarchical RL in real-world settings with unknown temporal uncertainties, by removing the need for prior delay knowledge or non-delayed data that limits prior delay-aware methods. The integration of causal discovery with empowerment-style objectives is conceptually attractive and could generalize beyond the tested domains.

major comments (2)

[Abstract and §3 (Method)] Abstract and §3 (Method): The central claim that the causal graph and delay distribution p(τ|s,a,s') can be recovered solely from sequences of delayed (s_t, a_t, s_{t+τ}) tuples is load-bearing yet unsupported by any identifiability argument or set of sufficient conditions. In stochastic-delay regimes each observed transition is a mixture over unknown lags; without known delay support, parametric restrictions on the kernel, or access to non-delayed rollouts, multiple (graph, delay) pairs are consistent with the same marginal transition statistics. Consequently the delay-aware empowerment objective operates on a potentially mis-specified model, and any reported performance gain is conditional on successful disentanglement that has not been demonstrated.
[§4 (Experiments)] §4 (Experiments): The abstract asserts that DECHRL 'significantly outperforms baselines' and 'effectively models temporal delays,' but the provided description contains no quantitative metrics, error bars, ablation results isolating the causal-modeling or delay-distribution components, or statistical significance tests. Without these, it is impossible to assess whether the empirical results actually support the superiority claim or merely reflect implementation details.

minor comments (2)

[§3] Notation for the delay distribution and empowerment objective should be introduced with explicit equations rather than descriptive prose alone.
[§4] The environments are described only as 'modified' 2D-Minecraft and MiniGrid; a precise description of how stochastic delays are injected (support, sampling procedure) belongs in the experimental setup.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address the two major concerns point by point below. Both points are valid and will be incorporated into a revised manuscript.

read point-by-point responses

Referee: [Abstract and §3 (Method)] Abstract and §3 (Method): The central claim that the causal graph and delay distribution p(τ|s,a,s') can be recovered solely from sequences of delayed (s_t, a_t, s_{t+τ}) tuples is load-bearing yet unsupported by any identifiability argument or set of sufficient conditions. In stochastic-delay regimes each observed transition is a mixture over unknown lags; without known delay support, parametric restrictions on the kernel, or access to non-delayed rollouts, multiple (graph, delay) pairs are consistent with the same marginal transition statistics. Consequently the delay-aware empowerment objective operates on a potentially mis-specified model, and any reported performance gain is conditional on successful disentanglement that has not been demonstrated.

Authors: We agree that the manuscript lacks a formal identifiability argument or sufficient conditions for unique recovery of the causal graph and per-transition delay distributions from delayed observation sequences alone. This is a genuine limitation of the current presentation. Our method relies on joint optimization of the causal model parameters together with the delay-aware empowerment objective inside the hierarchical policy; the hierarchical decomposition and the controllability-driven exploration provide an inductive bias that promotes disentanglement in practice. We will revise §3 to explicitly state the absence of a general identifiability guarantee, discuss the mixture-of-lags problem, and articulate the modeling assumptions (e.g., finite delay support, Markovian state transitions, and the role of the empowerment term) under which the learned model remains useful. We will also add a short synthetic-data experiment illustrating recovery quality under controlled stochastic-delay conditions. revision: partial
Referee: [§4 (Experiments)] §4 (Experiments): The abstract asserts that DECHRL 'significantly outperforms baselines' and 'effectively models temporal delays,' but the provided description contains no quantitative metrics, error bars, ablation results isolating the causal-modeling or delay-distribution components, or statistical significance tests. Without these, it is impossible to assess whether the empirical results actually support the superiority claim or merely reflect implementation details.

Authors: We concur that the experimental section requires substantially more quantitative detail. While the manuscript contains performance plots, the accompanying text does not report numerical values, variability measures, component ablations, or statistical tests. In the revised version we will expand §4 with tables reporting mean returns and standard deviations over at least five independent seeds, ablation variants that disable either the causal-graph learner or the explicit delay-distribution estimator, and paired t-test p-values comparing DECHRL against each baseline. These additions will allow readers to evaluate the magnitude and reliability of the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes a pipeline of learning causal transition structure and stochastic delay distributions from delayed observations, then feeding those learned quantities into a delay-aware empowerment objective for exploration. This is a standard modeling-then-optimize structure in RL and does not reduce by construction to its inputs; the objective is defined on the outputs of the learned models rather than being a tautological re-expression of the fitting loss. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are indicated in the abstract or description. The method is evaluated on modified environments with external performance metrics, keeping the central claim falsifiable and independent of the fitting process itself.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the premise that causal structures and stochastic delays can be learned from delayed data and usefully integrated into an empowerment objective; this depends on domain assumptions about hierarchical RL advantages and the learnability of delays without additional data.

free parameters (1)

stochastic delay distribution parameters
The method explicitly models stochastic delay distributions, which in practice requires parameters estimated or fitted from observed delayed transitions.

axioms (1)

domain assumption Hierarchical reinforcement learning inherently offers advantages in handling delays due to its hierarchical structure
Invoked in the abstract to contrast with existing methods restricted to fixed delays.

invented entities (1)

delay-aware empowerment objective no independent evidence
purpose: Drives proactive exploration toward highly controllable states under temporal uncertainty by incorporating modeled causal structures and delays
Introduced as the core mechanism of DECHRL; no independent evidence provided outside the paper.

pith-pipeline@v0.9.0 · 5470 in / 1632 out tokens · 152677 ms · 2026-05-13T05:37:54.004222+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
DECHRL explicitly models both the causal structure of state transitions and their associated stochastic delay distributions. These are then incorporated into a delay-aware empowerment objective that drives proactive exploration toward highly controllable states
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
We propose DECHRL, a novel Causal Hierarchical Reinforcement Learning (CHRL) framework that explicitly integrates stochastic delay distribution modeling with a delay-aware empowerment objective

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 2 internal anchors

[1]

R. S. Sutton, A. G. Barto, et al., Reinforcement learning, Journal of Cognitive Neuroscience 11 (1) (1999) 126–134

work page 1999
[2]

S. Nath, M. Baranwal, H. Khadilkar, Revisiting state augmentation methods for reinforcement learning with stochastic delays, in: Proceedings of the 30th ACM international conference on information & knowledge management, 2021, pp. 1346–1355

work page 2021
[3]

Bouteiller, S

Y . Bouteiller, S. Ramstedt, G. Beltrame, C. Pal, J. Binas, Reinforcement learning with random delays, in: International conference on learning repre- sentations, 2020

work page 2020
[4]

B. Han, Z. Ren, Z. Wu, Y . Zhou, J. Peng, Off-policy reinforcement learn- ing with delayed rewards, CoRR abs/2106.11854 (2021). arXiv:2106. 11854. URLhttps://arxiv.org/abs/2106.11854

work page arXiv 2021
[5]

Karamzade, K

A. Karamzade, K. Kim, M. Kalsi, R. Fox, Reinforcement learning from delayed observations via world models, CoRR abs/2403.12309 (2024). arXiv:2403.12309,doi:10.48550/ARXIV.2403.12309. URLhttps://doi.org/10.48550/arXiv.2403.12309

work page doi:10.48550/arxiv.2403.12309 2024
[6]

Agarwal, V

M. Agarwal, V . Aggarwal, Blind decision making: Reinforcement learning with delayed observations, Pattern Recognit. Lett. 150 (2021) 176–182. doi: 10.1016/J.PATREC.2021.06.022. URLhttps://doi.org/10.1016/j.patrec.2021.06.022 41

work page doi:10.1016/j.patrec.2021.06.022 2021
[7]

Z. Yu, C. Fu, H. Zhong, W. Wang, W. Wu, C. J. Xue, Delay-aware reinforce- ment learning: Insights from delay distributional perspective

work page
[8]

Schuitema, L

E. Schuitema, L. Bu¸ soniu, R. Babuška, P. Jonker, Control delay in reinforce- ment learning for real-time dynamic systems: A memoryless approach, in: 2010 IEEE/RSJ international conference on intelligent robots and systems, IEEE, 2010, pp. 3226–3231

work page 2010
[9]

W. Wang, D. Han, X. Luo, D. Li, Addressing signal delay in deep reinforcement learning, in: ICLR 2024, 2024. URL https://www.microsoft.com/en-us/research/ publication/addressing-signal-delay-in-deep-reinforcement-learning/

work page 2024
[10]

Liotet, D

P. Liotet, D. Maran, L. Bisi, M. Restelli, Delayed reinforcement learning by imitation, in: International conference on machine learning, PMLR, 2022, pp. 13528–13556

work page 2022
[11]

A. Levy, G. Konidaris, R. Platt, K. Saenko, Learning multi-level hierarchies with hindsight, arXiv preprint arXiv:1712.00948 (2017)

work page arXiv 2017
[12]

Bacon, J

P.-L. Bacon, J. Harb, D. Precup, The option-critic architecture, in: Proceed- ings of the AAAI conference on artificial intelligence, V ol. 31, 2017

work page 2017
[13]

W. Kim, J. Kim, Y . Sung, Lesson: learning to integrate exploration strate- gies for reinforcement learning via an option framework, arXiv preprint arXiv:2310.03342 (2023)

work page arXiv 2023
[14]

C. Zhao, D. Shi, M. Wang, J. Xia, H. Yang, S. Jin, S. Yang, C. Qiu, D3hrl: A distributed hierarchical reinforcement learning approach based on causal discovery and spurious correlation detection, Neural Networks 195 (2026) 108275. doi:https://doi.org/10.1016/j.neunet. 2025.108275. URL https://www.sciencedirect.com/science/article/ pii/S0893608025011566

work page doi:10.1016/j.neunet 2026
[15]

X. Hu, R. Zhang, K. Tang, J. Guo, Q. Yi, R. Chen, Z. Du, L. Li, Q. Guo, Y . Chen, et al., Causality-driven hierarchical structure discovery for reinforce- ment learning, Advances in Neural Information Processing Systems 35 (2022) 20064–20076. 42

work page 2022
[16]

Peters, D

J. Peters, D. Janzing, B. Schölkopf, Elements of causal inference: foundations and learning algorithms, The MIT Press, 2017

work page 2017
[17]

S. Sohn, J. Oh, H. Lee, Hierarchical reinforcement learning for zero-shot generalization with subtask dependencies, Advances in neural information processing systems 31 (2018)

work page 2018
[18]

Chevalier-Boisvert, B

M. Chevalier-Boisvert, B. Dai, M. Towers, R. de Lazcano, L. Willems, S. Lahlou, S. Pal, P. S. Castro, J. Terry, Minigrid & miniworld: Modu- lar & customizable reinforcement learning environments for goal-oriented tasks, CoRR abs/2306.13831 (2023)

work page arXiv 2023
[19]

Liotet, E

P. Liotet, E. Venneri, M. Restelli, Learning a belief representation for delayed reinforcement learning, in: 2021 International Joint Conference on Neural Networks (IJCNN), IEEE, 2021, pp. 1–8

work page 2021
[20]

Y . Li, Y . Wang, X. Tan, Highly valued subgoal generation for efficient goal- conditioned reinforcement learning, Neural Networks 181 (2025) 106825. doi:10.1016/J.NEUNET.2024.106825. URLhttps://doi.org/10.1016/j.neunet.2024.106825

work page doi:10.1016/j.neunet.2024.106825 2025
[21]

Corcoll, R

O. Corcoll, R. Vicente, Disentangling controlled effects for hierarchical reinforcement learning, in: Conference on Causal Learning and Reasoning, PMLR, 2022, pp. 178–200

work page 2022
[22]

T. E. Lee, S. Vats, S. Girdhar, O. Kroemer, Scale: Causal learning and discovery of robot manipulation skills using simulation, in: CoRL 2023 Workshop on Learning Effective Abstractions for Planning (LEAP), 2023

work page 2023
[23]

Chuck, K

C. Chuck, K. Black, A. Arjun, Y . Zhu, S. Niekum, Granger causal interaction skill chains, Transactions on Machine Learning Research

work page
[24]

B. Chen, Z. Cao, W. Mayer, M. Stumptner, R. Kowalczyk, Hcpi-hrl: Human causal perception and inference-driven hierarchical reinforcement learning, Neural Networks 187 (2025) 107318

work page 2025
[25]

M. H. Nguyen, H. Le, S. Venkatesh, Variable-agnostic causal exploration for reinforcement learning, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2024, pp. 216–232. 43

work page 2024
[26]

Mohamed, D

S. Mohamed, D. Jimenez Rezende, Variational information maximisation for intrinsically motivated reinforcement learning, Advances in neural informa- tion processing systems 28 (2015)

work page 2015
[27]

Gregor, D

K. Gregor, D. J. Rezende, D. Wierstra, Variational intrinsic control, arXiv preprint arXiv:1611.07507 (2016)

work page arXiv 2016
[28]

Eysenbach, A

B. Eysenbach, A. Gupta, J. Ibarz, S. Levine, Diversity is all you need: Learn- ing skills without a reward function, arXiv preprint arXiv:1802.06070 (2018)

work page arXiv 2018
[29]

Sharma, S

A. Sharma, S. Gu, S. Levine, V . Kumar, K. Hausman, Dynamics-aware unsupervised discovery of skills, arXiv preprint arXiv:1907.01657 (2019)

work page arXiv 1907
[30]

Bharadhwaj, M

H. Bharadhwaj, M. Babaeizadeh, D. Erhan, S. Levine, Information prior- itization through empowerment in visual model-based rl, arXiv preprint arXiv:2204.08585 (2022)

work page arXiv 2022
[31]

J. Choi, A. Sharma, H. Lee, S. Levine, S. S. Gu, Variational empowerment as representation learning for goal-based reinforcement learning, arXiv preprint arXiv:2106.01404 (2021)

work page arXiv 2021
[32]

H. Cao, F. Feng, M. Fang, S. Dong, T. Yang, J. Huo, Y . Gao, Towards empowerment gain through causal structure learning in model-based rl, arXiv preprint arXiv:2502.10077 (2025)

work page arXiv 2025
[33]

N. R. Ke, O. Bilaniuk, A. Goyal, S. Bauer, H. Larochelle, B. Schölkopf, M. C. Mozer, C. Pal, Y . Bengio, Learning neural causal models from unknown interventions, arXiv preprint arXiv:1910.01075 (2019)

work page arXiv 1910
[34]

Bengio, T

Y . Bengio, T. Deleu, N. Rahaman, N. R. Ke, S. Lachapelle, O. Bilaniuk, A. Goyal, C. J. Pal, A meta-transfer objective for learning to disentangle causal mechanisms, in: 8th International Conference on Learning Represen- tations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, OpenRe- view.net, 2020. URLhttps://openreview.net/forum?id=ryxWIgBFPS

work page 2020
[35]

Spirtes, C

P. Spirtes, C. Glymour, R. Scheines, Causation, Prediction, and Search, Second Edition, Adaptive computation and machine learning, MIT Press, 2000. 44

work page 2000
[36]

Boutilier, R

C. Boutilier, R. Dearden, M. Goldszmidt, Stochastic dynamic programming with factored representations, Artif. Intell. 121 (1-2) (2000) 49–107. doi: 10.1016/S0004-3702(00)00033-3. URLhttps://doi.org/10.1016/S0004-3702(00)00033-3

work page doi:10.1016/s0004-3702(00)00033-3 2000
[37]

Pearl, Models, reasoning and inference, Cambridge, UK: CambridgeUni- versityPress 19 (2) (2000) 3

J. Pearl, Models, reasoning and inference, Cambridge, UK: CambridgeUni- versityPress 19 (2) (2000) 3

work page 2000
[38]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal policy optimization algorithms, CoRR abs/1707.06347 (2017). arXiv: 1707.06347. URLhttp://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[39]

V . Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, K. Kavukcuoglu, Asynchronous methods for deep reinforcement learning, CoRR abs/1602.01783 (2016).arXiv:1602.01783. URLhttp://arxiv.org/abs/1602.01783

work page arXiv 2016
[40]

Andrychowicz, D

M. Andrychowicz, D. Crow, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, W. Zaremba, Hindsight experience replay, in: I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V . N. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing...

work page 2017
[41]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025