arxiv: 2605.07057 · v1 · submitted 2026-05-08 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Integrating Causal DAGs in Deep RL: Activating Minimal Markovian States with Multi-Order Exposure

Jiamin Xu , Jacqueline Maasch , Kyra Gan

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:29 UTC · model grok-4.3

classification 💻 cs.LG

keywords causal reinforcement learningMarkov propertystate representationdeep Q-networkscausal DAGsminimal statemulti-order exposurecontrolled redundancy

0 comments

The pith

Given a longitudinal causal graph over observations, a procedure builds a provably minimal Markov state for RL, yet deep networks require multi-order historical exposures to realize any gains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper supplies a procedure that turns a given longitudinal causal graph on observed variables into a state representation guaranteed to satisfy the Markov property. Real-world RL often lacks such states from raw data, so this construction fills a basic gap. Tests show that feeding only this minimal state into deep Q-networks produces no reliable improvement, which suggests that neural nets cannot exploit minimality on their own. MOSE instead supplies the Q-function with several orders of historical versions of the same minimal state at once, and this version beats both the pure minimal state and ordinary single-window policies on standard benchmarks and synthetic tasks. The results indicate that some controlled redundancy must accompany minimal causal states before their information becomes usable in practice.

Core claim

Given a longitudinal causal graph over observed variables, a procedure constructs a provably minimal state representation that satisfies the Markov property. In deep RL, the minimal representation alone fails to improve performance, indicating that neural networks cannot directly exploit Markovian minimality. MOSE addresses this by feeding multi-order historical state constructions into the same Q-function. MOSE consistently outperforms both the minimal state construction and single-window policies on common benchmarks and synthetic datasets. Including the minimal representation alongside MOSE can further improve performance. These results establish that minimal sufficiency is not enough and

What carries the argument

MOSE (Multi-Order State Exposure), the mechanism that augments a minimal Markov state derived from a causal DAG with multiple historical orders and supplies them jointly to a standard Q-network.

If this is right

A provably minimal Markovian state can be derived directly from any accurate longitudinal causal DAG over observed variables.
Standard deep Q-networks cannot exploit the minimality of a state without additional structure such as multi-order histories.
Multi-order exposure of historical states produces higher performance than either the pure minimal state or single-window policies.
Combining the minimal state with MOSE yields further gains beyond MOSE alone.
The performance pattern holds on common RL benchmarks and on synthetic datasets with known causal structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the causal graph must be learned from data rather than provided exactly, small errors could turn the derived state non-Markovian and erase the theoretical guarantee.
The same principle of controlled redundancy might apply to other deep RL architectures such as actor-critic or model-based methods.
An adaptive choice of which historical orders to expose could replace the fixed multi-order scheme and reduce unnecessary computation.
The construction could be tested in partially observable settings where some variables in the causal graph are hidden.

Load-bearing premise

An accurate longitudinal causal graph over the observed variables is supplied as input, and standard neural Q-networks can directly exploit the minimal state when it is augmented with multi-order histories without further architectural changes.

What would settle it

If, on a controlled benchmark where the true causal graph is known exactly, MOSE produces no improvement or produces worse performance than a non-causal baseline that ignores the graph, the claim that multi-order exposure unlocks the benefit of causal states would be refuted.

Figures

Figures reproduced from arXiv: 2605.07057 by Jacqueline Maasch, Jiamin Xu, Kyra Gan.

**Figure 1.** Figure 1: From DAG to MDP. (1) Time-series causal DAG shown at time t ∈ [0, 2], where S0, S1, and S2 are valid states selected by Algorithm F.1. (2) Causal DAG representation of the corresponding MDP [58], a subgraph of (1). In this work, we assume for simplicity, the action At can only affect Xt+1. When this assumption does not hold, we need to modify Theorem 4.1 to also include the parents of actions to the state … view at source ↗

**Figure 2.** Figure 2: Average per episode reward averaged over 17 instances, each with 2 repetitions. From [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Results on GOPHER. particular, the conventional practice of concatenating the most recent four frames may introduce substantial redundancy, since many pixels or past-frame components are only weakly related to the information needed for predicting future rewards and selecting actions in order to estimate the Q-function. As a result, the effective input space becomes unnecessarily large, which can make valu… view at source ↗

read the original abstract

Online reinforcement learning (RL) relies on the Markov property for guaranteed performance, but real-world applications often lack well-defined states given raw observed variables. While causal RL has attracted growing interest, existing work typically assumes Markovian states are provided and focuses on using causality to accelerate learning, leaving a fundamental gap: \emph{given a longitudinal causal graph over observed variables, how does one construct MDP states that provably satisfy the Markov property?} We address this by providing a procedure that constructs a provably minimal state representation. In deep RL, we observe that the minimal representation alone empirically fails to improve performance, indicating that neural networks cannot directly exploit Markovian minimality. To address this, we propose \textbf{MOSE} (Multi-Order State Exposure), which feeds multi-order historical state constructions into the same $Q$-function. MOSE consistently outperforms both the minimal state construction and single-window policies on common benchmarks and synthetic datasets. Including the minimal representation alongside MOSE can further improve performance. Our results establish a core principle for causal deep RL: minimal sufficiency is not enough, and \emph{controlled redundancy} is necessary to unlock the benefit of causal state information.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete procedure to build provably minimal Markov states from a longitudinal causal graph for RL, then shows that deep RL still needs multi-order historical copies of those states to get gains.

read the letter

The main takeaway is a method that starts from a given causal DAG over time-indexed observations and extracts a smallest set of variables that still forms a Markov state for the MDP. They pair this with MOSE, which passes several lagged versions of the minimal state into the same Q-network instead of relying on the minimal version alone. The abstract reports that the minimal state by itself does not move the needle in deep RL, while the multi-order version does on both synthetic data and standard benchmarks, and that adding the minimal state back in can help further.

Referee Report

2 major / 2 minor

Summary. The paper claims to provide a procedure that constructs a provably minimal Markovian state representation from a longitudinal causal graph over observed variables for use in RL. It observes that this minimal representation alone does not improve performance in deep RL, and proposes MOSE which feeds multi-order historical state constructions into the Q-function. MOSE is reported to consistently outperform the minimal state construction and single-window policies on common benchmarks and synthetic datasets. The paper concludes that minimal sufficiency is not enough and controlled redundancy is necessary to unlock the benefit of causal state information in deep RL.

Significance. If the results hold, this work is significant for causal deep RL as it provides a principled way to derive minimal Markov states from causal DAGs and highlights a key practical issue with using minimal representations in neural RL agents. The proposal of MOSE as a simple way to add controlled redundancy is a useful contribution. The paper explicitly credits the idea of using causal graphs but extends it to state construction and empirical validation of the redundancy principle. However, the significance depends on the rigor of the proof and experiments, which are not detailed in the abstract.

major comments (2)

[Construction procedure (likely §3)] The claim that the procedure constructs a 'provably minimal state representation' is central but the manuscript does not supply the algorithm steps or a proof sketch in the provided text, preventing assessment of whether the construction indeed satisfies the Markov property without additional assumptions.
[Empirical evaluation (likely §5)] The assertion that 'MOSE consistently outperforms' both minimal and single-window policies lacks any quantitative results, error bars, baseline details, or statistical tests in the abstract, which is load-bearing for the claim that minimal sufficiency is not enough.

minor comments (2)

[Abstract] The abstract mentions 'common benchmarks and synthetic datasets' but does not specify which ones, reducing clarity.
[Notation] The term 'multi-order historical state constructions' is introduced without a formal definition or equation in the summary text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below by referencing the relevant sections of the full manuscript and outlining the revisions we will make to improve clarity and accessibility of the key claims.

read point-by-point responses

Referee: [Construction procedure (likely §3)] The claim that the procedure constructs a 'provably minimal state representation' is central but the manuscript does not supply the algorithm steps or a proof sketch in the provided text, preventing assessment of whether the construction indeed satisfies the Markov property without additional assumptions.

Authors: Section 3 of the full manuscript presents the complete construction procedure as an algorithm that extracts a minimal set of variables from the longitudinal causal DAG such that the resulting state satisfies the Markov property for the RL process. The section includes pseudocode for the procedure and a proof sketch based on d-separation and the definition of minimal sufficient statistics for the transition and reward functions. The proof requires no assumptions beyond the given causal graph being a faithful representation of the data-generating process. We will revise the manuscript to move the proof sketch into the main text (currently in the appendix) and add an explicit statement that the construction is minimal by construction. revision: yes
Referee: [Empirical evaluation (likely §5)] The assertion that 'MOSE consistently outperforms' both minimal and single-window policies lacks any quantitative results, error bars, baseline details, or statistical tests in the abstract, which is load-bearing for the claim that minimal sufficiency is not enough.

Authors: Section 5 reports the full experimental results on standard RL benchmarks and synthetic datasets, including mean returns with standard error bars over 10 random seeds, explicit baseline implementations (minimal state, fixed-window history, and standard DQN), and statistical significance tests confirming MOSE's improvements. To make the abstract self-contained and address the load-bearing nature of the claim, we will revise it to include concise quantitative highlights such as average performance gains and significance levels. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper's core claim is a procedure that, given an external longitudinal causal graph over observed variables as input, constructs a provably minimal state representation satisfying the Markov property. This is presented as derived from causal graph properties rather than fitted parameters or self-referential definitions. The subsequent observation that the minimal state alone fails to improve deep RL performance (leading to the MOSE multi-order augmentation) is an empirical finding, not a mathematical reduction to the input. No load-bearing equations, uniqueness theorems, or ansatzes are shown to collapse by construction to the provided causal graph or to self-citations; the central result remains independent of the fitted Q-networks and rests on the external graph plus experimental validation. This is the expected honest non-finding for a method whose inputs are stated as given and whose outputs are not tautological renamings of those inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that a correct longitudinal causal graph is supplied and that the construction rule derived from it yields a state that neural networks can use once augmented with controlled redundancy.

axioms (1)

domain assumption A longitudinal causal graph over observed variables is given and correctly encodes the temporal dependencies.
The entire construction procedure begins from this supplied graph.

invented entities (1)

MOSE (Multi-Order State Exposure) no independent evidence
purpose: Feeding multiple differently ordered historical versions of the minimal state into the same Q-function.
New technique introduced to overcome the empirical failure of the minimal state alone.

pith-pipeline@v0.9.0 · 5509 in / 1491 out tokens · 48118 ms · 2026-05-11T02:29:21.405332+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We address this by providing a procedure that constructs a provably minimal state representation... minimal sufficiency is not enough, and controlled redundancy is necessary
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 4.1 (Graphical criterion for valid causal state space construction)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages

[1]

MIT press Cambridge, 1998

Richard S Sutton and Andrew G Barto.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

work page 1998
[2]

The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care

Matthieu Komorowski, Leo A Celi, Omar Badawi, Anthony C Gordon, and A Aldo Faisal. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature medicine, 24(11):1716–1720, 2018

work page 2018
[3]

Rethinking progression of memory state in robotic manipulation: An object-centric perspective

Nhat Chung, Taisei Hanyu, Toan Nguyen, Huy Le, Frederick Bumgarner, Duy Minh Ho Nguyen, Khoa V o, Kashu Yamazaki, Chase Rainwater, Tung Kieu, et al. Rethinking progression of memory state in robotic manipulation: An object-centric perspective. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 3407–3415, 2026

work page 2026
[4]

State of the art—a survey of partially observable markov decision processes: theory, models, and algorithms.Management science, 28(1):1–16, 1982

George E Monahan. State of the art—a survey of partially observable markov decision processes: theory, models, and algorithms.Management science, 28(1):1–16, 1982

work page 1982
[5]

Deep recurrent q-learning for partially observable mdps

Matthew J Hausknecht and Peter Stone. Deep recurrent q-learning for partially observable mdps. InAAAI fall symposia, volume 45, page 141, 2015

work page 2015
[6]

Recurrent experience replay in distributed reinforcement learning

Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. Recurrent experience replay in distributed reinforcement learning. InInternational conference on learning representations, 2018

work page 2018
[7]

Agent57: Outperforming the atari human benchmark

Adrià Puigdomènech Badia, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvit- skyi, Zhaohan Daniel Guo, and Charles Blundell. Agent57: Outperforming the atari human benchmark. InInternational conference on machine learning, pages 507–517. PMLR, 2020

work page 2020
[8]

Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021

work page 2021
[9]

Online decision transformer

Qinqing Zheng, Amy Zhang, and Aditya Grover. Online decision transformer. Ininternational conference on machine learning, pages 27042–27059. PMLR, 2022

work page 2022
[10]

Does the markov decision process fit the data: Testing for the markov property in sequential decision making

Chengchun Shi, Runzhe Wan, Rui Song, Wenbin Lu, and Ling Leng. Does the markov decision process fit the data: Testing for the markov property in sequential decision making. In International Conference on Machine Learning, pages 8807–8817. PMLR, 2020

work page 2020
[11]

Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

work page 2015
[12]

Rainbow: Combining improve- ments in deep reinforcement learning

Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improve- ments in deep reinforcement learning. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

work page 2018
[13]

Model based reinforcement learning for atari

Łukasz Kaiser, Mohammad Babaeizadeh, Piotr Miłos, Bła˙zej Osi´nski, Roy H Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, et al. Model based reinforcement learning for atari. InInternational Conference on Learning Representations, 2020

work page 2020
[14]

Finding the framestack: Learning what to remember for non-markovian reinforcement learning

Geraud Nangue Tasse, Matthew Riemer, Benjamin Rosman, and Tim Klinger. Finding the framestack: Learning what to remember for non-markovian reinforcement learning. InFinding the Frame Workshop at RLC 2025, 2025

work page 2025
[15]

Automatic reward shaping from confounded offline data.arXiv preprint arXiv:2505.11478, 2025

Mingxuan Li, Junzhe Zhang, and Elias Bareinboim. Automatic reward shaping from confounded offline data.arXiv preprint arXiv:2505.11478, 2025

work page arXiv 2025
[16]

Confounding robust deep reinforcement learning: A causal approach

Mingxuan Li, Junzhe Zhang, and Elias Bareinboim. Confounding robust deep reinforcement learning: A causal approach. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 10

work page 2025
[17]

arXiv preprint arXiv:1804.06893 , year=

Chiyuan Zhang, Oriol Vinyals, Remi Munos, and Samy Bengio. A study on overfitting in deep reinforcement learning.arXiv preprint arXiv:1804.06893, 2018

work page arXiv 2018
[18]

The primacy bias in deep reinforcement learning

Evgenii Nikishin, Max Schwarzer, Pierluca D’Oro, Pierre-Luc Bacon, and Aaron Courville. The primacy bias in deep reinforcement learning. InInternational conference on machine learning, pages 16828–16847. PMLR, 2022

work page 2022
[19]

The dormant neuron phenomenon in deep reinforcement learning

Ghada Sokar, Rishabh Agarwal, Pablo Samuel Castro, and Utku Evci. The dormant neuron phenomenon in deep reinforcement learning. InInternational Conference on Machine Learning, pages 32145–32168. PMLR, 2023

work page 2023
[20]

Quantifying generalization in reinforcement learning

Karl Cobbe, Oleg Klimov, Chris Hesse, Taehoon Kim, and John Schulman. Quantifying generalization in reinforcement learning. InInternational conference on machine learning, pages 1282–1289. PMLR, 2019

work page 2019
[21]

Noisy networks for exploration

M Fortunato, MG Azar, B Piot, J Menick, M Hessel, I Osband, A Graves, V Mnih, R Munos, D Hassabis, O Pietquin, and S Blundell, C Legg. Noisy networks for exploration. InInterna- tional Conference on Learning Representations (ICLR), 2018

work page 2018
[22]

Reinforcement learning with augmented data.Advances in Neural Information Processing Systems, 33:19884–19895, 2020

Misha Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforcement learning with augmented data.Advances in Neural Information Processing Systems, 33:19884–19895, 2020

work page 2020
[23]

Curl: Contrastive unsupervised repre- sentations for reinforcement learning

Michael Laskin, Aravind Srinivas, and Pieter Abbeel. Curl: Contrastive unsupervised repre- sentations for reinforcement learning. InInternational conference on machine learning, pages 5639–5650. PMLR, 2020

work page 2020
[24]

Decoupling representation learning from reinforcement learning

Adam Stooke, Kimin Lee, Pieter Abbeel, and Michael Laskin. Decoupling representation learning from reinforcement learning. InInternational conference on machine learning, pages 9870–9879. PMLR, 2021

work page 2021
[25]

Multi-view reinforcement learning.Advances in Neural Information Processing Systems, 32, 2019

Minne Li, Lisheng Wu, Jun Wang, and Haitham Bou Ammar. Multi-view reinforcement learning.Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[26]

Unsupervised learning of visual 3d keypoints for control

Boyuan Chen, Pieter Abbeel, and Deepak Pathak. Unsupervised learning of visual 3d keypoints for control. InInternational Conference on Machine Learning, pages 1539–1549. PMLR, 2021

work page 2021
[27]

Look closer: Bridging egocentric and third-person views with transformers for robotic manipulation

Rishabh Jangir, Nicklas Hansen, Sambaran Ghosal, Mohit Jain, and Xiaolong Wang. Look closer: Bridging egocentric and third-person views with transformers for robotic manipulation. IEEE Robotics and Automation Letters, 7(2):3046–3053, 2022

work page 2022
[28]

Information-theoretic state space model for multi-view reinforcement learning

HyeongJoo Hwang, Seokin Seo, Youngsoo Jang, Sungyoon Kim, Geon-Hyeong Kim, Se- unghoon Hong, and Kee-Eung Kim. Information-theoretic state space model for multi-view reinforcement learning. InProceedings of the 40th International Conference on Machine Learning, pages 14249–14282, 2023

work page 2023
[29]

Testing for the markov property in timeseries.Econometric Theory, 28(1):130–178, 2012

Bin Chen and Yongmiao Hong. Testing for the markov property in timeseries.Econometric Theory, 28(1):130–178, 2012

work page 2012
[30]

Yunzhe Zhou, Chengchun Shi, Lexin Li, and Qiwei Yao. Testing for the markov property in time series via deep conditional generative learning.Journal of the Royal Statistical Society Series B: Statistical Methodology, 85(4):1204–1222, 2023

work page 2023
[31]

Causal directed acyclic graph-informed reward design

Lutong Zou, Ziping Xu, Daiqi Gao, and Susan Murphy. Causal directed acyclic graph-informed reward design. InThe Multi-disciplinary Conference on Reinforcement Learning and Decision Making (RLDM), 2025

work page 2025
[32]

Robust reward modeling via causal rubrics

Pragya Srivastava, Harman Singh, Rahul Madhavan, Gandharv Patil, Sravanti Addepalli, Arun Suggala, Rengarajan Aravamudhan, Soumya Sharma, Anirban Laha, Aravindan Raghuveer, et al. Robust reward modeling via causal rubrics. InICML 2025 Workshop on Models of Human Feedback for AI Alignment, 2025

work page 2025
[33]

Confounding robust continuous control via automatic reward shaping.Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems, 2026

Mateo Juliani, Mingxuan Li, and Elias Bareinboim. Confounding robust continuous control via automatic reward shaping.Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems, 2026. 11

work page 2026
[34]

Designing optimal dynamic treatment regimes: A causal reinforcement learning approach

Junzhe Zhang and Elias Bareinboim. Designing optimal dynamic treatment regimes: A causal reinforcement learning approach. InInternational Conference on Machine Learning. PMLR, 2020

work page 2020
[35]

Causal dynamics learning for task-independent state abstraction

Zizhao Wang, Xuesu Xiao, Zifan Xu, Yuke Zhu, and Peter Stone. Causal dynamics learning for task-independent state abstraction. InInternational Conference on Machine Learning, pages 23151–23180. PMLR, 2022

work page 2022
[36]

Invariant causal prediction for block mdps

Amy Zhang, Clare Lyle, Shagun Sodhani, Angelos Filos, Marta Kwiatkowska, Joelle Pineau, Yarin Gal, and Doina Precup. Invariant causal prediction for block mdps. InInternational Conference on Machine Learning, pages 11214–11224. PMLR, 2020

work page 2020
[37]

Building minimal and reusable causal state abstractions for reinforcement learning

Zizhao Wang, Caroline Wang, Xuesu Xiao, Yuke Zhu, and Peter Stone. Building minimal and reusable causal state abstractions for reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 15778–15786, 2024

work page 2024
[38]

Harnessing causality in reinforce- ment learning with bagged decision times

Daiqi Gao, Hsin-Yu Lai, Predrag Klasnja, and Susan Murphy. Harnessing causality in reinforce- ment learning with bagged decision times. InThe 28th International Conference on Artificial Intelligence and Statistics, 2025

work page 2025
[39]

State abstraction for programmable reinforcement learning agents

David Andre, Stuart J Russell, et al. State abstraction for programmable reinforcement learning agents. InAnnual AAAI Conference on Artificial Intelligence, 2002

work page 2002
[40]

State abstractions for lifelong reinforcement learning

David Abel, Dilip Arumugam, Lucas Lehnert, and Michael Littman. State abstractions for lifelong reinforcement learning. InInternational conference on machine learning, pages 10–19. PMLR, 2018

work page 2018
[41]

Bridging State and History Represen- tations: Understanding Self-Predictive RL, April 2024

Tianwei Ni, Benjamin Eysenbach, Erfan Seyedsalehi, Michel Ma, Clement Gehring, Aditya Mahajan, and Pierre-Luc Bacon. Bridging state and history representations: Understanding self-predictive rl.arXiv preprint arXiv:2401.08898, 2024

work page arXiv 2024
[42]

State representation learning for control: An overview.Neural Networks, 108:379–392, 2018

Timothée Lesort, Natalia Díaz-Rodríguez, Jean-Franois Goudou, and David Filliat. State representation learning for control: An overview.Neural Networks, 108:379–392, 2018

work page 2018
[43]

Can increasing input dimensionality improve deep reinforcement learning? InInternational conference on machine learning, pages 7424–7433

Kei Ota, Tomoaki Oiki, Devesh Jha, Toshisada Mariyama, and Daniel Nikovski. Can increasing input dimensionality improve deep reinforcement learning? InInternational conference on machine learning, pages 7424–7433. PMLR, 2020

work page 2020
[44]

Curiosity-driven exploration by self-supervised prediction

Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. InInternational Conference on Machine Learning, pages 2778–

work page
[45]

Image augmentation is all you need: Regu- larizing deep reinforcement learning from pixels

Denis Yarats, Ilya Kostrikov, and Rob Fergus. Image augmentation is all you need: Regu- larizing deep reinforcement learning from pixels. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021
[46]

Time-contrastive networks: Self-supervised learning from video

Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, Sergey Levine, and Google Brain. Time-contrastive networks: Self-supervised learning from video. In2018 IEEE international conference on robotics and automation (ICRA), pages 1134–1141. IEEE, 2018

work page 2018
[47]

Unsupervised reinforcement learning with contrastive intrinsic control.Advances in Neural Information Processing Systems, 35:34478–34491, 2022

Michael Laskin, Hao Liu, Xue Bin Peng, Denis Yarats, Aravind Rajeswaran, and Pieter Abbeel. Unsupervised reinforcement learning with contrastive intrinsic control.Advances in Neural Information Processing Systems, 35:34478–34491, 2022

work page 2022
[48]

Contrastive learning as goal-conditioned reinforcement learning.Advances in Neural Information Process- ing Systems, 35:35603–35620, 2022

Benjamin Eysenbach, Tianjun Zhang, Sergey Levine, and Russ R Salakhutdinov. Contrastive learning as goal-conditioned reinforcement learning.Advances in Neural Information Process- ing Systems, 35:35603–35620, 2022

work page 2022
[49]

Causality: Models, reasoning, and inference.Cambridge, UK: Cambridge University Press, 19(2):3, 2000

Judea Pearl. Causality: Models, reasoning, and inference.Cambridge, UK: Cambridge University Press, 19(2):3, 2000. 12

work page 2000
[50]

Correa, Duligur Ibeling, and Thomas Icard.On Pearl’s Hierarchy and the Foundations of Causal Inference, page 507–556

Elias Bareinboim, Juan D. Correa, Duligur Ibeling, and Thomas Icard.On Pearl’s Hierarchy and the Foundations of Causal Inference, page 507–556. Association for Computing Machinery, New York, NY , USA, 1 edition, 2022. ISBN 9781450395861. URLhttps://doi.org/10. 1145/3501714.3501743

work page arXiv 2022
[51]

Probabilities of Causation: Three Counterfactual Interpretations and Their Identifi- cation.Synthese, 121:93–149, 1999

Judea Pearl. Probabilities of Causation: Three Counterfactual Interpretations and Their Identifi- cation.Synthese, 121:93–149, 1999

work page 1999
[52]

Causal inference: A tale of three frameworks.arXiv preprint arXiv:2511.21516, 2025

Linbo Wang, Thomas Richardson, and James Robins. Causal inference: A tale of three frameworks.arXiv preprint arXiv:2511.21516, 2025

work page arXiv 2025
[53]

Causal network reconstruction from time series: From theoretical assumptions to practical estimation.Chaos: An Interdisciplinary Journal of Nonlinear Science, 28(7), 2018

Jakob Runge. Causal network reconstruction from time series: From theoretical assumptions to practical estimation.Chaos: An Interdisciplinary Journal of Nonlinear Science, 28(7), 2018

work page 2018
[54]

Survey and evaluation of causal discovery methods for time series.Journal of Artificial Intelligence Research, 73:767–819, 2022

Charles K Assaad, Emilie Devijver, and Eric Gaussier. Survey and evaluation of causal discovery methods for time series.Journal of Artificial Intelligence Research, 73:767–819, 2022

work page 2022
[55]

A survey on causal discovery methods for iid and time series data.Transactions on Machine Learning Research, 2023

Uzma Hasan, Emam Hossain, and Md Osman Gani. A survey on causal discovery methods for iid and time series data.Transactions on Machine Learning Research, 2023

work page 2023
[56]

Causal inference on time series using restricted structural equation models.Advances in neural information processing systems, 26, 2013

Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Causal inference on time series using restricted structural equation models.Advances in neural information processing systems, 26, 2013

work page 2013
[57]

Near-optimal regret bounds for reinforcement learning

Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. InAdvances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2008

work page 2008
[58]

An introduction to causal reinforcement learning.arXiv preprint arXiv:2101.06498, 2025

Elias Bareinboim, Sanghack Lee, and Junzhe Zhang. An introduction to causal reinforcement learning.arXiv preprint arXiv:2101.06498, 2025

work page arXiv 2025
[59]

Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018

work page 2018
[60]

Machado, Marc G

Marlos C. Machado, Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the arcade learning environment: evaluation protocols and open problems for general agents (extended abstract). InProceedings of the 27th International Joint Conference on Artificial Intelligence, page 5573–5577, 2018. ISBN 9780999241127

work page 2018
[61]

Deep Reinforcement Learning with Double Q-learning

Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning, 2015. URLhttps://arxiv.org/abs/1509.06461

work page Pith review arXiv 2015
[62]

Nonlinear causal discovery with additive noise models.Advances in neural information processing systems, 21, 2008

Patrik Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Bernhard Schölkopf. Nonlinear causal discovery with additive noise models.Advances in neural information processing systems, 21, 2008

work page 2008
[63]

Directlingam: A direct method for learning a linear non-gaussian structural equation model.Journal of Machine Learning Research-JMLR, 12(Apr):1225–1248, 2011

Shohei Shimizu, Takanori Inazumi, Yasuhiro Sogawa, Aapo Hyvarinen, Yoshinobu Kawahara, Takashi Washio, Patrik O Hoyer, Kenneth Bollen, and Patrik Hoyer. Directlingam: A direct method for learning a linear non-gaussian structural equation model.Journal of Machine Learning Research-JMLR, 12(Apr):1225–1248, 2011

work page 2011
[64]

Causal discovery with continuous additive noise models.Journal of Machine Learning Research, 15:2009–2053, 2014

Jonas Peters, Joris M Mooij, Dominik Janzing, and Bernhard Schölkopf. Causal discovery with continuous additive noise models.Journal of Machine Learning Research, 15:2009–2053, 2014

work page 2009
[65]

On the identifiability of the post-nonlinear causal model

K Zhang and A Hyvärinen. On the identifiability of the post-nonlinear causal model. In25th Conference on Uncertainty in Artificial Intelligence (UAI 2009), pages 647–655. AUAI Press, 2009

work page 2009
[66]

Investigating causal relations by econometric models and cross-spectral methods.Econometrica: journal of the Econometric Society, pages 424–438, 1969

Clive WJ Granger. Investigating causal relations by econometric models and cross-spectral methods.Econometrica: journal of the Econometric Society, pages 424–438, 1969

work page 1969
[67]

Measurement of linear dependence and feedback between multiple time series

John Geweke. Measurement of linear dependence and feedback between multiple time series. Journal of the American statistical association, 77(378):304–313, 1982. 13

work page 1982
[68]

Interpretable models for granger causality using self-explaining neural networks.arXiv preprint arXiv:2101.07600, 2021

Riˇcards Marcinkevi ˇcs and Julia E V ogt. Interpretable models for granger causality using self-explaining neural networks.arXiv preprint arXiv:2101.07600, 2021

work page arXiv 2021
[69]

Neural granger causality

Alex Tank, Ian Covert, Nicholas Foti, Ali Shojaie, and Emily B Fox. Neural granger causality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4267–4279, 2022

work page 2022
[70]

Amortized causal discovery: Learning to infer causal graphs from time-series data

Sindy Löwe, David Madras, Richard Zemel, and Max Welling. Amortized causal discovery: Learning to infer causal graphs from time-series data. InConference on Causal Learning and Reasoning, pages 509–525. PMLR, 2022

work page 2022
[71]

Causal discovery for non-stationary non-linear time series data using just-in-time modeling

Daigo Fujiwara, Kazuki Koyama, Keisuke Kiritoshi, Tomomi Okawachi, Tomonori Izumitani, and Shohei Shimizu. Causal discovery for non-stationary non-linear time series data using just-in-time modeling. InConference on Causal Learning and Reasoning, pages 880–894. PMLR, 2023

work page 2023
[72]

On causal discovery from time series data using fci.Proba- bilistic graphical models, 16, 2010

Doris Entner and Patrik O Hoyer. On causal discovery from time series data using fci.Proba- bilistic graphical models, 16, 2010

work page 2010
[73]

Detecting and quantifying causal associations in large nonlinear time series datasets.Science advances, 5 (11):eaau4996, 2019

Jakob Runge, Peer Nowack, Marlene Kretschmer, Seth Flaxman, and Dino Sejdinovic. Detecting and quantifying causal associations in large nonlinear time series datasets.Science advances, 5 (11):eaau4996, 2019

work page 2019
[74]

Discovering contemporaneous and lagged causal relations in autocorrelated nonlinear time series datasets

Jakob Runge. Discovering contemporaneous and lagged causal relations in autocorrelated nonlinear time series datasets. InProceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), pages 1388–1397, 2020

work page 2020
[75]

Causal discovery for time series from multiple datasets with latent contexts

Wiebke Günther, Urmi Ninad, and Jakob Runge. Causal discovery for time series from multiple datasets with latent contexts. InUncertainty in Artificial Intelligence, pages 766–776. PMLR, 2023

work page 2023
[76]

Dynotears: Structure learning from time-series data

Roxana Pamfil, Nisara Sriwattanaworachai, Shaan Desai, Philip Pilgerstorfer, Konstantinos Georgatzis, Paul Beaumont, and Bryon Aragam. Dynotears: Structure learning from time-series data. InInternational conference on artificial intelligence and statistics, pages 1595–1605. PMLR, 2020

work page 2020
[77]

Neural graphical modelling in continuous-time: consistency guarantees and algorithms

Alexis Bellot, Kim Branson, and Mihaela van der Schaar. Neural graphical modelling in continuous-time: consistency guarantees and algorithms. InInternational Conference on Learning Representations, 2022

work page 2022
[78]

Nts-notears: Learning nonpara- metric dbns with prior knowledge

Xiangyu Sun, Oliver Schulte, Guiliang Liu, and Pascal Poupart. Nts-notears: Learning nonpara- metric dbns with prior knowledge. InInternational Conference on Artificial Intelligence and Statistics, pages 1942–1964. PMLR, 2023

work page 1942
[79]

Conditional local independence testing for it\ˆ o processes with applications to dynamic causal discovery.arXiv preprint arXiv:2506.07844, 2025

Mingzhou Liu, Xinwei Sun, and Yizhou Wang. Conditional local independence testing for it\ˆ o processes with applications to dynamic causal discovery.arXiv preprint arXiv:2506.07844, 2025

work page arXiv 2025
[80]

Causal bandits: Learning good interven- tions via causal inference.Advances in neural information processing systems, 29, 2016

Finnian Lattimore, Tor Lattimore, and Mark D Reid. Causal bandits: Learning good interven- tions via causal inference.Advances in neural information processing systems, 29, 2016

work page 2016

Showing first 80 references.