A Lecture Note on Offline RL and IRL, Part II: Foundations of Inverse Reinforcement Learning and Dynamic Discrete Choice Models

Enoch Hyunwook Kang

arxiv: 2605.30843 · v1 · pith:7STW5G4Wnew · submitted 2026-05-29 · 💻 cs.LG · econ.EM

A Lecture Note on Offline RL and IRL, Part II: Foundations of Inverse Reinforcement Learning and Dynamic Discrete Choice Models

Enoch Hyunwook Kang This is my paper

Pith reviewed 2026-06-28 23:31 UTC · model grok-4.3

classification 💻 cs.LG econ.EM

keywords inverse reinforcement learningdynamic discrete choiceoffline reinforcement learningreward identificationequivalencestructural econometricsentropy regularizationnested fixed point

0 comments

The pith

Entropy-regularized inverse reinforcement learning and dynamic discrete choice models describe the same probabilistic structure for recovering rewards from expert behavior.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that structural econometricians studying dynamic discrete choice models and machine learning researchers studying entropy-regularized inverse reinforcement learning have developed identical models for inferring an expert's unknown reward from offline trajectory data. It proves the equivalence of these formulations and then reviews classical econometric identification results and algorithms alongside modern machine learning IRL techniques. A sympathetic reader would care because recognizing the shared model lets methods, identification theorems, and computational approaches transfer between the two fields without duplication.

Core claim

Two communities have been working on exactly the same probabilistic model under different names: structural econometricians studying dynamic discrete choice (DDC) and machine learners studying entropy-regularized IRL. The paper begins by proving their equivalence, then develops the classical identification result of Magnac and Thesmar together with the computational paradigms that grew out of it, and walks through modern ML/IRL methods while deriving each method's objective and what it does and does not identify.

What carries the argument

The shared probabilistic model in which an expert optimizes an unknown reward inside a known Markov decision process with additive random utility shocks; the paper's central work is proving that entropy-regularized IRL recovers the same object as the DDC formulation.

If this is right

Classical identification results from econometrics apply directly to entropy-regularized IRL.
Rust's nested fixed-point algorithm, Hotz-Miller conditional choice probabilities, and temporal-difference methods become available for IRL estimation.
Adversarial IRL, occupancy matching, and IQ-Learn each identify the reward only up to the specific regularizer or matching objective they optimize.
The empirical-risk-minimization framework yields a gradient-based estimator that works for both offline IRL and DDC.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The unification implies that dimensionality-reduction techniques developed in one literature can be imported to the other without re-deriving the underlying model.
Hybrid estimators could combine econometric moment conditions with scalable gradient methods from machine learning.
Empirical tests could check whether the additive-shock assumption holds in domains where transition kernels are only partially observed.

Load-bearing premise

The observed offline data is generated by an expert optimizing an unknown reward inside a known Markov decision process structure with additive random utility shocks.

What would settle it

A dataset in which observed choices cannot be rationalized by any reward function under the additive random utility model, yet can be explained by a non-equivalent IRL formulation, would falsify the claimed equivalence.

read the original abstract

In the forward reinforcement-learning problem, the reward is fixed and known; the learner is asked to find a good policy or value function. Here we turn the question around. Given offline data generated by an expert, can we recover the reward the expert was optimizing? This is the inverse reinforcement learning problem, and remarkably, two communities, structural econometricians studying dynamic discrete choice (DDC) and machine learners studying entropy-regularized IRL, have been working on exactly the same probabilistic model under different names. We begin by proving their equivalence. We then develop the classical identification result of Magnac and Thesmar and the classical computational paradigms that grew out of it: Rust's nested fixed-point algorithm, the conditional-choice-probability approach of Hotz and Miller, and the two temporal-difference approaches of Adusumilli and Eckardt: linear semi-gradient TD and approximate value iteration. Each route has its limits: dimensionality, transition-kernel estimation, the deadly triad, or projected fixed-point bias. We then walk through the modern ML/IRL strand: adversarial IRL, occupancy matching, IQ-Learn, and offline ML-IRL, deriving each method's actual objective and stating precisely what it does and does not identify. We close with the empirical-risk-minimization framework of Kang et al., which yields a gradient-based estimator for offline IRL/DDC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A clear lecture note connecting IRL and DDC models, but the equivalence is already known and the work stays expository.

read the letter

This note lays out that entropy-regularized IRL and dynamic discrete choice models rest on the same random-utility MDP with Gumbel shocks. It proves the equivalence up front, then derives the objectives for Magnac-Thesmar identification, Rust NFXP, Hotz-Miller CCP, Adusumilli-Eckardt TD methods, and later ML approaches like adversarial IRL, IQ-Learn, and occupancy matching. It closes with the Kang et al. ERM estimator.

What it does well is spell out each method's exact loss and what it identifies versus what it does not. The derivations are explicit, the citations to the classical papers are accurate, and the side-by-side treatment makes the overlap obvious. That connective work is useful for anyone moving between the two fields.

The main limitation is that the core equivalence holds under the standard shared assumptions (additive Gumbel shocks, known transitions, infinite horizon) and has been recognized before; the note does not claim or deliver a new identification theorem or algorithm. No new data, experiments, or machine-checked proofs appear. The text is framed as a lecture note, so the derivations are presented for teaching rather than as original claims.

The piece is aimed at readers who already know one side and want a clean map to the other—graduate students or researchers crossing RL and structural econometrics. It is honest about limits like the deadly triad and kernel estimation. I would bring it to a reading group for the derivations, but I would not cite it as a new result. It does not need peer review as a research paper.

Referee Report

0 major / 2 minor

Summary. The manuscript is a lecture note claiming that structural dynamic discrete choice (DDC) models from econometrics (Magnac-Thesmar style with additive Gumbel shocks) and entropy-regularized inverse reinforcement learning (IRL) in machine learning are identical probabilistic models. It proves their equivalence, derives the classical identification result of Magnac and Thesmar, and walks through estimators including Rust's nested fixed-point algorithm, Hotz-Miller conditional choice probabilities, Adusumilli-Eckardt TD methods, adversarial IRL, occupancy matching, IQ-Learn, offline ML-IRL, and Kang et al.'s empirical risk minimization framework, while stating precisely what each identifies and their limitations (dimensionality, transition estimation, deadly triad, projected bias).

Significance. If the derivations hold, the note is a useful bridge between communities that have independently developed the same model, with explicit objective derivations and identification statements providing a clear reference for what each method recovers from offline data. The explicit walk-through of both classical econometric and modern ML estimators, including their limits, is a strength for cross-disciplinary work.

minor comments (2)

[Abstract] Abstract: the statement that the note 'begins by proving their equivalence' would benefit from a forward reference to the specific section or theorem number containing the proof.
The manuscript should clarify whether the equivalence proof is presented as novel or as a re-derivation under the standard additive Gumbel random utility assumption, to set reader expectations.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the detailed and accurate summary of the manuscript, the positive assessment of its contribution as a bridge between the DDC and entropy-regularized IRL literatures, and the recommendation to accept. No major comments were raised.

Circularity Check

1 steps flagged

Minor self-citation to author's prior ERM framework; central equivalence proof presented as independent derivation

specific steps

self citation load bearing [Abstract, final sentence]
"We close with the empirical-risk-minimization framework of Kang et al., which yields a gradient-based estimator for offline IRL/DDC."

The paper invokes its own prior work (Kang et al.) to present the final estimator. While this is a self-citation, it occurs only at the close and is not invoked to establish the central equivalence or identification theorems, so the circularity is minor and non-load-bearing for the main claims.

full rationale

The lecture note's core contribution is an explicit proof of equivalence between the Magnac-Thesmar DDC model and entropy-regularized IRL under additive Gumbel shocks. This is described as a fresh proof rather than a reduction to prior fitted quantities. The only self-reference is the closing mention of the Kang et al. ERM framework, which is not used to justify the equivalence or identification results. No equations reduce a claimed prediction to a fitted input by construction, and no uniqueness theorem is imported from overlapping-author citations to force the modeling choice. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The lecture note rests on the standard domain assumption that behavior data arises from reward optimization in an MDP with random utility shocks; no new free parameters, invented entities, or ad-hoc axioms are introduced in the abstract description.

axioms (1)

domain assumption Observed data generated by expert optimizing reward in known MDP with additive random shocks
This shared probabilistic model is invoked as the basis for proving equivalence between IRL and DDC.

pith-pipeline@v0.9.1-grok · 5775 in / 1305 out tokens · 34183 ms · 2026-06-28T23:31:10.392359+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 5 canonical work pages · 3 internal anchors

[1]

Temporal-Difference estimation of dynamic discrete choice models

Adusumilli, K. and D. Eckardt (2019). “Temporal-Difference estimation of dynamic discrete choice models”. In:arXiv preprint arXiv:1912.09509. Aguirregabiria, V. and P. Mira (2002). “Swapping the nested fixed point algorithm: A class of estimators for discrete Markov decision models”. In:Econometrica70.4, pp. 1519–1543. – (2007). “Sequential estimation of ...

work page arXiv 2019
[2]

Learning near-optimal policies with Bellman- residual minimization based fitted policy iteration and a single sample path

Proceedings of Machine Learning Research. PMLR, pp. 242–252. Antos, A., C. Szepesvari, and R. Munos (2008). “Learning near-optimal policies with Bellman- residual minimization based fitted policy iteration and a single sample path”. In:Machine Learning 71, pp. 89–129. Arcidiacono, P., P. Bayer, J. R. Blevins, and P. B. Ellickson (2016). “Estimation of dyn...

work page arXiv 2008
[3]

A Connection between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models

Feng, Y., E. Khmelnitskaya, and D. Nekipelov (2020). “Global concavity and optimization in a class of dynamic discrete choice models”. In:International Conference on Machine Learning. PMLR, pp. 3082–3091. Finn, C., P. Christiano, P. Abbeel, and S. Levine (2016). “A connection between generative adversarial networks, inverse reinforcement learning, and ene...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[4]

Conservative Q-learning for offline re- inforcement learning

Kumar, A., A. Zhou, G. Tucker, and S. Levine (2020). “Conservative Q-learning for offline re- inforcement learning”. In:Advances in Neural Information Processing Systems33, pp. 1179–

2020
[5]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Lagoudakis, M. G. and R. Parr (2003). “Least-squares policy iteration”. In:Journal of Machine Learning Research4.Dec, pp. 1107–1149. Levine, S., A. Kumar, G. Tucker, and J. Fu (2020). “Offline reinforcement learning: Tutorial, review, and perspectives on open problems”. In:arXiv preprint arXiv:2005.01643. Li, Z., T. Xu, Y. Yang, and Z. -Q. Luo (2022). “Re...

work page internal anchor Pith review Pith/arXiv arXiv 2003
[6]

Behavioral Cloning from Observation

Sutton, R. S., C. Szepesv´ ari, and H. R. Maei (2009). “A convergent O(n) algorithm for off- policy temporal-difference learning with linear function approximation”. In:Advances in Neural Information Processing Systems. Vol. 21, pp. 1609–1616. Tesauro, G. et al. (1995). “Temporal difference learning and TD-Gammon”. In:Communications of the ACM38.3, pp. 58...

work page internal anchor Pith review Pith/arXiv arXiv 2009

[1] [1]

Temporal-Difference estimation of dynamic discrete choice models

Adusumilli, K. and D. Eckardt (2019). “Temporal-Difference estimation of dynamic discrete choice models”. In:arXiv preprint arXiv:1912.09509. Aguirregabiria, V. and P. Mira (2002). “Swapping the nested fixed point algorithm: A class of estimators for discrete Markov decision models”. In:Econometrica70.4, pp. 1519–1543. – (2007). “Sequential estimation of ...

work page arXiv 2019

[2] [2]

Learning near-optimal policies with Bellman- residual minimization based fitted policy iteration and a single sample path

Proceedings of Machine Learning Research. PMLR, pp. 242–252. Antos, A., C. Szepesvari, and R. Munos (2008). “Learning near-optimal policies with Bellman- residual minimization based fitted policy iteration and a single sample path”. In:Machine Learning 71, pp. 89–129. Arcidiacono, P., P. Bayer, J. R. Blevins, and P. B. Ellickson (2016). “Estimation of dyn...

work page arXiv 2008

[3] [3]

A Connection between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models

Feng, Y., E. Khmelnitskaya, and D. Nekipelov (2020). “Global concavity and optimization in a class of dynamic discrete choice models”. In:International Conference on Machine Learning. PMLR, pp. 3082–3091. Finn, C., P. Christiano, P. Abbeel, and S. Levine (2016). “A connection between generative adversarial networks, inverse reinforcement learning, and ene...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[4] [4]

Conservative Q-learning for offline re- inforcement learning

Kumar, A., A. Zhou, G. Tucker, and S. Levine (2020). “Conservative Q-learning for offline re- inforcement learning”. In:Advances in Neural Information Processing Systems33, pp. 1179–

2020

[5] [5]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Lagoudakis, M. G. and R. Parr (2003). “Least-squares policy iteration”. In:Journal of Machine Learning Research4.Dec, pp. 1107–1149. Levine, S., A. Kumar, G. Tucker, and J. Fu (2020). “Offline reinforcement learning: Tutorial, review, and perspectives on open problems”. In:arXiv preprint arXiv:2005.01643. Li, Z., T. Xu, Y. Yang, and Z. -Q. Luo (2022). “Re...

work page internal anchor Pith review Pith/arXiv arXiv 2003

[6] [6]

Behavioral Cloning from Observation

Sutton, R. S., C. Szepesv´ ari, and H. R. Maei (2009). “A convergent O(n) algorithm for off- policy temporal-difference learning with linear function approximation”. In:Advances in Neural Information Processing Systems. Vol. 21, pp. 1609–1616. Tesauro, G. et al. (1995). “Temporal difference learning and TD-Gammon”. In:Communications of the ACM38.3, pp. 58...

work page internal anchor Pith review Pith/arXiv arXiv 2009