arxiv: 2605.01778 · v1 · submitted 2026-05-03 · 💻 cs.LG

Recognition: unknown

Adversarial Imitation Learning with General Function Approximation: Theoretical Analysis and Practical Algorithms

Ruishuo Chen, Tian Xu, Yang Yu, Yihao Sun, Zexuan Chen, Zhilong Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:18 UTC · model grok-4.3

classification 💻 cs.LG

keywords adversarial imitation learninggeneral function approximationprovably efficient algorithmssample complexityreinforcement learningpolicy optimizationimitation learning

0 comments

The pith

Optimization-based adversarial imitation learning achieves polynomial sample complexity under general function approximation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces the OPT-AIL framework for online adversarial imitation learning that handles general function approximation. It develops two concrete methods, model-free OPT-AIL and model-based OPT-AIL, each performing online optimization over rewards together with optimism-regularized optimization over policies. Both variants are shown to learn near-expert policies using only polynomial numbers of expert samples and environment interactions. The approach requires nothing beyond approximate optimization oracles for two objectives, closing the gap between prior theoretical restrictions to tabular or linear cases and practical neural-network implementations. Empirical results indicate that the resulting algorithms outperform earlier deep adversarial imitation learning techniques on several challenging tasks.

Core claim

We introduce optimization-based AIL (OPT-AIL), which performs online optimization for reward learning coupled with optimism-regularized optimization for policy learning. Within this framework we develop model-free OPT-AIL and model-based OPT-AIL. Our theoretical analysis demonstrates that both variants achieve polynomial expert sample complexity and interaction complexity for learning near-expert policies. To the best of our knowledge, they represent the first provably efficient AIL methods under general function approximation.

What carries the argument

The optimization-based AIL (OPT-AIL) framework that couples online reward optimization with optimism-regularized policy optimization and relies on approximate oracles for general function classes.

If this is right

Both model-free and model-based OPT-AIL variants achieve polynomial expert sample complexity and interaction complexity.
Implementation requires only the approximate optimization of two objectives.
The methods are the first provably efficient AIL algorithms that apply to general function approximation rather than tabular or linear cases.
Empirical performance exceeds that of prior state-of-the-art deep AIL methods on multiple tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same optimism-regularization pattern may stabilize training when the policy and reward networks are very deep or recurrent.
The framework could be reused for other adversarial objectives in reinforcement learning that currently lack general-approximation guarantees.
The polynomial degree might be further reduced by tighter oracle assumptions or by combining model-free and model-based elements.

Load-bearing premise

The function approximation class admits efficient approximate optimization oracles for the reward and policy objectives such that optimism-regularized updates produce polynomial regret bounds without exponential dependence on approximation error or dimension.

What would settle it

An explicit construction of a function class and oracle pair where the interaction complexity required to reach near-expert performance grows super-polynomially with the approximation error or with the dimension of the class.

Figures

Figures reproduced from arXiv: 2605.01778 by Ruishuo Chen, Tian Xu, Yang Yu, Yihao Sun, Zexuan Chen, Zhilong Zhang.

**Figure 2.** Figure 2: Learning curves on 8 DMControl tasks over 5 random seeds using 1 expert trajectory. [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗

**Figure 3.** Figure 3: Learning curves on 8 DMControl tasks over 5 random seeds using 4 expert trajectories. [PITH_FULL_IMAGE:figures/full_fig_p044_3.png] view at source ↗

**Figure 4.** Figure 4: Learning curves on 8 DMControl tasks over 5 random seeds using 7 expert trajectories. [PITH_FULL_IMAGE:figures/full_fig_p046_4.png] view at source ↗

**Figure 5.** Figure 5: Learning curves on 8 DMControl tasks over 5 random seeds using 10 expert trajectories. [PITH_FULL_IMAGE:figures/full_fig_p046_5.png] view at source ↗

read the original abstract

Adversarial imitation learning (AIL), a prominent approach in imitation learning, has achieved significant practical success powered by neural network approximation. However, existing theoretical analyses of AIL are primarily confined to simplified settings, such as tabular and linear function approximation, and involve complex algorithmic designs that impede practical implementation. This creates a substantial gap between theory and practice. This paper bridges this gap by exploring the theoretical underpinnings of online AIL with general function approximation. We introduce a novel framework called optimization-based AIL (OPT-AIL), which performs online optimization for reward learning coupled with optimism-regularized optimization for policy learning. Within this framework, we develop two concrete methods: model-free OPT-AIL and model-based OPT-AIL. Our theoretical analysis demonstrates that both variants achieve polynomial expert sample complexity and interaction complexity for learning near-expert policies. To the best of our knowledge, they represent the first provably efficient AIL methods under general function approximation. From a practical standpoint, OPT-AIL requires only the approximate optimization of two objectives, thereby facilitating practical implementation. Empirical studies demonstrate that OPT-AIL outperforms previous state-of-the-art deep AIL methods across several challenging tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims the first polynomial sample and interaction complexity for adversarial imitation learning under general function approximation, but the guarantees rest on optimization oracles whose efficiency is assumed rather than shown.

read the letter

The key point is that this work introduces the OPT-AIL framework and two concrete algorithms (model-free and model-based) that combine online reward optimization with optimism-regularized policy updates. It claims to deliver the first polynomial expert-sample and interaction complexity bounds for near-expert policies when the function class is general rather than tabular or linear. That claim is new on its face and directly targets the theory-practice gap noted in the abstract. The practical angle is also reasonable: the methods are said to need only approximate optimization of two objectives, which aligns with how people actually train deep networks. The empirical section reports outperformance over prior deep AIL baselines on several tasks, which is at least a positive signal that the approach is implementable. The citation pattern looks standard and does not appear circular. The main soft spot is the oracle assumption. Polynomial regret for the overall procedure requires that the reward and policy optimization oracles themselves produce regret that scales polynomially in the approximation error and class complexity, without exponential blow-up in covering numbers or 1/ε. The abstract states the bounds but gives no derivation details or explicit construction showing that such oracles exist for the AIL objectives. If the full analysis simply posits efficient oracles without proving they can be realized for the specific non-convex problems that arise, the polynomial guarantees become conditional on a computational primitive that is not automatic in general function approximation. That is the load-bearing assumption flagged in the stress-test note. The paper is aimed at theorists and practitioners who care about provable efficiency in imitation learning beyond linear settings. A reader who wants to see whether general function approximation can be handled rigorously would find the attempt useful, provided they check the oracle conditions and derivations carefully. It deserves a serious referee because the central claim, if the math holds, would be worth verifying even if revisions are needed to clarify the assumptions and their realizability.

Referee Report

2 major / 2 minor

Summary. The paper proposes the OPT-AIL framework for adversarial imitation learning under general function approximation. It combines online optimization for reward learning with optimism-regularized optimization for policy learning, yielding two concrete algorithms (model-free OPT-AIL and model-based OPT-AIL). The central claim is that both variants achieve polynomial expert-sample and interaction complexity for near-expert policies, constituting the first provably efficient AIL methods under general FA; the methods require only approximate optimization of two objectives and empirically outperform prior deep AIL baselines.

Significance. If the polynomial complexity guarantees hold without hidden exponential dependence on approximation error or covering numbers, the work would be significant: it would supply the first non-tabular, non-linear guarantees for AIL and directly address the theory-practice gap by reducing implementation to two optimization oracles. The empirical results on challenging tasks provide supporting evidence of practicality, though the theoretical contribution hinges on the oracle assumptions.

major comments (2)

[Theoretical Analysis (main regret theorems)] The polynomial expert-sample and interaction complexity claims rest on the existence of approximate optimization oracles for the reward and policy objectives whose regret scales polynomially in the approximation error and function-class complexity. No explicit construction, regret bound, or reduction is supplied showing that such oracles can be realized for the specific non-convex objectives arising in the OPT-AIL updates; in general function approximation this leaves open the possibility of exponential dependence on 1/ε or on the covering number, rendering the overall polynomial bounds conditional rather than unconditional.
[Policy optimization step and regret decomposition] The optimism-regularized policy update is presented as yielding polynomial regret under general FA, yet the analysis does not quantify how the optimism term interacts with the function-approximation error of the learned reward; without an explicit error-propagation lemma relating the two, it is unclear whether the claimed polynomial interaction complexity survives when the reward approximator is only ε-accurate.

minor comments (2)

[Preliminaries] Notation for the function class and the approximate optimization oracles should be introduced with explicit dependence on the covering number or pseudo-dimension to make the polynomial bounds transparent.
[Experiments] The empirical section would benefit from reporting the number of optimization oracle calls per iteration and wall-clock time, to substantiate the claim that only two approximate optimizations are needed in practice.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments and the recognition of our work's potential significance in bridging theory and practice for adversarial imitation learning under general function approximation. We address the major comments below.

read point-by-point responses

Referee: The polynomial expert-sample and interaction complexity claims rest on the existence of approximate optimization oracles for the reward and policy objectives whose regret scales polynomially in the approximation error and function-class complexity. No explicit construction, regret bound, or reduction is supplied showing that such oracles can be realized for the specific non-convex objectives arising in the OPT-AIL updates; in general function approximation this leaves open the possibility of exponential dependence on 1/ε or on the covering number, rendering the overall polynomial bounds conditional rather than unconditional.

Authors: We agree that our theoretical results are conditional on the existence of approximate optimization oracles that achieve regret scaling polynomially with the approximation error and function class complexity. The paper's main contribution is to show that, under such oracle assumptions, the OPT-AIL framework yields polynomial expert sample and interaction complexities for near-expert policies. We do not claim to construct such oracles for arbitrary non-convex objectives, as this would require advances in non-convex optimization that are beyond the scope of this work. Similar oracle-based analyses are standard in the literature on RL with general function approximation (e.g., in works on model-free RL). We will add a clearer statement in the introduction and conclusion to emphasize that the polynomial bounds hold assuming the oracles satisfy the polynomial regret property, to avoid any misunderstanding. revision: partial
Referee: The optimism-regularized policy update is presented as yielding polynomial regret under general FA, yet the analysis does not quantify how the optimism term interacts with the function-approximation error of the learned reward; without an explicit error-propagation lemma relating the two, it is unclear whether the claimed polynomial interaction complexity survives when the reward approximator is only ε-accurate.

Authors: We thank the referee for this observation. Our analysis in the main text and appendix decomposes the policy regret but does not include a standalone lemma explicitly relating the optimism term to the reward approximation error ε. We agree this would strengthen the paper and clarify that the polynomial bound holds. We will add such an error-propagation lemma in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation independent under oracle assumptions

full rationale

The paper introduces the OPT-AIL framework performing online optimization for reward learning and optimism-regularized optimization for policy learning, then derives polynomial expert-sample and interaction complexity bounds for both model-free and model-based variants under general function approximation. These bounds are obtained from the stated algorithmic structure and the assumption that the function class admits efficient approximate optimization oracles yielding polynomial regret. No step reduces a claimed result to its own inputs by construction, renames a fitted quantity as a prediction, or relies on a load-bearing self-citation whose content is unverified. The analysis supplies independent content conditional on the oracle primitive; the oracle assumption itself is an external modeling choice rather than a definitional loop. This is the normal case of a self-contained theoretical construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence of approximate optimization oracles for the reward and policy objectives and on the function approximation class permitting polynomial regret bounds; these are domain assumptions not supplied with independent evidence in the abstract.

axioms (1)

domain assumption General function approximation admits efficient approximate optimization oracles that yield polynomial regret when combined with optimism regularization.
Invoked to obtain the stated polynomial sample and interaction complexity.

invented entities (1)

OPT-AIL framework no independent evidence
purpose: Couples online reward optimization with optimism-regularized policy optimization to achieve provable efficiency.
New framework introduced to bridge theory and practice.

pith-pipeline@v0.9.0 · 5518 in / 1179 out tokens · 44563 ms · 2026-05-10T15:18:44.551830+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 12 canonical work pages · 3 internal anchors

[1]

R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduction. MIT press, 2018

2018
[2]

Human-level control through deep reinforcement learning,

V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovskiet al., “Human-level control through deep reinforcement learning,”Nature, vol. 518, no. 7540, pp. 529–533, 2015

2015
[3]

When to trust your model: Model-based policy optimization,

M. Janner, J. Fu, M. Zhang, and S. Levine, “When to trust your model: Model-based policy optimization,” inAdvances in Neural Information Processing Systems 32, 2019, pp. 12 498–12 509

2019
[4]

Generative adversarial user model for reinforcement learning based recommendation system,

X. Chen, S. Li, H. Li, S. Jiang, Y. Qi, and L. Song, “Generative adversarial user model for reinforcement learning based recommendation system,” inProceedings of the 36th International Conference on Machine Learning, 2019, pp. 1052–1061. 16

2019
[5]

Virtual-taobao: virtualizing real-world online retail envi- ronment for reinforcement learning,

J. Shi, Y. Yu, Q. Da, S. Chen, and A. Zeng, “Virtual-taobao: virtualizing real-world online retail envi- ronment for reinforcement learning,” inProceedings of the 33rd AAAI Conference on Artificial Intelligence, 2019, pp. 4902–4909

2019
[6]

OpenVLA: An open-source vision-language-action model,

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P . Foster, P . R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P . Liang, and C. Finn, “OpenVLA: An open-source vision-language-action model,” inConference on Robot Learning, 2024

2024
[7]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter et al., “π0: A vision-language-action flow model for general robot control,”arXiv, vol. 2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Efficient training of artificial neural networks for autonomous navigation,

D. Pomerleau, “Efficient training of artificial neural networks for autonomous navigation,”Neural Computation, vol. 3, no. 1, pp. 88–97, 1991

1991
[9]

A reduction of imitation learning and structured prediction to no-regret online learning,

S. Ross, G. J. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” inProceedings of the 14th International Conference on Artificial Intelligence and Statistics, 2011, pp. 627–635

2011
[10]

Disagreement-regularized imitation learning,

K. Brantley, W. Sun, and M. Henaff, “Disagreement-regularized imitation learning,” inProceedings of the 8th International Conference on Learning Representations, 2020

2020
[11]

Behavioral cloning from observation,

F. Torabi, G. Warnell, and P . Stone, “Behavioral cloning from observation,” inProceedings of the 27th International Joint Conference on Artificial Intelligence, 2018, pp. 4950–4957

2018
[12]

Discriminator-actor-critic: Ad- dressing sample inefficiency and reward bias in adversarial imitation learning,

I. Kostrikov, K. K. Agrawal, D. Dwibedi, S. Levine, and J. Tompson, “Discriminator-actor-critic: Ad- dressing sample inefficiency and reward bias in adversarial imitation learning,” inProceedings of the 7th International Conference on Learning Representations, 2019

2019
[13]

Offline imitation learning with a misspecified simulator,

S. Jiang, J. Pang, and Y. Yu, “Offline imitation learning with a misspecified simulator,”Advances in Neural Information Processing Systems 33, 2020

2020
[14]

Imitation Learning as $f$-Divergence Minimization, May 2020

L. Ke, M. Barnes, W. Sun, G. Lee, S. Choudhury, and S. S. Srinivasa, “Imitation learning as f-divergence minimization,”arXiv, vol. 1905.12888, 2019

work page arXiv 1905
[15]

A divergence minimization perspective on imitation learning methods,

S. K. S. Ghasemipour, R. S. Zemel, and S. Gu, “A divergence minimization perspective on imitation learning methods,” inProceedings of the 3rd Annual Conference on Robot Learning, 2019, pp. 1259–1277

2019
[16]

Iq-learn: Inverse soft-q learning for imitation,

D. Garg, S. Chakraborty, C. Cundy, J. Song, and S. Ermon, “Iq-learn: Inverse soft-q learning for imitation,” inAdvances in Neural Information Processing Systems 34, 2021, pp. 4028–4039

2021
[17]

Transferable reward learning by dynamics-agnostic discrimina- tor ensemble,

F.-M. Luo, X. Cao, R.-J. Qin, and Y. Yu, “Transferable reward learning by dynamics-agnostic discrimina- tor ensemble,”arXiv, vol. 2206.00238, 2022

work page arXiv 2022
[18]

Imitation learning from imperfection: Theoretical justifications and algorithms,

Z. Li, T. Xu, Z. Qin, Y. Yu, and Z.-Q. Luo, “Imitation learning from imperfection: Theoretical justifications and algorithms,”Advances in Neural Information Processing Systems 37, 2023

2023
[19]

Is behavior cloning all you need? understanding horizon in imitation learning.arXiv preprint arXiv:2407.15007, 2024

D. J. Foster, A. Block, and D. Misra, “Is behavior cloning all you need? understanding horizon in imitation learning,”arXiv, vol. 2407.15007, 2024

work page arXiv 2024
[20]

Online apprenticeship learning.arXiv:2102.06924, 2021

L. Shani, T. Zahavy, and S. Mannor, “Online apprenticeship learning,”arXiv, vol. 2102.06924, 2021

work page arXiv 2021
[21]

Provably efficient adversarial imitation learning with unknown transitions,

T. Xu, Z. Li, Y. Yu, and Z.-Q. Luo, “Provably efficient adversarial imitation learning with unknown transitions,” inProceedings of the 39th Conference on Uncertainty in Artificial Intelligence, 2023, pp. 2367– 2378

2023
[22]

Provably efficient generative adversarial imitation learning for online and offline setting with linear function approximation,

Z. Liu, Y. Zhang, Z. Fu, Z. Yang, and Z. Wang, “Provably efficient generative adversarial imitation learning for online and offline setting with linear function approximation,”arXiv, vol. 2108.08765, 2021. 17

work page arXiv 2021
[23]

Imitation learning in discounted linear MDPs without exploration assumptions,

L. Viano, S. Skoulakis, and V . Cevher, “Imitation learning in discounted linear MDPs without exploration assumptions,” inProceedings of the 41st International Conference on Machine Learning, 2024, pp. 49 471– 49 505

2024
[24]

Error bounds of imitating policies and environments,

T. Xu, Z. Li, and Y. Yu, “Error bounds of imitating policies and environments,” inAdvances in Neural Information Processing Systems 33, 2020, pp. 15 737–15 749

2020
[25]

Understanding Adversarial Imitation Learning in Small Sample Regime: A Stage-coupled Analysis

T. Xu, Z. Li, Y. Yu, and Z.-Q. Luo, “Understanding adversarial imitation learning in small sample regime: A stage-coupled analysis,”arXiv, vol. 2208.01899, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

Toward the fundamental limits of imitation learning,

N. Rajaraman, L. F. Yang, J. Jiao, and K. Ramchandran, “Toward the fundamental limits of imitation learning,” inAdvances in Neural Information Processing Systems 33, 2020, pp. 2914–2924

2020
[27]

Exploration in deep reinforcement learning: a comprehensive survey,

T. Yang, H. Tang, C. Bai, J. Liu, J. Hao, Z. Meng, P . Liu, and Z. Wang, “Exploration in deep reinforcement learning: a comprehensive survey,”arXiv, vol. 2109.06668, 2021

work page arXiv 2021
[28]

From dirichlet to rubin: Optimistic exploration in rl without bonuses,

D. Tiapkin, D. Belomestny, É. Moulines, A. Naumov, S. Samsonov, Y. Tang, M. Valko, and P . Ménard, “From dirichlet to rubin: Optimistic exploration in rl without bonuses,” inProceedings of the 39th International Conference on Machine Learning, 2022, pp. 21 380–21 431

2022
[29]

Maximize to explore: One objective function fusing estimation, planning, and exploration,

Z. Liu, M. Lu, W. Xiong, H. Zhong, H. Hu, S. Zhang, S. Zheng, Z. Yang, and Z. Wang, “Maximize to explore: One objective function fusing estimation, planning, and exploration,”Advances in Neural Information Processing Systems 36, 2024

2024
[30]

A posterior sampling framework for interactive decision making,

H. Zhong, W. Xiong, S. Zheng, L. Wang, Z. Wang, Z. Yang, and T. Zhang, “A posterior sampling framework for interactive decision making,”arXiv, vol. 2211.01962, 2022

work page arXiv 2022
[31]

Apprenticeship learning via inverse reinforcement learning,

P . Abbeel and A. Y. Ng, “Apprenticeship learning via inverse reinforcement learning,” inProceedings of the 21st International Conference on Machine Learning, 2004, pp. 1–8

2004
[32]

A game-theoretic approach to apprenticeship learning,

U. Syed and R. E. Schapire, “A game-theoretic approach to apprenticeship learning,” inAdvances in Neural Information Processing Systems 20, 2007, pp. 1449–1456

2007
[33]

Provably efficient imitation learning from observation alone,

W. Sun, A. Vemula, B. Boots, and D. Bagnell, “Provably efficient imitation learning from observation alone,” inProceedings of the 36th International Conference on Machine Learning, 2019, pp. 6036–6045

2019
[34]

On the value of interaction and function approximation in imitation learning,

N. Rajaraman, Y. Han, L. Yang, J. Liu, J. Jiao, and K. Ramchandran, “On the value of interaction and function approximation in imitation learning,” inAdvances in Neural Information Processing Systems 34, 2021, pp. 1325–1336

2021
[35]

Minimax optimal online imitation learning via replay estimation,

G. Swamy, N. Rajaraman, M. Peng, S. Choudhury, J. Bagnell, S. Z. Wu, J. Jiao, and K. Ramchandran, “Minimax optimal online imitation learning via replay estimation,”Advances in Neural Information Processing Systems 35, pp. 7077–7088, 2022

2022
[36]

Proximal point imitation learning,

L. Viano, A. Kamoutsi, G. Neu, I. Krawczuk, and V . Cevher, “Proximal point imitation learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 24 309–24 326, 2022

2022
[37]

Error bounds of imitating policies and environments for reinforcement learning,

T. Xu, Z. Li, and Y. Yu, “Error bounds of imitating policies and environments for reinforcement learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021

2021
[38]

Generative adversarial imitation learning,

J. Ho and S. Ermon, “Generative adversarial imitation learning,” inAdvances in Neural Information Processing Systems 29, 2016, pp. 4565–4573

2016
[39]

Imitation learning via off-policy distribution matching,

I. Kostrikov, O. Nachum, and J. Tompson, “Imitation learning via off-policy distribution matching,” in Proceedings of the 8th International Conference on Learning Representations, 2020

2020
[40]

End-to-end differentiable adversarial imitation learning,

N. Baram, O. Anschel, I. Caspi, and S. Mannor, “End-to-end differentiable adversarial imitation learning,” inProceedings of the 34th International Conference on Machine Learning, 2017, pp. 390–399. 18

2017
[41]

Efficient imitation learning with conservative world models,

V . Kolev, R. Rafailov, K. Hatch, J. Wu, and C. Finn, “Efficient imitation learning with conservative world models,” in6th Annual Learning for Dynamics & Control Conference, 2024, pp. 1777–1790

2024
[42]

Hybrid inverse reinforcement learning,

J. Ren, G. Swamy, Z. S. Wu, J. A. Bagnell, and S. Choudhury, “Hybrid inverse reinforcement learning,” Proceedings of the 41st International Conference on Machine Learning, 2024

2024
[43]

Model-based reinforcement learning and the eluder dimension,

I. Osband and B. V . Roy, “Model-based reinforcement learning and the eluder dimension,” inAdvances in Neural Information Processing Systems 27, 2014, pp. 1466–1474

2014
[44]

Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms,

C. Jin, Q. Liu, and S. Miryoosefi, “Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms,” inAdvances in Neural Information Processing Systems 34, 2021, pp. 13 406– 13 418

2021
[45]

Efficient reductions for imitation learning,

S. Ross and D. Bagnell, “Efficient reductions for imitation learning,” inProceedings of the 13rd International Conference on Artificial Intelligence and Statistics, 2010, pp. 661–668

2010
[46]

M. J. Wainwright,High-dimensional statistics: A non-asymptotic viewpoint. Cambridge University Press, 2019

2019
[47]

Introduction to online convex optimization,

E. Hazan, “Introduction to online convex optimization,”Foundations and Trends in Optimization, vol. 2, no. 3-4, pp. 157–325, 2016

2016
[48]

Online non-convex learning: Following the perturbed leader is optimal,

A. S. Suggala and P . Netrapalli, “Online non-convex learning: Following the perturbed leader is optimal,” inProceedings of the 31st International Conference on Algorithmic Learning Theory, 2020, pp. 845–861

2020
[49]

Learning near-optimal policies with bellman-residual mini- mization based fitted policy iteration and a single sample path,

A. Antos, C. Szepesvári, and R. Munos, “Learning near-optimal policies with bellman-residual mini- mization based fitted policy iteration and a single sample path,”Machine Learning, vol. 71, pp. 89–129, 2008

2008
[50]

Provably efficient reinforcement learning with linear function approximation,

C. Jin, Z. Yang, Z. Wang, and M. I. Jordan, “Provably efficient reinforcement learning with linear function approximation,” inProceedings of the 33rd Annual Conference on Learning Theory, 2020, pp. 2137–2143

2020
[51]

Wasserstein generative adversarial networks,

M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” inProceedings of the 34th International Conference on Machine Learning, 2017, pp. 214–223

2017
[52]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,

T. Haarnoja, A. Zhou, P . Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” inProceedings of the 35th International Conference on Machine Learning, 2018, pp. 1856–1865

2018
[53]

Adversarially trained actor critic for offline rein- forcement learning,

C.-A. Cheng, T. Xie, N. Jiang, and A. Agarwal, “Adversarially trained actor critic for offline rein- forcement learning,” inProceedings of the 39th International Conference on Machine Learning, 2022, pp. 3852–3878

2022
[54]

Adversarial model for offline reinforcement learning,

M. Bhardwaj, T. Xie, B. Boots, N. Jiang, and C.-A. Cheng, “Adversarial model for offline reinforcement learning,”Advances in Neural Information Processing Systems 37, vol. 36, 2024

2024
[55]

A note on target q-learning for solving finite mdps with a generative oracle,

Z. Li, T. Xu, and Y. Yu, “A note on target q-learning for solving finite mdps with a generative oracle,” arXiv, vol. 2203.11489, 2022

work page arXiv 2022
[56]

Rambo-rl: Robust adversarial model-based offline reinforcement learning,

M. Rigter, B. Lacerda, and N. Hawes, “Rambo-rl: Robust adversarial model-based offline reinforcement learning,” inAdvances in neural information processing systems 35, 2022, pp. 16 082–16 097

2022
[57]

DeepMind Control Suite

Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancqet al., “Deepmind control suite,”arXiv preprint arXiv:1801.00690, 2018

work page internal anchor Pith review arXiv 2018
[58]

Mastering visual continuous control: Improved data- augmented reinforcement learning,

D. Yarats, R. Fergus, A. Lazaric, and L. Pinto, “Mastering visual continuous control: Improved data- augmented reinforcement learning,” inInternational Conference on Learning Representations, 2021. 19

2021
[59]

Inverse reinforcement learning without reinforcement learning,

G. Swamy, D. Wu, S. Choudhury, D. Bagnell, and S. Wu, “Inverse reinforcement learning without reinforcement learning,” inProceedings of the 40th International Conference on Machine Learning, 2023

2023
[60]

Taming the monster: A fast and simple algorithm for contextual bandits,

A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. Schapire, “Taming the monster: A fast and simple algorithm for contextual bandits,” inProceedings of the 31st International Conference on Machine Learning, 2014, pp. 1638–1646

2014
[61]

arXiv preprint arXiv:2112.13487 , year=

D. J. Foster, S. M. Kakade, J. Qian, and A. Rakhlin, “The statistical complexity of interactive decision making,”arXiv, vol. 2112.13487, 2021

work page arXiv 2021
[62]

Conservative q-learning for offline reinforcement learning,

A. Kumar, A. Zhou, G. Tucker, and S. Levine, “Conservative q-learning for offline reinforcement learning,”Advances in Neural Information Processing Systems, vol. 33, pp. 1179–1191, 2020. 20 A Omitted Proof A.1 Proof of Lemma 1 Lemma 1 presents an error decomposition theory in adversarial imitation learning. According to the definition of sπ, we have that ...

2020