Split the Differences, Pool the Rest: Provably Efficient Multi-Objective Imitation

Claire Vernade; Luca Viano; Volkan Cevher; Ziyad Sheebaelhamd

arxiv: 2605.12000 · v2 · pith:3IIHZ4GZnew · submitted 2026-05-12 · 💻 cs.LG

Split the Differences, Pool the Rest: Provably Efficient Multi-Objective Imitation

Ziyad Sheebaelhamd , Luca Viano , Volkan Cevher , Claire Vernade This is my paper

Pith reviewed 2026-05-20 22:36 UTC · model grok-4.3

classification 💻 cs.LG

keywords multi-objective imitation learningPareto-optimal policiesbehavioral cloningMOMDPmulti-objective MDPimitation learningPareto front

0 comments

The pith

MA-BC recovers Pareto-optimal policies faster than independent learners by splitting conflicting expert data and pooling the rest in multi-objective MDPs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates recovering policies on the Pareto front from demonstrations by multiple Pareto-optimal experts in a multi-objective Markov decision process. Standard imitation methods produce dominated policies when trajectories conflict across objectives. MA-BC systematically partitions the state-action space into conflicting and non-conflicting regions, splitting the former and pooling the latter. It proves this yields a faster statistical convergence rate to Pareto-optimal policies than any method that processes each expert dataset in isolation. A matching lower bound establishes that the algorithm is minimax optimal, with supporting experiments in discrete environments and a continuous LQR task.

Core claim

Multi-Output Augmented Behavioral Cloning (MA-BC) systematically partitions divergent expert data while pooling state-action pairs where no behavior conflict is observed, converging to Pareto-optimal policies at a faster statistical rate than any learner that considers each expert dataset independently and achieving minimax optimality via a novel lower bound for multi-objective imitation learning.

What carries the argument

MA-BC algorithm that identifies regions of behavior conflict to split data and pools non-conflicting state-action pairs for joint learning across experts.

Load-bearing premise

The expert demonstrations come from Pareto-optimal policies in the MOMDP and identifiable regions exist where expert behaviors do not conflict.

What would settle it

In a MOMDP with known Pareto front, compare suboptimality gaps of MA-BC versus independent learners as sample size grows; if gaps do not decrease at the faster claimed rate or if non-conflicting regions cannot be identified from the data, the central claim fails.

Figures

Figures reproduced from arXiv: 2605.12000 by Claire Vernade, Luca Viano, Volkan Cevher, Ziyad Sheebaelhamd.

**Figure 1.** Figure 1: (Left) Geometry of the MOMDP Return Space. The space of expected returns J is a convex polytope [26]. The Pareto front (bold blue line) consists of boundary faces whose outward normal vectors w lie entirely within the positive orthant R d ≥0 . Boundary faces with outward normal vectors containing negative components represent Pareto-dominated policies. All vertex policies are deterministic. (Right) Naive B… view at source ↗

**Figure 2.** Figure 2: Sample efficiency and suboptimality gaps across all three discrete multi-objective environments. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 2.** Figure 2: Visualization of Theorem 2.1 (Pareto Path Existence). Let two neighboring Pareto-optimal policies, πA and πB, differ from each other at exactly k = 4 states. As established by the theorem, we can find a sequence of intermediate policies that are a 1-state action flip from each other and are also guaranteed to be on the Pareto front. PF-policies, each differing by only a single state action. Theorem 2.1. [P… view at source ↗

**Figure 3.** Figure 3: Evaluation of dataset concentrability on MA-BC performance. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 3.** Figure 3: Sample efficiency and suboptimality gaps across all three discrete multi-objective environments. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Continuous state-action imitation results for the LQR drone control task. The plots report the [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 4.** Figure 4: Evaluation of dataset concentrability on MA-BC performance. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Lower bound construction. For the hard MDP see fig. [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 5.** Figure 5: Continuous state-action imitation results for the LQR drone control task. The plots report the [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Hard MDP from [38] with n states. The learner must identify the expert action a ∗ in states rarely visited (s2 . . . sn). Otherwise, it falls in the red absorbing state xend. Then, we have that for all i ∈ [d], we have that Ji(π ⋆ ) − Ji(ˆπ1) ≥ Ji(π1) − Ji(˜π1) = (1 − γ) −1 X x∈X ν π ⋆ (x) [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 6.** Figure 6: Lower bound construction. For the hard MDP see fig. [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Action Consistency Graph on Xdiv. The nodes represent the conflicting state-action pairs observed at the divergent states. Edges connect actions that were observed co-occurring within the same trajectory. The graph partitions the conflicting actions into consistent, expert-specific datasets (Ddiv,ℓ), successfully binding the latent experts’ actions on the divergent set of states. 28 [PITH_FULL_IMAGE:figur… view at source ↗

**Figure 7.** Figure 7: Hard MDP from [38] with n states. The learner must identify the expert action a ∗ in states rarely visited (s2 . . . sn). Otherwise, it falls in the red absorbing state xend. Then, we have that for all i ∈ [d], we have that Ji(π ⋆ ) − Ji(ˆπ1) ≥ Ji(π1) − Ji(˜π1) = (1 − γ) −1 X x∈X ν π ⋆ (x) [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Overview of the discrete multi-objective environments used for empirical evaluation. [PITH_FULL_IMAGE:figures/full_fig_p032_8.png] view at source ↗

**Figure 8.** Figure 8: Action Consistency Graph on Xdiv. The nodes represent the conflicting state-action pairs observed at the divergent states. Edges connect actions that were observed co-occurring within the same trajectory. The graph partitions the conflicting actions into consistent, expert-specific datasets (Ddiv,ℓ), successfully binding the latent experts’ actions on the divergent set of states. 28 [PITH_FULL_IMAGE:figur… view at source ↗

**Figure 9.** Figure 9: The DST Return Polytope. The exact continuous Pareto frontier is bounded by deterministic optimal policies (green vertices). Aliased policies (where multiple distinct optimal policies yield the exact same expected return) are highlighted with purple circles. Imitation learning (Failure II) We now turn our attention to the imitation learning setting. We select two extreme experts: a time-optimizing expert (… view at source ↗

**Figure 9.** Figure 9: Overview of the discrete multi-objective environments used for empirical evaluation. [PITH_FULL_IMAGE:figures/full_fig_p032_9.png] view at source ↗

**Figure 10.** Figure 10: The Geometric Failure of Naive BC (Failure II). We collect data from two extreme experts, highlighted using the blue and red stars. Naive BC aggregates all of this data into a single joint dataset. Because the experts fundamentally disagree at bottleneck states, the resulting Maximum Likelihood Estimate policy diffuses its probability mass destructively, causing its expected return to sag deeply into the … view at source ↗

**Figure 10.** Figure 10: The DST Return Polytope. The exact continuous Pareto frontier is bounded by deterministic optimal policies (green vertices). Aliased policies (where multiple distinct optimal policies yield the exact same expected return) are highlighted with purple circles. Imitation learning (Failure II) We now turn our attention to the imitation learning setting. We select two extreme experts: a time-optimizing expert … view at source ↗

**Figure 11.** Figure 11: Failure II: Naive BC vs. Closest Optimal Mixture. (Left) The policy learned by Naive BC is overly stochastic at conflicting states (red cells). (Right) The closest theoretically optimal policy on the Pareto front. This policy achieves continuous returns by randomizing at only a single critical state, properly trading off between two adjacent treasures without diffusing entropy throughout the trajectory. I… view at source ↗

**Figure 11.** Figure 11: The Geometric Failure of Naive BC (Failure II). We collect data from two extreme experts, highlighted using the blue and red stars. Naive BC aggregates all of this data into a single joint dataset. Because the experts fundamentally disagree at bottleneck states, the resulting Maximum Likelihood Estimate policy diffuses its probability mass destructively, causing its expected return to sag deeply into the … view at source ↗

**Figure 12.** Figure 12: Failure I: The Failure of Isolated BC (N1 = N2 = 5). We depict the time-optimizing policy learned via Isolated BC (Left) versus MA-BC (Right) using only 5 trajectories. MA-BC aggressively pools non-divergent actions from the treasure-optimizing expert to fill in unvisited states. While largely helpful, MA-BC introduces temporary bias in the low-data regime (e.g., the actions above cells 16, 19, 20, and 22… view at source ↗

**Figure 13.** Figure 13: MA-BC Self-Correction (N1 = N2 = 50). As the sample size increases to 50 trajectories per expert, the bias previously introduced by MA-BC ( [PITH_FULL_IMAGE:figures/full_fig_p039_13.png] view at source ↗

**Figure 13.** Figure 13: Failure I: The Failure of Isolated BC (N1 = N2 = 5). We depict the time-optimizing policy learned via Isolated BC (Left) versus MA-BC (Right) using only 5 trajectories. MA-BC aggressively pools non-divergent actions from the treasure-optimizing expert to fill in unvisited states. While largely helpful, MA-BC introduces temporary bias in the low-data regime (e.g., the actions above cells 16, 19, 20, and 22… view at source ↗

**Figure 14.** Figure 14: MA-BC Self-Correction (N1 = N2 = 50). As the sample size increases to 50 trajectories per expert, the bias previously introduced by MA-BC ( [PITH_FULL_IMAGE:figures/full_fig_p039_14.png] view at source ↗

read the original abstract

This work investigates multi-objective imitation learning: the problem of recovering policies that lie on the Pareto front given demonstrations from multiple Pareto-optimal experts in a Multi-Objective Markov Decision Process (MOMDP). Standard imitation approaches are ill-equipped for this regime, as naively aggregating conflicting expert trajectories can result in dominated policies. To address this, we introduce Multi-Output Augmented Behavioral Cloning (MA-BC), an algorithm that systematically partitions divergent expert data while pooling state-action pairs where no behavior conflict is observed. Theoretically, we prove that MA-BC converges to Pareto-optimal policies at a faster statistical rate than any learner that considers each expert dataset independently. Furthermore, we establish a novel lower bound for multi-objective imitation learning, demonstrating that MA-BC is minimax optimal. Finally, we empirically validate our algorithm across diverse discrete environments and, guided by our theoretical insights, extend and evaluate MA-BC on a continuous Linear Quadratic Regulator (LQR) control task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MA-BC partitions conflicting expert data and pools the rest to get faster rates than independent behavioral cloning, plus a matching lower bound, but the finite-sample error in that partition step is the part that needs checking.

read the letter

The core idea is straightforward: when experts disagree on what to do in some states, treat those regions separately; where they agree, pool the data. MA-BC does this split-and-pool step and the paper claims it reaches Pareto-optimal policies at a better statistical rate than running behavioral cloning on each expert dataset alone. They also give a new minimax lower bound that matches their upper bound, which is the clearest sign of novelty here.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Multi-Output Augmented Behavioral Cloning (MA-BC) for multi-objective imitation learning in Multi-Objective Markov Decision Processes (MOMDPs). Given demonstrations from multiple Pareto-optimal experts, the algorithm partitions state-action pairs into conflicting versus non-conflicting regions and pools the latter while handling divergence separately. It claims that MA-BC converges to Pareto-optimal policies at a faster statistical rate than any learner treating each expert dataset independently, establishes a matching minimax lower bound for the multi-objective imitation setting, and provides empirical validation on discrete environments plus an LQR control task.

Significance. If the central claims hold, the work would be significant as a provably efficient approach to multi-objective imitation learning that exploits non-conflicting regions to improve sample complexity over naive aggregation or per-expert baselines. The novel lower bound and the extension of the partitioning idea to continuous LQR tasks add value; the paper supplies a theoretical analysis together with a matching optimality result.

major comments (2)

[Abstract and theoretical analysis] Abstract and theoretical analysis section: the faster statistical rate for MA-BC relative to independent per-expert BC is stated to hold because the data-driven partition incurs negligible error. However, when conflict detection is performed via finite-sample estimates of behavior divergence (e.g., empirical occupancy or likelihood ratios), the misclassification probability scales with state-space size or number of experts; this term is not shown to be absorbed into the improved rate without additional assumptions on the MOMDP or expert policies. An explicit high-probability bound on the partition error that preserves the rate gain is needed to support the central claim.
[Lower-bound section] Lower-bound section: the minimax lower bound appears to be derived under an oracle partitioning assumption. It is unclear whether the same lower bound continues to hold when the learner must discover the conflicting regions from finite data; a clarification or extension of the lower-bound argument to the data-driven case would strengthen the optimality result.

minor comments (2)

[Empirical evaluation] The description of how the continuous-state partitioning is implemented for the LQR task would benefit from an explicit statement of the divergence estimator used and any discretization or kernel choices.
[Preliminaries] Notation for the Pareto front and the conflict indicator could be introduced earlier and used consistently to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. Below we respond point by point to the major comments and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract and theoretical analysis] Abstract and theoretical analysis section: the faster statistical rate for MA-BC relative to independent per-expert BC is stated to hold because the data-driven partition incurs negligible error. However, when conflict detection is performed via finite-sample estimates of behavior divergence (e.g., empirical occupancy or likelihood ratios), the misclassification probability scales with state-space size or number of experts; this term is not shown to be absorbed into the improved rate without additional assumptions on the MOMDP or expert policies. An explicit high-probability bound on the partition error that preserves the rate gain is needed to support the central claim.

Authors: We agree that an explicit high-probability bound on the partition error is needed to fully support the faster rate claim. In the revised manuscript we will add a lemma that bounds the misclassification probability of the data-driven conflict detector using standard concentration inequalities on empirical occupancies. Under the finite-state-space assumption already present in the paper, this error term is of strictly lower order than the leading statistical rate and is absorbed without requiring further assumptions on the MOMDP or expert policies. The updated analysis will appear in the theoretical section. revision: yes
Referee: [Lower-bound section] Lower-bound section: the minimax lower bound appears to be derived under an oracle partitioning assumption. It is unclear whether the same lower bound continues to hold when the learner must discover the conflicting regions from finite data; a clarification or extension of the lower-bound argument to the data-driven case would strengthen the optimality result.

Authors: The lower bound is stated as a minimax result over all learners for the multi-objective imitation problem. To remove any ambiguity we will revise the lower-bound section to explicitly incorporate the cost of learning the partition from finite samples. The argument will show that the additional uncertainty from data-driven discovery does not improve the minimax rate beyond what MA-BC achieves, thereby confirming optimality in the fully data-driven setting. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained; no load-bearing reductions to inputs or self-citations

full rationale

The paper defines MA-BC via explicit partitioning of conflicting vs. non-conflicting state-action pairs, then derives convergence rates and a minimax lower bound through standard finite-sample analysis of the resulting estimator in MOMDPs. The abstract and described claims contain no equations or steps where a 'prediction' is obtained by fitting a parameter to the target quantity itself, nor does the central optimality result rest on a self-citation whose content is unverified or defined circularly. The faster rate is obtained by showing that the pooling step reduces variance without introducing bias under the stated identifiability assumption, which is external to the algorithm output. This is the normal case of an independent theoretical derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard MOMDP assumptions and the premise that experts are Pareto-optimal; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Expert demonstrations are generated by Pareto-optimal policies in a Multi-Objective Markov Decision Process.
Stated as the problem setup for recovering policies on the Pareto front.

pith-pipeline@v0.9.0 · 5705 in / 1125 out tokens · 28574 ms · 2026-05-20T22:36:11.584473+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 3.1 and Corollary 3.1: εℓ bound involving |Xcommon|/N and |X∖Xcommon|/Nℓ with concentrability factor
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 2.1 (Pareto Path Existence) via single-state action flips on neighboring PF-policies

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 2 internal anchors

[1]

Abbeel, D

P. Abbeel, D. Dolgov, A. Y. Ng, and S. Thrun. Apprenticeship learning for motion planning with application to parking lot navigation. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2008

work page 2008
[2]

Abbeel and A

P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. InInternational Conference on Machine Learning (ICML), 2004

work page 2004
[3]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

work page 2024
[4]

A data-driven dynamic obstacle avoidance method for liquid-carrying plant protection uavs.Agronomy, 12(4), 2022

Shibbir Ahmed, Baijing Qiu, Chun-Wei Kong, Huang Xin, Fiaz Ahmad, and Jinlong Lin. A data-driven dynamic obstacle avoidance method for liquid-carrying plant protection uavs.Agronomy, 12(4), 2022

work page 2022
[5]

Learning all optimal policies with multiple criteria

Leon Barrett and Srini Narayanan. Learning all optimal policies with multiple criteria. InProceedings of the 25th International Conference on Machine Learning, ICML ’08, page 41–47, New York, NY, USA,

work page
[6]

Association for Computing Machinery

work page
[7]

Relative entropy inverse reinforcement learning

Abdeslam Boularias, Jens Kober, and Jan Peters. Relative entropy inverse reinforcement learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 182–189. JMLR Workshop and Conference Proceedings, 2011

work page 2011
[8]

Diffusion policy: Visuomotor policy learning via action diffusion, 2024

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion, 2024

work page 2024
[9]

UltraFeedback: Boosting Language Models with Scaled AI Feedback

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, et al. Ultrafeedback: Boosting language models with scaled ai feedback.arXiv preprint arXiv:2310.01377, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

work page 2019
[11]

Is behavior cloning all you need? understanding horizon in imitation learning.arXiv preprint arXiv:2407.15007, 2024

Dylan J Foster, Adam Block, and Dipendra Misra. Is behavior cloning all you need? understanding horizon in imitation learning.arXiv preprint arXiv:2407.15007, 2024

work page arXiv 2024
[12]

Select to perfect: Imitating desired behavior from large multi-agent data.arXiv preprint arXiv:2405.03735, 2024

Tim Franzmeyer, Edith Elkind, Philip Torr, Jakob Foerster, and Joao Henriques. Select to perfect: Imitating desired behavior from large multi-agent data.arXiv preprint arXiv:2405.03735, 2024

work page arXiv 2024
[13]

Learning equilibria from data: Provably efficient multi-agent imitation learning, 2025

Till Freihaut, Luca Viano, Volkan Cevher, Matthieu Geist, and Giorgia Ramponi. Learning equilibria from data: Provably efficient multi-agent imitation learning, 2025. 11

work page 2025
[14]

Rate optimal learning of equilibria from data.arXiv preprint arXiv:2510.09325, 2025

Till Freihaut, Luca Viano, Emanuele Nevali, Volkan Cevher, Matthieu Geist, and Giorgia Ramponi. Rate optimal learning of equilibria from data.arXiv preprint arXiv:2510.09325, 2025

work page arXiv 2025
[15]

Characterization of optimal policies in vector-valued markovian decision processes

Nagata Furukawa. Characterization of optimal policies in vector-valued markovian decision processes. Mathematics of operations research, 5(2):271–279, 1980

work page 1980
[16]

Ho and S

J. Ho and S. Ermon. Generative adversarial imitation learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2016

work page 2016
[17]

J. Ho, J. K. Gupta, and S. Ermon. Model-free imitation learning with policy optimization. InInternational Conference on Machine Learning (ICML), 2016

work page 2016
[18]

Multi-objective lqr with linear scalarization.arXiv preprint arXiv:2408.04488, 2024

Ali Jadbabaie, Devavrat Shah, and Sean R Sinclair. Multi-objective lqr with linear scalarization.arXiv preprint arXiv:2408.04488, 2024

work page arXiv 2024
[19]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

OpenVLA: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An open-source vision-language-action model. In8th Annual Conference on Robot Learning, 2024

work page 2024
[21]

Pareto inverse reinforcement learning for diverse expert policy generation.arXiv preprint arXiv:2408.12110, 2024

Woo Kyung Kim, Minjong Yoo, and Honguk Woo. Pareto inverse reinforcement learning for diverse expert policy generation.arXiv preprint arXiv:2408.12110, 2024

work page arXiv 2024
[22]

Pretraining language models with human preferences

Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Vinayak Bhalerao, Christopher Buckley, Jason Phang, Samuel R Bowman, and Ethan Perez. Pretraining language models with human preferences. In International conference on machine learning, pages 17506–17533. PMLR, 2023

work page 2023
[23]

Coordinated multi-agent imitation learning

Hoang M Le, Yisong Yue, Peter Carr, and Patrick Lucey. Coordinated multi-agent imitation learning. InInternational Conference on Machine Learning, pages 1995–2003. PMLR, 2017

work page 1995
[24]

Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto

Seungjae Lee, Yibin Wang, Haritheja Etukuru, H. Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions. InMulti-modal Foundation Model meets Embodied AI Workshop @ ICML2024, 2024

work page 2024
[25]

Learning strategy representation for imitation learning in multi-agent games

Shiqi Lei, Kanghoon Lee, Linjing Li, and Jinkyoo Park. Learning strategy representation for imitation learning in multi-agent games. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 18163–18171, 2025

work page 2025
[26]

How to find the exact pareto front for multi-objective mdps? arXiv preprint arXiv:2410.15557, 2024

Yining Li, Peizhong Ju, and Ness B Shroff. How to find the exact pareto front for multi-objective mdps? arXiv preprint arXiv:2410.15557, 2024

work page arXiv 2024
[27]

Multi-objective reinforcement learning: Convexity, stationarity and pareto optimality

Haoye Lu, Daniel Herman, and Yaoliang Yu. Multi-objective reinforcement learning: Convexity, stationarity and pareto optimality. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[28]

Molter.Classic graph problems made temporal – a parameterized complexity analysis

H. Molter.Classic graph problems made temporal – a parameterized complexity analysis. Foundations of Computing. Universitätsverlag der TU Berlin, 2020

work page 2020
[29]

Inverse q-learning done right: Offline imitation learning inq π-realizable mdps, 2025

Antoine Moulin, Gergely Neu, and Luca Viano. Inverse q-learning done right: Offline imitation learning inq π-realizable mdps, 2025

work page 2025
[30]

Optimistically optimistic exploration for provably efficient infinite-horizon reinforcement and imitation learning.arXiv preprint arXiv:2502.13900,

Antoine Moulin, Gergely Neu, and Luca Viano. Optimistically optimistic exploration for provably efficient infinite-horizon reinforcement and imitation learning.arXiv preprint arXiv:2502.13900, 2025

work page arXiv 2025
[31]

A. Y. Ng and S. J. Russell. Algorithms for inverse reinforcement learning. InInternational Conference on Machine Learning (ICML), 2000

work page 2000
[32]

An algorithmic perspective on imitation learning.Foundations and Trends in Robotics, 2018

T Osa, J Pajarinen, G Neumann, JA Bagnell, P Abbeel, and J Peters. An algorithmic perspective on imitation learning.Foundations and Trends in Robotics, 2018. 12

work page 2018
[33]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

work page 2022
[34]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

work page 2024
[35]

D. A. Pomerleau. Efficient training of artificial neural networks for autonomous navigation.Neural Computation, 3(1):88–97, 1991

work page 1991
[36]

Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming

Martin L. Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., USA, 1st edition, 1994

work page 1994
[37]

Improving language under- standing by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language under- standing by generative pre-training. 2018

work page 2018
[38]

On the value of interaction and function approximation in imitation learning.Advances in Neural Information Processing Systems, 34:1325–1336, 2021

Nived Rajaraman, Yanjun Han, Lin Yang, Jingbo Liu, Jiantao Jiao, and Kannan Ramchandran. On the value of interaction and function approximation in imitation learning.Advances in Neural Information Processing Systems, 34:1325–1336, 2021

work page 2021
[39]

Toward the fundamental limits of imitation learning.Advances in Neural Information Processing Systems, 33:2914–2924, 2020

Nived Rajaraman, Lin Yang, Jiantao Jiao, and Kannan Ramchandran. Toward the fundamental limits of imitation learning.Advances in Neural Information Processing Systems, 33:2914–2924, 2020

work page 2020
[40]

A survey of multi-objective sequential decision-making.Journal of Artificial Intelligence Research, 48:67–113, 2013

Diederik M Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-objective sequential decision-making.Journal of Artificial Intelligence Research, 48:67–113, 2013

work page 2013
[41]

PhD thesis, University of Amsterdam, 2016

Diederik Marijn Roijers.Multi-Objective Decision-Theoretic Planning. PhD thesis, University of Amsterdam, 2016

work page 2016
[42]

S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2011

work page 2011
[43]

Efficient reductions for imitation learning

Stéphane Ross and Drew Bagnell. Efficient reductions for imitation learning. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2010

work page 2010
[44]

Learning agents for uncertain environments (extended abstract)

Stuart Russell. Learning agents for uncertain environments (extended abstract). InAnnual Conference on Computational Learning Theory (COLT), 1998

work page 1998
[45]

Smith, and Yejin Choi

Maarten Sap, Saadia Gabriel, Lianhui Qin, Dan Jurafsky, Noah A. Smith, and Yejin Choi. Social bias frames: Reasoning about social and power implications of language. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5477–5490, Online, July 2020. Association for Computational Linguistics

work page 2020
[46]

Online apprenticeship learning.arXiv:2102.06924,

Lior Shani, Tom Zahavy, and Shie Mannor. Online apprenticeship learning.arXiv:2102.06924, 2021

work page arXiv 2021
[47]

Stochastic games.Proceedings of the national academy of sciences, 39(10):1095–1100, 1953

Lloyd S Shapley. Stochastic games.Proceedings of the national academy of sciences, 39(10):1095–1100, 1953

work page 1953
[48]

Quantization-free autoregressive action transformer

Ziyad Sheebaelhamd, Michael Tschannen, Michael Muehlebach, and Claire Vernade. Quantization-free autoregressive action transformer. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

work page 2026
[49]

Multi-agent generative adversarial imitation learning.Advances in neural information processing systems, 31, 2018

Jiaming Song, Hongyu Ren, Dorsa Sadigh, and Stefano Ermon. Multi-agent generative adversarial imitation learning.Advances in neural information processing systems, 31, 2018. 13

work page 2018
[50]

Of moments and matching: A game-theoretic framework for closing the imitation gap

Gokul Swamy, Sanjiban Choudhury, J Andrew Bagnell, and Steven Wu. Of moments and matching: A game-theoretic framework for closing the imitation gap. InInternational Conference on Machine Learning, pages 10022–10032. PMLR, 2021

work page 2021
[51]

U. Syed, M. Bowling, and R.E. Schapire. Apprenticeship learning using linear programming. In International Conference on Machine Learning (ICML), 2008

work page 2008
[52]

Syed and R

U. Syed and R. E. Schapire. A game-theoretic approach to apprenticeship learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2007

work page 2007
[53]

Multi-agent imitation learning: Value is easy, regret is hard

Jingwu Tang, Gokul Swamy, Fei Fang, and Zhiwei Steven Wu. Multi-agent imitation learning: Value is easy, regret is hard. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 27790–27816. Curran Associates, Inc., 2024

work page 2024
[54]

Multi-agent imitation learning with function approximation: Linear markov games and beyond.arXiv preprint arXiv:2602.22810, 2026

Luca Viano, Till Freihaut, Emanuele Nevali, Volkan Cevher, Matthieu Geist, and Giorgia Ramponi. Multi-agent imitation learning with function approximation: Linear markov games and beyond.arXiv preprint arXiv:2602.22810, 2026

work page arXiv 2026
[55]

Proximal point imitation learning.Advances in Neural Information Processing Systems, 35:24309–24326, 2022

Luca Viano, Angeliki Kamoutsi, Gergely Neu, Igor Krawczuk, and Volkan Cevher. Proximal point imitation learning.Advances in Neural Information Processing Systems, 35:24309–24326, 2022

work page 2022
[56]

Imitation learning in discounted linear MDPs without exploration assumptions

Luca Viano, Stratis Skoulakis, and Volkan Cevher. Imitation learning in discounted linear MDPs without exploration assumptions. InForty-first International Conference on Machine Learning, 2024

work page 2024
[57]

Bridgedata v2: A dataset for robot learning at scale

Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR, 2023

work page 2023
[58]

Huang, and Nicolas Heess

Joe Watson, Sandy H. Huang, and Nicolas Heess. Coherent soft imitation learning, 2023

work page 2023
[59]

D.J. White. Multi-objective infinite-horizon discounted markov decision processes.Journal of mathemat- ical analysis and applications, 89(2):639–647, 1982

work page 1982
[60]

A broad-coverage challenge corpus for sentence understanding through inference

Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistics, 2018

work page 2018
[61]

Imitating language via scalable inverse reinforcement learning.Advances in Neural Information Processing Systems, 37:90714–90735, 2024

Markus Wulfmeier, Michael Bloesch, Nino Vieillard, Arun Ahuja, Jörg Bornschein, Sandy Huang, Artem Sokolov, Matt Barnes, Guillaume Desjardins, Alex Bewley, et al. Imitating language via scalable inverse reinforcement learning.Advances in Neural Information Processing Systems, 37:90714–90735, 2024

work page 2024
[62]

Provably efficient adversarial imitation learning with unknown transitions.arXiv preprint arXiv:2306.06563, 2023

Tian Xu, Ziniu Li, Yang Yu, and Zhi-Quan Luo. Provably efficient adversarial imitation learning with unknown transitions.arXiv preprint arXiv:2306.06563, 2023

work page arXiv 2023
[63]

A generalized algorithm for multi-objective reinforcement learning and policy adaptation

Runzhe Yang, Xingyuan Sun, and Karthik Narasimhan. A generalized algorithm for multi-objective reinforcement learning and policy adaptation. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché- Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019

work page 2019
[64]

Multi-agent adversarial inverse reinforcement learning

Lantao Yu, Jiaming Song, and Stefano Ermon. Multi-agent adversarial inverse reinforcement learning. InInternational conference on machine learning, pages 7194–7201. PMLR, 2019

work page 2019
[65]

P Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

work page 2023
[66]

B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning. InNational Conference on Artificial Intelligence (AAAI), 2008. 14 Contents of Appendix A Related Work 16 B Omitted Proofs 18 B.1 Proof of Theorem 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 B.2 Proof of Theorem...

work page 2008
[67]

for a complete overview of the classics. Recent theoretical developmentsThe first solid sample complexity investigation in imitation learning is by [38], which also introduced the lower bound construction on which we build on for ours to show that behavioral cloning is actually optimal across offline algorithms. Beyond the tabular setting, [10] proved tha...

work page
[68]

Aliasing:Multiple distinct deterministic policies collapse into the exact same geometric vertex of the return polytope

work page
[69]

distance-1

Edge-bound Policies:Deterministic policies may map to the flat edge connecting two vertices, rather than forming new extreme points. Because of these phenomena, the strict "distance-1" property between adjacent vertices breaks down. Our result in Theorem 2.1 holds in general, and is agnostic to the existence ofaliasedoredge-boundpolicies. B.2 Proof of The...

work page
[70]

24 Then, we have that max M∈{M 1,M2,M3,M4} E[J M,i(π⋆)−J M,i(ˆπ1)]≥ 1 8(1−γ) KX k=2 1 N1 + 1 1− 1 N1 + 1 N1 ≥ 1 8(1−γ) KX k=2 1 e(N1 + 1) ≥Ω K (1−γ)N 1

Then, switching the attention back to(3), taking the maximum over both sides and defining asJM,i(π)as the ith entry of the performance vector for policyπ in the environment M, we have that max M∈{M 1,M2,M3,M4} E[J M,i(π⋆)−J M,i(ˆπ1)]≥ 1 8(1−γ) X x∈X \Xcommon ν0(x)P[π 1(x)̸= ˆπ1(x)] = 1 8(1−γ) X x∈X \Xcommon ν0(x)P[x /∈ D1] = 1 8(1−γ) X x∈X \Xcommon ν0(x)(...

work page
[71]

falling off

in the graph, which is an NP-hard problem. Thus, we restrict our setting to pair-wise distinct actions, and leave the extension to arbitrary experts for future work. Cluster 1 (Expert 1 Actions) Cluster 2 (Expert 2 Actions) (x1, a1) (x2, a2) (x3, a3) (x4, a4) τA τB (x1, a′ 1) (x2, a′ 2) (x3, a′ 3) (x4, a′ 4) τC No Linking Trajectories (Mutually Exclusive)...

work page

[1] [1]

Abbeel, D

P. Abbeel, D. Dolgov, A. Y. Ng, and S. Thrun. Apprenticeship learning for motion planning with application to parking lot navigation. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2008

work page 2008

[2] [2]

Abbeel and A

P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. InInternational Conference on Machine Learning (ICML), 2004

work page 2004

[3] [3]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

work page 2024

[4] [4]

A data-driven dynamic obstacle avoidance method for liquid-carrying plant protection uavs.Agronomy, 12(4), 2022

Shibbir Ahmed, Baijing Qiu, Chun-Wei Kong, Huang Xin, Fiaz Ahmad, and Jinlong Lin. A data-driven dynamic obstacle avoidance method for liquid-carrying plant protection uavs.Agronomy, 12(4), 2022

work page 2022

[5] [5]

Learning all optimal policies with multiple criteria

Leon Barrett and Srini Narayanan. Learning all optimal policies with multiple criteria. InProceedings of the 25th International Conference on Machine Learning, ICML ’08, page 41–47, New York, NY, USA,

work page

[6] [6]

Association for Computing Machinery

work page

[7] [7]

Relative entropy inverse reinforcement learning

Abdeslam Boularias, Jens Kober, and Jan Peters. Relative entropy inverse reinforcement learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 182–189. JMLR Workshop and Conference Proceedings, 2011

work page 2011

[8] [8]

Diffusion policy: Visuomotor policy learning via action diffusion, 2024

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion, 2024

work page 2024

[9] [9]

UltraFeedback: Boosting Language Models with Scaled AI Feedback

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, et al. Ultrafeedback: Boosting language models with scaled ai feedback.arXiv preprint arXiv:2310.01377, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

work page 2019

[11] [11]

Is behavior cloning all you need? understanding horizon in imitation learning.arXiv preprint arXiv:2407.15007, 2024

Dylan J Foster, Adam Block, and Dipendra Misra. Is behavior cloning all you need? understanding horizon in imitation learning.arXiv preprint arXiv:2407.15007, 2024

work page arXiv 2024

[12] [12]

Select to perfect: Imitating desired behavior from large multi-agent data.arXiv preprint arXiv:2405.03735, 2024

Tim Franzmeyer, Edith Elkind, Philip Torr, Jakob Foerster, and Joao Henriques. Select to perfect: Imitating desired behavior from large multi-agent data.arXiv preprint arXiv:2405.03735, 2024

work page arXiv 2024

[13] [13]

Learning equilibria from data: Provably efficient multi-agent imitation learning, 2025

Till Freihaut, Luca Viano, Volkan Cevher, Matthieu Geist, and Giorgia Ramponi. Learning equilibria from data: Provably efficient multi-agent imitation learning, 2025. 11

work page 2025

[14] [14]

Rate optimal learning of equilibria from data.arXiv preprint arXiv:2510.09325, 2025

Till Freihaut, Luca Viano, Emanuele Nevali, Volkan Cevher, Matthieu Geist, and Giorgia Ramponi. Rate optimal learning of equilibria from data.arXiv preprint arXiv:2510.09325, 2025

work page arXiv 2025

[15] [15]

Characterization of optimal policies in vector-valued markovian decision processes

Nagata Furukawa. Characterization of optimal policies in vector-valued markovian decision processes. Mathematics of operations research, 5(2):271–279, 1980

work page 1980

[16] [16]

Ho and S

J. Ho and S. Ermon. Generative adversarial imitation learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2016

work page 2016

[17] [17]

J. Ho, J. K. Gupta, and S. Ermon. Model-free imitation learning with policy optimization. InInternational Conference on Machine Learning (ICML), 2016

work page 2016

[18] [18]

Multi-objective lqr with linear scalarization.arXiv preprint arXiv:2408.04488, 2024

Ali Jadbabaie, Devavrat Shah, and Sean R Sinclair. Multi-objective lqr with linear scalarization.arXiv preprint arXiv:2408.04488, 2024

work page arXiv 2024

[19] [19]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

OpenVLA: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An open-source vision-language-action model. In8th Annual Conference on Robot Learning, 2024

work page 2024

[21] [21]

Pareto inverse reinforcement learning for diverse expert policy generation.arXiv preprint arXiv:2408.12110, 2024

Woo Kyung Kim, Minjong Yoo, and Honguk Woo. Pareto inverse reinforcement learning for diverse expert policy generation.arXiv preprint arXiv:2408.12110, 2024

work page arXiv 2024

[22] [22]

Pretraining language models with human preferences

Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Vinayak Bhalerao, Christopher Buckley, Jason Phang, Samuel R Bowman, and Ethan Perez. Pretraining language models with human preferences. In International conference on machine learning, pages 17506–17533. PMLR, 2023

work page 2023

[23] [23]

Coordinated multi-agent imitation learning

Hoang M Le, Yisong Yue, Peter Carr, and Patrick Lucey. Coordinated multi-agent imitation learning. InInternational Conference on Machine Learning, pages 1995–2003. PMLR, 2017

work page 1995

[24] [24]

Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto

Seungjae Lee, Yibin Wang, Haritheja Etukuru, H. Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions. InMulti-modal Foundation Model meets Embodied AI Workshop @ ICML2024, 2024

work page 2024

[25] [25]

Learning strategy representation for imitation learning in multi-agent games

Shiqi Lei, Kanghoon Lee, Linjing Li, and Jinkyoo Park. Learning strategy representation for imitation learning in multi-agent games. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 18163–18171, 2025

work page 2025

[26] [26]

How to find the exact pareto front for multi-objective mdps? arXiv preprint arXiv:2410.15557, 2024

Yining Li, Peizhong Ju, and Ness B Shroff. How to find the exact pareto front for multi-objective mdps? arXiv preprint arXiv:2410.15557, 2024

work page arXiv 2024

[27] [27]

Multi-objective reinforcement learning: Convexity, stationarity and pareto optimality

Haoye Lu, Daniel Herman, and Yaoliang Yu. Multi-objective reinforcement learning: Convexity, stationarity and pareto optimality. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023

[28] [28]

Molter.Classic graph problems made temporal – a parameterized complexity analysis

H. Molter.Classic graph problems made temporal – a parameterized complexity analysis. Foundations of Computing. Universitätsverlag der TU Berlin, 2020

work page 2020

[29] [29]

Inverse q-learning done right: Offline imitation learning inq π-realizable mdps, 2025

Antoine Moulin, Gergely Neu, and Luca Viano. Inverse q-learning done right: Offline imitation learning inq π-realizable mdps, 2025

work page 2025

[30] [30]

Optimistically optimistic exploration for provably efficient infinite-horizon reinforcement and imitation learning.arXiv preprint arXiv:2502.13900,

Antoine Moulin, Gergely Neu, and Luca Viano. Optimistically optimistic exploration for provably efficient infinite-horizon reinforcement and imitation learning.arXiv preprint arXiv:2502.13900, 2025

work page arXiv 2025

[31] [31]

A. Y. Ng and S. J. Russell. Algorithms for inverse reinforcement learning. InInternational Conference on Machine Learning (ICML), 2000

work page 2000

[32] [32]

An algorithmic perspective on imitation learning.Foundations and Trends in Robotics, 2018

T Osa, J Pajarinen, G Neumann, JA Bagnell, P Abbeel, and J Peters. An algorithmic perspective on imitation learning.Foundations and Trends in Robotics, 2018. 12

work page 2018

[33] [33]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

work page 2022

[34] [34]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

work page 2024

[35] [35]

D. A. Pomerleau. Efficient training of artificial neural networks for autonomous navigation.Neural Computation, 3(1):88–97, 1991

work page 1991

[36] [36]

Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming

Martin L. Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., USA, 1st edition, 1994

work page 1994

[37] [37]

Improving language under- standing by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language under- standing by generative pre-training. 2018

work page 2018

[38] [38]

On the value of interaction and function approximation in imitation learning.Advances in Neural Information Processing Systems, 34:1325–1336, 2021

Nived Rajaraman, Yanjun Han, Lin Yang, Jingbo Liu, Jiantao Jiao, and Kannan Ramchandran. On the value of interaction and function approximation in imitation learning.Advances in Neural Information Processing Systems, 34:1325–1336, 2021

work page 2021

[39] [39]

Toward the fundamental limits of imitation learning.Advances in Neural Information Processing Systems, 33:2914–2924, 2020

Nived Rajaraman, Lin Yang, Jiantao Jiao, and Kannan Ramchandran. Toward the fundamental limits of imitation learning.Advances in Neural Information Processing Systems, 33:2914–2924, 2020

work page 2020

[40] [40]

A survey of multi-objective sequential decision-making.Journal of Artificial Intelligence Research, 48:67–113, 2013

Diederik M Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-objective sequential decision-making.Journal of Artificial Intelligence Research, 48:67–113, 2013

work page 2013

[41] [41]

PhD thesis, University of Amsterdam, 2016

Diederik Marijn Roijers.Multi-Objective Decision-Theoretic Planning. PhD thesis, University of Amsterdam, 2016

work page 2016

[42] [42]

S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2011

work page 2011

[43] [43]

Efficient reductions for imitation learning

Stéphane Ross and Drew Bagnell. Efficient reductions for imitation learning. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2010

work page 2010

[44] [44]

Learning agents for uncertain environments (extended abstract)

Stuart Russell. Learning agents for uncertain environments (extended abstract). InAnnual Conference on Computational Learning Theory (COLT), 1998

work page 1998

[45] [45]

Smith, and Yejin Choi

Maarten Sap, Saadia Gabriel, Lianhui Qin, Dan Jurafsky, Noah A. Smith, and Yejin Choi. Social bias frames: Reasoning about social and power implications of language. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5477–5490, Online, July 2020. Association for Computational Linguistics

work page 2020

[46] [46]

Online apprenticeship learning.arXiv:2102.06924,

Lior Shani, Tom Zahavy, and Shie Mannor. Online apprenticeship learning.arXiv:2102.06924, 2021

work page arXiv 2021

[47] [47]

Stochastic games.Proceedings of the national academy of sciences, 39(10):1095–1100, 1953

Lloyd S Shapley. Stochastic games.Proceedings of the national academy of sciences, 39(10):1095–1100, 1953

work page 1953

[48] [48]

Quantization-free autoregressive action transformer

Ziyad Sheebaelhamd, Michael Tschannen, Michael Muehlebach, and Claire Vernade. Quantization-free autoregressive action transformer. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

work page 2026

[49] [49]

Multi-agent generative adversarial imitation learning.Advances in neural information processing systems, 31, 2018

Jiaming Song, Hongyu Ren, Dorsa Sadigh, and Stefano Ermon. Multi-agent generative adversarial imitation learning.Advances in neural information processing systems, 31, 2018. 13

work page 2018

[50] [50]

Of moments and matching: A game-theoretic framework for closing the imitation gap

Gokul Swamy, Sanjiban Choudhury, J Andrew Bagnell, and Steven Wu. Of moments and matching: A game-theoretic framework for closing the imitation gap. InInternational Conference on Machine Learning, pages 10022–10032. PMLR, 2021

work page 2021

[51] [51]

U. Syed, M. Bowling, and R.E. Schapire. Apprenticeship learning using linear programming. In International Conference on Machine Learning (ICML), 2008

work page 2008

[52] [52]

Syed and R

U. Syed and R. E. Schapire. A game-theoretic approach to apprenticeship learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2007

work page 2007

[53] [53]

Multi-agent imitation learning: Value is easy, regret is hard

Jingwu Tang, Gokul Swamy, Fei Fang, and Zhiwei Steven Wu. Multi-agent imitation learning: Value is easy, regret is hard. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 27790–27816. Curran Associates, Inc., 2024

work page 2024

[54] [54]

Multi-agent imitation learning with function approximation: Linear markov games and beyond.arXiv preprint arXiv:2602.22810, 2026

Luca Viano, Till Freihaut, Emanuele Nevali, Volkan Cevher, Matthieu Geist, and Giorgia Ramponi. Multi-agent imitation learning with function approximation: Linear markov games and beyond.arXiv preprint arXiv:2602.22810, 2026

work page arXiv 2026

[55] [55]

Proximal point imitation learning.Advances in Neural Information Processing Systems, 35:24309–24326, 2022

Luca Viano, Angeliki Kamoutsi, Gergely Neu, Igor Krawczuk, and Volkan Cevher. Proximal point imitation learning.Advances in Neural Information Processing Systems, 35:24309–24326, 2022

work page 2022

[56] [56]

Imitation learning in discounted linear MDPs without exploration assumptions

Luca Viano, Stratis Skoulakis, and Volkan Cevher. Imitation learning in discounted linear MDPs without exploration assumptions. InForty-first International Conference on Machine Learning, 2024

work page 2024

[57] [57]

Bridgedata v2: A dataset for robot learning at scale

Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR, 2023

work page 2023

[58] [58]

Huang, and Nicolas Heess

Joe Watson, Sandy H. Huang, and Nicolas Heess. Coherent soft imitation learning, 2023

work page 2023

[59] [59]

D.J. White. Multi-objective infinite-horizon discounted markov decision processes.Journal of mathemat- ical analysis and applications, 89(2):639–647, 1982

work page 1982

[60] [60]

A broad-coverage challenge corpus for sentence understanding through inference

Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistics, 2018

work page 2018

[61] [61]

Imitating language via scalable inverse reinforcement learning.Advances in Neural Information Processing Systems, 37:90714–90735, 2024

Markus Wulfmeier, Michael Bloesch, Nino Vieillard, Arun Ahuja, Jörg Bornschein, Sandy Huang, Artem Sokolov, Matt Barnes, Guillaume Desjardins, Alex Bewley, et al. Imitating language via scalable inverse reinforcement learning.Advances in Neural Information Processing Systems, 37:90714–90735, 2024

work page 2024

[62] [62]

Provably efficient adversarial imitation learning with unknown transitions.arXiv preprint arXiv:2306.06563, 2023

Tian Xu, Ziniu Li, Yang Yu, and Zhi-Quan Luo. Provably efficient adversarial imitation learning with unknown transitions.arXiv preprint arXiv:2306.06563, 2023

work page arXiv 2023

[63] [63]

A generalized algorithm for multi-objective reinforcement learning and policy adaptation

Runzhe Yang, Xingyuan Sun, and Karthik Narasimhan. A generalized algorithm for multi-objective reinforcement learning and policy adaptation. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché- Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019

work page 2019

[64] [64]

Multi-agent adversarial inverse reinforcement learning

Lantao Yu, Jiaming Song, and Stefano Ermon. Multi-agent adversarial inverse reinforcement learning. InInternational conference on machine learning, pages 7194–7201. PMLR, 2019

work page 2019

[65] [65]

P Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

work page 2023

[66] [66]

B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning. InNational Conference on Artificial Intelligence (AAAI), 2008. 14 Contents of Appendix A Related Work 16 B Omitted Proofs 18 B.1 Proof of Theorem 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 B.2 Proof of Theorem...

work page 2008

[67] [67]

for a complete overview of the classics. Recent theoretical developmentsThe first solid sample complexity investigation in imitation learning is by [38], which also introduced the lower bound construction on which we build on for ours to show that behavioral cloning is actually optimal across offline algorithms. Beyond the tabular setting, [10] proved tha...

work page

[68] [68]

Aliasing:Multiple distinct deterministic policies collapse into the exact same geometric vertex of the return polytope

work page

[69] [69]

distance-1

Edge-bound Policies:Deterministic policies may map to the flat edge connecting two vertices, rather than forming new extreme points. Because of these phenomena, the strict "distance-1" property between adjacent vertices breaks down. Our result in Theorem 2.1 holds in general, and is agnostic to the existence ofaliasedoredge-boundpolicies. B.2 Proof of The...

work page

[70] [70]

24 Then, we have that max M∈{M 1,M2,M3,M4} E[J M,i(π⋆)−J M,i(ˆπ1)]≥ 1 8(1−γ) KX k=2 1 N1 + 1 1− 1 N1 + 1 N1 ≥ 1 8(1−γ) KX k=2 1 e(N1 + 1) ≥Ω K (1−γ)N 1

Then, switching the attention back to(3), taking the maximum over both sides and defining asJM,i(π)as the ith entry of the performance vector for policyπ in the environment M, we have that max M∈{M 1,M2,M3,M4} E[J M,i(π⋆)−J M,i(ˆπ1)]≥ 1 8(1−γ) X x∈X \Xcommon ν0(x)P[π 1(x)̸= ˆπ1(x)] = 1 8(1−γ) X x∈X \Xcommon ν0(x)P[x /∈ D1] = 1 8(1−γ) X x∈X \Xcommon ν0(x)(...

work page

[71] [71]

falling off

in the graph, which is an NP-hard problem. Thus, we restrict our setting to pair-wise distinct actions, and leave the extension to arbitrary experts for future work. Cluster 1 (Expert 1 Actions) Cluster 2 (Expert 2 Actions) (x1, a1) (x2, a2) (x3, a3) (x4, a4) τA τB (x1, a′ 1) (x2, a′ 2) (x3, a′ 3) (x4, a′ 4) τC No Linking Trajectories (Mutually Exclusive)...

work page