arxiv: 2605.09183 · v1 · submitted 2026-05-09 · 💻 cs.LG

Recognition: no theorem link

Learning When to Stop: Selective Imitation Learning Under Arbitrary Dynamics Shift

Surbhi Goel , Jonathan Pei , James Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:18 UTC · model grok-4.3

classification 💻 cs.LG

keywords selective imitation learningdynamics shiftstopping rulehorizon-free sample complexityvalidator policiesbehavior cloningimitation learning

0 comments

The pith

Selective imitation learning lets agents stop acting when dynamics shift makes expert demonstrations unreliable, using a small set of validator policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to address the failure of standard imitation learning when the test environment has different dynamics from training. It proposes allowing the learner to selectively stop imitating in states where it cannot act reliably based on available data. By leveraging unlabeled expert state trajectories from the test environment, the approach constructs a stopping rule that keeps the policy complete during training but sound during testing. This matters because it prevents arbitrary performance degradation in changing environments without requiring new demonstrations for every shift.

Core claim

SeqRejectron builds a stopping rule from a small collection of validator policies whose size is independent of the horizon and policy class, delivering horizon-free sample complexity of order log of policy class size over epsilon squared for deterministic policies under sparse costs, and similar guarantees for stochastic policies via cumulative Hellinger distance.

What carries the argument

SeqRejectron's validator policy set, a compact collection of policies used to determine when to reject an action and stop imitating.

Load-bearing premise

Unlabeled state trajectories from the same expert are available in the test environment, along with the sparse costs assumption for the deterministic case.

What would settle it

An empirical test where the regret of the selective policy before stopping exceeds the predicted bound when dynamics shift arbitrarily, or when no test trajectories are provided.

Figures

Figures reproduced from arXiv: 2605.09183 by James Wang, Jonathan Pei, Surbhi Goel.

read the original abstract

Behavior cloning provides strong imitation learning guarantees when training and test environments share the same dynamics. However, in many deployment settings the test environment's transitions differ from training, and classical offline IL offers no recourse: the learner must commit to an action at every state, even when its demonstrations are uninformative and could lead to arbitrary degradation of performance. This motivates the study of selective imitation, where the learner may choose to stop when it cannot act reliably. We introduce a model for selective imitation under arbitrary dynamics shift: given labeled expert demonstrations from a training environment and unlabeled state trajectories from the same expert in a test environment, the learner outputs a selective policy that is complete (rarely stops in training) and sound (incurs low regret before stopping in test). Our algorithm, SeqRejectron, constructs a stopping rule using a small set of validator policies whose size is independent of the horizon or policy class. For deterministic policies, this yields horizon-free $\tilde{O}(\log|\Pi|/\epsilon^2)$ sample complexity, assuming sparse costs. For stochastic policies, we obtain analogous horizon-free guarantees using a cumulative Hellinger stopping time. We extend the framework to misspecified experts and different expert policies across train and test and obtain results that gracefully degrade with the amount of misspecification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SeqRejectron gives a clean selective imitation model with horizon-free bounds via a small validator set, but it needs unlabeled expert trajectories in the shifted environment.

read the letter

The paper's main advance is a selective imitation framework that lets the learner stop acting under arbitrary dynamics shift. It supplies labeled expert demos from training plus unlabeled state trajectories from the same expert at test time, then outputs a policy that stays complete on training data and incurs low regret before stopping on test data. SeqRejectron builds the stopping rule from a validator set whose size stays independent of horizon and policy class, which yields the horizon-free Õ(log|Π|/ε²) bound for deterministic policies under sparse costs and a Hellinger-based version for stochastic ones. The extension to misspecified experts that degrades gracefully is also useful. These pieces are new relative to standard behavior cloning and offline IL results. The formalization is careful and the sample-complexity claims look internally consistent once the assumptions are granted. The validator construction avoids obvious circularity. The main soft spots are the two assumptions the abstract flags: access to unlabeled expert trajectories in the test environment, which is not always easy to obtain, and the sparse-cost condition needed for the strongest deterministic bound. Those are stated clearly rather than hidden, so they do not undermine the derivation itself. The stochastic case is a bit weaker but still horizon-free. This work is aimed at imitation-learning theorists and people thinking about safe offline deployment in robotics or control. A reader who cares about sample-complexity guarantees under distribution shift will get concrete value from the validator construction and the bounds. I would send it to peer review; the central claims are coherent and the technical contribution is worth referee scrutiny even if the assumptions need discussion in the final version.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces SeqRejectron for selective imitation learning under arbitrary dynamics shift. Given labeled expert trajectories from a training environment and unlabeled state trajectories from the same expert in a test environment, the algorithm outputs a selective policy that is complete (rarely stops on training data) and sound (low regret before stopping on test data). The core construction uses a small set of validator policies whose cardinality is independent of the horizon and policy class size. For deterministic policies this yields horizon-free sample complexity Õ(log|Π|/ε²) under a sparse-cost assumption; for stochastic policies an analogous bound is obtained via a cumulative Hellinger stopping time. The framework is extended to misspecified experts and differing train/test expert policies, with guarantees that degrade gracefully with the degree of misspecification.

Significance. If the stated bounds hold, the result is significant because it supplies the first horizon-free sample-complexity guarantees for imitation learning under arbitrary dynamics shift, achieved by a stopping rule whose computational cost does not grow with horizon. The validator-set construction (size independent of |Π| and T) is a technically clean device that avoids the usual dependence on horizon in disagreement-based or disagreement-coefficient arguments. The paper also supplies explicit, falsifiable assumptions (sparse costs, availability of unlabeled test trajectories) together with graceful-degradation results for misspecification, which strengthens the practical relevance of the theory.

minor comments (2)

[Theorem 3.1 and Section 4] The sparse-cost assumption is stated only in the abstract and in the deterministic theorem; a single, self-contained definition (including the precise constant or support size) should appear in the main theorem statement and be referenced from the stochastic and misspecification extensions.
[Sections 3 and 5] Notation for the validator set V and the stopping time τ is introduced in Section 3 but reused with slightly different indexing in the stochastic case (Section 5); a unified notation table or consistent subscript convention would reduce reader effort.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and insightful review, as well as the recommendation for minor revision. The summary accurately reflects the core contributions of SeqRejectron, including the horizon-free sample-complexity guarantees and the use of a small validator set whose size is independent of both the horizon and the policy class. We are pleased that the technical cleanliness of the validator construction and the graceful degradation under misspecification are highlighted as strengthening the practical relevance of the results.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents SeqRejectron as an explicit algorithmic construction of a stopping rule from a small validator policy set whose size is stated to be independent of horizon and policy class. The horizon-free sample complexity bounds are derived under explicitly listed assumptions (unlabeled test trajectories from the expert and sparse costs for the deterministic case) rather than by fitting parameters to the same data used for the final guarantee or by reducing the claimed result to a self-citation chain. No load-bearing step in the abstract or described framework reduces by construction to its own inputs, and the derivation remains self-contained once the stated assumptions are granted.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard statistical concentration tools and the modeling assumptions stated in the abstract; no free parameters or invented entities are introduced beyond the new selective-imitation framework itself.

axioms (1)

standard math Standard concentration inequalities suffice to obtain the stated log|Π|/ε² sample bounds
Invoked to derive the horizon-free sample complexity for the validator-based stopping rule.

pith-pipeline@v0.9.0 · 5529 in / 1297 out tokens · 48745 ms · 2026-05-12T03:18:58.353122+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · 2 internal anchors

[1]

Advances in Neural Information Processing Systems , volume=

Beyond perturbations: Learning guarantees with arbitrary adversarial test examples , author=. Advances in Neural Information Processing Systems , volume=

work page
[2]

Advances in Neural Information Processing Systems , volume=

Is behavior cloning all you need? understanding horizon in imitation learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[3]

2017 IEEE/RSJ international conference on intelligent robots and systems (IROS) , pages=

Domain randomization for transferring deep neural networks from simulation to the real world , author=. 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS) , pages=. 2017 , organization=

work page 2017
[4]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Exploring the limitations of behavior cloning for autonomous driving , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page
[5]

Advances in neural information processing systems , volume=

Alvinn: An autonomous land vehicle in a neural network , author=. Advances in neural information processing systems , volume=

work page
[6]

Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages=

Efficient reductions for imitation learning , author=. Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages=. 2010 , organization=

work page 2010
[7]

Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=

A reduction of imitation learning and structured prediction to no-regret online learning , author=. Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=. 2011 , organization=

work page 2011
[8]

IEEE Transactions on information theory , volume=

On optimum recognition error and reject tradeoff , author=. IEEE Transactions on information theory , volume=. 2003 , publisher=

work page 2003
[9]

, author=

On the Foundations of Noise-free Selective Classification. , author=. Journal of Machine Learning Research , volume=

work page
[10]

Algorithmic Learning Theory , pages=

Efficient learning with arbitrary covariate shift , author=. Algorithmic Learning Theory , pages=. 2021 , organization=

work page 2021
[11]

Advances in Neural Information Processing Systems , volume=

Tolerant algorithms for learning with arbitrary covariate shift , author=. Advances in Neural Information Processing Systems , volume=

work page
[12]

Advances in neural information processing systems , volume=

Selective classification for deep neural networks , author=. Advances in neural information processing systems , volume=

work page
[13]

Journal of Computer and System Sciences , volume=

Efficient algorithms for online decision problems , author=. Journal of Computer and System Sciences , volume=. 2005 , publisher=

work page 2005
[14]

Behavioral Cloning from Observation

Behavioral cloning from observation , author=. arXiv preprint arXiv:1805.01954 , year=

work page Pith review arXiv
[15]

Advances in Neural Information Processing Systems , volume=

Toward the fundamental limits of imitation learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[16]

Proceedings of the forty-eighth annual ACM symposium on Theory of Computing , pages=

The computational power of optimization in online learning , author=. Proceedings of the forty-eighth annual ACM symposium on Theory of Computing , pages=

work page
[17]

arXiv preprint arXiv:2006.13916 , year=

Off-dynamics reinforcement learning: Training for transfer with domain classifiers , author=. arXiv preprint arXiv:2006.13916 , year=

work page arXiv 2006
[18]

Advances in Neural Information Processing Systems , volume=

Robust inverse reinforcement learning under transition dynamics mismatch , author=. Advances in Neural Information Processing Systems , volume=

work page
[19]

International Conference on Machine Learning , pages=

Robust imitation learning against variations in environment dynamics , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022
[20]

arXiv preprint arXiv:2002.11879 , year=

State-only imitation with transition dynamics mismatch , author=. arXiv preprint arXiv:2002.11879 , year=

work page arXiv 2002
[21]

Advances in Neural Information Processing Systems , volume=

An imitation from observation approach to transfer learning with dynamics mismatch , author=. Advances in Neural Information Processing Systems , volume=

work page
[22]

Imitation learning for multi-turn lm agents via on-policy expert corrections.arXiv preprint arXiv:2512.14895, 2025

Imitation Learning for Multi-turn LM Agents via On-policy Expert Corrections , author=. arXiv preprint arXiv:2512.14895 , year=

work page arXiv
[23]

The International Journal of Robotics Research , volume=

Learning dexterous in-hand manipulation , author=. The International Journal of Robotics Research , volume=. 2020 , publisher=

work page 2020
[24]

2019 International Conference on Robotics and Automation (ICRA) , pages=

Safe reinforcement learning with model uncertainty estimates , author=. 2019 International Conference on Robotics and Automation (ICRA) , pages=. 2019 , organization=

work page 2019
[25]

Advances in Neural Information Processing Systems , volume=

Bridging offline reinforcement learning and imitation learning: A tale of pessimism , author=. Advances in Neural Information Processing Systems , volume=

work page
[26]

Advances in neural information processing systems , volume=

Conservative q-learning for offline reinforcement learning , author=. Advances in neural information processing systems , volume=

work page
[27]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Offline reinforcement learning: Tutorial, review, and perspectives on open problems , author=. arXiv preprint arXiv:2005.01643 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2005
[28]

arXiv preprint arXiv:2503.13162 , year=

Efficient imitation under misspecification , author=. arXiv preprint arXiv:2503.13162 , year=

work page arXiv
[29]

International Conference on Algorithmic Learning Theory , pages=

On the hardness of domain adaptation and the utility of unlabeled target samples , author=. International Conference on Algorithmic Learning Theory , pages=. 2012 , organization=

work page 2012
[30]

arXiv preprint arXiv:2110.03239 , year=

Understanding domain randomization for sim-to-real transfer , author=. arXiv preprint arXiv:2110.03239 , year=

work page arXiv
[31]

CAD2RL: Real Single-Image Flight without a Single Real Image

Cad2rl: Real single-image flight without a single real image , author=. arXiv preprint arXiv:1611.04201 , year=

work page Pith review arXiv
[32]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical adaptation networks , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[33]

2018 IEEE international conference on robotics and automation (ICRA) , pages=

Sim-to-real transfer of robotic control with dynamics randomization , author=. 2018 IEEE international conference on robotics and automation (ICRA) , pages=. 2018 , organization=

work page 2018
[34]

Sim-to-Real: Learning Agile Locomotion For Quadruped Robots

Sim-to-real: Learning agile locomotion for quadruped robots , author=. arXiv preprint arXiv:1804.10332 , year=

work page Pith review arXiv
[35]

arXiv preprint arXiv:1910.07113 , year=

Solving rubik's cube with a robot hand , author=. arXiv preprint arXiv:1910.07113 , year=

work page arXiv 1910
[36]

arXiv preprint arXiv:2502.12310 , year=

Domain randomization is sample efficient for linear quadratic control , author=. arXiv preprint arXiv:2502.12310 , year=

work page arXiv
[37]

The Fourteenth International Conference on Learning Representations , year=

Statistical Guarantees for Offline Domain Randomization , author=. The Fourteenth International Conference on Learning Representations , year=

work page
[38]

2019 international conference on robotics and automation (ICRA) , pages=

Closing the sim-to-real loop: Adapting simulation randomization with real world experience , author=. 2019 international conference on robotics and automation (ICRA) , pages=. 2019 , organization=

work page 2019
[39]

Preparing for the Unknown: Learning a Universal Policy with Online System Identification

Preparing for the unknown: Learning a universal policy with online system identification , author=. arXiv preprint arXiv:1702.02453 , year=

work page Pith review arXiv
[40]

Rma: Rapid motor adaptation for legged robots,

Rma: Rapid motor adaptation for legged robots , author=. arXiv preprint arXiv:2107.04034 , year=

work page arXiv
[41]

Conference on Robot Learning , pages=

Active domain randomization , author=. Conference on Robot Learning , pages=. 2020 , organization=

work page 2020
[42]

Conference on robot learning , pages=

Sim-to-real robot learning from pixels with progressive nets , author=. Conference on robot learning , pages=. 2017 , organization=

work page 2017
[43]

Proceedings of the IEEE , volume=

A game theoretic approach to controller design for hybrid systems , author=. Proceedings of the IEEE , volume=. 2000 , publisher=

work page 2000
[44]

2019 18th European control conference (ECC) , pages=

Control barrier functions: Theory and applications , author=. 2019 18th European control conference (ECC) , pages=. 2019 , organization=

work page 2019
[45]

International workshop on hybrid systems: Computation and control , pages=

Safety verification of hybrid systems using barrier certificates , author=. International workshop on hybrid systems: Computation and control , pages=. 2004 , organization=

work page 2004
[46]

2017 IEEE 56th annual conference on decision and control (CDC) , pages=

Hamilton-jacobi reachability: A brief overview and recent advances , author=. 2017 IEEE 56th annual conference on decision and control (CDC) , pages=. 2017 , organization=

work page 2017
[47]

Mathematics of Operations Research , volume=

Robust dynamic programming , author=. Mathematics of Operations Research , volume=. 2005 , publisher=

work page 2005
[48]

Operations Research , volume=

Robust control of Markov decision processes with uncertain transition matrices , author=. Operations Research , volume=. 2005 , publisher=

work page 2005
[49]

Mathematics of Operations Research , volume=

Robust Markov decision processes , author=. Mathematics of Operations Research , volume=. 2013 , publisher=

work page 2013
[50]

International conference on machine learning , pages=

Robust adversarial reinforcement learning , author=. International conference on machine learning , pages=. 2017 , organization=

work page 2017
[51]

arXiv preprint arXiv:1610.01283 , year=

Epopt: Learning robust neural network policies using model ensembles , author=. arXiv preprint arXiv:1610.01283 , year=

work page arXiv
[52]

Advances in neural information processing systems , volume=

Robust deep reinforcement learning against adversarial perturbations on state observations , author=. Advances in neural information processing systems , volume=

work page
[53]

International Conference on Machine Learning , pages=

Action robust reinforcement learning and applications in continuous control , author=. International Conference on Machine Learning , pages=. 2019 , organization=

work page 2019
[54]

International Conference on Artificial Intelligence and Statistics , pages=

Sample complexity of robust reinforcement learning with a generative model , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=

work page 2022
[55]

Journal of Machine Learning Research , volume=

Distributionally robust model-based offline reinforcement learning with near-optimal sample complexity , author=. Journal of Machine Learning Research , volume=

work page
[56]

Advances in neural information processing systems , volume=

Generative adversarial imitation learning , author=. Advances in neural information processing systems , volume=

work page
[57]

Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , url =

Learning robust rewards with adversarial inverse reinforcement learning , author=. arXiv preprint arXiv:1710.11248 , year=

work page arXiv
[58]

arXiv preprint arXiv:1912.05032 , year=

Imitation learning via off-policy distribution matching , author=. arXiv preprint arXiv:1912.05032 , year=

work page arXiv 1912
[59]

, author=

Maximum entropy inverse reinforcement learning. , author=. Aaai , volume=. 2008 , organization=

work page 2008
[60]

international conference on machine learning , pages=

Dropout as a bayesian approximation: Representing model uncertainty in deep learning , author=. international conference on machine learning , pages=. 2016 , organization=

work page 2016
[61]

Advances in neural information processing systems , volume=

Simple and scalable predictive uncertainty estimation using deep ensembles , author=. Advances in neural information processing systems , volume=

work page
[62]

Uncertainty-Aware Reinforcement Learning for Collision Avoidance

Uncertainty-aware reinforcement learning for collision avoidance , author=. arXiv preprint arXiv:1702.01182 , year=

work page Pith review arXiv
[63]

IEEE Robotics and Automation Letters , volume=

Safe planning in dynamic environments using conformal prediction , author=. IEEE Robotics and Automation Letters , volume=. 2023 , publisher=

work page 2023
[64]

Learning for Dynamics and Control Conference , pages=

Adaptive conformal prediction for motion planning among dynamic agents , author=. Learning for Dynamics and Control Conference , pages=. 2023 , organization=

work page 2023
[65]

Proceedings of the ACM/IEEE 14th International Conference on Cyber-Physical Systems (with CPS-IoT Week 2023) , pages=

Conformal prediction for stl runtime verification , author=. Proceedings of the ACM/IEEE 14th International Conference on Cyber-Physical Systems (with CPS-IoT Week 2023) , pages=

work page 2023
[66]

2024 IEEE 63rd Conference on Decision and Control (CDC) , pages=

Single trajectory conformal prediction , author=. 2024 IEEE 63rd Conference on Decision and Control (CDC) , pages=. 2024 , organization=

work page 2024
[67]

Foundations and Trends in Machine Learning , volume=

Conformal prediction: A gentle introduction , author=. Foundations and Trends in Machine Learning , volume=. 2023 , publisher=

work page 2023
[68]

Advances in neural information processing systems , volume=

Conformal prediction under covariate shift , author=. Advances in neural information processing systems , volume=

work page
[69]

Advances in Neural Information Processing Systems , volume=

Adaptive conformal inference under distribution shift , author=. Advances in Neural Information Processing Systems , volume=

work page
[70]

Journal of statistical planning and inference , volume=

Improving predictive inference under covariate shift by weighting the log-likelihood function , author=. Journal of statistical planning and inference , volume=. 2000 , publisher=

work page 2000
[71]

, author=

Covariate shift adaptation by importance weighted cross validation. , author=. Journal of Machine Learning Research , volume=

work page
[72]

Machine learning , volume=

A theory of learning from different domains , author=. Machine learning , volume=. 2010 , publisher=

work page 2010
[73]

arXiv preprint arXiv:0902.3430 , year=

Domain adaptation: Learning bounds and algorithms , author=. arXiv preprint arXiv:0902.3430 , year=

work page arXiv
[74]

1998 , publisher=

Reinforcement learning: An introduction , author=. 1998 , publisher=

work page 1998
[75]

Handbooks in operations research and management science , volume=

Markov decision processes , author=. Handbooks in operations research and management science , volume=. 1990 , publisher=

work page 1990
[76]

NeurIPS , year=

Efficient Discrepancy Testing for Learning with Distribution Shift , author=. NeurIPS , year=

work page
[77]

37th Annual Conference on Learning Theory, COLT 2024 (to appear) , year=

Testable Learning with Distribution Shift , author=. 37th Annual Conference on Learning Theory, COLT 2024 (to appear) , year=

work page 2024
[78]

37th Annual Conference on Learning Theory, COLT 2024 (to appear) , year=

Learning Intersections of Halfspaces with Distribution Shift: Improved Algorithms and SQ Lower Bounds , author=. 37th Annual Conference on Learning Theory, COLT 2024 (to appear) , year=

work page 2024
[79]

International Conference on Machine Learning , pages=

Of moments and matching: A game-theoretic framework for closing the imitation gap , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[80]

Advances in Neural Information Processing Systems , volume=

Minimax optimal online imitation learning via replay estimation , author=. Advances in Neural Information Processing Systems , volume=

work page

Showing first 80 references.