pith. machine review for the scientific record. sign in

arxiv: 2605.09183 · v1 · submitted 2026-05-09 · 💻 cs.LG

Recognition: no theorem link

Learning When to Stop: Selective Imitation Learning Under Arbitrary Dynamics Shift

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:18 UTC · model grok-4.3

classification 💻 cs.LG
keywords selective imitation learningdynamics shiftstopping rulehorizon-free sample complexityvalidator policiesbehavior cloningimitation learning
0
0 comments X

The pith

Selective imitation learning lets agents stop acting when dynamics shift makes expert demonstrations unreliable, using a small set of validator policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to address the failure of standard imitation learning when the test environment has different dynamics from training. It proposes allowing the learner to selectively stop imitating in states where it cannot act reliably based on available data. By leveraging unlabeled expert state trajectories from the test environment, the approach constructs a stopping rule that keeps the policy complete during training but sound during testing. This matters because it prevents arbitrary performance degradation in changing environments without requiring new demonstrations for every shift.

Core claim

SeqRejectron builds a stopping rule from a small collection of validator policies whose size is independent of the horizon and policy class, delivering horizon-free sample complexity of order log of policy class size over epsilon squared for deterministic policies under sparse costs, and similar guarantees for stochastic policies via cumulative Hellinger distance.

What carries the argument

SeqRejectron's validator policy set, a compact collection of policies used to determine when to reject an action and stop imitating.

Load-bearing premise

Unlabeled state trajectories from the same expert are available in the test environment, along with the sparse costs assumption for the deterministic case.

What would settle it

An empirical test where the regret of the selective policy before stopping exceeds the predicted bound when dynamics shift arbitrarily, or when no test trajectories are provided.

Figures

Figures reproduced from arXiv: 2605.09183 by James Wang, Jonathan Pei, Surbhi Goel.

Figure 1
Figure 1. Figure 1: Left: normalized target switched cost and source/target handoff rates as functions of the [PITH_FULL_IMAGE:figures/full_fig_p016_1.png] view at source ↗
read the original abstract

Behavior cloning provides strong imitation learning guarantees when training and test environments share the same dynamics. However, in many deployment settings the test environment's transitions differ from training, and classical offline IL offers no recourse: the learner must commit to an action at every state, even when its demonstrations are uninformative and could lead to arbitrary degradation of performance. This motivates the study of selective imitation, where the learner may choose to stop when it cannot act reliably. We introduce a model for selective imitation under arbitrary dynamics shift: given labeled expert demonstrations from a training environment and unlabeled state trajectories from the same expert in a test environment, the learner outputs a selective policy that is complete (rarely stops in training) and sound (incurs low regret before stopping in test). Our algorithm, SeqRejectron, constructs a stopping rule using a small set of validator policies whose size is independent of the horizon or policy class. For deterministic policies, this yields horizon-free $\tilde{O}(\log|\Pi|/\epsilon^2)$ sample complexity, assuming sparse costs. For stochastic policies, we obtain analogous horizon-free guarantees using a cumulative Hellinger stopping time. We extend the framework to misspecified experts and different expert policies across train and test and obtain results that gracefully degrade with the amount of misspecification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces SeqRejectron for selective imitation learning under arbitrary dynamics shift. Given labeled expert trajectories from a training environment and unlabeled state trajectories from the same expert in a test environment, the algorithm outputs a selective policy that is complete (rarely stops on training data) and sound (low regret before stopping on test data). The core construction uses a small set of validator policies whose cardinality is independent of the horizon and policy class size. For deterministic policies this yields horizon-free sample complexity Õ(log|Π|/ε²) under a sparse-cost assumption; for stochastic policies an analogous bound is obtained via a cumulative Hellinger stopping time. The framework is extended to misspecified experts and differing train/test expert policies, with guarantees that degrade gracefully with the degree of misspecification.

Significance. If the stated bounds hold, the result is significant because it supplies the first horizon-free sample-complexity guarantees for imitation learning under arbitrary dynamics shift, achieved by a stopping rule whose computational cost does not grow with horizon. The validator-set construction (size independent of |Π| and T) is a technically clean device that avoids the usual dependence on horizon in disagreement-based or disagreement-coefficient arguments. The paper also supplies explicit, falsifiable assumptions (sparse costs, availability of unlabeled test trajectories) together with graceful-degradation results for misspecification, which strengthens the practical relevance of the theory.

minor comments (2)
  1. [Theorem 3.1 and Section 4] The sparse-cost assumption is stated only in the abstract and in the deterministic theorem; a single, self-contained definition (including the precise constant or support size) should appear in the main theorem statement and be referenced from the stochastic and misspecification extensions.
  2. [Sections 3 and 5] Notation for the validator set V and the stopping time τ is introduced in Section 3 but reused with slightly different indexing in the stochastic case (Section 5); a unified notation table or consistent subscript convention would reduce reader effort.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and insightful review, as well as the recommendation for minor revision. The summary accurately reflects the core contributions of SeqRejectron, including the horizon-free sample-complexity guarantees and the use of a small validator set whose size is independent of both the horizon and the policy class. We are pleased that the technical cleanliness of the validator construction and the graceful degradation under misspecification are highlighted as strengthening the practical relevance of the results.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents SeqRejectron as an explicit algorithmic construction of a stopping rule from a small validator policy set whose size is stated to be independent of horizon and policy class. The horizon-free sample complexity bounds are derived under explicitly listed assumptions (unlabeled test trajectories from the expert and sparse costs for the deterministic case) rather than by fitting parameters to the same data used for the final guarantee or by reducing the claimed result to a self-citation chain. No load-bearing step in the abstract or described framework reduces by construction to its own inputs, and the derivation remains self-contained once the stated assumptions are granted.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard statistical concentration tools and the modeling assumptions stated in the abstract; no free parameters or invented entities are introduced beyond the new selective-imitation framework itself.

axioms (1)
  • standard math Standard concentration inequalities suffice to obtain the stated log|Π|/ε² sample bounds
    Invoked to derive the horizon-free sample complexity for the validator-based stopping rule.

pith-pipeline@v0.9.0 · 5529 in / 1297 out tokens · 48745 ms · 2026-05-12T03:18:58.353122+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · 2 internal anchors

  1. [1]

    Advances in Neural Information Processing Systems , volume=

    Beyond perturbations: Learning guarantees with arbitrary adversarial test examples , author=. Advances in Neural Information Processing Systems , volume=

  2. [2]

    Advances in Neural Information Processing Systems , volume=

    Is behavior cloning all you need? understanding horizon in imitation learning , author=. Advances in Neural Information Processing Systems , volume=

  3. [3]

    2017 IEEE/RSJ international conference on intelligent robots and systems (IROS) , pages=

    Domain randomization for transferring deep neural networks from simulation to the real world , author=. 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS) , pages=. 2017 , organization=

  4. [4]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Exploring the limitations of behavior cloning for autonomous driving , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  5. [5]

    Advances in neural information processing systems , volume=

    Alvinn: An autonomous land vehicle in a neural network , author=. Advances in neural information processing systems , volume=

  6. [6]

    Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages=

    Efficient reductions for imitation learning , author=. Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages=. 2010 , organization=

  7. [7]

    Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=

    A reduction of imitation learning and structured prediction to no-regret online learning , author=. Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=. 2011 , organization=

  8. [8]

    IEEE Transactions on information theory , volume=

    On optimum recognition error and reject tradeoff , author=. IEEE Transactions on information theory , volume=. 2003 , publisher=

  9. [9]

    , author=

    On the Foundations of Noise-free Selective Classification. , author=. Journal of Machine Learning Research , volume=

  10. [10]

    Algorithmic Learning Theory , pages=

    Efficient learning with arbitrary covariate shift , author=. Algorithmic Learning Theory , pages=. 2021 , organization=

  11. [11]

    Advances in Neural Information Processing Systems , volume=

    Tolerant algorithms for learning with arbitrary covariate shift , author=. Advances in Neural Information Processing Systems , volume=

  12. [12]

    Advances in neural information processing systems , volume=

    Selective classification for deep neural networks , author=. Advances in neural information processing systems , volume=

  13. [13]

    Journal of Computer and System Sciences , volume=

    Efficient algorithms for online decision problems , author=. Journal of Computer and System Sciences , volume=. 2005 , publisher=

  14. [14]

    Behavioral Cloning from Observation

    Behavioral cloning from observation , author=. arXiv preprint arXiv:1805.01954 , year=

  15. [15]

    Advances in Neural Information Processing Systems , volume=

    Toward the fundamental limits of imitation learning , author=. Advances in Neural Information Processing Systems , volume=

  16. [16]

    Proceedings of the forty-eighth annual ACM symposium on Theory of Computing , pages=

    The computational power of optimization in online learning , author=. Proceedings of the forty-eighth annual ACM symposium on Theory of Computing , pages=

  17. [17]

    arXiv preprint arXiv:2006.13916 , year=

    Off-dynamics reinforcement learning: Training for transfer with domain classifiers , author=. arXiv preprint arXiv:2006.13916 , year=

  18. [18]

    Advances in Neural Information Processing Systems , volume=

    Robust inverse reinforcement learning under transition dynamics mismatch , author=. Advances in Neural Information Processing Systems , volume=

  19. [19]

    International Conference on Machine Learning , pages=

    Robust imitation learning against variations in environment dynamics , author=. International Conference on Machine Learning , pages=. 2022 , organization=

  20. [20]

    arXiv preprint arXiv:2002.11879 , year=

    State-only imitation with transition dynamics mismatch , author=. arXiv preprint arXiv:2002.11879 , year=

  21. [21]

    Advances in Neural Information Processing Systems , volume=

    An imitation from observation approach to transfer learning with dynamics mismatch , author=. Advances in Neural Information Processing Systems , volume=

  22. [22]

    Imitation learning for multi-turn lm agents via on-policy expert corrections.arXiv preprint arXiv:2512.14895, 2025

    Imitation Learning for Multi-turn LM Agents via On-policy Expert Corrections , author=. arXiv preprint arXiv:2512.14895 , year=

  23. [23]

    The International Journal of Robotics Research , volume=

    Learning dexterous in-hand manipulation , author=. The International Journal of Robotics Research , volume=. 2020 , publisher=

  24. [24]

    2019 International Conference on Robotics and Automation (ICRA) , pages=

    Safe reinforcement learning with model uncertainty estimates , author=. 2019 International Conference on Robotics and Automation (ICRA) , pages=. 2019 , organization=

  25. [25]

    Advances in Neural Information Processing Systems , volume=

    Bridging offline reinforcement learning and imitation learning: A tale of pessimism , author=. Advances in Neural Information Processing Systems , volume=

  26. [26]

    Advances in neural information processing systems , volume=

    Conservative q-learning for offline reinforcement learning , author=. Advances in neural information processing systems , volume=

  27. [27]

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    Offline reinforcement learning: Tutorial, review, and perspectives on open problems , author=. arXiv preprint arXiv:2005.01643 , year=

  28. [28]

    arXiv preprint arXiv:2503.13162 , year=

    Efficient imitation under misspecification , author=. arXiv preprint arXiv:2503.13162 , year=

  29. [29]

    International Conference on Algorithmic Learning Theory , pages=

    On the hardness of domain adaptation and the utility of unlabeled target samples , author=. International Conference on Algorithmic Learning Theory , pages=. 2012 , organization=

  30. [30]

    arXiv preprint arXiv:2110.03239 , year=

    Understanding domain randomization for sim-to-real transfer , author=. arXiv preprint arXiv:2110.03239 , year=

  31. [31]

    CAD2RL: Real Single-Image Flight without a Single Real Image

    Cad2rl: Real single-image flight without a single real image , author=. arXiv preprint arXiv:1611.04201 , year=

  32. [32]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical adaptation networks , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  33. [33]

    2018 IEEE international conference on robotics and automation (ICRA) , pages=

    Sim-to-real transfer of robotic control with dynamics randomization , author=. 2018 IEEE international conference on robotics and automation (ICRA) , pages=. 2018 , organization=

  34. [34]

    Sim-to-Real: Learning Agile Locomotion For Quadruped Robots

    Sim-to-real: Learning agile locomotion for quadruped robots , author=. arXiv preprint arXiv:1804.10332 , year=

  35. [35]

    arXiv preprint arXiv:1910.07113 , year=

    Solving rubik's cube with a robot hand , author=. arXiv preprint arXiv:1910.07113 , year=

  36. [36]

    arXiv preprint arXiv:2502.12310 , year=

    Domain randomization is sample efficient for linear quadratic control , author=. arXiv preprint arXiv:2502.12310 , year=

  37. [37]

    The Fourteenth International Conference on Learning Representations , year=

    Statistical Guarantees for Offline Domain Randomization , author=. The Fourteenth International Conference on Learning Representations , year=

  38. [38]

    2019 international conference on robotics and automation (ICRA) , pages=

    Closing the sim-to-real loop: Adapting simulation randomization with real world experience , author=. 2019 international conference on robotics and automation (ICRA) , pages=. 2019 , organization=

  39. [39]

    Preparing for the Unknown: Learning a Universal Policy with Online System Identification

    Preparing for the unknown: Learning a universal policy with online system identification , author=. arXiv preprint arXiv:1702.02453 , year=

  40. [40]

    Rma: Rapid motor adaptation for legged robots,

    Rma: Rapid motor adaptation for legged robots , author=. arXiv preprint arXiv:2107.04034 , year=

  41. [41]

    Conference on Robot Learning , pages=

    Active domain randomization , author=. Conference on Robot Learning , pages=. 2020 , organization=

  42. [42]

    Conference on robot learning , pages=

    Sim-to-real robot learning from pixels with progressive nets , author=. Conference on robot learning , pages=. 2017 , organization=

  43. [43]

    Proceedings of the IEEE , volume=

    A game theoretic approach to controller design for hybrid systems , author=. Proceedings of the IEEE , volume=. 2000 , publisher=

  44. [44]

    2019 18th European control conference (ECC) , pages=

    Control barrier functions: Theory and applications , author=. 2019 18th European control conference (ECC) , pages=. 2019 , organization=

  45. [45]

    International workshop on hybrid systems: Computation and control , pages=

    Safety verification of hybrid systems using barrier certificates , author=. International workshop on hybrid systems: Computation and control , pages=. 2004 , organization=

  46. [46]

    2017 IEEE 56th annual conference on decision and control (CDC) , pages=

    Hamilton-jacobi reachability: A brief overview and recent advances , author=. 2017 IEEE 56th annual conference on decision and control (CDC) , pages=. 2017 , organization=

  47. [47]

    Mathematics of Operations Research , volume=

    Robust dynamic programming , author=. Mathematics of Operations Research , volume=. 2005 , publisher=

  48. [48]

    Operations Research , volume=

    Robust control of Markov decision processes with uncertain transition matrices , author=. Operations Research , volume=. 2005 , publisher=

  49. [49]

    Mathematics of Operations Research , volume=

    Robust Markov decision processes , author=. Mathematics of Operations Research , volume=. 2013 , publisher=

  50. [50]

    International conference on machine learning , pages=

    Robust adversarial reinforcement learning , author=. International conference on machine learning , pages=. 2017 , organization=

  51. [51]

    arXiv preprint arXiv:1610.01283 , year=

    Epopt: Learning robust neural network policies using model ensembles , author=. arXiv preprint arXiv:1610.01283 , year=

  52. [52]

    Advances in neural information processing systems , volume=

    Robust deep reinforcement learning against adversarial perturbations on state observations , author=. Advances in neural information processing systems , volume=

  53. [53]

    International Conference on Machine Learning , pages=

    Action robust reinforcement learning and applications in continuous control , author=. International Conference on Machine Learning , pages=. 2019 , organization=

  54. [54]

    International Conference on Artificial Intelligence and Statistics , pages=

    Sample complexity of robust reinforcement learning with a generative model , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=

  55. [55]

    Journal of Machine Learning Research , volume=

    Distributionally robust model-based offline reinforcement learning with near-optimal sample complexity , author=. Journal of Machine Learning Research , volume=

  56. [56]

    Advances in neural information processing systems , volume=

    Generative adversarial imitation learning , author=. Advances in neural information processing systems , volume=

  57. [57]

    Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , url =

    Learning robust rewards with adversarial inverse reinforcement learning , author=. arXiv preprint arXiv:1710.11248 , year=

  58. [58]

    arXiv preprint arXiv:1912.05032 , year=

    Imitation learning via off-policy distribution matching , author=. arXiv preprint arXiv:1912.05032 , year=

  59. [59]

    , author=

    Maximum entropy inverse reinforcement learning. , author=. Aaai , volume=. 2008 , organization=

  60. [60]

    international conference on machine learning , pages=

    Dropout as a bayesian approximation: Representing model uncertainty in deep learning , author=. international conference on machine learning , pages=. 2016 , organization=

  61. [61]

    Advances in neural information processing systems , volume=

    Simple and scalable predictive uncertainty estimation using deep ensembles , author=. Advances in neural information processing systems , volume=

  62. [62]

    Uncertainty-Aware Reinforcement Learning for Collision Avoidance

    Uncertainty-aware reinforcement learning for collision avoidance , author=. arXiv preprint arXiv:1702.01182 , year=

  63. [63]

    IEEE Robotics and Automation Letters , volume=

    Safe planning in dynamic environments using conformal prediction , author=. IEEE Robotics and Automation Letters , volume=. 2023 , publisher=

  64. [64]

    Learning for Dynamics and Control Conference , pages=

    Adaptive conformal prediction for motion planning among dynamic agents , author=. Learning for Dynamics and Control Conference , pages=. 2023 , organization=

  65. [65]

    Proceedings of the ACM/IEEE 14th International Conference on Cyber-Physical Systems (with CPS-IoT Week 2023) , pages=

    Conformal prediction for stl runtime verification , author=. Proceedings of the ACM/IEEE 14th International Conference on Cyber-Physical Systems (with CPS-IoT Week 2023) , pages=

  66. [66]

    2024 IEEE 63rd Conference on Decision and Control (CDC) , pages=

    Single trajectory conformal prediction , author=. 2024 IEEE 63rd Conference on Decision and Control (CDC) , pages=. 2024 , organization=

  67. [67]

    Foundations and Trends in Machine Learning , volume=

    Conformal prediction: A gentle introduction , author=. Foundations and Trends in Machine Learning , volume=. 2023 , publisher=

  68. [68]

    Advances in neural information processing systems , volume=

    Conformal prediction under covariate shift , author=. Advances in neural information processing systems , volume=

  69. [69]

    Advances in Neural Information Processing Systems , volume=

    Adaptive conformal inference under distribution shift , author=. Advances in Neural Information Processing Systems , volume=

  70. [70]

    Journal of statistical planning and inference , volume=

    Improving predictive inference under covariate shift by weighting the log-likelihood function , author=. Journal of statistical planning and inference , volume=. 2000 , publisher=

  71. [71]

    , author=

    Covariate shift adaptation by importance weighted cross validation. , author=. Journal of Machine Learning Research , volume=

  72. [72]

    Machine learning , volume=

    A theory of learning from different domains , author=. Machine learning , volume=. 2010 , publisher=

  73. [73]

    arXiv preprint arXiv:0902.3430 , year=

    Domain adaptation: Learning bounds and algorithms , author=. arXiv preprint arXiv:0902.3430 , year=

  74. [74]

    1998 , publisher=

    Reinforcement learning: An introduction , author=. 1998 , publisher=

  75. [75]

    Handbooks in operations research and management science , volume=

    Markov decision processes , author=. Handbooks in operations research and management science , volume=. 1990 , publisher=

  76. [76]

    NeurIPS , year=

    Efficient Discrepancy Testing for Learning with Distribution Shift , author=. NeurIPS , year=

  77. [77]

    37th Annual Conference on Learning Theory, COLT 2024 (to appear) , year=

    Testable Learning with Distribution Shift , author=. 37th Annual Conference on Learning Theory, COLT 2024 (to appear) , year=

  78. [78]

    37th Annual Conference on Learning Theory, COLT 2024 (to appear) , year=

    Learning Intersections of Halfspaces with Distribution Shift: Improved Algorithms and SQ Lower Bounds , author=. 37th Annual Conference on Learning Theory, COLT 2024 (to appear) , year=

  79. [79]

    International Conference on Machine Learning , pages=

    Of moments and matching: A game-theoretic framework for closing the imitation gap , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  80. [80]

    Advances in Neural Information Processing Systems , volume=

    Minimax optimal online imitation learning via replay estimation , author=. Advances in Neural Information Processing Systems , volume=

Showing first 80 references.