Auditing Near-Optimal Policies Can Be Exponentially Hard: Conditional Query Lower Bounds via Occupancy Rashomon Capacity

Anuj Sharma; Ibne Farabi Shihab; Sanjeda Akter

arxiv: 2606.00414 · v1 · pith:ZYM5S3FAnew · submitted 2026-05-29 · 💻 cs.LG

Auditing Near-Optimal Policies Can Be Exponentially Hard: Conditional Query Lower Bounds via Occupancy Rashomon Capacity

Ibne Farabi Shihab , Sanjeda Akter , Anuj Sharma This is my paper

Pith reviewed 2026-06-28 22:54 UTC · model grok-4.3

classification 💻 cs.LG

keywords reinforcement learningpolicy auditingRashomon capacityoccupancy measuresquery complexitynear-optimal policieslower bounds

0 comments

The pith

When the audited class contains a high-capacity packing of near-optimal policies with sparse signatures, exact local-query auditing requires exponentially many queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an occupancy-measure analogue of Rashomon capacity as the metric entropy of the near-optimal occupancy region relative to an audited deployment class. It proves a conditional lower bound: if that class contains a 2/H-separated near-optimal packing with b-sparse local signatures that realizes deployment-class capacity, then exact local-query auditing needs Omega(M/b) queries. When b is constant the bound becomes exponential in the capacity. The authors attain the bound with a finite discounted hidden-branch MDP and extend the analysis to noisy sample oracles.

Core claim

If the audited deployment class contains a 2/H-separated near-optimal packing whose local signatures are b-sparse and that realizes the deployment-class capacity, then exact local-query auditing requires Omega(M/b) queries; when the packing realizes capacity and b is constant this becomes Omega(2 to the power of Hopt^F(eps)). A finite discounted hidden-branch MDP attains the bound, and a mixture lower bound of order M/beta holds for noisy hidden-trigger testing.

What carries the argument

Occupancy Rashomon capacity: the metric entropy of the near-optimal occupancy region computed relative to the audited deployment class, which counts the number of occupancy-distinct near-optimal policies.

If this is right

Exact local-query auditing requires Omega(M/b) queries under the stated packing condition.
Noisy hidden-trigger testing requires Omega(M/beta) samples where beta is the per-sample KL signal.
The audited capacity collapses under a canonical occupancy regularizer when a trusted reference occupancy is supplied.
A transcript-compatible oracle-cover verification upper bound exists alongside the lower bounds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Auditors may need to switch to noisy or reference-assisted methods in high-capacity environments rather than exact local queries.
Designing or selecting deployment classes with low occupancy Rashomon capacity could make auditing tractable in practice.
The hardness result may extend to continuous or visual control tasks where post-processed policies exhibit similar occupancy diversity.

Load-bearing premise

The audited deployment class contains a 2/H-separated near-optimal packing with b-sparse local signatures that realizes the deployment-class capacity.

What would settle it

An audited deployment class that contains many return-equivalent policies yet has low metric entropy in occupancy space, so that exact auditing succeeds with o(2 to the power of capacity) local queries.

Figures

Figures reproduced from arXiv: 2606.00414 by Anuj Sharma, Ibne Farabi Shihab, Sanjeda Akter.

**Figure 1.** Figure 1: Hidden-branch MDP (Theorem 8). All transitions are essentially deterministic except at s0, which routes uniformly to one of M branches. Rewards are identically one, so every policy in the deployment class FM = {π1, . . . , πM} is optimal. Policy πi plays the distinguished action only on branch i, producing 1-sparse exact local signatures: querying any branch state reveals which πi is deployed only when the… view at source ↗

read the original abstract

When many reinforcement-learning policies achieve near-optimal return, a post-hoc auditor may have to distinguish among many behaviorally distinct but return-equivalent policies. We formalize this phenomenon through an occupancy-measure analogue of Rashomon capacity: the metric entropy of the near-optimal occupancy region, computed relative to an audited deployment class. Because occupancy measures identify behavior only up to occupancy equivalence, we formulate auditing at the occupancy-class level and distinguish exact local-query oracles from noisy sample-query oracles. Our main exact-query result is conditional: if the audited class contains a $2/H$-separated near-optimal packing whose local signatures are $b$-sparse, then exact local-query auditing requires $\Omega(M/b)$ queries; when the packing realizes deployment-class capacity and $b=O(1)$, this becomes $\Omega(2^{\Hopt^\cF(\eps)})$. We give a finite discounted hidden-branch MDP attaining this bound and show the exact Bayes success law. For noisy hidden-trigger testing, we prove a mixture lower bound of order $M/\beta$, where $\beta$ is the per-sample KL signal, yielding $\Omega(2^{\Hopt^\cF(\eps)}/(\rho^2\Delta^2))$ for capacity-order packings with $\beta=O(\rho^2\Delta^2)$. We also provide a static target-recognition information lower bound, a transcript-compatible oracle-cover verification upper bound, and a canonical occupancy regularizer whose regularized audited capacity collapses when a trusted reference occupancy is available. Controlled benchmarks distinguish positive sparse-signature instances from high-capacity negative controls where exact auditing is easy, and map the noisy-trigger law to post-processed continuous-control and visual-RL auditing regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives conditional exponential query lower bounds for auditing near-optimal RL policies using a new occupancy Rashomon capacity, backed by explicit MDP constructions and both exact and noisy oracle results.

read the letter

The main thing to know is that auditing near-optimal policies can require exponentially many queries under a specific structural condition on the policy class, formalized through occupancy Rashomon capacity.

The new element is the occupancy-measure version of Rashomon capacity, defined as the metric entropy of the near-optimal occupancy region relative to an audited deployment class. The central exact-query lower bound is conditional: if the class contains a 2/H-separated near-optimal packing with b-sparse local signatures, then exact local-query auditing needs Omega(M/b) queries, which scales to Omega(2^{Hopt^F(eps)}) when the packing realizes capacity and b is constant. They supply a finite discounted hidden-branch MDP that attains the bound, state the exact Bayes success law, and prove a mixture lower bound of order M/beta for the noisy case. The static target-recognition information lower bound, the transcript-compatible oracle-cover upper bound, and the canonical occupancy regularizer are additional pieces. The controlled benchmarks that separate sparse-signature positive instances from high-capacity negative controls where auditing is easy help clarify when the hardness applies.

The result stays conditional on the existence of that packing, which is stated plainly rather than hidden. This keeps the logic tight but means the exponential barrier does not apply to every deployment class. The paper does not overclaim universality, and the argument structure links the bound directly to the premise without visible internal gaps.

This work is for researchers in theoretical RL who study verification, query complexity, and Rashomon sets in sequential settings. A reader focused on information-theoretic limits or post-hoc auditing would get concrete value from the occupancy formulation and the matching constructions.

It deserves peer review. The formal setup, explicit constructions, and separation of cases provide enough substance for referees to evaluate.

Referee Report

0 major / 3 minor

Summary. The paper defines occupancy Rashomon capacity as the metric entropy of the near-optimal occupancy region relative to an audited deployment class. It proves a conditional exact-query lower bound: if the audited class contains a 2/H-separated near-optimal packing with b-sparse local signatures realizing deployment-class capacity, then exact local-query auditing requires Ω(M/b) queries, which becomes Ω(2^{Hopt^F(ε)}) for b=O(1). The paper supplies a finite discounted hidden-branch MDP attaining the bound, an exact Bayes success law, a mixture lower bound Ω(M/β) for noisy hidden-trigger testing (yielding Ω(2^{Hopt^F(ε)}/(ρ²Δ²)) for capacity-order packings), a static target-recognition information lower bound, a transcript-compatible oracle-cover verification upper bound, a canonical occupancy regularizer, and controlled benchmarks separating sparse-signature positive instances from high-capacity negative controls.

Significance. If the conditional premise holds, the results establish that post-hoc auditing of near-optimal policies can be exponentially hard in the effective horizon or capacity when the deployment class admits a suitable packing. The explicit finite-MDP construction attaining the bound, the exact Bayes success law, the mixture lower bound for the noisy case, and the positive/negative benchmark controls are strengths that make the conditional claim falsifiable and technically grounded. The distinction between exact local-query and noisy sample-query oracles, together with the occupancy-class formulation, clarifies the setting in which the hardness applies.

minor comments (3)

[§2] The notation Hopt^F(ε) and the precise definition of the deployment-class capacity (relative to the audited class) should be stated in a single displayed equation early in §2 or §3 so that the conditional premise can be checked without cross-referencing the abstract.
[Benchmarks] In the benchmark section, the high-capacity negative-control instances should report the realized occupancy Rashomon capacity value alongside the query count to make the separation from the sparse-signature positive instances quantitative.
[Upper bound] The transcript-compatible oracle-cover verification upper bound is stated without an explicit dependence on the packing parameters; adding a short remark on how the upper bound scales with M and b would clarify its relation to the Ω(M/b) lower bound.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the detailed and accurate summary of the manuscript, the positive assessment of its significance, and the recommendation for minor revision. No major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper states conditional lower bounds on exact local-query auditing that hold only if the audited deployment class contains a 2/H-separated near-optimal packing with b-sparse local signatures realizing deployment-class capacity. It supplies an explicit finite discounted hidden-branch MDP construction attaining the bound, derives an exact Bayes success law, a mixture lower bound for the noisy case, and provides controlled benchmarks separating positive and negative instances. No load-bearing step reduces by definition, by renaming a fitted parameter as a prediction, or via a self-citation chain; all central claims rest on independent information-theoretic and query-complexity arguments plus the supplied construction. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the newly introduced occupancy Rashomon capacity together with standard MDP and oracle definitions; the exponential bound is conditional on the existence of a specific packing.

axioms (2)

domain assumption Finite discounted hidden-branch MDP
The example attaining the bound is defined on this class of MDPs.
standard math Occupancy measures identify behavior up to equivalence
Used to formulate auditing at the occupancy-class level.

invented entities (1)

Occupancy Rashomon capacity no independent evidence
purpose: Metric entropy of the near-optimal occupancy region relative to the audited deployment class
Newly defined quantity that organizes the hardness result.

pith-pipeline@v0.9.1-grok · 5852 in / 1356 out tokens · 31245 ms · 2026-06-28T22:54:21.947178+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 1 canonical work pages

[1]

Poisoning deep reinforcement learning agents with in- distribution triggers, 2021

Chace Ashcraft and Kiran Karra. Poisoning deep reinforcement learning agents with in- distribution triggers, 2021. URLhttps://arxiv.org/abs/2106.07798

arXiv 2021
[2]

Statistical modeling: The two cultures (with comments and a rejoinder by the author).Statistical science, 16(3):199–231, 2001

Leo Breiman. Statistical modeling: The two cultures (with comments and a rejoinder by the author).Statistical science, 16(3):199–231, 2001

2001
[3]

Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Bıyık, Anca Dragan, David Krueger, Dorsa Sadigh, and Dylan Hadfield-Menell

Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, J ´er´emy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Rapha ¨el Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, ...

Pith/arXiv arXiv 2023
[4]

Diversity is all you need: Learning skills without a reward function.arXiv preprint arXiv:1802.06070, 2018

Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function.arXiv preprint arXiv:1802.06070, 2018

Pith/arXiv arXiv 2018
[5]

Aaron Fisher, Cynthia Rudin, and Francesca Dominici. All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models si- multaneously.Journal of Machine Learning Research, 20(177):1–81, 2019

2019
[6]

D4RL: Datasets for deep data-driven reinforcement learning, 2020

Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for deep data-driven reinforcement learning, 2020. URLhttps://arxiv.org/abs/2004. 07219

2020
[7]

Safe reinforcement learning via formal methods: Toward safe control through proof and learning

Nathan Fulton and Andr ´e Platzer. Safe reinforcement learning via formal methods: Toward safe control through proof and learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

2018
[8]

Cautious reinforce- ment learning with logical constraints

Mohammadhosein Hasanbeig, Alessandro Abate, and Daniel Kroening. Cautious reinforce- ment learning with logical constraints. InProceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, pages 483–491, 2020

2020
[9]

Rashomon capacity: A metric for predictive multiplicity in classification.Advances in Neural Information Processing Systems, 35:28988–29000, 2022

Hsiang Hsu and Flavio Calmon. Rashomon capacity: A metric for predictive multiplicity in classification.Advances in Neural Information Processing Systems, 35:28988–29000, 2022

2022
[10]

Trojdrl: evaluation of back- door attacks on deep reinforcement learning

Panagiota Kiourti, Kacper Wardega, Susmit Jha, and Wenchao Li. Trojdrl: evaluation of back- door attacks on deep reinforcement learning. In2020 57th ACM/IEEE Design Automation Conference (DAC), pages 1–6. IEEE, 2020. 9

2020
[11]

One solution is not all you need: Few-shot extrapolation via structured maxent rl.Advances in Neural Information Processing Systems, 33:8198–8210, 2020

Saurabh Kumar, Aviral Kumar, Sergey Levine, and Chelsea Finn. One solution is not all you need: Few-shot extrapolation via structured maxent rl.Advances in Neural Information Processing Systems, 33:8198–8210, 2020

2020
[12]

Convergence of estimates under dimensionality restrictions.The Annals of Statistics, pages 38–53, 1973

Lucien LeCam. Convergence of estimates under dimensionality restrictions.The Annals of Statistics, pages 38–53, 1973

1973
[13]

Predictive multiplicity in classification

Charles Marx, Flavio Calmon, and Berk Ustun. Predictive multiplicity in classification. In International conference on machine learning, pages 6765–6774. PMLR, 2020

2020
[14]

An empirical evaluation of the rashomon effect in explainable machine learning

Sebastian M ¨uller, Vanessa Toborek, Katharina Beckh, Matthias Jakobs, Christian Bauckhage, and Pascal Welke. An empirical evaluation of the rashomon effect in explainable machine learning. InJoint european conference on machine learning and knowledge discovery in databases, pages 462–478. Springer, 2023

2023
[15]

The effects of reward misspecification: Mapping and mitigating misaligned models, 2022

Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The effects of reward misspecification: Mapping and mitigating misaligned models, 2022. URLhttps://arxiv.org/abs/2201. 03544

2022
[16]

Stable-Baselines3: Reliable reinforcement learning implementations.Journal of Machine Learning Research, 22(268):1–8, 2021

Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-Baselines3: Reliable reinforcement learning implementations.Journal of Machine Learning Research, 22(268):1–8, 2021. URLhttps://jmlr.org/papers/v22/ 20-1364.html

2021
[17]

Policy teach- ing via environment poisoning: Training-time adversarial attacks against reinforcement learn- ing

Amin Rakhsha, Goran Radanovic, Rati Devidze, Xiaojin Zhu, and Adish Singla. Policy teach- ing via environment poisoning: Training-time adversarial attacks against reinforcement learn- ing. InInternational Conference on Machine Learning, pages 7974–7984. PMLR, 2020

2020
[18]

Interpretable machine learning: Fundamental principles and 10 grand challenges.Statistic Surveys, 16:1–85, 2022

Cynthia Rudin, Chaofan Chen, Zhi Chen, Haiyang Huang, Lesia Semenova, and Chudi Zhong. Interpretable machine learning: Fundamental principles and 10 grand challenges.Statistic Surveys, 16:1–85, 2022

2022
[19]

On the existence of simpler machine learn- ing models

Lesia Semenova, Cynthia Rudin, and Ronald Parr. On the existence of simpler machine learn- ing models. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1827–1858, 2022

2022
[20]

Defining and char- acterizing reward gaming.Advances in Neural Information Processing Systems, 35:9460– 9471, 2022

Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and char- acterizing reward gaming.Advances in Neural Information Processing Systems, 35:9460– 9471, 2022

2022
[21]

Tsybakov.Introduction to Nonparametric Estimation

Alexandre B. Tsybakov.Introduction to Nonparametric Estimation. Springer Series in Statis- tics. Springer, New York, NY , 2009. doi: 10.1007/b13794

work page doi:10.1007/b13794 2009
[22]

Backdoorl: Backdoor attack against competitive reinforcement learning.arXiv preprint arXiv:2105.00579, 2021

Lun Wang, Zaynah Javed, Xian Wu, Wenbo Guo, Xinyu Xing, and Dawn Song. Backdoorl: Backdoor attack against competitive reinforcement learning.arXiv preprint arXiv:2105.00579, 2021

arXiv 2021
[23]

Should tables be sorted?Journal of the ACM (JACM), 28(3):615–628, 1981

Andrew Chi-Chih Yao. Should tables be sorted?Journal of the ACM (JACM), 28(3):615–628, 1981

1981
[24]

Probabilistic computations: Toward a unified measure of complexity

Andrew Chi-Chin Yao. Probabilistic computations: Toward a unified measure of complexity. In18th Annual Symposium on Foundations of Computer Science (sfcs 1977), pages 222–227. IEEE, 1977

1977
[25]

MinAtar: An Atari-inspired testbed for thorough and repro- ducible reinforcement learning experiments, 2019

Kenny Young and Tian Tian. MinAtar: An Atari-inspired testbed for thorough and repro- ducible reinforcement learning experiments, 2019. URLhttps://arxiv.org/abs/1903. 03176

2019
[26]

Right 5 then Down5

Bin Yu. Assouad, fano, and le cam. InFestschrift for Lucien Le Cam: research papers in probability and statistics, pages 423–435. Springer, 1997. 10 A Additional Proof Details A.1 Yao reduction for Theorem 6 Theorem 6 is stated for deterministic auditors under a uniform prior. If a randomized auditor suc- ceeded with fewer queries against every index, the...

1997

[1] [1]

Poisoning deep reinforcement learning agents with in- distribution triggers, 2021

Chace Ashcraft and Kiran Karra. Poisoning deep reinforcement learning agents with in- distribution triggers, 2021. URLhttps://arxiv.org/abs/2106.07798

arXiv 2021

[2] [2]

Statistical modeling: The two cultures (with comments and a rejoinder by the author).Statistical science, 16(3):199–231, 2001

Leo Breiman. Statistical modeling: The two cultures (with comments and a rejoinder by the author).Statistical science, 16(3):199–231, 2001

2001

[3] [3]

Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Bıyık, Anca Dragan, David Krueger, Dorsa Sadigh, and Dylan Hadfield-Menell

Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, J ´er´emy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Rapha ¨el Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, ...

Pith/arXiv arXiv 2023

[4] [4]

Diversity is all you need: Learning skills without a reward function.arXiv preprint arXiv:1802.06070, 2018

Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function.arXiv preprint arXiv:1802.06070, 2018

Pith/arXiv arXiv 2018

[5] [5]

Aaron Fisher, Cynthia Rudin, and Francesca Dominici. All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models si- multaneously.Journal of Machine Learning Research, 20(177):1–81, 2019

2019

[6] [6]

D4RL: Datasets for deep data-driven reinforcement learning, 2020

Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for deep data-driven reinforcement learning, 2020. URLhttps://arxiv.org/abs/2004. 07219

2020

[7] [7]

Safe reinforcement learning via formal methods: Toward safe control through proof and learning

Nathan Fulton and Andr ´e Platzer. Safe reinforcement learning via formal methods: Toward safe control through proof and learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

2018

[8] [8]

Cautious reinforce- ment learning with logical constraints

Mohammadhosein Hasanbeig, Alessandro Abate, and Daniel Kroening. Cautious reinforce- ment learning with logical constraints. InProceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, pages 483–491, 2020

2020

[9] [9]

Rashomon capacity: A metric for predictive multiplicity in classification.Advances in Neural Information Processing Systems, 35:28988–29000, 2022

Hsiang Hsu and Flavio Calmon. Rashomon capacity: A metric for predictive multiplicity in classification.Advances in Neural Information Processing Systems, 35:28988–29000, 2022

2022

[10] [10]

Trojdrl: evaluation of back- door attacks on deep reinforcement learning

Panagiota Kiourti, Kacper Wardega, Susmit Jha, and Wenchao Li. Trojdrl: evaluation of back- door attacks on deep reinforcement learning. In2020 57th ACM/IEEE Design Automation Conference (DAC), pages 1–6. IEEE, 2020. 9

2020

[11] [11]

One solution is not all you need: Few-shot extrapolation via structured maxent rl.Advances in Neural Information Processing Systems, 33:8198–8210, 2020

Saurabh Kumar, Aviral Kumar, Sergey Levine, and Chelsea Finn. One solution is not all you need: Few-shot extrapolation via structured maxent rl.Advances in Neural Information Processing Systems, 33:8198–8210, 2020

2020

[12] [12]

Convergence of estimates under dimensionality restrictions.The Annals of Statistics, pages 38–53, 1973

Lucien LeCam. Convergence of estimates under dimensionality restrictions.The Annals of Statistics, pages 38–53, 1973

1973

[13] [13]

Predictive multiplicity in classification

Charles Marx, Flavio Calmon, and Berk Ustun. Predictive multiplicity in classification. In International conference on machine learning, pages 6765–6774. PMLR, 2020

2020

[14] [14]

An empirical evaluation of the rashomon effect in explainable machine learning

Sebastian M ¨uller, Vanessa Toborek, Katharina Beckh, Matthias Jakobs, Christian Bauckhage, and Pascal Welke. An empirical evaluation of the rashomon effect in explainable machine learning. InJoint european conference on machine learning and knowledge discovery in databases, pages 462–478. Springer, 2023

2023

[15] [15]

The effects of reward misspecification: Mapping and mitigating misaligned models, 2022

Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The effects of reward misspecification: Mapping and mitigating misaligned models, 2022. URLhttps://arxiv.org/abs/2201. 03544

2022

[16] [16]

Stable-Baselines3: Reliable reinforcement learning implementations.Journal of Machine Learning Research, 22(268):1–8, 2021

Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-Baselines3: Reliable reinforcement learning implementations.Journal of Machine Learning Research, 22(268):1–8, 2021. URLhttps://jmlr.org/papers/v22/ 20-1364.html

2021

[17] [17]

Policy teach- ing via environment poisoning: Training-time adversarial attacks against reinforcement learn- ing

Amin Rakhsha, Goran Radanovic, Rati Devidze, Xiaojin Zhu, and Adish Singla. Policy teach- ing via environment poisoning: Training-time adversarial attacks against reinforcement learn- ing. InInternational Conference on Machine Learning, pages 7974–7984. PMLR, 2020

2020

[18] [18]

Interpretable machine learning: Fundamental principles and 10 grand challenges.Statistic Surveys, 16:1–85, 2022

Cynthia Rudin, Chaofan Chen, Zhi Chen, Haiyang Huang, Lesia Semenova, and Chudi Zhong. Interpretable machine learning: Fundamental principles and 10 grand challenges.Statistic Surveys, 16:1–85, 2022

2022

[19] [19]

On the existence of simpler machine learn- ing models

Lesia Semenova, Cynthia Rudin, and Ronald Parr. On the existence of simpler machine learn- ing models. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1827–1858, 2022

2022

[20] [20]

Defining and char- acterizing reward gaming.Advances in Neural Information Processing Systems, 35:9460– 9471, 2022

Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and char- acterizing reward gaming.Advances in Neural Information Processing Systems, 35:9460– 9471, 2022

2022

[21] [21]

Tsybakov.Introduction to Nonparametric Estimation

Alexandre B. Tsybakov.Introduction to Nonparametric Estimation. Springer Series in Statis- tics. Springer, New York, NY , 2009. doi: 10.1007/b13794

work page doi:10.1007/b13794 2009

[22] [22]

Backdoorl: Backdoor attack against competitive reinforcement learning.arXiv preprint arXiv:2105.00579, 2021

Lun Wang, Zaynah Javed, Xian Wu, Wenbo Guo, Xinyu Xing, and Dawn Song. Backdoorl: Backdoor attack against competitive reinforcement learning.arXiv preprint arXiv:2105.00579, 2021

arXiv 2021

[23] [23]

Should tables be sorted?Journal of the ACM (JACM), 28(3):615–628, 1981

Andrew Chi-Chih Yao. Should tables be sorted?Journal of the ACM (JACM), 28(3):615–628, 1981

1981

[24] [24]

Probabilistic computations: Toward a unified measure of complexity

Andrew Chi-Chin Yao. Probabilistic computations: Toward a unified measure of complexity. In18th Annual Symposium on Foundations of Computer Science (sfcs 1977), pages 222–227. IEEE, 1977

1977

[25] [25]

MinAtar: An Atari-inspired testbed for thorough and repro- ducible reinforcement learning experiments, 2019

Kenny Young and Tian Tian. MinAtar: An Atari-inspired testbed for thorough and repro- ducible reinforcement learning experiments, 2019. URLhttps://arxiv.org/abs/1903. 03176

2019

[26] [26]

Right 5 then Down5

Bin Yu. Assouad, fano, and le cam. InFestschrift for Lucien Le Cam: research papers in probability and statistics, pages 423–435. Springer, 1997. 10 A Additional Proof Details A.1 Yao reduction for Theorem 6 Theorem 6 is stated for deterministic auditors under a uniform prior. If a randomized auditor suc- ceeded with fewer queries against every index, the...

1997