arxiv: 2605.14587 · v1 · submitted 2026-05-14 · 💻 cs.LG · cs.AI· cs.CR

Recognition: 2 theorem links

· Lean Theorem

Angel or Demon: Investigating the Plasticity Interventions' Impact on Backdoor Threats in Deep Reinforcement Learning

Oubo Ma , Ruixiao Lin , Yang Dai , Jiahao Chen , Chunyi Zhou , Linkang Du , Shouling Ji

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:33 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CR

keywords backdoor attacksdeep reinforcement learningplasticity interventionsSAMsecurityloss landscapeactivation pathways

0 comments

The pith

Most plasticity interventions reduce backdoor threats in deep reinforcement learning, but SAM makes them worse by amplifying gradients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how common plasticity interventions, now standard in modern DRL agents to prevent performance loss, interact with backdoor attacks that can hijack agent behavior. Through 14,664 empirical cases across representative environments and models, it shows that interventions other than SAM tend to block backdoor success by breaking activation pathways and shrinking the space where malicious patterns can form. SAM instead strengthens backdoors because it sharpens the loss landscape in ways that boost the gradients carrying the attack signal. A sympathetic reader would care because these interventions are built into deployed DRL systems, so their security side effects directly affect whether agents can be trusted in real settings. The authors also introduce an SCC framework that explains the interplay and identify abnormal loss sharpness as a practical signal for spotting injected backdoors.

Core claim

Only the SAM intervention exacerbates backdoor threats in DRL while the others mitigate them; the worsening occurs through backdoor gradient amplification, and the protection arises from activation pathway disruption together with representation space compression.

What carries the argument

The SCC conceptual framework that deconstructs the mechanistic interplay between plasticity interventions and backdoor attacks in DRL.

If this is right

SAM should be used with extra caution in any DRL system that might face backdoor risks.
Other plasticity methods can be chosen specifically to lower backdoor vulnerability without sacrificing their plasticity benefits.
Loss-landscape sharpness measurements can be added to routine monitoring as an early warning for possible backdoor presence.
The SCC framework supplies a structured way to test new combinations of interventions for backdoor resistance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Security evaluations of DRL agents should now include the specific plasticity method as a variable rather than treating all agents as vanilla.
Designers could deliberately stack multiple mitigating interventions to create layered defenses that are harder for attackers to overcome.
The gradient-amplification effect of SAM might be turned into a diagnostic tool for identifying which parts of a network are most attack-prone.

Load-bearing premise

The 14,664 tested cases capture enough of the variety in real DRL environments and attack methods for the observed patterns to hold more generally.

What would settle it

A new DRL environment and model combination in which SAM no longer increases attack success rate or in which the other interventions fail to reduce it.

Figures

Figures reproduced from arXiv: 2605.14587 by Chunyi Zhou, Jiahao Chen, Linkang Du, Oubo Ma, Ruixiao Lin, Shouling Ji, Yang Dai.

**Figure 1.** Figure 1: Outline of our investigation. tasks, the application of SAM raises the attack success rate (ASR) from 0.178 to 0.326 and simultaneously enhances benign task performance (BTP) from 0.745 to 0.814. RQ2: What internal properties of DRL backdoors do interventions alter to drive these impacts? Extending the pathological analysis from the plasticity loss domain, we attribute the observed effects in RQ1 to three… view at source ↗

**Figure 2.** Figure 2: Threat models. of state, action, and reward) stored by the agent to bind the trigger with the target action through backdoor rewards. Additional injection techniques include environment perturbation (Yang et al., 2019; Liu et al., 2025) and policy combination (Wang et al., 2021; Gong et al., 2024). The adversary can further escalate the backdoor threat by targeting both triggers and rewards. Techniques s… view at source ↗

**Figure 3.** Figure 3: Impact of interventions in TM-Scratch and TM-Post. The backdoor task is modeled as a tuple M† = (T , S † , A† , Fs, Fa, R† ), where T is the trigger space, S † ⊆ S is the backdoor state space, and A† ⊆ A is the target action space. Fs : S × T → S† is a trigger-state mapping function that defines how a trigger alters a benign state. Fa : T → A† is a trigger-action mapping function specified by the adversa… view at source ↗

**Figure 4.** Figure 4: Comparison of conventional training and backdoor attacks on the three pathological characteristics. Solid lines represent mean values, and shaded areas denote the range from minimum to maximum values. Conventional Training, Backdoor Attacks. ability via three pathological characteristics: weight magnitude, effective rank, and loss landscape sharpness (Lyle et al., 2023; Sokar et al., 2023; Dohare et al., … view at source ↗

**Figure 5.** Figure 5: Impact of interventions on backdoor attacks across three pathological characteristics. To systematically analyze the effects of interventions, we monitor their impact across these three pathologies and rank them accordingly. For each intervention setting pi ∈ P, we define a pathological vector v(pi) = (vi1, vi2, vi3), where vij ∈ [1, 8] represents the ranking of the agent on pathological characteristic cj… view at source ↗

**Figure 6.** Figure 6: Visualization of the weight magnitude in the actor network’s second fully connected layer. 0 8 16 24 32 40 48 56 State Index 0 8 16 24 32 40 48 56 State Index (a) None 0 8 16 24 32 40 48 56 State Index (b) Spectral Normalization 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Normalized Dot Product [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Normalized dot product of the actor network’s gradients over 64 states. The red arrow marks the backdoor state. effective rank (i.e., v52 < v12 and v62 < v12), and their effects on backdoor attacks follow mechanisms intrinsically similar to that of Spectral Normalization. Weight Decay imposes explicit constraints on the distances between weights, guiding the backdoor representation to associate with more … view at source ↗

**Figure 8.** Figure 8: SAM flattens the loss landscape. Appendix H presents a theoretical proof leveraging influence functions (Koh & Liang, 2017) to demonstrate how SAM amplifies backdoor threats [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: The SCC framework. Remarks. (1) Why do interventions exert a more pronounced impact in TM-Post? In TM-Scratch, the representations of benign and backdoor tasks are competitively co-constructed; thus, the effects of interventions on yetto-be-stabilized representations are continuously reshaped and diluted by training dynamics. In TM-Post, the DRL agent has already established representations correspondin… view at source ↗

**Figure 10.** Figure 10: A conceptual illustration of backdoor injection. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Examples of trigger designs across four DRL tasks: Lunar Lander, Bipedal Walker, Hopper, and Half Cheetah. D. Impact of Interventions on Conventional Training Since our analysis of interventions and combinations against DRL backdoor attacks involves benign task performance (BTP), we examine their effects under conventional DRL training (i.e., poisoning rate = 0.00%). Note that in this case, None is equiva… view at source ↗

**Figure 12.** Figure 12: Impact of interventions on backdoor attacks across three pathological characteristics (weight magnitude, effective rank, and loss landscape sharpness). The x-axis corresponds to 47 backdoor task indices, and the y-axis represents 8 intervention settings. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: 3D visualization illustrates the differences in weight magnitude between TM-Scratch and TM-Post (cf. (a) and (c)). In TM-Post, the overall weight magnitude of the actor network are substantially larger than in TM-Scratch, causing Weight Clipping to clip more weights per iteration, thereby impacting activation pathways for both benign and backdoor tasks. Moreover, the higher weight magnitude reduces the pa… view at source ↗

**Figure 14.** Figure 14: Without intervention (c.f., (a)), the agent’s activation for the backdoor state (red arrow) differs markedly from that of benign states. With Layer Normalization, this disparity is substantially reduced (c.f., (b)), thereby lowering the agent’s sensitivity to triggers [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: Contour plots of attack performance. Peaks closer to the top-right corner indicate stronger attack performance, characterized by increased ASR and BTP. The four columns in the figure correspond to the effects of four intervention settings (None, Weight Clipping, Layer Normalization, and SAM) on two threat models (TM-Scratch and TM-Post) and three DRL backdoor attack methods (TrojDRL, BadRL, SleeperNets, a… view at source ↗

read the original abstract

Extensive research has highlighted the severe threats posed by backdoor attacks to deep reinforcement learning (DRL). However, prior studies primarily focus on vanilla scenarios, while plasticity interventions have emerged as indispensable built-in components of modern DRL agents. Despite their effectiveness in mitigating plasticity loss, the impact of these interventions on DRL backdoor vulnerabilities remains underexplored, and this lack of systematic investigation poses risks in practical DRL deployments. To bridge this gap, we empirically study 14,664 cases integrating representative interventions and attack scenarios. We find that only one intervention (i.e., SAM) exacerbates backdoor threats, while other interventions mitigate them. Pathological analysis identifies that the exacerbation is attributed to backdoor gradient amplification, while the mitigation stems from activation pathway disruption and representation space compression. From these findings, we derive two novel insights: (1) a conceptual framework SCC for robust backdoor injection that deconstructs the mechanistic interplay between interventions and backdoors in DRL, and (2) abnormal loss landscape sharpness as a key indicator for DRL backdoor detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an empirical study examining how plasticity interventions affect backdoor attacks in deep reinforcement learning (DRL). Through 14,664 tested cases across representative interventions and attack scenarios, it finds that only Sharpness-Aware Minimization (SAM) exacerbates backdoor threats by amplifying backdoor gradients, whereas other interventions mitigate threats by disrupting activation pathways and compressing representation spaces. The authors derive a conceptual framework called SCC for robust backdoor injection and identify abnormal loss landscape sharpness as a potential indicator for backdoor detection.

Significance. If the reported patterns are robust, this study is significant because it addresses a timely gap in DRL security by systematically evaluating built-in plasticity interventions that are now standard in modern agents. The large-scale empirical sweep provides concrete evidence of differential impacts, and the mechanistic pathological analysis offers insights beyond black-box observations. The proposed SCC framework and sharpness indicator represent constructive contributions that could guide both attack design and defense strategies in practical DRL deployments.

major comments (2)

[Empirical Evaluation] The central claim that only SAM exacerbates backdoor threats (while others mitigate) rests on the 14,664-case survey, yet the manuscript provides no information on the number of random seeds per configuration, variance across runs, or statistical significance testing of the observed exacerbation/mitigation patterns. This is load-bearing for the generalization asserted in the abstract and results sections.
[Pathological Analysis] In the pathological analysis, the attribution of SAM exacerbation to backdoor gradient amplification and mitigation to activation pathway disruption/representation compression is presented as observational. Without ablations or controls that isolate these mechanisms from confounding factors such as hyperparameter sensitivity or environment-specific dynamics, the causal explanations remain under-supported.

minor comments (2)

[Abstract] The abstract refers to 'representative interventions and attack scenarios' without enumerating them; a short list or reference to the specific environments and models used would improve immediate readability.
[Conceptual Framework] The SCC framework is introduced as a novel conceptual contribution, but its components and deconstruction of intervention-backdoor interplay would benefit from a dedicated formal subsection or diagram rather than being described only narratively.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which help strengthen the rigor of our empirical study. We address each major comment point-by-point below and will revise the manuscript to incorporate additional details and clarifications where feasible.

read point-by-point responses

Referee: [Empirical Evaluation] The central claim that only SAM exacerbates backdoor threats (while others mitigate) rests on the 14,664-case survey, yet the manuscript provides no information on the number of random seeds per configuration, variance across runs, or statistical significance testing of the observed exacerbation/mitigation patterns. This is load-bearing for the generalization asserted in the abstract and results sections.

Authors: We agree that explicit reporting of random seeds, variance, and statistical testing is essential to support the claimed patterns. Our experiments were run with 5 independent random seeds per configuration to mitigate stochasticity in DRL training, and we observed consistent directional effects (SAM exacerbation, others mitigation) across seeds. We will revise the manuscript to state the seed count explicitly in the experimental setup, include standard deviation bars or tables for key metrics, and add statistical significance tests (paired t-tests with p-values) comparing backdoor success rates with and without each intervention. These additions will appear in Section 4 and the appendix. revision: yes
Referee: [Pathological Analysis] In the pathological analysis, the attribution of SAM exacerbation to backdoor gradient amplification and mitigation to activation pathway disruption/representation compression is presented as observational. Without ablations or controls that isolate these mechanisms from confounding factors such as hyperparameter sensitivity or environment-specific dynamics, the causal explanations remain under-supported.

Authors: We acknowledge that the mechanistic attributions are derived from observational patterns (gradient norms, activation maps, and representation distances) measured across the 14,664 cases rather than from controlled ablations that hold all other factors fixed. The consistency of these patterns across multiple environments, attack types, and intervention strengths provides supporting evidence, but we agree it falls short of causal isolation. In revision we will (1) explicitly label the analysis as correlational/observational, (2) discuss potential confounders such as hyperparameter sensitivity, and (3) add a short paragraph on the limitations of the current evidence. Full isolating ablations would require a new experimental campaign that exceeds the scope of the present work; we therefore treat this as a partial revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical claims

full rationale

The paper reports results from an empirical sweep of 14,664 intervention-attack combinations in DRL settings. All central findings (SAM exacerbates via gradient amplification; others mitigate via pathway disruption or representation compression) are presented as direct observations from experiments and post-hoc pathological analysis. No equations, fitted parameters renamed as predictions, self-citation load-bearing premises, or ansatz smuggling appear in the derivation chain. The SCC framework and loss-landscape indicator are introduced as conceptual summaries of the observed patterns rather than independent derivations that reduce to the inputs by construction. The study is therefore self-contained against external benchmarks with no circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on the representativeness of the chosen intervention set, attack scenarios, and environments. No free parameters are introduced in the abstract. The SCC framework is a new conceptual construct whose independent predictive power is not yet demonstrated.

axioms (1)

domain assumption Plasticity interventions are now standard built-in components of modern DRL agents.
Stated in the opening sentence of the abstract as background for the study.

invented entities (2)

SCC conceptual framework no independent evidence
purpose: Deconstruct the mechanistic interplay between interventions and backdoors in DRL
Introduced as a derived insight from the empirical findings; no external validation or falsifiable prediction is supplied in the abstract.
Abnormal loss landscape sharpness no independent evidence
purpose: Key indicator for DRL backdoor detection
Proposed as a detection signal; the abstract does not report a quantitative evaluation of its false-positive rate or comparison against baselines.

pith-pipeline@v0.9.0 · 5513 in / 1391 out tokens · 47457 ms · 2026-05-15T01:33:25.686471+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Pathological analysis identifies that the exacerbation is attributed to backdoor gradient amplification, while the mitigation stems from activation pathway disruption and representation space compression.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We derive two novel insights: (1) a conceptual framework SCC for robust backdoor injection...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages

[1]

ICLR , year=

Continuous control with deep reinforcement learning , author=. ICLR , year=

work page
[2]

NeurIPS , year=

Multi-agent actor-critic for mixed cooperative-competitive environments , author=. NeurIPS , year=

work page
[3]

NDSS , year=

CLIBE: detecting dynamic backdoors in transformer-based NLP models , author=. NDSS , year=

work page
[4]

IEEE transactions on neural networks and learning systems , year=

Backdoor learning: A survey , author=. IEEE transactions on neural networks and learning systems , year=

work page
[5]

ICML , year=

Understanding black-box predictions via influence functions , author=. ICML , year=

work page
[6]

Doan, Thang and Bennani, Mehdi Abbana and Mazoure, Bogdan and Rabusseau, Guillaume and Alquier, Pierre , booktitle=

work page
[7]

Bennani, Mehdi Abbana and Doan, Thang and Sugiyama, Masashi , journal=

work page
[8]

Zhang, Kaiyuan and Cheng, Siyuan and Shen, Guangyu and Tao, Guanhong and An, Shengwei and Makur, Anuran and Ma, Shiqing and Zhang, Xiangyu , booktitle=

work page
[9]

arXiv , year=

Hong, Sanghyun and Chandrasekaran, Varun and Kaya, Yi. arXiv , year=

work page
[10]

Yang, Yu and Liu, Tian Yu and Mirzasoleiman, Baharan , booktitle=

work page
[11]

Pu, Yuwen and Chen, Jiahao and Zhou, Chunyi and Feng, Zhou and Li, Qingming and Hu, Chunqiang and Ji, Shouling , journal=

work page
[12]

Cao, Yulong and Xiao, Chaowei and Cyr, Benjamin and Zhou, Yimeng and Park, Won and Rampazzi, Sara and Chen, Qi Alfred and Fu, Kevin and Mao, Z Morley , booktitle=

work page
[13]

Xu, Yuan and Han, Xingshuo and Deng, Gelei and Li, Jiwei and Liu, Yang and Zhang, Tianwei , booktitle=

work page
[14]

Kaufmann, Elia and Bauersfeld, Leonard and Loquercio, Antonio and Müller, Matthias and Koltun, Vladlen and Scaramuzza, Davide , journal=

work page
[15]

Pan, Xinlei and Wang, Weiyao and Zhang, Xiaoshuai and Li, Bo and Yi, Jinfeng and Song, Dawn , booktitle=

work page
[16]

Du, Linkang and Zhang, Zhikun and Chen, Min and Sun, Mingyang and Ji, Shouling and Cheng, Peng and Chen, Jiming and Backes, Michael and Zhang, Yang , journal=

work page
[17]

Schulman, John and Wolski, Filip and Dhariwal, Prafulla and Radford, Alec and Klimov, Oleg , journal=

work page
[18]

Ye, Dayong and Zhu, Tianqing and Zhu, Congcong and Wang, Derui and Gao, Kun and Shi, Zewei and Shen, Sheng and Zhou, Wanlei and Xue, Minhui , booktitle=

work page
[19]

Gong, Chen and Li, Kecen and Yao, Jin and Wang, Tianhao , booktitle=

work page
[20]

Du, Linkang and Chen, Min and Sun, Mingyang and Ji, Shouling and Cheng, Peng and Chen, Jiming and Zhang, Zhikun , booktitle=

work page
[21]

Chen, Kangjie and Guo, Shangwei and Zhang, Tianwei and Xie, Xiaofei and Liu, Yang , booktitle=

work page
[22]

Chen, Kangjie and Guo, Shangwei and Zhang, Tianwei and Li, Shuxin and Liu, Yang , booktitle=

work page
[23]

ICML , year=

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs , author=. ICML , year=

work page
[24]

Pathmanathan, Pankayaraj and Chakraborty, Souradip and Liu, Xiangyu and Liang, Yongyuan and Huang, Furong , booktitle=

work page
[25]

Li, Jianhui and Zhang, Bokang and Wu, Junfeng , journal=

work page
[26]

AAMAS , year=

Mohammadi, Mohammad and N. AAMAS , year=

work page
[27]

Ma, Oubo and Pu, Yuwen and Du, Linkang and Dai, Yang and Wang, Ruo and Liu, Xiaolei and Wu, Yingcai and Ji, Shouling , booktitle=

work page
[28]

Wang, Tony Tong and Gleave, Adam and Tseng, Tom and Pelrine, Kellin and Belrose, Nora and Miller, Joseph and Dennis, Michael D and Duan, Yawen and Pogrebniak, Viktor and Levine, Sergey and others , booktitle=

work page
[29]

Guo, Wenbo and Wu, Xian and Huang, Sui and Xing, Xinyu , booktitle=

work page
[30]

Gleave, Adam and Dennis, Michael and Wild, Cody and Kant, Neel and Levine, Sergey and Russell, Stuart , booktitle=

work page
[31]

Carlini, Nicholas and Wagner, David , booktitle=

work page
[32]

Tu, James and Wang, Tsunhsuan and Wang, Jingkang and Manivasagam, Sivabalan and Ren, Mengye and Urtasun, Raquel , booktitle=

work page
[33]

Sun, Jianwen and Zhang, Tianwei and Xie, Xiaofei and Ma, Lei and Zheng, Yan and Chen, Kangjie and Liu, Yang , booktitle=

work page
[34]

Huang, Sandy and Papernot, Nicolas and Goodfellow, Ian and Duan, Yan and Abbeel, Pieter , journal=

work page
[35]

Behzadan, Vahid and Munir, Arslan , booktitle=

work page
[36]

Xia, Yingjie and Liu, Xuejiao and Ou, Jing and Ma, Oubo , journal=

work page
[37]

Qiao, Wei and Feng, Yebo and Li, Teng and Ma, Zhuo and Shen, Yulong and Ma, JianFeng and Liu, Yang , booktitle=

work page
[38]

Wilson, Garrett and Goh, Geoffrey and Jiang, Yan and Gupta, Ajay and Wang, Jiaxuan and Freeman, David and Dinuzzo, Francesco , booktitle=

work page
[39]

Wang, Zixiang and Yan, Hao and Wang, Zhuoyue and Xu, Zhengjia and Wu, Zhizhong and Wang, Yining , booktitle=

work page
[40]

AAAI , year=

Tang, Chen and Abbatematteo, Ben and Hu, Jiaheng and Chandra, Rohan and Mart. AAAI , year=

work page
[41]

2024 , publisher=

Chaudhari, Shreyas and Aggarwal, Pranjal and Murahari, Vishvak and Rajpurohit, Tanmay and Kalyan, Ashwin and Narasimhan, Karthik and Deshpande, Ameet and Castro da Silva, Bruno , journal=. 2024 , publisher=

work page 2024
[42]

Oubo Ma and Ruixiao Lin and Jiahao Chen and Yuan Su and Yong Yang and Shouling Ji , journal=

work page
[43]

Chen, Xuan and Feng, Shiwei and Xiong, Zikang and An, Shengwei and Mao, Yunshu and Tao, Guanhong and Guo, Wenbo and Zhang, Xiangyu , journal=

work page
[44]

Rathbun, Ethan and Oprea, Alina and Amato, Christopher , booktitle=

work page
[45]

Foret, Pierre and Kleiner, Ariel and Mobahi, Hossein and Neyshabur, Behnam , journal=

work page
[46]

Ma, Guozheng and Li, Lu and Zhang, Sen and Liu, Zixuan and Wang, Zhen and Chen, Yixin and Shen, Li and Wang, Xueqian and Tao, Dacheng , booktitle=

work page
[47]

Raffin, Antonin and Hill, Ashley and Gleave, Adam and Kanervisto, Anssi and Ernestus, Maximilian and Dormann, Noah , journal=

work page
[48]

Coumans, Erwin and Bai, Yunfei , howpublished=

work page
[49]

Brockman, Greg and Cheung, Vicki and Pettersson, Ludwig and Schneider, Jonas and Schulman, John and Tang, Jie and Zaremba, Wojciech , journal=

work page
[50]

Li, Songze and Zhang, Mingxuan and Ma, Oubo and Wei, Kang and Ji, Shouling , journal=

work page
[51]

Dai, Yang and Ma, Oubo and Zhang, Longfei and Liang, Xingxing and Cao, Xiaochun and Ji, Shouling and Zhang, Jiaheng and Huang, Jincai and Shen, Li , journal=

work page
[52]

Ma, Oubo and Du, Linkang and Dai, Yang and Zhou, Chunyi and Li, Qingming and Pu, Yuwen and Ji, Shouling , journal=

work page
[53]

Rathbun, Ethan and Amato, Christopher and Oprea, Alina , booktitle=

work page
[54]

Cui, Jing and Han, Yufei and Ma, Yuzhe and Jiao, Jianbin and Zhang, Junge , booktitle=

work page
[55]

Kiourti, Panagiota and Wardega, Kacper and Jha, Susmit and Li, Wenchao , booktitle=

work page
[56]

Liu, Shijie and Cullen, Andrew C and Montague, Paul and Erfani, Sarah and Rubinstein, Benjamin IP , journal=

work page
[57]

Gong, Chen and Yang, Zhou and Bai, Yunpeng and He, Junda and Shi, Jieke and Li, Kecen and Sinha, Arunesh and Xu, Bowen and Hou, Xinwen and Lo, David and others , booktitle=

work page
[58]

Wang, Lun and Javed, Zaynah and Wu, Xian and Guo, Wenbo and Xing, Xinyu and Song, Dawn , booktitle=

work page
[59]

Yang, Zhaoyuan and Iyer, Naresh and Reimann, Johan and Virani, Nurali , journal=

work page
[60]

Klein, Timo and Miklautz, Lukas and Sidak, Kevin and Plant, Claudia and Tschiatschek, Sebastian , journal=

work page
[61]

Hernandez-Garcia, J Fernando and Dohare, Shibhansh and Sutton, Richard S , booktitle=

work page
[62]

Lyle, Clare and Zheng, Zeyu and Khetarpal, Khimya and van Hasselt, Hado and Pascanu, Razvan and Martens, James and Dabney, Will , booktitle=

work page
[63]

Lee, Hojoon and Cho, Hyeonseo and Kim, Hyunseung and Kim, Donghu and Min, Dugki and Choo, Jaegul and Lyle, Clare , booktitle=

work page
[64]

Lyle, Clare and Zheng, Zeyu and Khetarpal, Khimya and Martens, James and van Hasselt, Hado P and Pascanu, Razvan and Dabney, Will , booktitle=

work page
[65]

Elsayed, Mohamed and Lan, Qingfeng and Lyle, Clare and Mahmood, A Rupam , booktitle=

work page
[66]

Dohare, Shibhansh and Hernandez-Garcia, J Fernando and Lan, Qingfeng and Rahman, Parash and Mahmood, A Rupam and Sutton, Richard S , journal=

work page
[67]

Lee, Hojoon and Cho, Hanseul and Kim, Hyunseung and Gwak, Daehoon and Kim, Joonkee and Choo, Jaegul and Yun, Se-Young and Yun, Chulhee , booktitle=

work page
[68]

NeurIPS , year=

Nikishin, Evgenii and Oh, Junhyuk and Ostrovski, Georg and Lyle, Clare and Pascanu, Razvan and Dabney, Will and Barreto, Andr. NeurIPS , year=

work page
[69]

Abbas, Zaheer and Zhao, Rosie and Modayil, Joseph and White, Adam and Machado, Marlos C , booktitle=

work page
[70]

Lyle, Clare and Zheng, Zeyu and Nikishin, Evgenii and Pires, Bernardo Avila and Pascanu, Razvan and Dabney, Will , booktitle=

work page
[71]

Sokar, Ghada and Agarwal, Rishabh and Castro, Pablo Samuel and Evci, Utku , booktitle=

work page
[72]

Lyle, Clare and Rowland, Mark and Dabney, Will , booktitle=

work page
[73]

Nikishin, Evgenii and Schwarzer, Max and D’Oro, Pierluca and Bacon, Pierre-Luc and Courville, Aaron , booktitle=

work page
[74]

Gogianu, Florin and Berariu, Tudor and Rosca, Mihaela C and Clopath, Claudia and Busoniu, Lucian and Pascanu, Razvan , booktitle=

work page
[75]

Ash, Jordan and Adams, Ryan P , booktitle=

work page