When Critics Disagree: Adaptive Reward Poisoning Attacks in RIS-Aided Wireless Control System
Pith reviewed 2026-05-20 06:41 UTC · model grok-4.3
The pith
An adaptive reward poisoning attack targeting critic disagreement in SAC substantially diminishes RIS performance gains in wireless control.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DGRP corrupts rewards particularly when the SAC dual critics exhibit substantial disagreement, especially in high-leverage high-uncertainty states. This results in distorted value estimations and guides the policy towards suboptimal actions for optimizing SU transmitter power and RIS phase shifts, diminishing the performance improvements from RIS and degrading transmission quality.
What carries the argument
Disagreement-Guided Reward Poisoning (DGRP) that selects states based on high critic disagreement for targeted reward corruption.
If this is right
- DGRP substantially diminishes the performance improvements typically provided by RIS.
- The attack degrades transmission quality by corrupting rewards in uncertain states.
- DGRP consistently causes greater damage than periodic-timing and exploration-triggered baselines.
- Key attack parameters affect the learning process and overall system performance.
Where Pith is reading between the lines
- Monitoring critic disagreement levels could serve as a detection signal for such attacks in DRL-based wireless systems.
- The disagreement-targeting approach may extend to other actor-critic algorithms used in communication optimization.
- Robustness evaluations of DRL in RIS-assisted networks should incorporate disagreement-aware threat models as standard.
Load-bearing premise
The attacker can observe or infer the level of disagreement between the SAC dual critics to identify and target high-uncertainty states.
What would settle it
Measure whether reward corruption still produces comparable degradation in RIS-assisted rates and transmission quality when the attacker lacks any access to or estimate of critic disagreement signals.
Figures
read the original abstract
Reward-poisoning attacks present a significant risk to learning-based wireless control systems. Given this, we propose a Disagreement-Guided Reward Poisoning (DGRP) adaptive attack on a Soft Actor-Critic (SAC) agent. In a Cognitive Radio Network (CRN) environment assisted by Reconfigurable Intelligent Surfaces (RIS), the SAC agent is tasked with maximizing the long-term secondary users' (SUs) rate by simultaneously optimizing the transmission power of the SU transmitter and the RIS phase shifts. DGRP corrupts rewards, particularly when the SAC dual critics exhibit substantial disagreement-especially in high-leverage, high-uncertainty states-resulting in distorted value estimations and guiding the policy towards suboptimal actions. Our findings demonstrate that DGRP substantially diminishes the performance improvements typically provided by RIS and degrades transmission quality. We further investigate key attack parameters and determine their impact on learning. In comparison to periodic-timing and exploration-triggered baselines, DGRP consistently causes greater damage, highlighting the necessity of considering disagreement-aware threats when evaluating the robustness of Deep Reinforcement Learning (DRL) in RIS-assisted networks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Disagreement-Guided Reward Poisoning (DGRP), an adaptive attack on a Soft Actor-Critic (SAC) agent operating in a Reconfigurable Intelligent Surface (RIS)-assisted Cognitive Radio Network (CRN). The attack selectively corrupts rewards in states where the dual critics disagree substantially, distorting value estimates and guiding the policy toward suboptimal power and phase-shift choices that reduce the rate gains normally provided by RIS. The work compares DGRP to periodic-timing and exploration-triggered baselines, claims superior damage, and examines the sensitivity of attack performance to key parameters such as poisoning magnitude and disagreement threshold.
Significance. If the central claims are substantiated with quantitative evidence and a validated threat model, the paper would usefully highlight how critic disagreement can be exploited for more effective reward-poisoning attacks in DRL-based wireless control. This could motivate disagreement-aware defenses for RIS-assisted resource allocation. The explicit comparison to two non-adaptive baselines and the parameter-impact study are constructive elements; however, the current absence of numerical metrics, error bars, and feasibility analysis for the attacker’s access to critic outputs substantially weakens the immediate contribution.
major comments (2)
- Threat Model section: The central claim that DGRP can selectively target high-disagreement states presupposes that an external attacker can observe or accurately infer the outputs (or difference) of the two SAC critics. No ablation, surrogate-model analysis, or threat-model justification is provided to show that this inference is feasible under a realistic black-box deployment of the learned policy; without such evidence the reported performance gap versus the baselines cannot be attributed to the disagreement-guided mechanism.
- Evaluation / Results section: The abstract asserts that DGRP “substantially diminishes the performance improvements typically provided by RIS” and “consistently causes greater damage” than the baselines, yet the manuscript supplies no quantitative metrics (e.g., rate degradation percentages, cumulative reward curves with error bars, or statistical significance tests). This omission leaves the magnitude and reliability of the claimed superiority unverifiable.
minor comments (2)
- Abstract: Adding at least one concrete numerical result (e.g., “DGRP reduces average SU rate by X % relative to the unattacked RIS baseline”) would strengthen the summary of findings.
- Notation and terminology: Ensure that all acronyms (SAC, RIS, CRN, DGRP, SU) are defined at first use and that the distinction between the two critics is made explicit when describing the disagreement signal.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which have helped us improve the clarity and rigor of our manuscript. We address each major comment below and have revised the paper to incorporate additional justification and quantitative evidence.
read point-by-point responses
-
Referee: Threat Model section: The central claim that DGRP can selectively target high-disagreement states presupposes that an external attacker can observe or accurately infer the outputs (or difference) of the two SAC critics. No ablation, surrogate-model analysis, or threat-model justification is provided to show that this inference is feasible under a realistic black-box deployment of the learned policy; without such evidence the reported performance gap versus the baselines cannot be attributed to the disagreement-guided mechanism.
Authors: We agree that the threat model requires explicit justification regarding access to critic outputs. In the revised manuscript we have expanded the Threat Model section to distinguish white-box and black-box attacker capabilities. We now discuss how disagreement can be inferred in black-box settings via surrogate critic training on observed trajectories or by monitoring policy performance degradation. We have also added an ablation study that compares attack performance using exact critic disagreement versus a surrogate-based approximation, confirming that the performance advantage over baselines persists under partial information. revision: yes
-
Referee: Evaluation / Results section: The abstract asserts that DGRP “substantially diminishes the performance improvements typically provided by RIS” and “consistently causes greater damage” than the baselines, yet the manuscript supplies no quantitative metrics (e.g., rate degradation percentages, cumulative reward curves with error bars, or statistical significance tests). This omission leaves the magnitude and reliability of the claimed superiority unverifiable.
Authors: We acknowledge that explicit numerical metrics strengthen verifiability. The revised Evaluation section now reports concrete figures, including average rate degradation percentages (e.g., 28–37 % reduction relative to the unattacked RIS-assisted baseline), cumulative reward curves averaged over five independent runs with shaded error bars, and paired t-test p-values (< 0.01) confirming that DGRP produces statistically greater damage than the periodic-timing and exploration-triggered baselines. revision: yes
Circularity Check
No circularity: empirical attack evaluation stands independently of definitions or self-citations
full rationale
The manuscript defines DGRP as an adaptive reward-poisoning strategy that targets states of high critic disagreement in an SAC agent controlling power and RIS phases. Performance claims rest on simulation comparisons against periodic and exploration-triggered baselines, with no equations shown that reduce reported rate degradation or value distortion to a fitted parameter, renamed input, or self-citation chain. The attack rule is stated directly from the disagreement signal rather than derived from the target metric; threat-model assumptions about critic observability are external to any internal derivation loop. This matches the default expectation of a self-contained empirical study whose central results do not collapse by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- poisoning magnitude or disagreement threshold
axioms (1)
- domain assumption SAC dual critics produce disagreement that reliably indicates high-uncertainty states suitable for targeted poisoning.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DGRP corrupts rewards precisely when the twin critics exhibit large disagreement (high-leverage, high-uncertainty states)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SAC agent maximizing long-term secondary users’ rate by jointly optimizing transmit power and RIS phase shifts
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
An Overview and Future Directions on Physical-Layer Security for Cognitive Radio Networks,
D. H. Tashman et al. , “An Overview and Future Directions on Physical-Layer Security for Cognitive Radio Networks,” IEEE Network, vol. 35, no. 3, pp. 205–211, 2021
work page 2021
-
[2]
Advances in Machine Learning-Driven Cognitive Radio for Wireless Networks: A Survey,
N. A. Khalek et al. , “Advances in Machine Learning-Driven Cognitive Radio for Wireless Networks: A Survey,” IEEE Com- munications Surveys & Tutorials , vol. 26, no. 2, pp. 1201–1237, 2024
work page 2024
-
[3]
Towards Improving the Security of Cognitive Radio Networks-Based Energy Harvesting,
D. H. Tashman et al. , “Towards Improving the Security of Cognitive Radio Networks-Based Energy Harvesting,” in ICC 2022 - IEEE International Conference on Communications , 2022, pp. 3436–3441
work page 2022
-
[4]
D. H. Tashman et al. , “Secrecy Analysis for Energy Harvesting- Enabled Cognitive Radio Networks in Cascaded Fading Chan- nels,” in ICC 2021 - IEEE International Conference on Commu- nications, 2021, pp. 1–6
work page 2021
-
[5]
Securing Cognitive Radio Networks via Relay and Jammer-Based Energy Harvesting on Cascaded Channels,
D. H. Tashman et al. , “Securing Cognitive Radio Networks via Relay and Jammer-Based Energy Harvesting on Cascaded Channels,” in ICC 2023 - IEEE International Conference on Communications, 2023, pp. 3246–3251
work page 2023
-
[6]
Overlay Cognitive Radio Networks Enabled Energy Harvesting With Random AF Relays,
D. H. Tashman et al. , “Overlay Cognitive Radio Networks Enabled Energy Harvesting With Random AF Relays,” IEEE Access, vol. 10, pp. 113 035–113 045, 2022
work page 2022
-
[7]
H. Zhou et al. , “A Survey on Model-Based, Heuristic, and Ma- chine Learning Optimization Approaches in RIS-Aided Wirel ess Networks,” IEEE Communications Surveys & Tutorials , vol. 26, no. 2, pp. 781–823, 2024
work page 2024
-
[8]
D. H. Tashman et al. , “Maximizing Reliability in Overlay Radio Networks With Time Switching and Power Splitting Energy Harvesting,” IEEE Transactions on Cognitive Communications and Networking , vol. 10, no. 4, pp. 1307–1316, 2024
work page 2024
-
[9]
Green machine learning for Internet-of- Things: Current solutions and future challenges,
H. Moudoud et al. , “Green machine learning for Internet-of- Things: Current solutions and future challenges,” in Green Ma- chine Learning Protocols for Future Communication Network s. CRC Press, 2023, pp. 161–175
work page 2023
-
[10]
L. Wang et al., “Hybrid Hierarchical DRL Enabled Resource Al- location for Secure Transmission in Multi-IRS-Assisted Se nsing- Enhanced Spectrum Sharing Networks,” IEEE Transactions on Wireless Communications, vol. 23, no. 6, pp. 6330–6346, 2024
work page 2024
-
[11]
W. Khalid et al., “Reconfigurable Intelligent Surface for Physical Layer Security in 6G-IoT: Designs, Issues, and Advances,” IEEE Internet of Things Journal , vol. 11, no. 2, pp. 3599–3613, 2024
work page 2024
-
[12]
D. H. Tashman et al. , “Dynamic Synergy: Leveraging RIS and Reinforcement Learning for Secure, Adaptive Underlay Cogn i- tive Radio Networks,” in 2025 Global Information Infrastructure and Networking Symposium (GIIS) , 2025, pp. 1–6
work page 2025
-
[13]
ML-Enabled Open RAN: A Comprehensive Survey of Architectures, Challenges, and Opportunities,
M. C. Kirana et al., “ML-Enabled Open RAN: A Comprehensive Survey of Architectures, Challenges, and Opportunities,” IEEE Communications Surveys & Tutorials , pp. 1–1, 2026
work page 2026
-
[14]
A. Filali et al. , “Communication and Computation O-RAN Re- source Slicing for URLLC Services Using Deep Reinforcement Learning,” IEEE Communications Standards Magazine , vol. 7, no. 1, pp. 66–73, 2023
work page 2023
-
[15]
Mean-Field Game and Reinforcement Learning MEC Resource Provisioning for SFC,
A. Abouaomar et al. , “Mean-Field Game and Reinforcement Learning MEC Resource Provisioning for SFC,” in 2021 IEEE Global Communications Conference (GLOBECOM) , 2021, pp. 1–6
work page 2021
-
[16]
Z. Mlika et al., “Deep Deterministic Policy Gradient to Minimize the Age of Information in Cellular V2X Communications,” IEEE Trans. Intell. Transp. Syst. , vol. 23, no. 12, pp. 23 597–23 612, 2022
work page 2022
-
[17]
Securing next-generation networks against eavesdroppers: Fl-enabled drl approach,
D. H. Tashman et al., “Securing next-generation networks against eavesdroppers: Fl-enabled drl approach,” in 2024 International Wireless Communications and Mobile Computing (IWCMC) , 2024, pp. 1643–1648
work page 2024
-
[18]
Federated Deep Reinforcement Learning for Open RAN Slicing in 6G Networks,
A. Abouaomar et al. , “Federated Deep Reinforcement Learning for Open RAN Slicing in 6G Networks,” IEEE Communications Magazine, vol. 61, no. 2, pp. 126–132, 2023
work page 2023
-
[19]
Empowering Security and Trust in 5G and Beyond: A Deep Reinforcement Learning Approach,
H. Moudoud et al. , “Empowering Security and Trust in 5G and Beyond: A Deep Reinforcement Learning Approach,” IEEE Open Journal of the Communications Society , vol. 4, pp. 2410–2420, 2023
work page 2023
-
[20]
Competitive Algorithms and Reinforcement Learning for NOMA in IoT Networks,
Z. Mlika et al. , “Competitive Algorithms and Reinforcement Learning for NOMA in IoT Networks,” in ICC 2021 - IEEE International Conference on Communications , 2021, pp. 1–6
work page 2021
-
[21]
Digital Twin and DRL-Driven Semantic Dissem- ination for 6G Autonomous Driving Service,
Y . Tao et al. , “Digital Twin and DRL-Driven Semantic Dissem- ination for 6G Autonomous Driving Service,” in GLOBECOM 2023 - 2023 IEEE Global Communications Conference , 2023, pp. 2075–2080
work page 2023
-
[22]
Open RAN Slicing for MVNOs With Deep Re- inforcement Learning,
A. Filali et al. , “Open RAN Slicing for MVNOs With Deep Re- inforcement Learning,” IEEE Internet of Things Journal , vol. 11, no. 10, pp. 18 711–18 725, 2024
work page 2024
-
[23]
Efficient reward poisoning attacks on online deep reinforcement learning,
Y . Xu et al. , “Efficient reward poisoning attacks on online deep reinforcement learning,” arXiv preprint arXiv:2205.14842 , 2022
-
[24]
D. H. Tashman et al. , “Trustworthy AI-Driven Dynamic Hybrid RIS: Joint Optimization and Reward Poisoning-Resilient Co ntrol in Cognitive MISO Networks,” IEEE Transactions on Network and Service Management , pp. 1–1, 2026
work page 2026
-
[25]
Multi-Environment Training Against Re- ward Poisoning Attacks on Deep Reinforcement Learning
M. Bouhaddi et al. , “Multi-Environment Training Against Re- ward Poisoning Attacks on Deep Reinforcement Learning.” in SECRYPT, 2023, pp. 870–875
work page 2023
-
[26]
Black-box targeted reward poisoning attack against online deep reinforcement learning,
Y . Xu et al. , “Black-box targeted reward poisoning attack against online deep reinforcement learning,” arXiv preprint arXiv:2305.10681, 2023
-
[27]
Reward poisoning attacks in deep reinforcement learning based on exploration strategies,
K. Cai et al. , “Reward poisoning attacks in deep reinforcement learning based on exploration strategies,” Neurocomputing, vol. 553, p. 126578, 2023
work page 2023
-
[28]
Overview of RIS-enabled secure transmission in 6G wireless networks,
J. Bae et al. , “Overview of RIS-enabled secure transmission in 6G wireless networks,” Digital Communications and Networks , 2024
work page 2024
-
[29]
Quantum-Aided Active User Detection for Energy-Efficient CD-NOMA in Cognitive Radio Networks,
D. H. Tashman et al. , “Quantum-Aided Active User Detection for Energy-Efficient CD-NOMA in Cognitive Radio Networks,” in 2025 International Wireless Communications and Mobile Computing (IWCMC) , 2025, pp. 1661–1666
work page 2025
-
[30]
J. Singh et al. , “Joint Hybrid Transceiver and Reflection Ma- trix Design for RIS-Aided mmWave MIMO Cognitive Radio Systems,” IEEE Transactions on Cognitive Communications and Networking, vol. 11, no. 1, pp. 391–407, 2025
work page 2025
-
[31]
N. D. Nguyen et al. , “Secrecy Outage Probability of Reconfig- urable Intelligent Surface-Aided Cooperative Underlay Co gnitive Radio Network Communications,” in 2021 22nd Asia-Pacific Network Operations and Management Symposium (APNOMS) , 2021, pp. 73–77
work page 2021
-
[32]
Physical-Layer Security for Cognitive Radio Networks over Cascaded Rayleigh Fading Channels,
D. H. Tashman et al. , “Physical-Layer Security for Cognitive Radio Networks over Cascaded Rayleigh Fading Channels,” in GLOBECOM 2020 - 2020 IEEE Global Communications Conference, 2020, pp. 1–6
work page 2020
-
[33]
D. H. Tashman et al., “Physical-Layer Security on Maximal Ratio Combining for SIMO Cognitive Radio Networks Over Cascaded κ -µ Fading Channels,” IEEE Transactions on Cognitive Commu- nications and Networking , vol. 7, no. 4, pp. 1244–1252, 2021
work page 2021
-
[34]
D. H. Tashman et al. , “On Securing Cognitive Radio Networks- Enabled SWIPT Over Cascaded κ -µ Fading Channels With Multiple Eavesdroppers,” IEEE Transactions on V ehicular Tech- nology, vol. 71, no. 1, pp. 478–488, 2022
work page 2022
-
[35]
D. H. Tashman et al. , “Performance Optimization of Energy- Harvesting Underlay Cognitive Radio Networks Using Rein- forcement Learning,” in 2023 International Wireless Communi- cations and Mobile Computing (IWCMC) , 2023, pp. 1160–1165
work page 2023
-
[36]
Federated Learning-based MARL for Strengthening Physical-Layer Security in B5G Networks,
D. H. Tashman et al. , “Federated Learning-based MARL for Strengthening Physical-Layer Security in B5G Networks,” i n ICC 2024 - IEEE International Conference on Communications , 2024, pp. 293–298
work page 2024
-
[37]
Network Slicing with MEC and Deep Rein- forcement Learning for the Internet of V ehicles,
Z. Mlika et al. , “Network Slicing with MEC and Deep Rein- forcement Learning for the Internet of V ehicles,” IEEE Network, vol. 35, no. 3, pp. 132–138, 2021
work page 2021
-
[38]
A. Filali et al. , “Dynamic SDN-Based Radio Access Network Slicing With Deep Reinforcement Learning for URLLC and eMBB Services,” IEEE Trans. Network Sci. Eng. , vol. 9, no. 4, pp. 2174–2187, 2022
work page 2022
-
[39]
A Deep Reinforcement Learning Approach for Service Migration in MEC-enabled V ehicular Networks,
A. Abouaomar et al., “A Deep Reinforcement Learning Approach for Service Migration in MEC-enabled V ehicular Networks,” in 2021 IEEE 46th Conference on Local Computer Networks (LCN) , 2021, pp. 273–280
work page 2021
-
[40]
D. H. Tashman et al. , “Optimizing Cognitive Networks: Rein- forcement Learning Meets Energy Harvesting Over Cascaded Channels,” IEEE Systems Journal , vol. 18, no. 4, pp. 1839–1848, 2024
work page 2024
-
[41]
Securing Cognitive IoT Networks: Re- inforcement Learning for Adaptive Physical Layer Defense,
D. H. Tashman et al. , “Securing Cognitive IoT Networks: Re- inforcement Learning for Adaptive Physical Layer Defense, ” in 2024 6th International Conference on Communications, Sign al Processing, and their Applications (ICCSPA) , 2024, pp. 1–6
work page 2024
-
[42]
B. Saglam et al. , “Deep Reinforcement Learning Based Joint Downlink Beamforming and RIS Configuration in RIS-Aided MU-MISO Systems Under Hardware Impairments and Imperfect CSI,” in 2023 IEEE International Conference on Communica- tions W orkshops (ICC W orkshops), 2023, pp. 66–72
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.