arxiv: 2605.07138 · v1 · submitted 2026-05-08 · 💻 cs.AI · cs.LG

Recognition: no theorem link

Can You Break RLVER? Probing Adversarial Robustness of RL-Trained Empathetic Agents

Deeraj S K, Krishna Mehra, Sadhana Devarajan, Sudhakar Mishra

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:34 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords reinforcement learningempathetic AIadversarial robustnessemotional consistency scorelanguage model agentsRLVERadversarial benchmark

0 comments

The pith

RL training on empathetic agents improves responsiveness but leaves emotional state tracking unchanged under adversarial conditions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether reinforcement learning from verifiable emotion rewards creates robust empathetic language models that can handle real-world adversarial users who gaslight or escalate. It introduces the Adversarial Empathy Benchmark with six trajectory types and the Emotional Consistency Score to measure state tracking separately from responsiveness. The RLVER model beats baselines on responsiveness and intention detection but ECS scores stay flat, suggesting the training builds better observable behavior without advancing state awareness. A sympathetic reader would care because this dissociation could mean current empathetic AI remains vulnerable to manipulation despite strong benchmark scores on cooperative tests.

Core claim

RLVER-PPO-Think substantially outperforms the same-scale untuned baseline with a score of 0.963 versus 0.761, achieves zero dialogue collapses, and shows 47 percent higher hidden-intention detection, yet the Emotional Consistency Score remains nearly flat and statistically indistinguishable from the base model, leading the authors to conclude that RL training improves emotional responsiveness without measurable gains in observable state tracking.

What carries the argument

The Adversarial Empathy Benchmark consisting of six psychologically grounded adversarial trajectory types and the Emotional Consistency Score that disentangles capacity to track user emotional states from capacity to improve them.

Load-bearing premise

The six psychologically grounded adversarial trajectory types in AEB and the ECS metric validly capture real-world emotional manipulation dynamics and successfully disentangle state tracking from responsiveness without confounding from the simulator or reward design.

What would settle it

Observing whether an RL method that directly rewards accurate tracking of user emotional states produces significantly higher ECS scores than the base model in the AEB evaluation would test if standard RLVER truly lacks state-tracking improvements.

Figures

Figures reproduced from arXiv: 2605.07138 by Deeraj S K, Krishna Mehra, Sadhana Devarajan, Sudhakar Mishra.

**Figure 2.** Figure 2: Final Score by model condition and AEB trajectory. Escalation (ESC) is most diagnostic: [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: FS vs. ECS across all eight conditions (circle = NoThink, square = Think). RLVER [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Reinforcement learning from verifiable emotion rewards RLVER has produced language models with strong empathetic performance, evaluated on benchmarks that assume cooperative, honest users. Yet real emotional interactions systematically violate this assumption: users gaslight, escalate, and pressure AI systems for unconditional validation, dynamics that cooperative benchmarks cannot surface. We construct the Adversarial Empathy Benchmark AEB and introduce the Emotional Consistency Score ECS to evaluate empathetic robustness under adversarial conditions. AEB comprises six psychologically grounded adversarial trajectory types with discriminative reward structures that penalize formulaic responses; ECS formally disentangles a model's capacity to track user emotional states from its capacity to improve them. In a controlled experiment across eight scenario-matched conditions (think and no-think conditions on 2 RLVER models, and 2 base models (Qwen 1.5B and 7B) with 480 adversarial dialogues), RLVER-PPO-Think substantially outperforms the same-scale untuned baseline (0.963 vs. 0.761, \(p<0.001, r=0.688\)), with zero dialogue collapses and 47\% higher hidden-intention detection. However, ECS remains nearly flat and is not significantly different for RLVER-PPO-Think versus Base-7B-Think (\(p=0.650\)): RL training improves emotional responsiveness without measurable gains in observable state tracking. We interpret the ECS--FS (Final Score) gap as a behavioral/legibility dissociation inside this simulator family, not as evidence about internal understanding or clinical readiness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the Adversarial Empathy Benchmark (AEB) with six psychologically grounded adversarial trajectory types and the Emotional Consistency Score (ECS) to probe the robustness of RLVER models (RL from verifiable emotion rewards) under non-cooperative user interactions. In experiments across eight conditions on Qwen 1.5B/7B models using 480 adversarial dialogues, RLVER-PPO-Think outperforms the untuned baseline on responsiveness (0.963 vs. 0.761, p<0.001, r=0.688) with zero collapses and higher hidden-intention detection, but ECS shows no significant difference (p=0.650), which the authors interpret as a behavioral dissociation between emotional responsiveness and observable state tracking.

Significance. If the dissociation holds under independent validation of AEB and ECS, the result would be significant for understanding limitations of RL in training empathetic agents, showing that gains in responsiveness metrics need not translate to better state tracking under adversarial pressure. The controlled experiment design, statistical reporting with p-values and effect sizes, and introduction of new evaluation tools are strengths that could inform future work on robust alignment.

major comments (2)

[AEB construction] AEB construction section: the six adversarial trajectory types and their 'discriminative reward structures' are presented without validation details, generation procedure, or ablations confirming independence from the RLVER reward simulator. This is load-bearing for the central claim, as any shared dependencies would make the flat ECS result (p=0.650) an artifact rather than evidence of unchanged state tracking.
[ECS definition] ECS definition and computation: no equations, pseudocode, or practical implementation details are given for how ECS disentangles state tracking from responsiveness (e.g., how the state-tracking component is scored independently of response quality). Without this, the dissociation interpretation cannot be assessed, especially given the significant FS improvement in the same models.

minor comments (1)

[Results] The results section reports 480 dialogues but lacks a table or breakdown showing per-condition sample sizes, variance, or exact ECS/FS formulas used in the statistical tests.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential significance of the observed dissociation between emotional responsiveness and state tracking under adversarial conditions. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [AEB construction] AEB construction section: the six adversarial trajectory types and their 'discriminative reward structures' are presented without validation details, generation procedure, or ablations confirming independence from the RLVER reward simulator. This is load-bearing for the central claim, as any shared dependencies would make the flat ECS result (p=0.650) an artifact rather than evidence of unchanged state tracking.

Authors: We agree that additional details on AEB construction are needed to support the central claim. The submitted manuscript describes the six trajectory types as psychologically grounded (drawing from established patterns such as gaslighting, escalation, and validation pressure) with reward structures designed to penalize formulaic responses, but it lacks the full generation procedure, expert validation metrics, and ablations for independence from the RLVER simulator. We will add a dedicated subsection detailing the template-based synthesis process followed by human filtering, inter-rater reliability scores from expert annotations, and new ablations that recompute ECS under independently varied reward structures to confirm the flat result (p=0.650) is not an artifact. These changes will be included in the revised manuscript. revision: yes
Referee: [ECS definition] ECS definition and computation: no equations, pseudocode, or practical implementation details are given for how ECS disentangles state tracking from responsiveness (e.g., how the state-tracking component is scored independently of response quality). Without this, the dissociation interpretation cannot be assessed, especially given the significant FS improvement in the same models.

Authors: We acknowledge that the ECS definition requires more explicit implementation details. The manuscript states that ECS formally disentangles state tracking from responsiveness improvement, with state tracking scored via separate annotations on inferred emotional states from dialogue history (distinct from response quality) and responsiveness via the Final Score (FS). However, we did not include the equation, pseudocode, or annotation protocol. We will revise Section 3 to add the formal definition ECS = state_tracking_accuracy_delta - responsiveness_improvement, along with pseudocode for the scoring pipeline and clarification that state-tracking annotations are performed independently by raters blind to model responses. This will enable direct assessment of the dissociation, particularly alongside the observed FS gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical evaluation is self-contained

full rationale

The paper reports a controlled empirical study comparing RLVER-trained models to untuned baselines on held-out adversarial dialogues from the AEB benchmark, using direct performance metrics and statistical tests (p-values, effect sizes). No mathematical derivation chain is present that reduces a claimed result to a fitted parameter, self-referential definition, or self-citation load-bearing premise; ECS is introduced as a new disentangling metric whose application yields the flat result as an observed outcome rather than a constructed tautology. The interpretation of responsiveness gains without state-tracking improvement follows from the experimental data without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on the assumption that the chosen adversarial trajectories and ECS formulation are psychologically valid and that the simulator family does not introduce artifacts that explain the observed dissociation.

axioms (2)

domain assumption The six adversarial trajectory types are psychologically grounded and produce discriminative reward structures that penalize formulaic responses.
Invoked in the abstract as the basis for AEB construction.
domain assumption ECS formally disentangles a model's capacity to track user emotional states from its capacity to improve them.
Stated as the purpose of the new metric.

invented entities (2)

Adversarial Empathy Benchmark (AEB) no independent evidence
purpose: To evaluate empathetic robustness under adversarial user conditions.
Newly constructed test set with six trajectory types.
Emotional Consistency Score (ECS) no independent evidence
purpose: To measure state tracking separately from emotional improvement.
Newly introduced metric.

pith-pipeline@v0.9.0 · 5589 in / 1602 out tokens · 43550 ms · 2026-05-11T01:34:20.867895+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 7 internal anchors

[1]

Not what you’ve signed up for: Compromising real- world llm-integrated applications with indirect prompt injection,

doi: 10.1145/3605764.3623985. Godfrey T Barrett-Lennard. Dimensions of therapist response as causal factors in therapeutic change. Psychological Monographs: General and Applied, 76(43):1–36,

work page doi:10.1145/3605764.3623985
[2]

Myra Cheng, Cinoo Lee, Pranav Khadpe, Sunny Yu, Dyllan Han, and Dan Jurafsky

doi: 10.1037/h0093918. Myra Cheng, Cinoo Lee, Pranav Khadpe, Sunny Yu, Dyllan Han, and Dan Jurafsky. Sycophantic AI decreases prosocial intentions and promotes dependence.Science, 391(6792):eaec8352, 2026a. doi: 10.1126/science.aec8352. Myra Cheng, Sunny Yu, Cinoo Lee, Pranav Khadpe, Lujain Ibrahim, and Dan Jurafsky. ELE- PHANT: Measuring and understandin...

work page doi:10.1037/h0093918
[3]

Tim Dettmers, Artidoro Pagnoni, Ari Fansi, and Luke Zettlemoyer

doi: 10.1037/h0086006. Tim Dettmers, Artidoro Pagnoni, Ari Fansi, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs.Advances in Neural Information Processing Systems, 36,

work page doi:10.1037/h0086006
[4]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Mistral 7B

8 Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7B.arXiv preprint arXiv:2310.06825,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Understanding the bene- fits and challenges of deploying conversational AI leveraging large language models for public health intervention

Eunkyung Jo, Daniel A Epstein, Hyunhoon Jung, and Young-Ho Kim. Understanding the bene- fits and challenges of deploying conversational AI leveraging large language models for public health intervention. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems,

work page 2023
[8]

Victoria E Johnson, Kevin L Nadal, Dani R G Sissoko, and Rukiya King

doi: 10.1145/3544548.3581503. Victoria E Johnson, Kevin L Nadal, Dani R G Sissoko, and Rukiya King. Gaslighting, emotional abuse, and the manipulation of reality.Women & Therapy, 42(1–2):1–13,

work page doi:10.1145/3544548.3581503
[9]

On the impact of fine-tuning on chain-of- thought reasoning.arXiv preprint arXiv:2411.15382,

Elita Lobo, Chirag Agarwal, and Himabindu Lakkaraju. On the impact of fine-tuning on chain-of- thought reasoning.arXiv preprint arXiv:2411.15382,

work page arXiv
[10]

Red teaming language models with language models

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. InPro- ceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448,

work page 2022
[11]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yu Wu, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

doi: 10.1016/j.cub.2014.06.054. Peisong Wang, Ruotian Ma, Bang Zhang, Xingyu Chen, Zhiwei He, Kang Luo, Qingsong Lv, Qingx- uan Jiang, Zheng Xie, Shanyi Wang, Yuan Li, Fanghua Ye, Jian Li, Yifan Yang, Zhaopeng Tu, and Xiaolong Li. RLVER: Reinforcement learning with verifiable emotion rewards for empathetic agents.arXiv preprint arXiv:2507.03112,

work page doi:10.1016/j.cub.2014.06.054 2014
[15]

Sentient agent as a judge: Evaluating higher-order social cognition in large language models.arXiv preprint arXiv:2505.02847,

Bang Zhang, Ruotian Ma, Qingxuan Jiang, Peisong Wang, Jiaqi Chen, Zheng Xie, Xingyu Chen, Yue Wang, Fanghua Ye, Jian Li, Yifan Yang, Zhaopeng Tu, and Xiaolong Li. Sentient agent as a judge: Evaluating higher-order social cognition in large language models.arXiv preprint arXiv:2505.02847,

work page arXiv
[16]

Universal and Transferable Adversarial Attacks on Aligned Language Models

9 Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Prior RLVER evidence is cooperative: simulated users reward ordinary empathy

10 A1 Paper Logic Figure Figure A1 summarises the overall paper logic and the gap in the existing literature that motivates our work. Prior RLVER evidence is cooperative: simulated users reward ordinary empathy. AEB probes the held-out adversarial regime, where surface behaviour conflicts with latent emotional need. We evaluate both final outcomes (Final ...

work page 2025
[18]

Guidelines: • The answer [N/A] means that the paper does not include experiments

Evaluation code is available upon request and will be publicly released upon acceptance. Guidelines: • The answer [N/A] means that the paper does not include experiments. • If the paper includes experiments, a [No] answer to this question will not be per- ceived well by the reviewers: Making the paper reproducible is important, regardless of whether the c...

work page 2024