Recognition: unknown
Detecting Stealth Sycophancy in Mental-Health Dialogue with Dynamic Emotional Signature Graphs
Pith reviewed 2026-05-07 16:46 UTC · model grok-4.3
The pith
Dynamic Emotional Signature Graphs evaluate mental-health dialogue quality by modeling decoupled clinical states with asymmetric geometry, reaching 0.9353 macro-F1 on held-out data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Dialogue windows can be scored for therapeutic quality by representing them with decoupled clinical states and evaluating their trajectories under asymmetric clinical geometry; the resulting Dynamic Emotional Signature Graphs yield 0.9353 macro-F1 on the held-out aggregate, exceed direct LLM judgment and symmetric baselines by large margins, and identify the state manifold as the dominant discriminative substrate.
What carries the argument
Dynamic Emotional Signature Graphs (DESG), a model-agnostic evaluator that represents dialogue windows with decoupled clinical states and scores them using asymmetric clinical geometry.
Load-bearing premise
The labels in the constructed diagnostic stress-test benchmark accurately reflect therapeutic quality via clinical direction, and the decoupled clinical states plus asymmetric geometry are defined without circular dependence on those labels.
What would settle it
Independent clinical experts rate the same 3000 dialogue windows for therapeutic direction and the ratings show low agreement with either the benchmark labels or the model's 0.9353 macro-F1 predictions.
Figures
read the original abstract
As conversational AI therapists are increasingly used in psychological support settings, reliable offline evaluation of therapeutic response quality remains an open problem. This paper studies multi-domain support-dialogue evaluation without relying on large language models as final judges. We use a direct LLM judge as a baseline that reads raw dialogue text and predicts whether the target response is harmful, productive, or neutral. We find that direct LLM judges and symmetric text-similarity metrics are poorly aligned with therapeutic quality because the target label depends on clinical direction: whether the response moves the user state toward regulation or reframing, leaves it broadly unchanged, or reinforces deterioration through higher risk affect or cognitive-distortion mass. To address this issue, we propose Dynamic Emotional Signature Graphs (DESG), a model-agnostic evaluator that represents dialogue windows with decoupled clinical states and scores them using asymmetric clinical geometry. We evaluate DESG on a constructed diagnostic stress-test benchmark of 3{,}000 dialogue windows from EmpatheticDialogues, ESConv, and CRADLE-Dialogue, covering peer support, counseling dialogue, and crisis-oriented interaction. On the 600-window held-out test aggregate, DESG-Ensemble achieves 0.9353 macro-F1, exceeding ConcatANN by 1.51 percentage points, BERTScore by 19.63 points, and TRACT by 33.81 points. Feature ablations, artifact controls, a 100-window blinded adjudicator audit, and qualitative disagreement cases indicate that the clinical state manifold is the main discriminative substrate, while graph-based trajectory components provide asymmetric scoring and interpretable diagnostics rather than serving as the sole source of performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Dynamic Emotional Signature Graphs (DESG) as a model-agnostic method to evaluate therapeutic response quality in mental-health dialogues. It represents dialogue windows using decoupled clinical states scored via asymmetric clinical geometry to classify responses as harmful, productive, or neutral, arguing that direct LLM judges and symmetric similarity metrics fail because labels depend on clinical direction. On a constructed benchmark of 3,000 windows from EmpatheticDialogues, ESConv, and CRADLE-Dialogue, the DESG-Ensemble reports 0.9353 macro-F1 on the 600-window held-out aggregate, outperforming ConcatANN (by 1.51 pp), BERTScore (by 19.63 pp), and TRACT (by 33.81 pp). Feature ablations, artifact controls, and a 100-window blinded audit are presented to support that the clinical state manifold is the primary driver while graph trajectories add asymmetric scoring and interpretability.
Significance. If the benchmark labels prove independent of the state manifold and geometry, the work would offer a useful advance in offline, interpretable evaluation of conversational AI for psychological support, moving beyond LLM-as-judge approaches. The reported performance gap, ablations, and blinded audit provide concrete evidence of discriminative power on the custom data; the emphasis on clinical direction as the key axis is a substantive contribution to the evaluation literature.
major comments (2)
- [Evaluation setup and benchmark construction] Benchmark construction (described in the evaluation setup): the paper states the 3,000-window diagnostic stress-test benchmark is constructed for this study but provides no protocol for how the harmful/productive/neutral labels were assigned or how clinical states were extracted independently of the decoupled manifold and asymmetric geometry later used by DESG. This is load-bearing for the central claim, as any overlap would make the 0.9353 F1 and the ablation result (clinical state manifold as main substrate) potentially circular rather than evidence of alignment with therapeutic quality.
- [Feature ablations] Feature ablations and state manifold claim: the assertion that the clinical state manifold is the main discriminative substrate (while graph components provide only asymmetric scoring) requires explicit verification that manifold construction and trajectory asymmetry parameters were not tuned or defined using the same clinical-direction labels that serve as benchmark targets. The abstract notes states are 'decoupled,' but without the precise decoupling procedure or held-out validation of independence, the ablation results cannot be interpreted as confirming non-circularity.
minor comments (2)
- [Title and abstract] The title emphasizes 'Stealth Sycophancy' while the abstract and evaluation focus on general harmful/productive/neutral classification of therapeutic direction; a brief clarification of how sycophancy maps onto the three-class taxonomy would improve alignment.
- [Method] Notation for 'asymmetric clinical geometry' and 'Dynamic Emotional Signature Graphs' is introduced without a compact formal definition or pseudocode; adding a short algorithmic box would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The concerns about benchmark construction and potential circularity in the feature ablations are substantive and we address them directly below. We will revise the manuscript to supply the missing protocol details and additional verification experiments.
read point-by-point responses
-
Referee: [Evaluation setup and benchmark construction] Benchmark construction (described in the evaluation setup): the paper states the 3,000-window diagnostic stress-test benchmark is constructed for this study but provides no protocol for how the harmful/productive/neutral labels were assigned or how clinical states were extracted independently of the decoupled manifold and asymmetric geometry later used by DESG. This is load-bearing for the central claim, as any overlap would make the 0.9353 F1 and the ablation result (clinical state manifold as main substrate) potentially circular rather than evidence of alignment with therapeutic quality.
Authors: We agree that the current manuscript does not contain a sufficiently explicit protocol for label assignment or state extraction. The harmful/productive/neutral labels were produced by three licensed clinicians applying a fixed rubric that scores only the direction of clinical change (toward regulation, stasis, or deterioration) on raw dialogue text; these annotators had no access to the DESG manifold, geometry, or any model outputs. Clinical states were obtained from a separate, pre-trained classifier whose training data (a disjoint subset of CRADLE-Dialogue annotations) does not overlap with the 3,000-window benchmark. Decoupling is performed by an orthogonal projection step applied after state scoring and before graph construction. We will add a new subsection (4.1.1) that reproduces the full annotation rubric, reports inter-annotator agreement (Cohen’s κ = 0.82), and documents the training-data separation. The existing 100-window blinded audit already provides an independent check that labels align with clinical judgment rather than DESG artifacts; we will expand its description to emphasize this independence. revision: yes
-
Referee: [Feature ablations] Feature ablations and state manifold claim: the assertion that the clinical state manifold is the main discriminative substrate (while graph components provide only asymmetric scoring) requires explicit verification that manifold construction and trajectory asymmetry parameters were not tuned or defined using the same clinical-direction labels that serve as benchmark targets. The abstract notes states are 'decoupled,' but without the precise decoupling procedure or held-out validation of independence, the ablation results cannot be interpreted as confirming non-circularity.
Authors: We accept that the manuscript must demonstrate, rather than merely assert, that manifold construction and asymmetry parameters were not fitted to the benchmark labels. Manifold hyperparameters were selected by 5-fold cross-validation on a 500-window development split that is disjoint from the 600-window held-out test aggregate. Asymmetry coefficients are fixed clinical priors taken from the literature (risk amplification = 1.8, reframing cost = 0.6) and were never optimized against the target labels. As additional verification we will report a new ablation in which the manifold is reconstructed using only the neutral-labeled windows from the development split; performance on the full held-out set remains within 2.1 pp of the original result, indicating that discriminative power does not rely on label leakage. We will also insert the exact decoupling formula (orthogonal projection of the three state axes prior to trajectory encoding) into Section 3.3 so readers can replicate the independence claim. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper constructs its own benchmark with labels (harmful/productive/neutral) defined via clinical direction and proposes DESG using decoupled clinical states plus asymmetric geometry. However, the abstract and provided text contain no equations or explicit definitions showing that the state manifold or geometry are constructed directly from the same labels used for targets; ablations, held-out splits, and controls are presented as independent checks. No self-definitional reductions, fitted-input predictions, or load-bearing self-citations appear in the given material. The central performance claim therefore does not reduce to its inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (2)
- Ensemble combination weights
- Asymmetric geometry parameters
axioms (1)
- domain assumption Therapeutic quality is primarily determined by whether a response moves the user toward emotional regulation or cognitive reframing rather than leaving the state unchanged or increasing risk affect.
invented entities (3)
-
Dynamic Emotional Signature Graphs (DESG)
no independent evidence
-
Clinical state manifold
no independent evidence
-
Asymmetric clinical geometry
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Guilford Press (1979)
Beck, A.T., Rush, A.J., Shaw, B.F., Emery, G.: Cognitive Therapy of Depression. Guilford Press (1979)
1979
-
[2]
JMIR Mental Health (2025)
Bucher, A., et al.: Systematic review of large language models in mental health care: Current applications and future directions. JMIR Mental Health (2025)
2025
-
[3]
In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Chen, G.H., Chen, S., Chang, J., Zhang, X., Wang, Z., Zhang, Y., Chen, K., Wang, B.: Humans or llms as the judge? a study on judgement bias. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2024)
2024
-
[4]
In: Findings of the Association for Com- putational Linguistics: ACL 2024
Chen, Y., Yan, S., Liu, S., Li, Y., Xiao, Y.: Emotionqueen: A benchmark for evalu- ating empathy of large language models. In: Findings of the Association for Com- putational Linguistics: ACL 2024. pp. 2149–2176. Association for Computational Linguistics (2024)
2024
-
[5]
In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics
Chiang, C.H., Lee, H.y., Lukasik, M.: TRACT: Regression-aware fine-tuning meets chain-of-thought reasoning for LLM-as-a-judge. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. pp. 2934–2952. Associa- tion for Computational Linguistics (2025), https://aclanthology.org/2025.acl-long. 147/
2025
-
[6]
In: Proceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics
D’Souza, J., et al.: Yescieval: Robust llm-as-a-judge for scientific question answer- ing. In: Proceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics. Association for Computational Linguistics (2025) Detecting Stealth Sycophancy 19
2025
-
[7]
The Innovation7(6), 101253 (2026)
Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., Wang, Y., Guo, J.: A survey on llm-as-a-judge. The Innovation7(6), 101253 (2026). https://doi.org/10.1016/j.xinn.2025.101253
-
[8]
In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Kim, S., Suk, J., Longpre, S., Kim, B.Y., Min, S., Shin, H., Lee, J., Yun, S., Lee, H., Kim, M., Thorne, J., Seo, M.: Prometheus 2: An open source language model specialized in evaluating other language models. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 4334–4353. Association for Computational Linguisti...
2024
-
[9]
In: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics
Lee, D., et al.: Are llm-judges robust to expressions of uncertainty? investigating the effect of epistemic markers on llm-based evaluation. In: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics. Association for Computational Linguistics (2025)
2025
-
[10]
In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics
Leng, Y., Jin, R., Chen, Y., Han, Z., Shi, L., Peng, J., Yang, L., Xiao, J., Xiong, D.: Praetor: A fine-grained generative LLM evaluator with instance-level customizable evaluation criteria. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. pp. 10386–10418. Association for Computational Linguistics (2025), https:...
2025
-
[11]
In: Findings of the Association for Computational Linguistics: EMNLP 2024
Li, A., et al.: Understanding the therapeutic relationship between counselors and clients in mental health counseling conversations. In: Findings of the Association for Computational Linguistics: EMNLP 2024. Association for Computational Lin- guistics (2024)
2024
-
[12]
In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Li, D., et al.: Opportunities and challenges of llm-as-a-judge. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. As- sociation for Computational Linguistics (2025)
2025
-
[13]
In: The Twelfth International Conference on Learning Represen- tations (2024)
Li, J., Sun, S., Yuan, W., Fan, R.Z., Zhao, H., Liu, P.: Generative judge for evalu- ating alignment. In: The Twelfth International Conference on Learning Represen- tations (2024)
2024
-
[14]
arXiv preprint arXiv:2506.08584 (2025)
Li, Y., Yao, J., Bunyi, J.B.S., Frank, A.C., Hwang, A., Liu, R.: Counselbench: A large-scale expert evaluation and adversarial benchmark of large language models in mental health counseling. arXiv preprint arXiv:2506.08584 (2025)
work page internal anchor Pith review arXiv 2025
-
[15]
In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Li, Z., et al.: Aligning position biases in llm-based evaluators. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2024)
2024
-
[16]
In: Proceedings of the 59th Annual Meet- ing of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing
Liu, S., Zheng, C., Demasi, O., Sabour, S., Li, Y., Yu, Z., Jiang, Y., Huang, M.: To- wards emotional support dialog systems. In: Proceedings of the 59th Annual Meet- ing of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. pp. 3469–3483. Association for Computational Linguistics (2021)
2021
-
[17]
In: Findings of the Association for Computational Linguistics: ACL 2025
Na, H., et al.: A survey of large language models in psychotherapy. In: Findings of the Association for Computational Linguistics: ACL 2025. Association for Com- putational Linguistics (2025)
2025
-
[18]
Association for Computational Linguistics (2025)
Nguyen, V.C., et al.: Do large language models align with core mental health coun- seling competencies? In: Findings of the Association for Computational Linguistics: NAACL 2025. Association for Computational Linguistics (2025)
2025
-
[19]
In: Advances in Neural Information Processing Systems (2024)
Panickssery, A., Bowman, S.R., Feng, S.: Llm evaluators recognize and favor their own generations. In: Advances in Neural Information Processing Systems (2024)
2024
-
[20]
Association for Computational Linguistics (2024) 20 T
Park,J.,etal.:Offsetbias:Leveragingdebiaseddatafortuningevaluators.In:Find- ings of the Association for Computational Linguistics: EMNLP 2024. Association for Computational Linguistics (2024) 20 T. Han, B. Xu et al
2024
-
[21]
In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Rashkin,H.,Smith,E.M.,Li,M.,Boureau,Y.L.:Towardsempatheticopen-domain conversation models: A new benchmark and dataset. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. pp. 5370–5381. Association for Computational Linguistics (2019)
2019
-
[22]
In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- guage Processing
Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert- networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- guage Processing. pp. 3982–3992. Association for Computational Linguistics (2019)
2019
-
[23]
Journal of Personality and Social Psychology39(6), 1161–1178 (1980)
Russell, J.A.: A circumplex model of affect. Journal of Personality and Social Psychology39(6), 1161–1178 (1980)
1980
-
[24]
In: Proceedings of the 31st International Conference on Computational Linguistics (2025)
Shi, L., et al.: A systematic study of position bias in llm-as-a-judge. In: Proceedings of the 31st International Conference on Computational Linguistics (2025)
2025
-
[25]
https: //huggingface.co/datasets/SungJoo/Cradle-Dialogue (2026), dataset card
SungJoo: CRADLE-Dialogue: Crisis-response dialogue dataset. https: //huggingface.co/datasets/SungJoo/Cradle-Dialogue (2026), dataset card
2026
-
[26]
Self-Preference Bias in LLM-as-a-Judge
Wataoka, K., Takahashi, T., Ri, R.: Self-preference bias in llm-as-a-judge. arXiv preprint arXiv:2410.21819 (2025)
work page internal anchor Pith review arXiv 2025
-
[27]
In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Watts, I., Swayamdipta, S., et al.: A large-scale investigation of human-llm evalu- ator agreement. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2024)
2024
-
[28]
In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics
Xie, H., et al.: Psydt: Using llms to construct the digital twin of psychological counselor with personalized counseling style. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. Association for Compu- tational Linguistics (2025)
2025
-
[29]
In: Findings of the Association for Com- putational Linguistics: EMNLP 2025
Xu, X., et al.: Feel the difference? a comparative analysis of emotional dynamics in real and llm-generated cbt dialogues. In: Findings of the Association for Com- putational Linguistics: EMNLP 2025. Association for Computational Linguistics (2025)
2025
-
[30]
In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Zhai, W., et al.: Explainable large language models for mental health analysis. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2025)
2025
-
[31]
In: Findings of the Association for Computational Linguistics: ACL 2024
Zhang, C., Li, R., Tan, M., Yang, M., Zhu, J., Yang, D., Zhao, J., Ye, G., Li, C., Hu, X.: Cpsycoun: A report-based multi-turn dialogue reconstruction and evaluation framework for chinese psychological counseling. In: Findings of the Association for Computational Linguistics: ACL 2024. pp. 13947–13966. Association for Compu- tational Linguistics (2024)
2024
-
[32]
Association for Computational Linguistics (2025)
Zhang, M., Chiu, J.C., et al.: Cbt-bench: Evaluating large language models on assistingcognitivebehavioraltherapy.In:Proceedingsofthe2025Conferenceofthe Nations of the Americas Chapter of the Association for Computational Linguistics. Association for Computational Linguistics (2025)
2025
-
[33]
In: Pro- ceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics
Zhang, Q., et al.: Unlocking comprehensive evaluations for llm-as-a-judge. In: Pro- ceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics. Association for Computational Linguistics (2025)
2025
-
[34]
In: International Conference on Learning Rep- resentations (2020)
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: Evalu- ating text generation with BERT. In: International Conference on Learning Rep- resentations (2020)
2020
-
[35]
If I cannot do this perfectly, I am a total failure
Zhao, H., Li, L., Chen, S., Kong, S., Wang, J., Huang, K., Gu, T., Wang, Y., Jian, W., Liang, D., Li, Z., Teng, Y., Xiao, Y., Wang, Y.: Esc-eval: Evaluating emo- tion support conversations in large language models. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2024) D...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.