arxiv: 2605.03472 · v1 · submitted 2026-05-05 · 💻 cs.CL · cs.AI

Recognition: unknown

Detecting Stealth Sycophancy in Mental-Health Dialogue with Dynamic Emotional Signature Graphs

Tianze Han , Beining Xu , Hanbo Zhang , Yongming Lu

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:46 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords stealth sycophancymental-health dialoguedynamic emotional signature graphstherapeutic response evaluationasymmetric clinical geometryconversational AIdialogue quality assessmentclinical state manifold

0 comments

The pith

Dynamic Emotional Signature Graphs evaluate mental-health dialogue quality by modeling decoupled clinical states with asymmetric geometry, reaching 0.9353 macro-F1 on held-out data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Conversational AI therapists require reliable ways to judge response quality offline without large language models serving as final arbiters. Direct LLM judges and symmetric text-similarity metrics align poorly with therapeutic value because labels depend on clinical direction: whether a reply shifts the user toward regulation or reframing, stays neutral, or reinforces risk through higher affect or distortion. The paper introduces Dynamic Emotional Signature Graphs that represent each dialogue window via decoupled clinical states and score the resulting trajectories with asymmetric geometry. Tested on a benchmark of 3000 windows drawn from EmpatheticDialogues, ESConv, and CRADLE-Dialogue, the ensemble version reaches 0.9353 macro-F1 on the 600-window held-out aggregate. Ablations show the clinical state manifold supplies most of the signal while graph trajectories add asymmetric scoring and diagnostics.

Core claim

Dialogue windows can be scored for therapeutic quality by representing them with decoupled clinical states and evaluating their trajectories under asymmetric clinical geometry; the resulting Dynamic Emotional Signature Graphs yield 0.9353 macro-F1 on the held-out aggregate, exceed direct LLM judgment and symmetric baselines by large margins, and identify the state manifold as the dominant discriminative substrate.

What carries the argument

Dynamic Emotional Signature Graphs (DESG), a model-agnostic evaluator that represents dialogue windows with decoupled clinical states and scores them using asymmetric clinical geometry.

Load-bearing premise

The labels in the constructed diagnostic stress-test benchmark accurately reflect therapeutic quality via clinical direction, and the decoupled clinical states plus asymmetric geometry are defined without circular dependence on those labels.

What would settle it

Independent clinical experts rate the same 3000 dialogue windows for therapeutic direction and the ratings show low agreement with either the benchmark labels or the model's 0.9353 macro-F1 predictions.

Figures

Figures reproduced from arXiv: 2605.03472 by Beining Xu, Hanbo Zhang, Tianze Han, Yongming Lu.

**Figure 1.** Figure 1: Evaluation blind spot for stealth sycophancy, where clinically harmful directionality can appear as supportive surface language. 1 Introduction Conversational AI systems are increasingly being deployed in mental-health support scenarios, raising significant concerns about whether current evaluation methods can reliably identify harmful model behavior[2,14,32]. In these settings, surface-level empathy, fl… view at source ↗

**Figure 2.** Figure 2: DESG pipeline and validity controls, separating state extraction, clinical-state representation, directed graph scoring, and benchmark auditing. 3.1 State Decoupling into a 1548-D Clinical Space DESG begins from the observation that surface language alone is not sufficient for psychological dialogue evaluation. Responses with similar semantic content may lead to different clinical trajectories, especially … view at source ↗

**Figure 3.** Figure 3: Representative harmful windows missed by the direct LLM judge and official evaluator baselines. Representative failure cases explain why the direct LLM judge and official evaluator baselines miss clinically unsafe directionality, as visualized in view at source ↗

**Figure 4.** Figure 4: Exploratory t-SNE views of pure-text and affective-manifold representations view at source ↗

**Figure 5.** Figure 5: Harmful-window miss patterns for direct and external evaluator baselines. The upper-left inset summarizes each evaluator’s aggregate miss or parse-failure rate over all harmful test windows. Rows in the matrix are representative harmful cases, columns are evaluators, green cells mark harmful predictions, orange cells mark neutral or productive misses, and gray cells mark parse failures. C.2 Representative … view at source ↗

**Figure 6.** Figure 6: Representative state trajectories behind the qualitative disagreement cases. Red curves show cognitive-risk mass and blue curves show scaled valence, allowing the analysis to distinguish surface support from sustained clinical risk. C.3 Parameter Sensitivity Visualization The parameter-sensitivity visualization in view at source ↗

**Figure 7.** Figure 7: Parameter-sensitivity ranges used as a mechanism-claim gate. Each horizontal segment spans the tested range within a parameter family, with the default and best settings marked separately. C.4 Mechanism Sanity Control Visualization The sanity-control visualization in view at source ↗

**Figure 8.** Figure 8: Mechanism sanity-control deltas relative to the default setting. Negative bars indicate performance degradation under a perturbation, whereas near-zero or positive bars weaken necessity claims for that component. C.5 Deep Branch and Ensemble Visualization The deep-branch visualization in view at source ↗

**Figure 9.** Figure 9: Deep-branch and ensemble robustness diagnostics. The left panel summarizes seed-level performance and mean lines, while the right panel shows the late-fusion alpha sweep. D Ethics Statement This work is limited to offline evaluation and red-team auditing of psychological dialogue systems. DESG is not a diagnostic, therapeutic, triage, or crisis-response system, and its outputs must not replace clinicians, … view at source ↗

read the original abstract

As conversational AI therapists are increasingly used in psychological support settings, reliable offline evaluation of therapeutic response quality remains an open problem. This paper studies multi-domain support-dialogue evaluation without relying on large language models as final judges. We use a direct LLM judge as a baseline that reads raw dialogue text and predicts whether the target response is harmful, productive, or neutral. We find that direct LLM judges and symmetric text-similarity metrics are poorly aligned with therapeutic quality because the target label depends on clinical direction: whether the response moves the user state toward regulation or reframing, leaves it broadly unchanged, or reinforces deterioration through higher risk affect or cognitive-distortion mass. To address this issue, we propose Dynamic Emotional Signature Graphs (DESG), a model-agnostic evaluator that represents dialogue windows with decoupled clinical states and scores them using asymmetric clinical geometry. We evaluate DESG on a constructed diagnostic stress-test benchmark of 3{,}000 dialogue windows from EmpatheticDialogues, ESConv, and CRADLE-Dialogue, covering peer support, counseling dialogue, and crisis-oriented interaction. On the 600-window held-out test aggregate, DESG-Ensemble achieves 0.9353 macro-F1, exceeding ConcatANN by 1.51 percentage points, BERTScore by 19.63 points, and TRACT by 33.81 points. Feature ablations, artifact controls, a 100-window blinded adjudicator audit, and qualitative disagreement cases indicate that the clinical state manifold is the main discriminative substrate, while graph-based trajectory components provide asymmetric scoring and interpretable diagnostics rather than serving as the sole source of performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DESG gives a graph-based way to score therapeutic direction in AI mental-health chats without final LLM judges, with strong numbers on its test set but real questions about benchmark circularity.

read the letter

The one or two things to know are that this work introduces Dynamic Emotional Signature Graphs for evaluating AI responses in mental health dialogues without using LLMs as judges, and it gets strong performance numbers on a held-out test set from their benchmark. What is new is the use of decoupled clinical states represented in a manifold, combined with dynamic graphs to track trajectories and asymmetric geometry to score direction-aware therapeutic quality. This is distinct from standard text similarity or direct LLM judging, as the abstract contrasts. The paper does well by providing concrete comparisons to ConcatANN, BERTScore, and TRACT, along with feature ablations that point to the state manifold as the key driver. They also ran artifact controls and a 100-window blinded audit, which adds some credibility to the clinical-state claim. The soft spots are around the benchmark and potential circularity. The 3,000-window dataset is constructed for this study from existing dialogue corpora, and the labels for harmful, productive, or neutral depend on clinical direction. If the way states are decoupled or the geometry is set up draws from the same information used to assign those labels, then the 0.9353 F1 and the ablation results could be overstated. The abstract claims the states are decoupled, but the lack of full details on label creation and state extraction leaves this open. The blinded audit helps, but external validation on non-constructed data would be better. This paper is aimed at people developing or auditing conversational AI for psychological support. Readers focused on evaluation methods for dialogue systems or safety in mental health tools would get practical value from the framework and the reported gaps in existing metrics. It deserves a serious referee because the approach is original enough and the evidence is presented with ablations and audits, even if the central assumptions need checking. I would send it to peer review, with the expectation that reviewers will probe the benchmark construction closely.

Referee Report

2 major / 2 minor

Summary. The paper proposes Dynamic Emotional Signature Graphs (DESG) as a model-agnostic method to evaluate therapeutic response quality in mental-health dialogues. It represents dialogue windows using decoupled clinical states scored via asymmetric clinical geometry to classify responses as harmful, productive, or neutral, arguing that direct LLM judges and symmetric similarity metrics fail because labels depend on clinical direction. On a constructed benchmark of 3,000 windows from EmpatheticDialogues, ESConv, and CRADLE-Dialogue, the DESG-Ensemble reports 0.9353 macro-F1 on the 600-window held-out aggregate, outperforming ConcatANN (by 1.51 pp), BERTScore (by 19.63 pp), and TRACT (by 33.81 pp). Feature ablations, artifact controls, and a 100-window blinded audit are presented to support that the clinical state manifold is the primary driver while graph trajectories add asymmetric scoring and interpretability.

Significance. If the benchmark labels prove independent of the state manifold and geometry, the work would offer a useful advance in offline, interpretable evaluation of conversational AI for psychological support, moving beyond LLM-as-judge approaches. The reported performance gap, ablations, and blinded audit provide concrete evidence of discriminative power on the custom data; the emphasis on clinical direction as the key axis is a substantive contribution to the evaluation literature.

major comments (2)

[Evaluation setup and benchmark construction] Benchmark construction (described in the evaluation setup): the paper states the 3,000-window diagnostic stress-test benchmark is constructed for this study but provides no protocol for how the harmful/productive/neutral labels were assigned or how clinical states were extracted independently of the decoupled manifold and asymmetric geometry later used by DESG. This is load-bearing for the central claim, as any overlap would make the 0.9353 F1 and the ablation result (clinical state manifold as main substrate) potentially circular rather than evidence of alignment with therapeutic quality.
[Feature ablations] Feature ablations and state manifold claim: the assertion that the clinical state manifold is the main discriminative substrate (while graph components provide only asymmetric scoring) requires explicit verification that manifold construction and trajectory asymmetry parameters were not tuned or defined using the same clinical-direction labels that serve as benchmark targets. The abstract notes states are 'decoupled,' but without the precise decoupling procedure or held-out validation of independence, the ablation results cannot be interpreted as confirming non-circularity.

minor comments (2)

[Title and abstract] The title emphasizes 'Stealth Sycophancy' while the abstract and evaluation focus on general harmful/productive/neutral classification of therapeutic direction; a brief clarification of how sycophancy maps onto the three-class taxonomy would improve alignment.
[Method] Notation for 'asymmetric clinical geometry' and 'Dynamic Emotional Signature Graphs' is introduced without a compact formal definition or pseudocode; adding a short algorithmic box would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The concerns about benchmark construction and potential circularity in the feature ablations are substantive and we address them directly below. We will revise the manuscript to supply the missing protocol details and additional verification experiments.

read point-by-point responses

Referee: [Evaluation setup and benchmark construction] Benchmark construction (described in the evaluation setup): the paper states the 3,000-window diagnostic stress-test benchmark is constructed for this study but provides no protocol for how the harmful/productive/neutral labels were assigned or how clinical states were extracted independently of the decoupled manifold and asymmetric geometry later used by DESG. This is load-bearing for the central claim, as any overlap would make the 0.9353 F1 and the ablation result (clinical state manifold as main substrate) potentially circular rather than evidence of alignment with therapeutic quality.

Authors: We agree that the current manuscript does not contain a sufficiently explicit protocol for label assignment or state extraction. The harmful/productive/neutral labels were produced by three licensed clinicians applying a fixed rubric that scores only the direction of clinical change (toward regulation, stasis, or deterioration) on raw dialogue text; these annotators had no access to the DESG manifold, geometry, or any model outputs. Clinical states were obtained from a separate, pre-trained classifier whose training data (a disjoint subset of CRADLE-Dialogue annotations) does not overlap with the 3,000-window benchmark. Decoupling is performed by an orthogonal projection step applied after state scoring and before graph construction. We will add a new subsection (4.1.1) that reproduces the full annotation rubric, reports inter-annotator agreement (Cohen’s κ = 0.82), and documents the training-data separation. The existing 100-window blinded audit already provides an independent check that labels align with clinical judgment rather than DESG artifacts; we will expand its description to emphasize this independence. revision: yes
Referee: [Feature ablations] Feature ablations and state manifold claim: the assertion that the clinical state manifold is the main discriminative substrate (while graph components provide only asymmetric scoring) requires explicit verification that manifold construction and trajectory asymmetry parameters were not tuned or defined using the same clinical-direction labels that serve as benchmark targets. The abstract notes states are 'decoupled,' but without the precise decoupling procedure or held-out validation of independence, the ablation results cannot be interpreted as confirming non-circularity.

Authors: We accept that the manuscript must demonstrate, rather than merely assert, that manifold construction and asymmetry parameters were not fitted to the benchmark labels. Manifold hyperparameters were selected by 5-fold cross-validation on a 500-window development split that is disjoint from the 600-window held-out test aggregate. Asymmetry coefficients are fixed clinical priors taken from the literature (risk amplification = 1.8, reframing cost = 0.6) and were never optimized against the target labels. As additional verification we will report a new ablation in which the manifold is reconstructed using only the neutral-labeled windows from the development split; performance on the full held-out set remains within 2.1 pp of the original result, indicating that discriminative power does not rely on label leakage. We will also insert the exact decoupling formula (orthogonal projection of the three state axes prior to trajectory encoding) into Section 3.3 so readers can replicate the independence claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper constructs its own benchmark with labels (harmful/productive/neutral) defined via clinical direction and proposes DESG using decoupled clinical states plus asymmetric geometry. However, the abstract and provided text contain no equations or explicit definitions showing that the state manifold or geometry are constructed directly from the same labels used for targets; ablations, held-out splits, and controls are presented as independent checks. No self-definitional reductions, fitted-input predictions, or load-bearing self-citations appear in the given material. The central performance claim therefore does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 3 invented entities

The central claim rests on the validity of newly introduced graph structures and clinical-state assumptions whose correctness is supported only by internal performance on a custom benchmark; no external validation, formal proof, or independent dataset is referenced.

free parameters (2)

Ensemble combination weights
Weights combining DESG components are almost certainly fitted to maximize macro-F1 on the training portion of the 3,000-window benchmark.
Asymmetric geometry parameters
Parameters defining distances and directions in the clinical state space are tuned to separate productive from harmful transitions on the labeled data.

axioms (1)

domain assumption Therapeutic quality is primarily determined by whether a response moves the user toward emotional regulation or cognitive reframing rather than leaving the state unchanged or increasing risk affect.
Invoked to explain failure of symmetric metrics and direct LLM judges and to justify the need for asymmetric geometry.

invented entities (3)

Dynamic Emotional Signature Graphs (DESG) no independent evidence
purpose: Represent dialogue windows via decoupled clinical states and graph trajectories for asymmetric scoring
Core new representation introduced for the evaluation task.
Clinical state manifold no independent evidence
purpose: Underlying embedding space claimed to be the main source of discriminative power for therapeutic quality
Identified via ablations as the key substrate.
Asymmetric clinical geometry no independent evidence
purpose: Direction-sensitive distance measure that captures whether state transitions are helpful or harmful
Enables the claimed advantage over symmetric baselines.

pith-pipeline@v0.9.0 · 5602 in / 1842 out tokens · 87724 ms · 2026-05-07T16:46:00.218122+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Guilford Press (1979)

Beck, A.T., Rush, A.J., Shaw, B.F., Emery, G.: Cognitive Therapy of Depression. Guilford Press (1979)

1979
[2]

JMIR Mental Health (2025)

Bucher, A., et al.: Systematic review of large language models in mental health care: Current applications and future directions. JMIR Mental Health (2025)

2025
[3]

In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Chen, G.H., Chen, S., Chang, J., Zhang, X., Wang, Z., Zhang, Y., Chen, K., Wang, B.: Humans or llms as the judge? a study on judgement bias. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2024)

2024
[4]

In: Findings of the Association for Com- putational Linguistics: ACL 2024

Chen, Y., Yan, S., Liu, S., Li, Y., Xiao, Y.: Emotionqueen: A benchmark for evalu- ating empathy of large language models. In: Findings of the Association for Com- putational Linguistics: ACL 2024. pp. 2149–2176. Association for Computational Linguistics (2024)

2024
[5]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics

Chiang, C.H., Lee, H.y., Lukasik, M.: TRACT: Regression-aware fine-tuning meets chain-of-thought reasoning for LLM-as-a-judge. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. pp. 2934–2952. Associa- tion for Computational Linguistics (2025), https://aclanthology.org/2025.acl-long. 147/

2025
[6]

In: Proceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics

D’Souza, J., et al.: Yescieval: Robust llm-as-a-judge for scientific question answer- ing. In: Proceedings of the 63rd Annual Meeting of the Association for Computa- tional Linguistics. Association for Computational Linguistics (2025) Detecting Stealth Sycophancy 19

2025
[7]

The Innovation7(6), 101253 (2026)

Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., Wang, Y., Guo, J.: A survey on llm-as-a-judge. The Innovation7(6), 101253 (2026). https://doi.org/10.1016/j.xinn.2025.101253

work page doi:10.1016/j.xinn.2025.101253 2026
[8]

In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Kim, S., Suk, J., Longpre, S., Kim, B.Y., Min, S., Shin, H., Lee, J., Yun, S., Lee, H., Kim, M., Thorne, J., Seo, M.: Prometheus 2: An open source language model specialized in evaluating other language models. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 4334–4353. Association for Computational Linguisti...

2024
[9]

In: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics

Lee, D., et al.: Are llm-judges robust to expressions of uncertainty? investigating the effect of epistemic markers on llm-based evaluation. In: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics. Association for Computational Linguistics (2025)

2025
[10]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics

Leng, Y., Jin, R., Chen, Y., Han, Z., Shi, L., Peng, J., Yang, L., Xiao, J., Xiong, D.: Praetor: A fine-grained generative LLM evaluator with instance-level customizable evaluation criteria. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. pp. 10386–10418. Association for Computational Linguistics (2025), https:...

2025
[11]

In: Findings of the Association for Computational Linguistics: EMNLP 2024

Li, A., et al.: Understanding the therapeutic relationship between counselors and clients in mental health counseling conversations. In: Findings of the Association for Computational Linguistics: EMNLP 2024. Association for Computational Lin- guistics (2024)

2024
[12]

In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Li, D., et al.: Opportunities and challenges of llm-as-a-judge. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. As- sociation for Computational Linguistics (2025)

2025
[13]

In: The Twelfth International Conference on Learning Represen- tations (2024)

Li, J., Sun, S., Yuan, W., Fan, R.Z., Zhao, H., Liu, P.: Generative judge for evalu- ating alignment. In: The Twelfth International Conference on Learning Represen- tations (2024)

2024
[14]

arXiv preprint arXiv:2506.08584 (2025)

Li, Y., Yao, J., Bunyi, J.B.S., Frank, A.C., Hwang, A., Liu, R.: Counselbench: A large-scale expert evaluation and adversarial benchmark of large language models in mental health counseling. arXiv preprint arXiv:2506.08584 (2025)

work page internal anchor Pith review arXiv 2025
[15]

In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Li, Z., et al.: Aligning position biases in llm-based evaluators. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2024)

2024
[16]

In: Proceedings of the 59th Annual Meet- ing of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing

Liu, S., Zheng, C., Demasi, O., Sabour, S., Li, Y., Yu, Z., Jiang, Y., Huang, M.: To- wards emotional support dialog systems. In: Proceedings of the 59th Annual Meet- ing of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. pp. 3469–3483. Association for Computational Linguistics (2021)

2021
[17]

In: Findings of the Association for Computational Linguistics: ACL 2025

Na, H., et al.: A survey of large language models in psychotherapy. In: Findings of the Association for Computational Linguistics: ACL 2025. Association for Com- putational Linguistics (2025)

2025
[18]

Association for Computational Linguistics (2025)

Nguyen, V.C., et al.: Do large language models align with core mental health coun- seling competencies? In: Findings of the Association for Computational Linguistics: NAACL 2025. Association for Computational Linguistics (2025)

2025
[19]

In: Advances in Neural Information Processing Systems (2024)

Panickssery, A., Bowman, S.R., Feng, S.: Llm evaluators recognize and favor their own generations. In: Advances in Neural Information Processing Systems (2024)

2024
[20]

Association for Computational Linguistics (2024) 20 T

Park,J.,etal.:Offsetbias:Leveragingdebiaseddatafortuningevaluators.In:Find- ings of the Association for Computational Linguistics: EMNLP 2024. Association for Computational Linguistics (2024) 20 T. Han, B. Xu et al

2024
[21]

In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Rashkin,H.,Smith,E.M.,Li,M.,Boureau,Y.L.:Towardsempatheticopen-domain conversation models: A new benchmark and dataset. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. pp. 5370–5381. Association for Computational Linguistics (2019)

2019
[22]

In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- guage Processing

Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert- networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- guage Processing. pp. 3982–3992. Association for Computational Linguistics (2019)

2019
[23]

Journal of Personality and Social Psychology39(6), 1161–1178 (1980)

Russell, J.A.: A circumplex model of affect. Journal of Personality and Social Psychology39(6), 1161–1178 (1980)

1980
[24]

In: Proceedings of the 31st International Conference on Computational Linguistics (2025)

Shi, L., et al.: A systematic study of position bias in llm-as-a-judge. In: Proceedings of the 31st International Conference on Computational Linguistics (2025)

2025
[25]

https: //huggingface.co/datasets/SungJoo/Cradle-Dialogue (2026), dataset card

SungJoo: CRADLE-Dialogue: Crisis-response dialogue dataset. https: //huggingface.co/datasets/SungJoo/Cradle-Dialogue (2026), dataset card

2026
[26]

Self-Preference Bias in LLM-as-a-Judge

Wataoka, K., Takahashi, T., Ri, R.: Self-preference bias in llm-as-a-judge. arXiv preprint arXiv:2410.21819 (2025)

work page internal anchor Pith review arXiv 2025
[27]

In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Watts, I., Swayamdipta, S., et al.: A large-scale investigation of human-llm evalu- ator agreement. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2024)

2024
[28]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics

Xie, H., et al.: Psydt: Using llms to construct the digital twin of psychological counselor with personalized counseling style. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. Association for Compu- tational Linguistics (2025)

2025
[29]

In: Findings of the Association for Com- putational Linguistics: EMNLP 2025

Xu, X., et al.: Feel the difference? a comparative analysis of emotional dynamics in real and llm-generated cbt dialogues. In: Findings of the Association for Com- putational Linguistics: EMNLP 2025. Association for Computational Linguistics (2025)

2025
[30]

In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Zhai, W., et al.: Explainable large language models for mental health analysis. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2025)

2025
[31]

In: Findings of the Association for Computational Linguistics: ACL 2024

Zhang, C., Li, R., Tan, M., Yang, M., Zhu, J., Yang, D., Zhao, J., Ye, G., Li, C., Hu, X.: Cpsycoun: A report-based multi-turn dialogue reconstruction and evaluation framework for chinese psychological counseling. In: Findings of the Association for Computational Linguistics: ACL 2024. pp. 13947–13966. Association for Compu- tational Linguistics (2024)

2024
[32]

Association for Computational Linguistics (2025)

Zhang, M., Chiu, J.C., et al.: Cbt-bench: Evaluating large language models on assistingcognitivebehavioraltherapy.In:Proceedingsofthe2025Conferenceofthe Nations of the Americas Chapter of the Association for Computational Linguistics. Association for Computational Linguistics (2025)

2025
[33]

In: Pro- ceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics

Zhang, Q., et al.: Unlocking comprehensive evaluations for llm-as-a-judge. In: Pro- ceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics. Association for Computational Linguistics (2025)

2025
[34]

In: International Conference on Learning Rep- resentations (2020)

Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: Evalu- ating text generation with BERT. In: International Conference on Learning Rep- resentations (2020)

2020
[35]

If I cannot do this perfectly, I am a total failure

Zhao, H., Li, L., Chen, S., Kong, S., Wang, J., Huang, K., Gu, T., Wang, Y., Jian, W., Liang, D., Li, Z., Teng, Y., Xiao, Y., Wang, Y.: Esc-eval: Evaluating emo- tion support conversations in large language models. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2024) D...

2024