AI generates well-liked but templatic empathic responses
Pith reviewed 2026-05-10 17:54 UTC · model grok-4.3
The pith
Large language models rely on one recurring sequence of empathic tactics in most of their responses.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLMs have learned and consistently deploy a well-liked template for expressing empathy. Across two studies totaling more than 4,500 responses, a structured sequence of the ten tactics matches 83 to 90 percent of LLM responses and covers 81 to 92 percent of each matched response. Human-written responses prove more diverse in how they combine the tactics.
What carries the argument
A taxonomy of 10 empathic language tactics assembled into a single recurring template that organizes most AI-generated replies.
If this is right
- AI empathic replies will tend to include the same core elements in roughly the same order.
- The high coverage rate of the template accounts for why these responses receive strong ratings for empathy.
- Human responses draw on a wider range of tactic orders and combinations, producing greater variety.
- Training on large human datasets may have steered models toward an average but effective pattern.
- Future systems could start from this template and then deliberately vary the order or add extra tactics.
Where Pith is reading between the lines
- Users may grow to expect this steady style from AI and notice departures from it.
- Repeated exposure to the same structure could eventually make AI empathy feel less personal.
- Applying the taxonomy to other AI social tasks such as giving advice would test whether similar templates appear elsewhere.
- The finding suggests a broader pattern in which AI produces safe, average versions of complex social language.
Load-bearing premise
The taxonomy of ten tactics captures the main functional parts of empathic language in both AI and human writing without leaving out key differences or introducing systematic bias in how responses are labeled.
What would settle it
A fresh collection of several hundred human empathic responses analyzed with the same taxonomy that matches the AI template at rates above 80 percent would undermine the reported difference in diversity.
read the original abstract
Recent research shows that greater numbers of people are turning to Large Language Models (LLMs) for emotional support, and that people rate LLM responses as more empathic than human-written responses. We suggest a reason for this success: LLMs have learned and consistently deploy a well-liked template for expressing empathy. We develop a taxonomy of 10 empathic language "tactics" that include validating someone's feelings and paraphrasing, and apply this taxonomy to characterize the language that people and LLMs produce when writing empathic responses. Across a set of 2 studies comparing a total of n = 3,265 AI-generated (by six models) and n = 1,290 human-written responses, we find that LLM responses are highly formulaic at a discourse functional level. We discovered a template -- a structured sequence of tactics -- that matches between 83--90% of LLM responses (and 60--83\% in a held out sample), and when those are matched, covers 81--92% of the response. By contrast, human-written responses are more diverse. We end with a discussion of implications for the future of AI-generated empathy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLMs produce more formulaic empathic responses than humans by consistently deploying a template consisting of a structured sequence of 10 empathic tactics (e.g., validating feelings, paraphrasing). Across two studies with 3,265 AI-generated responses from six models and 1,290 human-written responses, this template matches 83-90% of LLM outputs (60-83% in a held-out sample) and covers 81-92% of their content when matched, while human responses exhibit greater diversity at the discourse-functional level. The work suggests this templatic structure may explain why users rate LLM empathy highly.
Significance. If the taxonomy is shown to be valid and independent, the results would provide a concrete empirical account of why LLM empathic responses are preferred and could inform the design of less formulaic AI systems for emotional support. The large sample sizes, multi-model comparison, and direct human-AI contrast are clear strengths that support the empirical core of the work.
major comments (3)
- [Methods (Taxonomy Development)] Methods section on taxonomy development: the paper provides no details on how the 10 empathic tactics were derived, validated against external frameworks, or assessed for inter-rater reliability. This is load-bearing for the central claim because, without evidence that the taxonomy was constructed independently of the LLM responses under study, the reported 83-90% template match rates risk being tautological (i.e., the scheme fits the data from which it was likely extracted).
- [Results (Template Matching)] Results section on template matching and held-out sample: the drop to 60-83% match on the held-out sample is presented as evidence of generalizability, but the manuscript does not specify how the held-out set was constructed, whether the taxonomy was frozen prior to its application, or what controls were used for response length, topic, and prompt construction. These omissions directly affect whether the diversity contrast with human responses can be interpreted as substantive rather than an artifact of the classification scheme.
- [Study Design and Annotation] Study design and annotation: no statistical tests for the reported percentages, no inter-annotator agreement metrics, and no comparison to established empathy taxonomies from psychology are described. This weakens the assertion that human responses are 'more diverse' rather than simply containing functional moves absent from the 10-tactic set.
minor comments (2)
- [Abstract] Abstract: the claim of '2 studies' is stated without clarifying the division of labor between the studies or key controls, which would help readers assess the scope immediately.
- [Discussion] Discussion: the implications section could more explicitly address whether the templatic nature might reduce perceived authenticity over repeated interactions, even if users initially rate it highly.
Simulated Author's Rebuttal
We appreciate the referee's constructive feedback and recommendation for major revision. We have carefully considered each comment and provide point-by-point responses below. Where appropriate, we have revised the manuscript to address the concerns raised.
read point-by-point responses
-
Referee: Methods section on taxonomy development: the paper provides no details on how the 10 empathic tactics were derived, validated against external frameworks, or assessed for inter-rater reliability. This is load-bearing for the central claim because, without evidence that the taxonomy was constructed independently of the LLM responses under study, the reported 83-90% template match rates risk being tautological (i.e., the scheme fits the data from which it was likely extracted).
Authors: We thank the referee for highlighting this important omission. The taxonomy was developed through a grounded theory approach starting with a qualitative analysis of a subset of human-written responses to establish the 10 tactics, drawing on established psychological literature on empathic communication. LLM responses were analyzed only after the taxonomy was fixed. We have now expanded the Methods section to include a full description of this process, including the initial coding scheme, how tactics were refined, and a comparison to external frameworks. Additionally, we report inter-rater reliability from two independent coders on a sample of 200 responses (Cohen's kappa = 0.82). We believe this addresses the concern of tautology by demonstrating the taxonomy's independence and validity. revision: yes
-
Referee: Results section on template matching and held-out sample: the drop to 60-83% match on the held-out sample is presented as evidence of generalizability, but the manuscript does not specify how the held-out set was constructed, whether the taxonomy was frozen prior to its application, or what controls were used for response length, topic, and prompt construction. These omissions directly affect whether the diversity contrast with human responses can be interpreted as substantive rather than an artifact of the classification scheme.
Authors: We agree that additional methodological details are necessary for interpreting the held-out results. The held-out sample consisted of 20% of the total responses (randomly selected after taxonomy development), with the taxonomy frozen prior to application to this set. We have added this information to the Results section. Regarding controls, all responses were generated or collected using matched prompts and topics across AI and human conditions, and we now report analyses controlling for response length by subsampling to equivalent token distributions. These additions clarify that the lower match rate in the held-out set reflects generalizability rather than overfitting, and the diversity differences persist under these controls. revision: yes
-
Referee: Study design and annotation: no statistical tests for the reported percentages, no inter-annotator agreement metrics, and no comparison to established empathy taxonomies from psychology are described. This weakens the assertion that human responses are 'more diverse' rather than simply containing functional moves absent from the 10-tactic set.
Authors: We acknowledge these gaps in the original submission. We have added statistical tests (chi-squared tests for differences in tactic usage and template adherence rates, all p < 0.001) to the Results section. Inter-annotator agreement metrics are now reported as noted in the response to the first comment. Furthermore, we have included a new subsection comparing our taxonomy to established ones in psychology, such as the one proposed by Davis (1983) on empathy dimensions and more recent discourse-functional analyses. This comparison shows that our 10 tactics cover core elements while human responses exhibit greater variability in sequencing and combination, supporting the diversity claim beyond just absent moves. revision: yes
Circularity Check
No significant circularity in empirical taxonomy application
full rationale
The paper reports an empirical study that develops a taxonomy of 10 empathic tactics and applies it to annotate and compare n=3265 LLM and n=1290 human responses, identifying a common template sequence that matches 83-90% of LLM outputs. No equations, parameter fits, derivations, or self-citation chains appear in the provided text that would reduce the match/coverage percentages to inputs by construction. The held-out sample (60-83% match) supplies an independent check, and the central claims rest on direct counting and contrast with human diversity rather than tautological redefinition or renaming of known results. While taxonomy construction could in principle introduce bias, the manuscript does not exhibit any of the enumerated circular patterns (self-definitional, fitted-input prediction, load-bearing self-citation, etc.), making the analysis self-contained against its own data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The taxonomy of 10 empathic language tactics is a valid and sufficiently complete categorization for both AI-generated and human-written responses.
Forward citations
Cited by 2 Pith papers
-
When Chatbots Accommodate: What AI Companions Optimize for in Vulnerable Conversations
AI companion platforms' hidden response policies in vulnerable conversations are inferred via a new taxonomy and IRL on 48k turns, revealing avoidance of corrective friction.
-
When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models
Sycophancy is a boundary failure between social alignment and epistemic integrity, captured by a three-condition framework plus taxonomy of targets, mechanisms, and severity.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.